The 1.0 release is finally here! It's been a little over a year since our first public release, and we're ready to say that SKLL is 1.0. Read our massive release notes:
:warning: We did make some API- and config-file-breaking changes. They are listed at the end of the release notes. They should all be addressable by a quick find-and-replace.
Bug fixes
- Fixed path problems in iris example (issue 103, PR 171)
- Fixed bug where `ablated_features` field was incorrect when config file contained multiple feature sets (issue 125)
- Fixed bug where CV would crash with rare classes (issue 109, PR 165)
- Fixed issue where warning about extremely large feature values was being issued before rescaling
- Fixed issue where some warning messages used mix of new-style and old-style replacement strings with old-style formatting.
- Fixed a number of bugs with filtering `FeatureSet` objects and writing filtered sets to files.
- Fixed bug in `FeatureSet.__sub__` where feature names were being passed instead of indices.
- Fixed issue where `MegaMWriter` could not print numbers in Python 2.7.
New features
- SKLL releases are now for specific versions of scikit-learn. 1.0.0 requires scikit-learn 0.15.2 (issue 138, PR 170)
- Added [tutorial](https://skll.readthedocs.org/en/master/tutorial.html) to documentation that walks new users through using SKLL in much the same way as our PyData talks (issue #153).
- Added support for custom learners (issue 92, PR 183)
- Added two command-line utilities, `join_features` and `filter_features`, for joining and filtering feature files. These replace `join_megam` and `filter_megam` (issue 79, PR 198)
- Added support for specifying the field in ARFF, CSV, or TSV files that contains the IDs for each instance (issue 204, PR 206)
- Added train/test set sizes to result files (issue 150, PR 161)
- Added intercept to `print_model_weights` output (issue 155, PR 163)
- Added total time and end time-stamp to experiment results (issue 91, PR 167)
- Added exception when `featureset_name` is longer than 210 characters (issue 121, PR 168)
- Added regression example data, `boston` (issue 162)
- Added ability to specify number of grid search folds (issue 122, PR 175)
- Added warning message when number of features in training model are different than those for FeatureSet passed to `Learner.predict()` (issue 145)
- Added `conda.yaml` file to repository to make conda package creation simpler (issue 159, PR 173)
- Added loads more unit tests, greatly increased unit test coverage, and generally cleaned up test modules (issues 97, 148, 157, 188, and 202; PRs 176, 184, 196, 203, and 205)
- Added `train_file` and `test_file` fields to config files, which can be used to specify single file feature sets. This greatly simplifies running simple experiments (issue 12, PR 197)
- Added support for merging feature sets with IDs in different orders (issue 149, PR 177)
- Added `ValueError` when invalid tuning objective is specified (issues 117 and 179; PRs 174 and 181)
- Added `shuffle` option to config files to decide whether training data should be shuffled before training. By default this is `False`, but if `grid_search` is `True`, we will automatically `shuffle`. Previously, the default was `True`, and there was no option in the config files. (issue 189, PR 190)
- Updated documentation to indicate that we're using `StratifiedKFold` (issue 160)
- Added `FeatureSet.__eq__` and `FeatureSet.__getitem__` methods.
Minor changes without issues
- Overhauled and cleaned up all documentation. [Look](https://skll.readthedocs.org) how pretty it is!
- Updated docstrings all over the place to be more accurate.
- Updated `generate_predictions` to use new `Reader` API.
- Added `argv` optional argument to all utility script `main` functions to simplify testing.
- Added `mock` tests, so SKLL now requires `mock` to work with Python 2.7.
- Added prettier SVG badges to README.
- Added link to Data Science at the Command Line to README.
- `LibSVMReader` now converts UTF-8 replacement characters that are used by `LibSVMWriter` when a feature name contains an `=`, `|`, ``, `:`, or ` ` back to the original ASCII characters.
:warning: API breaking changes :warning:
- `FeatureSetWriter` :arrow_right: `Writer`
- `load_examples(path)` :arrow_right: `Reader.for_path(path).read()`
- `write_feature_file(...)` :arrow_right: `Writer.for_path(FeatureSet(...)).write()`
- `FeatureSet.classes` :arrow_right: `FeatureSet.labels`
- All other instances of word "classes" changed to "labels" (166)
- `FeatureSet.feat_vectorizer` :arrow_right: `FeatureSet.vectorizer`
- `run_ablation(all_combos=True)` :arrow_right: `run_configuration(ablation=None)`
- `run_ablation()` :arrow_right: `run_configuration(ablation=1)`
- `ExamplesTuple(ids, classes, features, vectorizer)` :arrow_right: `FeatureSet(name, ids, classes, features, vectorizer)`
- Removed `feature_hasher` argument to all `Learner` methods, because its unnecessary
- `Learner.model_type` is now the actual type of the underlying model instead of just a string.
- `FeatureSet.__len__` now returns the number of examples instead of the number of features.
- Removed `skll.learner._REGRESSION_MODELS` and now we check for regression by seeing if model is subclass of `RegressorMixin`.
:warning: Config file breaking changes :warning:
- Removed all short names for learners (PR 199)
- Can no longer use `classifiers` instead of `learners`
- `train_location` :arrow_right: `train_directory`
- `test_location` :arrow_right: `train_directory`
- `cv_folds_location` :arrow_right: `cv_folds_file`