π₯ Incompatible Changes π₯
- Python 2.7 is no longer supported since the underlying version of scikit-learn no longer supports it (Issue 497, PR 506).
- Configuration field `objective` has been deprecated and replaced with `objectives` which [allows](https://skll.readthedocs.io/en/latest/run_experiment.html#objectives-optional) specifying multiple tuning objectives for grid search (Issue 381, PR 458).
- Grid search is now enabled by default in both the API as well as while using a configuration file (Issue 463, PR 465).
- The `Predictor` class previously provided by the `generate_predictions` utility script is no longer available. If you were relying on this class, you should just load the model file and call `Learner.predict()` instead (Issue 562, PR 566).
- There are no longer any default grid search objectives since the choice of objective is best left to the user. Note that since grid search is enabled by default, you must either choose an objective or explicitly disable grid search (Issue 381, PR 458).
- `mean_squared_error` is no longer supported as a metric. Use `neg_mean_squared_error` instead (Issue 382, PR 470).
- The `cv_folds_file` configuration file field is now just called `folds_file` (Issue 382, PR 470).
- Running an experiment with the `learning_curve` task now requires specifying [`metrics`](https://skll.readthedocs.io/en/latest/run_experiment.html#metrics-optional) in the `Output` section instead of `objectives` in the `Tuning` section (Issue 382, PR 470).
- Previously when reading in CSV/TSV files, missing data was automatically imputed as zeros. This is not appropriate in all cases. This no longer the case and blanks are retained as is. Missing values will need to be explicitly dropped or replaced (see below) before using the file with SKLL (Issue 364, PRs 475 & 518).
- `pandas` and `seaborn` are now direct dependencies of SKLL, and not optional (Issues 455 & 364, PRs 475 & 508).
π‘ New features π‘
- `CSVReader`/`CSVWriter` & `TSVReader`/`TSVWriter` now use `pandas` as the backend rather than custom code that relied on the `csv` module. This leads to significant speedups, especially for very large files (~5x for reading and ~10x for writing)! The speedup comes at the cost of moderate increase in memory consumption. See detailed benchmarks [here](https://github.com/EducationalTestingService/skll/files/3637196/test_skll.pdf) (Issue #364, PRs 475 & 518).
- SKLL models now have a new [`pipeline` attribute](https://skll.readthedocs.io/en/latest/run_experiment.html#pipeline-optional) which makes it easy to manipulate and use them in `scikit-`learn, if needed (Issue 451, PR 474).
- `scikit-learn` updated to 0.21.3 (Issue 457, PR 559).
- The SKLL conda package is now a [generic Python package](https://www.anaconda.com/condas-new-noarch-packages/) which means the same package works on all platforms and on all Python versions >= 3.6. This package is hosted on the new, public [ETS anaconda channel](https://anaconda.org/ets).
- SKLL learner hyperparameters have been updated to match the new `scikit-learn` defaults and those upcoming in 0.22.0 (Issue 438, PR 533).
- Intermediate results for the grid search process are now available in the [`results.json`](https://skll.readthedocs.io/en/latest/run_experiment.html#results-files) files (Issue 431, 471).
- The K models trained for each split of a K-fold cross-validation experiment can now be [saved](https://skll.readthedocs.io/en/latest/run_experiment.html#save-cv-models-optional) to disk (Issue 501, PR 505).
- Missing values in CSV/TSV files can be dropped/replaced both via the [command line](https://skll.readthedocs.io/en/latest/utilities.html#cmdoption-filter-features-db) and the [API](https://skll.readthedocs.io/en/latest/api/data.html#skll.data.readers.CSVReader) (Issue 540, PR 542).
- Warnings from `scikit-learn` are now captured in SKLL log files (issue 441, PR 480).
- `Learner.model_params()` and, consequently, the [`print_model_weights`](https://skll.readthedocs.io/en/latest/utilities.html#print-model-weights) utility script now work with models trained on hashed features (issue 444, PR 466).
- The [`print_model_weights`](https://skll.readthedocs.io/en/latest/utilities.html#print-model-weights) utility script can now output feature weights sorted by class labels to improve readability (Issue 442, PR 468).
- The [`skll_convert`](https://skll.readthedocs.io/en/latest/utilities.html#skll-convert) utility script can now convert feature files that do not contain labels (Issue 426, PR 453).
π Bugfixes & Improvements π
- Fix several bugs in how various tuning objectives and output metrics were computed (Issues 545 & 548, PR 551).
- Fix how [`pos_label_str`](https://skll.readthedocs.io/en/latest/run_experiment.html#pos-label-str-optional) is documented, read in, and used for classification tasks (Issues 550 & 570, PRs 566 & 571).
- Fix several bugs in the `generate_predictions` utility script and streamline its implementation to _not_ rely on an externally specified positive label or index but rather read it from the model file or infer it (Issues 484 & 562, PR 566).
- Fix bug due to overlap between tuning objectives that metrics that could prevent metric computation (Issue 564, PR 567).
- Using an externally specified `folds_file` for grid search now works for `evaluate` and `predict` tasks, not just `train` (Issue 536, PR 538).
- Fix incorrect application of sampling _before_ feature scaling in `Learner.predict()` (Issue 472, PR 474).
- Disable feature sampling for `MultinomialNB` learner since it cannot handle negative values (Issue 473, PR 474).
- Add missing logger attribute to `Learner.FilteredLeaveOneGroupOut` (Issue 541, PR 543).
- Fix `FeatureSet.has_labels` to recognize list of `None` objects which is what happens when you read in an unlabeled data set and pass `label_col=None` (Issue 426, PR 453).
- Fix bug in `ARFFWriter` that adds/removes `label_col` from the field names even if it's `None` to begin with (Issue 452, PR 453).
- Do not produce unnecessary warnings for learning curves (Issue 410, PR 458).
- Show a warning when applying feature hashing to multiple feature files (Issue 461, PR 479).
- Fix loading issue for saved `MultinomialNB` models (Issue 573, PR 574).
- Reduce memory usage for learning curve experiments by explicitly closing `matplotlib` figure instances after they are saved.
- Improve SKLLβs cross-platform operation by explicitly reading and writing files as UTF-8 in readers and writers and by using the `newline` parameter when writing files.
π Documentation Updates π
- Reorganize documentation to explicitly document all types of output files and link them to the corresponding configuration fields in the `Output` section (Issue 459, PR 568).
- Add new interactive tutorial that uses a Jupyter notebook hosted on binder (Issue 448, PRs 547 & 552).
- Add a new page to official documentation explaining how the SKLL code is organized for new developers (Issue 511, PR 519).
- Update SKLL contribution guidelines and link to them from official documentation (Issues 498 & 514, PR 503 & 519).
- Update documentation to indicate that `pandas` and `seaborn` are now direct dependencies and not optional (Issue 553, PR 563).
- Update `LogisticRegression` learner documentation to talk explicitly about penalties and solvers (Issue 490, PR 500).
- Properly [document](https://skll.readthedocs.io/en/latest/api/data.html#notes-about-ids-label-conversion) the internal conversion of string labels to ints/floats and possible edge cases (Issue 436, PR 476).
- Add feature scaling to Boston regression example (Issue 469, PR 478).
- Several other additions/updates to documentation (Issue 459, PR 568).
βοΈ Tests βοΈ
- Make `tests` into a package so that we can do something like `from skll.tests.utils import X` etc. (Issue 530 , PR 531).
- Add new tests based on SKLL examples so that we would know if examples ever break with any SKLL updates (Issues 529 & 544, PR 546).
- Tweak tests to make test suite runnable on Windows (and pass!).
- Add [Azure Pipelines](https://azure.microsoft.com/en-us/services/devops/pipelines/) integration for automated test builds on Windows.
- Added several new comprehensive tests for all new features and bugfixes. Also, removed older, unnecessary tests. See various PRs above for details.
- Current code coverage for SKLL tests is at 95%, the highest it has ever been!
π Other changes π
- Replace `prettytable` with the more actively maintained `tabulate` (Issue 356, PR 467).
- Make sure entire codebase complies with PEP8 (Issue 460, PR 568).
- Update the year to 2019 everywhere (Issue 447, PRs 456 & 568).
- Update TravisCI configuration to use `conda_requirements.txt` for building environment (PR 515).
π©βπ¬ Contributors π¨βπ¬
(*Note*: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)
Supreeth Baliga (SupreethBaliga), Jeremy Biggs (jbiggsets), Aoife Cahill (aoifecahill), Ananya Ganesh (ananyaganesh), R. Gokul (rgokul), Binod Gyawali (bndgyawali), Nitin Madnani (desilinguist), Matt Mulholland (mulhod), Robert Pugh (Lguyogiro), Maxwell Schwartz (maxwell-schwartz), Eugene Tsuprun (etsuprun), Avijit Vajpayee (AVajpayeeJr), Mengxuan Zhao (chaomenghsuan)