Skll

Latest version: v5.1.0

Safety actively analyzes 724327 Python packages for vulnerabilities to keep your Python projects secure.

Page 2 of 12

3.1.0

This is a new release with with dependency updates, bugfixes, and improvements.

💥 Dependency Updates 💥

- `scikit-learn` has been updated to v1.1.2. This could mean that the same SKLL experiments when run with SKLL 3.1.0 could yield different results. (Issue 713, PR 716 ).

🛠 Bugfixes & Improvements 🛠

- SKLL Learners now support a new method `get_feature_names_out()` which returns the _correct_ set of features actually used by the learner. Since some features might be removed by the feature selector, relying on the vectorizer vocabulary is not enough in those cases. This method allows easy access to the names of the actual features used, even if the selector has removed some features (Issue 714, PR 715).
- Updated learning curve code to use the new API for `seaborn` v0.12.0 (PR 716)
- Removed the Boston housing dataset from SKLL examples and tests. This dataset has ethical issues and is being removed from scikit-learn. (Issue 700, 717)

✔️ Tests ✔️

- Added new tests for `Learner.get_feature_name_out()`. (Issue 714, PR 715)

👩‍🔬 Contributors 👨‍🔬

(*Note*: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)

Sanjna Kashyap (Frost45), Nitin Madnani (desilinguist), Matt Mulholland (mulhod), and Remo Nitschke (remo-help).

3.0

💥 Breaking Changes 💥

- Python 3.7 is no longer officially supported while official support for Python 3.10 has been added (Issue 701, PR 711).

- `scikit-learn` has been updated to v1.0.1 (Issue 699, PR 702).

- The configuration field `pos_label_str` from the “Tuning" section has been renamed to [`pos_label`](https://skll.readthedocs.io/en/latest/run_experiment.html#pos-label-optional). Older configuration files with `pos_label_str` will now raise an exception (Issue 569, PR 706).

- The configuration field `log` from the “Output” section that was renamed to [`logs`](https://skll.readthedocs.io/en/latest/run_experiment.html#logs-optional) in SKLL v2.5 has now been completely deprecated. Older configuration files with `log` will now raise an exception (Issue 671, PR 705).

💡 New features 💡

- SKLL now supports specifying [custom seed values](https://skll.readthedocs.io/en/latest/run_experiment.html#cv-seed-optional) for cross-validation tasks. This option may be useful for running the same cross-validation experiment multiple times (with the same number of differently constituted folds) to get a sense of the variance across replicates (Issue 593, PR 707).

🛠 Bugfixes & Improvements 🛠

- Using the `--drop-blanks` option with [`filter_features`](https://skll.readthedocs.io/en/latest/utilities.html#filter-features) now raises a more useful error for the case when every single row in a tabular feature file has a blank column (Issue 693, PR 703).

- SKLL conda packages are again generic Python packages instead of platform-specific ones (Issue 710, PR 711).

📖 Documentation Updates 📖

- Add a [new section](https://skll.readthedocs.io/en/latest/tutorial.html#create-virtual-environment-with-skll) to the hands-on tutorial explaining how to first install SKLL in a virtual environment (Issue 689, PR 709).

- Add missing link to SKLL repository in the tutorial data section (Issue 688, PR 691).

- Update [`CONTRIBUTING.md`](https://github.com/EducationalTestingService/skll/blob/main/CONTRIBUTING.md) to include more detailed instructions for pushing to the SKLL repository (Issue #680, PR 704).

- Link to the RSMTool implementation of `quadratic_weighted_kappa` which supports continuous values and can be used as a custom metric in SKLL for both hyper-parameter tuning as well as validation. See the **quadratic_weighted_kappa** bullet under the [objectives](https://skll.readthedocs.io/en/latest/run_experiment.html#objectives) section (Issue 512, PR 704).

- Continued readability improvements to function and method docstrings.

✔️ Tests ✔️

- All tests now specify `local=True` when making `run_configuration()` calls. This ensures that tests always run in local mode and prevent an unnecessary check for the `gridmap` library. (Issue 616, PR 708).

👩‍🔬 Contributors 👨‍🔬

(*Note*: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)

Binod Gyawali (bndgyawali), Robbie Imbrie (RobertImbrie), Sanjna Kashyap (Frost45), Sözen Ozkan Grigoras (sozkangrigoras), Nitin Madnani (desilinguist), Matt Mulholland (mulhod), and Damien Xie (damien2012eng),

2.5

💥 Breaking Changes 💥

- Python 3.6 is no longer officially supported since the latest versions of `pandas` and `numpy` have dropped support for it.

- Older top-level imports have been removed and should now be rewritten as follows (Issue 661, PR 662):
+ `from skll import Learner` ➡️ `from skll.learner import Learner`
+ `from skll import FeatureSet` ➡️ `from skll.data import FeatureSet`
+ `from skll import run_configuration` ➡️ `from skll.experiments import run_configuration`

- The default value for the `class_labels` keyword argument for `Learner.predict()` is now `True` instead of `False`. Therefore, for probabilistic classifiers, this method will now return class labels by default instead of class probabilities. To obtain class probabilities, set `class_labels` to `False` when calling this method (Issue 621, PR 622).

- The [`filter_features`](https://skll.readthedocs.io/en/latest/utilities.html#filter-features) script now offers more intuitive command line options. Input files must be specified using the `-i`/`--input` and output files must be specified using the `-o`/`--output`. Additionally, `--inverse` must now be used to invert the filtering command since `-i` is used for input files (Issue 598, PR 660).

- The `MegaMReader` and `MegaMWriter` classes have been removed from SKLL since `.megam` files are no longer supported by SKLL (Issue 532, PR 557).

- The [`param_grids`](https://skll.readthedocs.io/en/latest/run_experiment.html#param-grids-optional) option in the configuration file is now a list of dictionaries instead of a list of list of dictionaries, one for each learner specified in the `learners` option. Correspondingly, the and the `param_grid` option in [`Learner.train()`](https://skll.readthedocs.io/en/latest/api/learner.html#skll.learner.Learner.train) and [`Learner.cross_validate()`](https://skll.readthedocs.io/en/latest/api/learner.html#skll.learner.Learner.cross_validate) is now a dictionary instead of a list of dictionaries and the default parameter grids for each learner are also simply dictionaries. (Issue 618, PR 619).

- Running a [`learning_curve` task](https://skll.readthedocs.io/en/latest/run_experiment.html#learning-curve) via a configuration file now requires at least 500 examples. Fewer examples will raise a `ValueError`. This behavior can only be overridden when using [`Learner.learning_curve()`](https://skll.readthedocs.io/en/latest/api/learner.html#skll.learner.Learner.learning_curve) directly via the API (Issue 624, PR 631).

💡 New features 💡

- [`VotingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html) and [`VotingRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingRegressor.html) from scikit-learn are now [available](https://skll.readthedocs.io/en/latest/run_experiment.html#learners) for use in SKLL. This was done by adding a new [`VotingLearner`](https://skll.readthedocs.io/en/latest/api/learner.html#module-skll.learner.voting) class that uses [`Learner`](https://skll.readthedocs.io/en/latest/api/learner.html#module-skll.learner) instances to represent underlying estimators (Issue 488, PR 665).

- SKLL now supports [custom, user-defined metrics](https://skll.readthedocs.io/en/latest/custom_metrics.html) for both hyperparameter tuning as well as evaluation (Issue #606, PR 612).

- The following new built-in classification metrics are now available in SKLL: `f05`, `f05_score_macro`, `f05_score_micro`, `f05_score_weighted`, `jaccard`, `jaccard_macro`, `jaccard_micro`, `jaccard_weighted`, `precision_macro`, `precision_micro`, `precision_weighted`, `recall_macro`, `recall_micro`, and `recall_weighted` (Issues 609 and 610, PRs 607 and 612).

- `scikit-learn` has been updated to 0.24.1 (Issue 653, PR 659).

🛠 Bugfixes & Improvements 🛠

- Hyperparamter tuning now uses 5-fold cross-validation, instead of 3, to match the change in the default value of the `cv` parameter for [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). This will marginally increase the time taken for experiments with grid search but should produce more reliable results (Issue #487, PR 667).

- The SKLL codebase now uses sub-packages instead of very long modules which makes it easier to navigate and understand (Issue 600, PR 601).

- The `log` configuration file option has been renamed to [`logs`](https://skll.readthedocs.io/en/latest/run_experiment.html#logs-optional). Using `log` will still work but will raise a warning. The `log` option will be removed entirely in the next release (Issue 520, PR 670).

- Learning curves are now correctly generated for probabilistic classifiers (Issue 648, PR 649).

- Saving models in the current directory via [`Learner.save()`](https://skll.readthedocs.io/en/latest/api/learner.html#skll.learner.Learner.save) no longer requires adding `./` to the path (Issue 572, PR 604).

- The [`filter_features`](https://skll.readthedocs.io/en/latest/utilities.html#filter-features) script no longer automatically assumes labels specified with `-L` or `--label` to be strings (Issue 598, PR 660).

- Remove the `create_label_dict` keyword argument from [`Learner.train()`](https://skll.readthedocs.io/en/latest/api/learner.html#skll.learner.Learner.train) since it did not need to be user-facing (Issue 565, PR 605).

- Do not return 0 from correlation metrics when `NaN` is more appropriate. Doing this resulted in incorrect hyperparameter tuning results (Issue 585, PR 588).

- The `Learner._check_input_formatting()` private method now works correctly for dense featuresets (Issue 656, PR 658).

- SKLL conda packages are again platform-specific and the recipe now uses a `conda_build_config.yaml` to build the Python 3.7, 3.8, and 3.9 variants in one go (Issue 623, PR XXX).

- Several useful changes to the SKLL code style:
+ Standardize string concatenation (Issue 636, PR 645)
+ Use `with` context manager when opening files (Issue 641, PR 644)
+ Use f-strings where possible (Issue 633, PR 634)
+ Follow standard guidelines for sorting imports (Issue 638, PR 650)
+ Use [`pre-commit`](https://pre-commit.com) hooks to enforce code formatting guidelines during development (Issue #646, PR 650)

📖 Documentation Updates 📖

- Update [`CONTRIBUTING.md`](https://github.com/EducationalTestingService/skll/blob/main/CONTRIBUTING.md) with the new sub-package structure of the SKLL codebase (Issue #611, PR 628).

- Add a [section](https://github.com/EducationalTestingService/skll#citing) to the README that explains how to cite SKLL (Issue 599, PR 672).

- Add Azure Pipelines badge to the README (Issue 608, PR 672).

- Add explicit `.readthedocs.yml` file to configure the auto-built [documentation](https://skll.readthedocs.io) (Issue #668, PR 672).

- Make it clear that not specifying [`predictions`](https://skll.readthedocs.io/en/latest/run_experiment.html#predictions-optional) configuration file option leads to prediction files being output in the current directory (Issue 664, PR 672).

✔️ Tests ✔️

- Reduce code duplication in tests (Issue 635, PR 642).

- The Linux and Windows CI builds now use Python 3.7 and 3.8 respectively, instead of Python 3.6 (Issue 524, PR 665)

- Both the Linux and Windows CI builds now use consistent `nosetests` commands (Issue 584, PR 665).

- `nose-cov` is now automatically installed via `conda_requirements.txt` when setting up a development environment instead of requiring a separate step (Issue 527, PR 672).

- Add comprehensive new tests for voting learners, custom metrics, new built-in metrics, as well as for new bugfixes.

- Current code coverage for SKLL tests is at 97%, the highest it has ever been!

👩‍🔬 Contributors 👨‍🔬

(*Note*: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)

Aoife Cahill (aoifecahill), Binod Gyawali (bndgyawali), Nitin Madnani (desilinguist), Matt Mulholland (mulhod), Sree Harsha Ramesh (srhrshr)

2.1

This is a minor release of SKLL with the _only_ change being that it is now compatible with scikit-learn v0.22.2.

⚡️ **There are several [changes](https://scikit-learn.org/stable/whats_new/v0.22.html) in scikit-learn v0.22 that might cause several estimators and functions to produce different results even when fit with the same data and parameters. Therefore, SKLL 2.1 can also yield different results compared to previous versions even with the same data and same settings.** ⚡️

💡 New features 💡

- `scikit-learn` updated to 0.22.2 (Issue 594, PR 595).

🔎 Other minor changes 🔎

- Update imports to align with the new `scikit-learn` API.
- A minor bugfix in `logutils.py`.
- Update some test outputs due to changes in `scikit-learn` models and functions.
- Update some tests to make pre-release testing for conda and PyPI packages possible.

👩‍🔬 Contributors 👨‍🔬

(*Note*: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)

Aoife Cahill (aoifecahill), Binod Gyawali (bndgyawali), Matt Mulholland (mulhod), Nitin Madnani (desilinguist), and Mengxuan Zhao (chaomenghsuan).

2.0

💥 Incompatible Changes 💥

- Python 2.7 is no longer supported since the underlying version of scikit-learn no longer supports it (Issue 497, PR 506).

- Configuration field `objective` has been deprecated and replaced with `objectives` which [allows](https://skll.readthedocs.io/en/latest/run_experiment.html#objectives-optional) specifying multiple tuning objectives for grid search (Issue 381, PR 458).

- Grid search is now enabled by default in both the API as well as while using a configuration file (Issue 463, PR 465).

- The `Predictor` class previously provided by the `generate_predictions` utility script is no longer available. If you were relying on this class, you should just load the model file and call `Learner.predict()` instead (Issue 562, PR 566).

- There are no longer any default grid search objectives since the choice of objective is best left to the user. Note that since grid search is enabled by default, you must either choose an objective or explicitly disable grid search (Issue 381, PR 458).

- `mean_squared_error` is no longer supported as a metric. Use `neg_mean_squared_error` instead (Issue 382, PR 470).

- The `cv_folds_file` configuration file field is now just called `folds_file` (Issue 382, PR 470).

- Running an experiment with the `learning_curve` task now requires specifying [`metrics`](https://skll.readthedocs.io/en/latest/run_experiment.html#metrics-optional) in the `Output` section instead of `objectives` in the `Tuning` section (Issue 382, PR 470).

- Previously when reading in CSV/TSV files, missing data was automatically imputed as zeros. This is not appropriate in all cases. This no longer the case and blanks are retained as is. Missing values will need to be explicitly dropped or replaced (see below) before using the file with SKLL (Issue 364, PRs 475 & 518).

- `pandas` and `seaborn` are now direct dependencies of SKLL, and not optional (Issues 455 & 364, PRs 475 & 508).

💡 New features 💡

- `CSVReader`/`CSVWriter` & `TSVReader`/`TSVWriter` now use `pandas` as the backend rather than custom code that relied on the `csv` module. This leads to significant speedups, especially for very large files (~5x for reading and ~10x for writing)! The speedup comes at the cost of moderate increase in memory consumption. See detailed benchmarks [here](https://github.com/EducationalTestingService/skll/files/3637196/test_skll.pdf) (Issue #364, PRs 475 & 518).

- SKLL models now have a new [`pipeline` attribute](https://skll.readthedocs.io/en/latest/run_experiment.html#pipeline-optional) which makes it easy to manipulate and use them in `scikit-`learn, if needed (Issue 451, PR 474).

- `scikit-learn` updated to 0.21.3 (Issue 457, PR 559).

- The SKLL conda package is now a [generic Python package](https://www.anaconda.com/condas-new-noarch-packages/) which means the same package works on all platforms and on all Python versions >= 3.6. This package is hosted on the new, public [ETS anaconda channel](https://anaconda.org/ets).

- SKLL learner hyperparameters have been updated to match the new `scikit-learn` defaults and those upcoming in 0.22.0 (Issue 438, PR 533).

- Intermediate results for the grid search process are now available in the [`results.json`](https://skll.readthedocs.io/en/latest/run_experiment.html#results-files) files (Issue 431, 471).

- The K models trained for each split of a K-fold cross-validation experiment can now be [saved](https://skll.readthedocs.io/en/latest/run_experiment.html#save-cv-models-optional) to disk (Issue 501, PR 505).

- Missing values in CSV/TSV files can be dropped/replaced both via the [command line](https://skll.readthedocs.io/en/latest/utilities.html#cmdoption-filter-features-db) and the [API](https://skll.readthedocs.io/en/latest/api/data.html#skll.data.readers.CSVReader) (Issue 540, PR 542).

- Warnings from `scikit-learn` are now captured in SKLL log files (issue 441, PR 480).

- `Learner.model_params()` and, consequently, the [`print_model_weights`](https://skll.readthedocs.io/en/latest/utilities.html#print-model-weights) utility script now work with models trained on hashed features (issue 444, PR 466).

- The [`print_model_weights`](https://skll.readthedocs.io/en/latest/utilities.html#print-model-weights) utility script can now output feature weights sorted by class labels to improve readability (Issue 442, PR 468).

- The [`skll_convert`](https://skll.readthedocs.io/en/latest/utilities.html#skll-convert) utility script can now convert feature files that do not contain labels (Issue 426, PR 453).

🛠 Bugfixes & Improvements 🛠

- Fix several bugs in how various tuning objectives and output metrics were computed (Issues 545 & 548, PR 551).

- Fix how [`pos_label_str`](https://skll.readthedocs.io/en/latest/run_experiment.html#pos-label-str-optional) is documented, read in, and used for classification tasks (Issues 550 & 570, PRs 566 & 571).

- Fix several bugs in the `generate_predictions` utility script and streamline its implementation to _not_ rely on an externally specified positive label or index but rather read it from the model file or infer it (Issues 484 & 562, PR 566).

- Fix bug due to overlap between tuning objectives that metrics that could prevent metric computation (Issue 564, PR 567).

- Using an externally specified `folds_file` for grid search now works for `evaluate` and `predict` tasks, not just `train` (Issue 536, PR 538).

- Fix incorrect application of sampling _before_ feature scaling in `Learner.predict()` (Issue 472, PR 474).

- Disable feature sampling for `MultinomialNB` learner since it cannot handle negative values (Issue 473, PR 474).

- Add missing logger attribute to `Learner.FilteredLeaveOneGroupOut` (Issue 541, PR 543).

- Fix `FeatureSet.has_labels` to recognize list of `None` objects which is what happens when you read in an unlabeled data set and pass `label_col=None` (Issue 426, PR 453).

- Fix bug in `ARFFWriter` that adds/removes `label_col` from the field names even if it's `None` to begin with (Issue 452, PR 453).

- Do not produce unnecessary warnings for learning curves (Issue 410, PR 458).

- Show a warning when applying feature hashing to multiple feature files (Issue 461, PR 479).

- Fix loading issue for saved `MultinomialNB` models (Issue 573, PR 574).

- Reduce memory usage for learning curve experiments by explicitly closing `matplotlib` figure instances after they are saved.

- Improve SKLL’s cross-platform operation by explicitly reading and writing files as UTF-8 in readers and writers and by using the `newline` parameter when writing files.

📖 Documentation Updates 📖

- Reorganize documentation to explicitly document all types of output files and link them to the corresponding configuration fields in the `Output` section (Issue 459, PR 568).

- Add new interactive tutorial that uses a Jupyter notebook hosted on binder (Issue 448, PRs 547 & 552).

- Add a new page to official documentation explaining how the SKLL code is organized for new developers (Issue 511, PR 519).

- Update SKLL contribution guidelines and link to them from official documentation (Issues 498 & 514, PR 503 & 519).

- Update documentation to indicate that `pandas` and `seaborn` are now direct dependencies and not optional (Issue 553, PR 563).

- Update `LogisticRegression` learner documentation to talk explicitly about penalties and solvers (Issue 490, PR 500).

- Properly [document](https://skll.readthedocs.io/en/latest/api/data.html#notes-about-ids-label-conversion) the internal conversion of string labels to ints/floats and possible edge cases (Issue 436, PR 476).

- Add feature scaling to Boston regression example (Issue 469, PR 478).

- Several other additions/updates to documentation (Issue 459, PR 568).

✔️ Tests ✔️

- Make `tests` into a package so that we can do something like `from skll.tests.utils import X` etc. (Issue 530 , PR 531).

- Add new tests based on SKLL examples so that we would know if examples ever break with any SKLL updates (Issues 529 & 544, PR 546).

- Tweak tests to make test suite runnable on Windows (and pass!).

- Add [Azure Pipelines](https://azure.microsoft.com/en-us/services/devops/pipelines/) integration for automated test builds on Windows.

- Added several new comprehensive tests for all new features and bugfixes. Also, removed older, unnecessary tests. See various PRs above for details.

- Current code coverage for SKLL tests is at 95%, the highest it has ever been!

🔍 Other changes 🔍

- Replace `prettytable` with the more actively maintained `tabulate` (Issue 356, PR 467).

- Make sure entire codebase complies with PEP8 (Issue 460, PR 568).

- Update the year to 2019 everywhere (Issue 447, PRs 456 & 568).

- Update TravisCI configuration to use `conda_requirements.txt` for building environment (PR 515).

👩‍🔬 Contributors 👨‍🔬

(*Note*: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)

Supreeth Baliga (SupreethBaliga), Jeremy Biggs (jbiggsets), Aoife Cahill (aoifecahill), Ananya Ganesh (ananyaganesh), R. Gokul (rgokul), Binod Gyawali (bndgyawali), Nitin Madnani (desilinguist), Matt Mulholland (mulhod), Robert Pugh (Lguyogiro), Maxwell Schwartz (maxwell-schwartz), Eugene Tsuprun (etsuprun), Avijit Vajpayee (AVajpayeeJr), Mengxuan Zhao (chaomenghsuan)

1.5.3

This is a minor release of SKLL with the most notable change being compatibility with the latest version of scikit-learn (v0.20.1).

What's new

- SKLL is now compatible with scikit-learn v0.20.1 (Issue 432, PR 439).
- `GradientBoostingClassifier` and `GradientBoostingRegressor` now accept sparse matrices as input (Issue 428, PR 429).
- The `model_params` property now works for SVC learners with a linear kernel (Issue 425, PR 443).
- Improved documentation (Issue 423, PR 437).
- Update `generate_predictions` to output the probabilities for _all_ classes instead of just the first class (Issue 430, PR 433). **Note**: this change breaks backward compatibility with previous SKLL versions since the output file now _always_ includes a column header.

Bugfixes

- Fixed broken links in documentation (Issues 421 and 422, PR 437).
- Fixed data type conversion in `NDJWriter` (Issue 416, PR 440).
- Properly handle the possible combinations of trained model and prediction set vectorizers in `Learner.predict` (Issue 414, PR 445).

Other changes

- Make the tests for `MLPClassifier` and `MLPRegressor` go faster (by turning off grid search) to prevent Travis CI from timing out (issue 434, PR 435).

Page 2 of 12

Releases

Has known vulnerabilities

Previous Next

Skll

Page 2 of 12

3.1.0

3.0

2.5

2.1

2.0

1.5.3

Page 2 of 12

Links

Releases