Skrub

Latest version: v0.5.3

Safety actively analyzes 723976 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 3

0.5.3

=============

Changes
-------

- The :class:`SimpleCleaner` has been renamed to :class:`Cleaner`. Use of the
name :class:`SimpleCleaner` is deprecated and will result in an error in some
future release of skrub. :pr:`1275` by :user:`Riccardo Cappuzzo<rcap107>`.

- A new parameter ``max_plot_columns`` has been added to the
:class:`TableReport` and :func:`patch_display` to skip column plots when the
number of columns exceeds the specified value. :pr:`1255` by :user:`Priscilla
Baah<priscilla-b>`.

0.5.2

=============

New features
------------

- The :class:`TableReport` now switches its visual theme between light and dark according to the user preferences.
:pr:`1201` by :user:`rouk1 <rouk1>`.

- Adding a new way to control the location of the data directory, using envar ``SKRUB_DATA_DIRECTORY``.
:pr:`1215` by :user:`Thomas S. <thomass-dev>`

- The :class:`DatetimeEncoder` now supports periodic encoding of datetime features
with trigonometric functions and B-splines transformers.
:pr:`1235` by :user:`Riccardo Cappuzzo<rcap107>`.

- The :class:`TableReport` now also compute Pearson's correlation for numeric values.
:pr:`1203` by :user:`Reshama Shaikh <reshamas>` and
:user:`Vincent Maladiere <Vincent-Maladiere>`.

- The :class:`SimpleCleaner` is now available (⚠️ it was renamed to
:class:`Cleaner` in skrub ``0.5.3``.). This transformer is a lightweight
pre-processor that applies some of the transformations applied by the
:class:`TableVectorizer`, with a simpler interface. :pr:`1266` by
:user:`Riccardo Cappuzzo<rcap107>` and :user:`Jerome Dockes <jeromedockes>` .

Changes
-------

- The estimator returned by :func:`tabular_learner` now uses spline encoding of
datetime features when the supervised learner is not a model based on decision
trees such as random forests or gradient boosting. :pr:`1264` by
:user:`Guillaume Lemaitre <glemaitre>`.

- The "distribution" tab of the ``TableReport`` now stacks cards horizontally to avoid adding
vertical space.
:pr:`1259` by :user:`Gaël Varoquaux <gaelvaroquaux>`

- Progress messages when generating a ``TableReport`` are now written to stderr instead of stdout.
:pr:`1236` by :user:`Priscilla Baah<priscilla-b>`

- Optimize the :class:`StringEncoder`: lower memory footprint and faster execution in some cases.
:pr:`1248` by :user:`Gaël Varoquaux <gaelvaroquaux>`

=======
Bug fixes
---------
- :class:`StringEncoder` now works correctly in presence of null values.
:pr:`1224` by :user:`Jérôme Dockès <jeromedockes>`.

- The :meth:`TableVectorizer.get_feature_names_out` method now works when used in a
scikit-learn pipeline by exposing the `input_features` parameter.
:pr:`1258` by :user:`Guillaume Lemaitre <glemaitre>`.

0.5.1

=============

New features
------------
* The :class:`StringEncoder` encodes strings using tf-idf and truncated SVD
decomposition and provides a cheaper alternative to :class:`GapEncoder`.
:pr:`1159` by :user:`Riccardo Cappuzzo<rcap107>`.

Changes
-------
* New dataset fetching methods have been added: :func:`fetch_videogame_sales`,
:func:`fetch_bike_sharing`, :func:`fetch_flight_delays`,
:func:`fetch_country_happiness`, and removed :func:`fetch_road_safety`.
:pr:`1218` by :user:`Vincent Maladiere <Vincent-Maladiere>`

Bug fixes
---------

Maintenance
-----------

0.4.1

==========================

Major changes
-------------
* :func:`fuzzy_join` and :class:`FeatureAugmenter` can now join on numerical columns based on the euclidean distance.
:pr:`530` by :user:`Jovan Stojanovic <jovan-stojanovic>`

* :func:`fuzzy_join` and :class:`FeatureAugmenter` can perform many-to-many joins on lists of numerical or string key columns.
:pr:`530` by :user:`Jovan Stojanovic <jovan-stojanovic>`

* :func:`GapEncoder.transform` will not continue fitting of the instance anymore.
It makes functions that depend on it (:func:`~GapEncoder.get_feature_names_out`,
:func:`~GapEncoder.score`, etc.) deterministic once fitted.
:pr:`548` by :user:`Lilian Boulard <LilianBoulard>`

* :func:`fuzzy_join` and :class:`FeatureAugmenter` now perform joins on missing values as in `pandas.merge`
but raises a warning. :pr:`522` and :pr:`529` by :user:`Jovan Stojanovic <jovan-stojanovic>`

* Added :func:`get_ken_table_aliases` and :func:`get_ken_types` for exploring
KEN embeddings. :pr:`539` by :user:`Lilian Boulard <LilianBoulard>`.

Minor changes
-------------
* Improvement of date column detection and date format inference in :class:`TableVectorizer`. The
format inference now tries to find a format which works for all non-missing values of the column, and only
tries pandas default inference if it fails.
:pr:`543` by :user:`Leo Grinsztajn <LeoGrin>`
:pr:`587` by :user:`Leo Grinsztajn <LeoGrin>`

0.4.0

=========================

Major changes
-------------
* `SuperVectorizer` is renamed as :class:`TableVectorizer`, a warning is raised when using the old name.
:pr:`484` by :user:`Jovan Stojanovic <jovan-stojanovic>`

* New experimental feature: joining tables using :func:`fuzzy_join` by approximate key matching. Matches are based
on string similarities and the nearest neighbors matches are found for each category.
:pr:`291` by :user:`Jovan Stojanovic <jovan-stojanovic>` and :user:`Leo Grinsztajn <LeoGrin>`

* New experimental feature: :class:`FeatureAugmenter`, a transformer
that augments with :func:`fuzzy_join` the number of features in a main table by using information from auxiliary tables.
:pr:`409` by :user:`Jovan Stojanovic <jovan-stojanovic>`

* Unnecessary API has been made private: everything (files, functions, classes)
starting with an underscore shouldn't be imported in your code. :pr:`331` by :user:`Lilian Boulard <LilianBoulard>`

* The :class:`MinHashEncoder` now supports a `n_jobs` parameter to parallelize
the hashes computation. :pr:`267` by :user:`Leo Grinsztajn <LeoGrin>` and :user:`Lilian Boulard <LilianBoulard>`.

* New experimental feature: deduplicating misspelled categories using :func:`deduplicate` by clustering string distances.
This function works best when there are significantly more duplicates than underlying categories.
:pr:`339` by :user:`Moritz Boos <mjboos>`.

Minor changes
-------------
* Add example `Wikipedia embeddings to enrich the data`. :pr:`487` by :user:`Jovan Stojanovic <jovan-stojanovic>`

* **datasets.fetching**: contains a new function :func:`get_ken_embeddings` that can be used to download Wikipedia
embeddings and filter them by type.

* **datasets.fetching**: contains a new function :func:`fetch_world_bank_indicator` that can be used to download indicators
from the World Bank Open Data platform.
:pr:`291` by :user:`Jovan Stojanovic <jovan-stojanovic>`

* Removed example `Fitting scalable, non-linear models on data with dirty categories`. :pr:`386` by :user:`Jovan Stojanovic <jovan-stojanovic>`

* :class:`MinHashEncoder`'s :func:`minhash` method is no longer public. :pr:`379` by :user:`Jovan Stojanovic <jovan-stojanovic>`

* Fetching functions now have an additional argument ``directory``,
which can be used to specify where to save and load from datasets.
:pr:`432` by :user:`Lilian Boulard <LilianBoulard>`

* Fetching functions now have an additional argument ``directory``,
which can be used to specify where to save and load from datasets.
:pr:`432` and :pr:`453` by :user:`Lilian Boulard <LilianBoulard>`

* The :class:`TableVectorizer`'s default `OneHotEncoder` for low cardinality categorical variables now defaults
to `handle_unknown="ignore"` instead of `handle_unknown="error"` (for sklearn >= 1.0.0).
This means that categories seen only at test time will be encoded by a vector of zeroes instead of raising an error. :pr:`473` by :user:`Leo Grinsztajn <LeoGrin>`

Bug fixes
---------

* The :class:`MinHashEncoder` now considers `None` and empty strings as missing values, rather
than raising an error. :pr:`378` by :user:`Gael Varoquaux <GaelVaroquaux>`

0.3.1

=============

Minor changes
-------------

* For tree-based models, :func:`tabular_learner` now adds
`handle_unknown='use_encoded_value'` to the `OrdinalEncoder`, to avoid
errors with new categories in the test set. This is consistent with the
setting of `OneHotEncoder` used by default in the
:class:`TableVectorizer`. :pr:`1078` by :user:`Gaël Varoquaux <gaelvaroquaux>`

* The reports created by :class:`TableReport`, when inserted in an html page (or
displayed in a notebook), now use the same font as the surrounding page.
:pr:`1038` by :user:`Jérôme Dockès <jeromedockes>`.

* The content of the dataframe corresponding to the currently selected table
cell in the TableReport can be copied without actually selecting the text (as
in a spreadsheet).
:pr:`1048` by :user:`Jérôme Dockès <jeromedockes>`.

* The selection of content displayed in the TableReport's copy-paste boxes has
been removed. Now they always display the value of the selected item. When
copied, the repr of the selected item is copied to the clipboard.
:pr:`1058` by :user:`Jérôme Dockès <jeromedockes>`.

* A "stats" panel has been added to the TableReport, showing summary statistics
for all columns (number of missing values, mean, etc. -- similar to
``pandas.info()`` ) in a table. It can be sorted by each column.
:pr:`1056` and :pr:`1068` by :user:`Jérôme Dockès <jeromedockes>`.

* The credit fraud dataset is now available with the
:func:`fetch_credit_fraud function`.
:pr:`1053` by :user:`Vincent Maladiere <Vincent-Maladiere>`.

* Added zero padding for column names in :class:`MinHashEncoder` to improve column ordering consistency.
:pr:`1069` by :user:`Shreekant Nandiyawar <Shree7676>`.

* The selection in the TableReport's sample table can now be manipulated with
the keyboard. :pr:`1065` by :user:`Jérôme Dockès <jeromedockes>`.

* The ``TableReport`` now displays the pandas (multi-)index, and has a better
display & interaction of pandas columns when the columns are a MultiIndex.
:pr:`1083` by :user:`Jérôme Dockès <jeromedockes>`.

* It is possible to control the number of rows displayed by the TableReport in
the "sample" tab panel by specifying ``n_rows``.
:pr:`1083` by :user:`Jérôme Dockès <jeromedockes>`.

* the `TableReport` used to raise an exception when the dataframe contained
unhashable types such as python lists. This has been fixed in :pr:`1087` by
:user:`Jérôme Dockès <jeromedockes>`.

* Display's columns name with the HTML representation of the fitted TableVectorizer.
This has been fixed in :pr:`1093` by :user:`Shreekant Nandiyawar <Shree7676>`.

* AggTarget will now work even when y is a Series and not raise any error.
This has been fixed in :pr:`1094` by :user:`Shreekant Nandiyawar <Shree7676>`.

Page 1 of 3

Releases

Has known vulnerabilities

Skrub

Page 1 of 3

0.5.3

0.5.2

0.5.1

0.4.1

0.4.0

0.3.1

Page 1 of 3

Links

Releases