Hlink

Latest version: v3.5.4

Safety actively analyzes 625051 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 3

3.5.4

What's Changed
* Document column_mappings transform concat_two_cols by riley-harper in https://github.com/ipums/hlink/pull/126. These new docs are here: https://hlink.docs.ipums.org/column_mappings.html#concat-two-cols.
* Document column mapping overrides by riley-harper in https://github.com/ipums/hlink/pull/129. These can let you read two columns with different names from the two input files into a single hlink column. Check out the documentation at https://hlink.docs.ipums.org/column_mappings.html#advanced-usage and following.
* Fix a bug with the override_column_X attributes in conf_validations.py by riley-harper in https://github.com/ipums/hlink/pull/131. Previously config validation was raising spurious errors because it didn't take override_column_a and override_column_b into account.


**Full Changelog**: https://github.com/ipums/hlink/compare/v3.5.3...v3.5.4

3.5.3

Highlights

In this release we start supporting Python 3.12 and remove the ceiling on most of our dependency versions to support this. We also fix a bug with one-hot encoding and add an additional check to config validation that looks for duplicate output columns for some config sections.

What's Changed
* Refactor to use colorama in a simpler way by jrbalch543 in https://github.com/ipums/hlink/pull/115. User-facing functionality should be unchanged.
* Add checks for duplicated comparison features, feature selection, and column mappings by jrbalch543 in https://github.com/ipums/hlink/pull/113. This will cause a validation error when there are duplicated aliases or output columns for these sections.
* Clean up a couple of core modules by jrbalch543 in https://github.com/ipums/hlink/pull/117. These changes are internal refactoring and don't affect functionality.
* Upgrade dependencies by pinning them more loosely and support Python 3.12 by riley-harper in https://github.com/ipums/hlink/pull/119. This removes the upper limit on almost all of our dependencies so that users can more freely pick versions for themselves. Our tests run on the most recent available versions for each particular version of Python. Unpinning these dependencies allows us to easily support Python 3.12, which we now support and run in CI/CD.
* Update the docs to include Python 3.12 by riley-harper in https://github.com/ipums/hlink/pull/120
* Revert to handleInvalid = "keep" for OneHotEncoder by riley-harper in https://github.com/ipums/hlink/pull/121. This is a bug that we introduced in the last release. Although it's not common, it does sometimes happen that our training data doesn't cover every category present in the matching data. We would rather silently continue and ignore these cases by giving them a coefficient of 0 than error out on them.
* Put the config file name in the script prompt by riley-harper in https://github.com/ipums/hlink/pull/123. This is a small quality of life feature that makes it easier to remember which config file you're running during long hlink runs.


**Full Changelog**: https://github.com/ipums/hlink/compare/v3.5.2...v3.5.3

3.5.2

What's Changed
* Fixed zipping issue in Training step 3 by jrbalch543 in https://github.com/ipums/hlink/pull/104
* Fix a bug in Training step 3 for categorical features by jrbalch543 and riley-harper in https://github.com/ipums/hlink/pull/107. Each categorical feature was getting a single coefficient when each *category* should get its own coefficient instead.
* Error out on invalid categories in training data instead of creating a new category for them by riley-harper in https://github.com/ipums/hlink/pull/109. This bug fix reduces the number of categories created by hlink by 1. The last category represented missing or invalid data, but these categories were pretty much always unused because hlink creates exhaustive categories whenever possible. Users can still manually mark missing data by creating their own category for it, but hlink will not do this by default anymore. This should help prevent silent errors and confusion with missing data.
* Fix a bug where categorical features created by interaction caused Training step 3 to crash by riley-harper in https://github.com/ipums/hlink/pull/111
* Tweak the format of Training step 3's output by riley-harper in https://github.com/ipums/hlink/pull/112. There are now 3 columns: feature_name, category, and coefficient_or_importance. Feature names aren't suffixed with the category value anymore.


**Full Changelog**: https://github.com/ipums/hlink/compare/v3.5.1...v3.5.2

3.5.1

What's Changed
* Implement a new Training step that replaces Model Exploration step 3 by jrbalch543 and riley-harper in https://github.com/ipums/hlink/pull/101. This new step replaces the broken "get feature importances" step in Model Exploration, which now is removed. Training step 3 saves model feature importances or coefficients when `training.feature_importances` is set to true in the config file.

New Contributors
* jrbalch543 made their first contribution in https://github.com/ipums/hlink/pull/102! :tada:

**Full Changelog**: https://github.com/ipums/hlink/compare/v3.5.0...v3.5.1

3.5.0

What's Changed
* Make the CI Dockerfile more flexible and maintainable by riley-harper in https://github.com/ipums/hlink/pull/92. This allowed us to support Python 3.11 and also cleared up some questions about which versions of Java are supported by hlink and pyspark.
* Support Python 3.11 by riley-harper in https://github.com/ipums/hlink/pull/94. This required upgrading Spark from 3.3 to 3.5. We are now also less strict about the versions of numpy and pandas used.
* Fix 2 small command-line bugs by riley-harper in https://github.com/ipums/hlink/pull/96. One was a typo in some documentation, and the other was a bug where the autocomplete cache was not reloaded consistently. It is now reloaded after each command.
* Deprecate the `interaction_transformer` module by riley-harper in https://github.com/ipums/hlink/pull/97. This is a backport from when we were on Spark 2. Users of hlink should use Spark's `pyspark.ml.feature.Interaction` class instead. The `interaction_transformer` module will be removed in the future.
* Add a new `multi_jaro_winkler_search` comparison feature by riley-harper in https://github.com/ipums/hlink/pull/99. This is a complex comparison feature that supports conditional Jaro-Winkler comparisons between lists of columns with similar names. You can read more in the documentation at https://hlink.docs.ipums.org/comparison_types.html#multi-jaro-winkler-search.


**Full Changelog**: https://github.com/ipums/hlink/compare/v3.4.0...v3.5.0

3.4.0

What's Changed

New Features and Improvements
* Add a new `convert_ints_to_longs` configuration setting by riley-harper in https://github.com/ipums/hlink/pull/87. This configuration setting is especially helpful for reading from CSV files, which don't contain an explicit schema. Documentation for `convert_ints_to_longs` can be found at https://hlink.docs.ipums.org/config.html#data-sources.
* Drop the comment column in the hlink script's `desc` command by riley-harper in https://github.com/ipums/hlink/pull/88. This column was always full of `null`s and was just cluttering up the screen.

Documentation Updates

* Add more information to Link Tasks docs page by riley-harper in https://github.com/ipums/hlink/pull/86. See the new and improved page at https://hlink.docs.ipums.org/link_tasks.html!

Developer-Facing Changes

* Pin the Docker image to Debian bullseye by riley-harper in https://github.com/ipums/hlink/pull/84
* Bump the version to 3.4.0 by riley-harper in https://github.com/ipums/hlink/pull/89


**Full Changelog**: https://github.com/ipums/hlink/compare/v3.3.1...v3.4.0

Page 1 of 3

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.