Hlink

Latest version: v3.5.5

Safety actively analyzes 634728 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 3

3.4.0

What's Changed

New Features and Improvements
* Add a new `convert_ints_to_longs` configuration setting by riley-harper in https://github.com/ipums/hlink/pull/87. This configuration setting is especially helpful for reading from CSV files, which don't contain an explicit schema. Documentation for `convert_ints_to_longs` can be found at https://hlink.docs.ipums.org/config.html#data-sources.
* Drop the comment column in the hlink script's `desc` command by riley-harper in https://github.com/ipums/hlink/pull/88. This column was always full of `null`s and was just cluttering up the screen.

Documentation Updates

* Add more information to Link Tasks docs page by riley-harper in https://github.com/ipums/hlink/pull/86. See the new and improved page at https://hlink.docs.ipums.org/link_tasks.html!

Developer-Facing Changes

* Pin the Docker image to Debian bullseye by riley-harper in https://github.com/ipums/hlink/pull/84
* Bump the version to 3.4.0 by riley-harper in https://github.com/ipums/hlink/pull/89


**Full Changelog**: https://github.com/ipums/hlink/compare/v3.3.1...v3.4.0

3.3.1

What's Changed

Bug Fixes
* Fix categorical variable bug by anpumn in https://github.com/ipums/hlink/pull/82. This fixes issue #81, which caused comparison features to be marked as categorical even when the user set `categorical = false` in the configuration file.

Documentation Updates
* Update column_mapping_transforms docs page by riley-harper in https://github.com/ipums/hlink/pull/77
* Update docs for present_both_years and neither_are_null by riley-harper in https://github.com/ipums/hlink/pull/79

Developer-Facing Changes
* Don't reload modules for the reload command by riley-harper in https://github.com/ipums/hlink/pull/78. This removes some old developer-facing functionality for hot-reloading hlink modules. Now the `reload` command in the hlink script just reloads the config file.

New Contributors
* anpumn made their first contribution in https://github.com/ipums/hlink/pull/82! 🎉

**Full Changelog**: https://github.com/ipums/hlink/compare/v3.3.0...v3.3.1

3.3.0

Overview

This release contains several new features like separate log files for each run, logging user input, and a loosening of production dependency requirements. It also contains an important bug fix for Jaro-Winkler scores on blank names and many other smaller enhancements.

Changes

- Started writing to a unique log file for each hlink script run. The name of the log file is `"{config_name}-{session_id}.log"`, where `session_id` is a UUID uniquely generated for the particular run of the script.
- Started logging user input in the main loop. This helps give more context to errors and other logging information.
- Loosened production dependency requirements so that they are not pinned to particular patch versions which may quickly become out of date. Adjusted some development dependency requirements.
- Fixed a bug where the Scala `jw` user-defined function returned a similarity of 1.0 for two empty strings. It now returns 0.0.
- Added syntax highlighting to the TOML example config file in the README (thanks bollwyvl).
- Documented some previously undocumented comparison types: `not_zero_and_not_equals`, `present_and_matching_categorical`, `caution_comp_3_012`, `caution_comp_4_012`, `sql_condition`, `present_and_equal_categorical_in_universe`.
- Updated documentation for a few more comparison types: `caution_comp_3`, `caution_comp_4`, `not_zero_and_not_equals`.
- Updated the Introduction and Installation documentation pages to make them more reader friendly and helpful.
- Updated the tutorial in examples/tutorial and added some small datasets so that it can be run for real. It can now be run with the commands

$ cd examples/tutorial
$ python tutorial.py

- Updated and added type hints for the following classes and modules: `Table`, `LinkRun`, `LinkTask`, `LinkStep`, `linking.util`, `configs.load_config`.

Developer-Facing Changes

- Updated developer instructions for generating the Sphinx docs, adding some more context and tips.
- Renamed some private functions and methods to use a single leading underscore instead of two leading underscores. This should complete the transition from two leading underscores to one leading underscore.
- Allowed the Dockerfile to pull the most recent patch version of Python 3.10 for CI instead of pinning to a particular patch version.
- Moved from setup.py and setup.cfg to pyproject.toml for specifying package metadata. Added and tweaked package metadata for installation and PyPI.
- Started using the `build` package for creating an sdist and wheel. Added a step to the CI to run `python -m build` to generate the sdist and wheel.
- Moved the declaration of `pytest_plugins` to a top-level conftest.py file to allow for running tests with just the command `pytest`. Updated CI and the docs from `pytest hlink/tests` to just `pytest`.

3.2.7

Overview

This release of hlink contains some bug fixes and maintenance items, along with some tuning of hlink for large datasets. It modifies the `hlink.spark.session.SparkConnection` class to allow easier adjustment of the `spark.driver.memory` configuration setting, and it upgrades hlink from Spark 3.2 to 3.3.

Changes

- Upgraded from Spark 3.2 to 3.3.0. This required only a few internal changes to hlink.
- Fixed a bug where `feature_selections` was always required in the config file. Now it defaults to `[]` as intended.
- Fixed a bug where an error message in `conf_validations` wasn't formatted correctly.
- Added a check to `conf_validations` to confirm that both data sources contain the id column specified in the config file.
- Improved the project README.
- Capped the number of Spark partitions requested at 10,000 to prevent hlink from requesting too many partitions with very large datasets.
- Added driver memory options to `SparkConnection`.

Notes

- Added developer documentation on how to push hlink to PyPI.
- Cleaned up some old files and did some reorganization. Did some work to organize some test files that were in a confusing place.

3.2.6

Overview

With this release, hlink is now installable from [pypi.org](https://pypi.org) with `pip install hlink`. hlink went through several small intermediate updates to get packaging with PyPI set up correctly.

Changes

- Updated metadata to integrate with PyPI.
- Updated documentation to include instructions on installing from PyPI.

Notes

- Versions 3.2.2 through 3.2.5 are intermediate versions needed to get hlink working on PyPI. They don't have associated releases.

3.2.1

Overview

This is a small patch release with a bug fix and a couple of usability improvements to hlink.
Changes

- Fixed a bug where model exploration's step 3 would run into a `TypeError` due to trying to manually build up a file path.
- Improved logging during startup and for the `LinkTask.run_all_steps()` method.
- Added code to adjust the number of Spark partitions based on the size of input datasets for a few link steps. This should help these steps scale better with large datasets.

Notes

- Updated the pre-commit installation file to work with Python 3.10.

Page 2 of 3

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.