=====
This release contains a major overhaul of Anonlink’s API and introduces support for multi-party linkage.
The changes are all additive, so the previous API continues to work. That API has now been deprecated and will be removed in a future release. The deprecation timeline is:
- v0.9.0: old API deprecated
- v0.10.0: use of old API raises a warning
- v0.11.0: remove old API
Major changes
-------------
- Introduce abstract similarity functions. The Sørensen–Dice coefficient is now just one possible similarity function.
- Implement Hamming similarity as a similarity function.
- Permit linkage of records other than CLKs (BYO similarity function).
- Similarity functions now return multiple contiguous arrays instead of a list of tuples.
- Candidate pairs from similarity functions are now always sorted.
- Introduce a standard type for storing candidate pairs. This is now used consistently throughout the API.
- Provide a function for multiparty candidate generation. It takes multiple datasets and compares them against each other using a similarity function.
- Extend the greedy solver to multiparty problems.
- The greedy solver also takes the new candidate pairs type.
- Implement serialisation and deserialisation of candidate pairs.
- Multiple files with serialised candidate pairs can be merged without loading everything into memory at once.
- Introduce type annotations in the new API.
Minor changes
-------------
- Automatically test on Python 3.7.
- Remove support for Python 3.5 and below.
- Update Clkhash dependency to 0.11.
- Minor documentation and style in ``anonlink.concurrency``.
- Provide a convenience function for generating valid candidate pairs from a chunk.
- Change the format of a chunk and move the type definition to ``anonlink.typechecking``.
New modules
-----------
- ``anonlink.blocking``: Implementation of functions that assign blocks to every record. These are generally used to optimise matching.
- ``anonlink.candidate_generation``: Finding candidate pairs from multiple datasets using a similarity function.
- ``anonlink.serialization``: Tools for serialisation and deserialisation of candidate pairs. Also permits efficient merging multiple files of serialised candidate pairs.
- ``anonlink.similarities``: Exposes different similarity functions that can be used to compare records. Currently implemented are ``hamming_similarity`` and ``dice_coefficient``.
- ``anonlink.solving``: Exposes solvers that can be used to turn candidate pairs into a concrete matching. Currently, only the ``greedy_solve`` function is exposed.
- ``anonlink.typechecking``: Types for Mypy and other typecheckers.
Deprecated modules
------------------
- ``anonlink.bloommatcher`` is replaced by ``anonlink.similarities``. The Tanimoto coefficient functions currently have no replacement.
- ``anonlink.distributed_processing`` is deprecated with no replacement.
- ``anonlink.network_flow`` is deprecated with no replacement.
- ``anonlink.util`` is deprecated with no replacement.
New usage examples
------------------
Before
~~~~~~
.. code-block:: python
>>> dataset0[0]
(bitarray('0111101001001100101001001010101000100100010010011011010110110000'),
0,
28)
>>> dataset1[0]
(bitarray('1100101101001110100001110000110000110101110010101001010001110100'),
3,
30)
>>> candidate_pairs = anonlink.entitymatch.calculate_filter_similarity(
dataset0, dataset1, k=len(dataset1), threshold=0.7)
>>> candidate_pairs[0:3]
[(1, 0.75, 6), (1, 0.75, 96), (1, 0.7457627118644068, 13)]
>>> mapping = anonlink.entitymatch.greedy_solver(candidate_pairs)
>>> mapping
{1: 6,
2: 44,
3: 86,
4: 4,
5: 61,
6: 10,
...
After
~~~~~~
- The function generating candidate pairs needs only the bloom filters. It does not need the record indices or the popcounts.
- The same function returns a tuple of arrays, instead of a list of tuples.
- The solvers return groups of 2-tuples (dataset index, record index) instead of a mapping.
.. code-block:: python
>>> dataset0[0]
bitarray('0111101001001100101001001010101000100100010010011011010110110000')
>>> dataset1[0]
bitarray('0101001110110000101110101101110000110001010000000011010010100011')
>>> datasets = [dataset0, dataset1]
>>> candidate_pairs = anonlink.candidate_generation.find_candidate_pairs(
datasets,
anonlink.similarities.dice_coefficient,
0.7)
>>> candidate_pairs[0][:3]
array('d', [1.0, 0.9850746268656716, 0.9841269841269841])
>>> candidate_pairs[1][0][:3]
array('I', [0, 0, 0])
>>> candidate_pairs[1][1][:3]
array('I', [1, 1, 1])
>>> candidate_pairs[2][0][:3]
array('I', [85, 66, 83])
>>> candidate_pairs[2][1][:3]
array('I', [82, 62, 79])
>>> groups = anonlink.solving.greedy_solve(candidate_pairs)
>>> groups
([(0, 85), (1, 82)],
[(0, 66), (1, 62)],
[(0, 83), (1, 79)],
[(0, 49), (1, 44)],
[(0, 20), (1, 22)],
...