Almost a year passed from our initial release of the core package. Over this time we have experienced a great deal of interest, getting 18K downloads and 400 stars. Therefore, we have decided to make this release a bit special by trying to introduce as many new interesting patterns into Desbordante as possible. While there are several of them, the obvious star of this release is matching dependencies—a pattern which will greatly help in data deduplication, data cleaning and many other data quality tasks.
Overall, this release contains pattern implementations accumulated in over half a year. We hope you will find it useful!
Changes:
* Added discovery of matching dependencies—a very expressive type of dependency, capable of capturing subtle inconsistencies in data by using various matching functions.
* Added discovery of many types of approximate functional dependencies. Before, we defined the error value of an approximate FD to be calculated using the g1 metric. Our new definition permits use of any error metric, as the alternative metrics are currently gaining popularity. Therefore, we are expanding the number of supported metrics in Desbordante and in this release we added discovery for the $\mu+$, $\tau$, $pdep$, and $\rho$ metrics.
* Added discovery of soft functional dependencies and corellations.
* Added validation of variable heterogeneous denial constraints.
* Added discovery and validation of approximate inclusion dependencies (using the $g^{‘}_3$ metric). Inclusion dependencies can help users to recover foreign keys, or to find joinable columns in a table or a collection of tables. Supporting an approximate version of this pattern will allow users to perform these tasks even when dealing with dirty data.
* Added validation of probabilistic functional dependencies.
* Examples were reorganized into three categories: 1) basic, which showcase a single pattern, 2) advanced, which illustrate various pattern nuances, and 3) expert, which demonstrate instances of complex programs that aim to provide tangible benefits for end-users by solving real-life problems using pattern discovery or validation.
Miscellaneous:
* Added HPIValid algorithm for discovery of UCCs. To the best of our knowledge, it is currently the most performant algorithm for this task, therefore we made it the default one.
* Added examples for UCC and AUCC mining.
* Fixed the AR example and added the support output to the Python bindings.
* Fixed lifetime issues with FD and UCC objects.
All novel patterns are coming with usage examples. Please note that the console version of Desbordante will be updated a bit later.