This pre-release has upcoming changes for version 4 of hlink. Since this includes breaking changes and an overhaul of the model exploration task, we'd like to test it out a bit before creating a full release. Part of the work yet to be done is documentation and code cleanup. The documentation for these changes and new features is lacking so far. Here is a preview of the version 4 highlights (so far!):
* Completely overhauled the model exploration task, switching to a nested cross-validation algorithm.
* Added support for a third strategy for generating models to test in model exploration. Along with "explicit" (take exactly what's in `training.model_parameters`) and grid search, there is now randomized search. Randomized search takes a certain number of samples from a distribution defined in `training.model_parameters`.
* Added the F-measure metric to the model exploration output, and simplified the output so that it always has the same columns.
* Removed the `training.output_suspicious_TD` configuration option because it was rarely used and presented code and performance issues. Removing `output_suspicious_TD` makes the model exploration code more maintainable and helps it run more quickly.
* Disentangled two core modules (`classifier` and `pipeline`) from the configuration format by changing the arguments to a couple of functions. This should help separate those concerns more neatly and make changes to the configuration easier if we end up doing that in the future.
* Changed `SparkConnection` to require a `checkpoint_dir` argument, which fixes a bug related to Spark configuration.