* Some enumeration classes were moved and renamed:
* `d3m.metadata.pipeline.ArgumentType` to `d3m.metadata.base.ArgumentType`
* `d3m.metadata.pipeline.PipelineContext` to `d3m.metadata.base.Context`
* `d3m.metadata.pipeline.PipelineStep` to `d3m.metadata.base.PipelineStepType`
**Backwards incompatible.**
* Added `pipeline_run.json` JSON schema which describes the results of running a
pipeline as described by the `pipeline.json` JSON schema. Also implemented
a reference pipeline run output for reference runtime.
[165](https://gitlab.com/datadrivendiscovery/d3m/issues/165)
[!59](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/59)
* When computing primitive digests, primitive's ID is included in the
hash so that digest is not the same for all primitives from the same
package.
[154](https://gitlab.com/datadrivendiscovery/d3m/issues/154)
* When datasets are loaded, digest of their metadata and data can be
computed. To control when this is done, `compute_digest` argument
to `Dataset.load` can now take the following `ComputeDigest`
enumeration values: `ALWAYS`, `ONLY_IF_MISSING` (default), and `NEVER`.
[!75](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/75)
* Added `digest` field to pipeline descriptions. Digest is computed based
on the pipeline document and it helps differentiate between pipelines
with same `id`. When loading a pipeline, if there
is a digest mismatch a warning is issued. You can use
`strict_digest` argument to request an exception instead.
[190](https://gitlab.com/datadrivendiscovery/d3m/issues/190)
[!75](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/75)
* Added `digest` field to problem description metadata.
This `digest` field is computed based on the problem description document
and it helps differentiate between problem descriptions with same `id`.
[190](https://gitlab.com/datadrivendiscovery/d3m/issues/190)
[!75](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/75)
* Moved `id`, `version`, `name`, `other_names`, and `description` fields
in problem schema to top-level of the problem description. Moreover, made
`id` required. This aligns it more with the structure of other descriptions we have.
[!75](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/75)
**Backwards incompatible.**
* Pipelines can now provide multiple inputs to the same primitive argument.
In such case runtime wraps those inputs into a `List` container type, and then
passes the list to the primitive.
[200](https://gitlab.com/datadrivendiscovery/d3m/issues/200)
[!112](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/112)
* Primitives now have a method `fit_multi_produce` which primitive author can
override to implement an optimized version of both fitting and producing a primitive on same data.
The default implementation just calls `set_training_data`, `fit` and produce methods.
If your primitive has non-standard additional arguments in its `produce` method(s) then you
will have to implement `fit_multi_produce` method to accept those additional arguments
as well, similarly to how you have had to do for `multi_produce`.
[117](https://gitlab.com/datadrivendiscovery/d3m/issues/117)
[!110](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/110)
**Could be backwards incompatible.**
* `source`, `timestamp`, and `check` arguments to all metadata functions and container types'
constructors have been deprecated. You do not have to and should not be providing them anymore.
[171](https://gitlab.com/datadrivendiscovery/d3m/issues/171)
[172](https://gitlab.com/datadrivendiscovery/d3m/issues/172)
[173](https://gitlab.com/datadrivendiscovery/d3m/issues/173)
[!108](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/108)
[!109](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/109)
* Primitive's constructor is not run anymore during importing of primitive's class
which allows one to use constructor to load things and do any resource
allocation/reservation. Constructor is now the preferred place to do so.
[158](https://gitlab.com/datadrivendiscovery/d3m/issues/158)
[!107](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/107)
* `foreign_key` metadata has been extended with `RESOURCE` type which allows
referencing another resource in the same dataset.
[221](https://gitlab.com/datadrivendiscovery/d3m/issues/221)
[!105](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/105)
* Updated supported D3M dataset and problem schema both to version 3.2.0.
Problem description parsing supports data augmentation metadata.
A new approach for LUPI datasets and problems is now supported,
including runtime support.
Moreover, if dataset's resource name is `learningData`, it is marked as a
dataset entry point.
[229](https://gitlab.com/datadrivendiscovery/d3m/issues/229)
[225](https://gitlab.com/datadrivendiscovery/d3m/issues/225)
[226](https://gitlab.com/datadrivendiscovery/d3m/issues/226)
[!97](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/97)
* Added support for "raw" datasets.
[217](https://gitlab.com/datadrivendiscovery/d3m/issues/217)
[!94](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/94)
* A warning is issued if a primitive does not provide a description through
its docstring.
[167](https://gitlab.com/datadrivendiscovery/d3m/issues/167)
[!101](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/101)
* A warning is now issued if an installable primitive is lacking contact or bug
tracker URI metadata.
[178](https://gitlab.com/datadrivendiscovery/d3m/issues/178)
[!81](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/81)
* `Pipeline` class now has also `equals` and `hash` methods which can help
determining if two pipelines are equal in the sense of isomorphism.
[!53](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/53)
* `Pipeline` and pipeline steps classes now has `get_all_hyperparams`
method to return all hyper-parameters defined for a pipeline and steps.
[222](https://gitlab.com/datadrivendiscovery/d3m/issues/222)
[!104](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/104)
* Implemented a check for primitive Python paths to assure that they adhere
to the new standard of all of them having to be in the form `d3m.primitives.primitive_family.primitive_name.kind`
(e.g., `d3m.primitives.classification.random_forest.SKLearn`).
Currently there is a warning if a primitive has a different Python path,
and after January 2019 it will be an error.
For `primitive_name` segment there is a [`primitive_names.py`](./d3m/metadata/primitive_names.py)
file containing a list of all allowed primitive names.
Everyone is encouraged to help currate this list and suggest improvements (merging, removals, additions)
of values in that list. Initial version was mostly automatically made from an existing list of
values used by current primitives.
[3](https://gitlab.com/datadrivendiscovery/d3m/issues/3)
[!67](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/67)
* Added to semantic types:
* `https://metadata.datadrivendiscovery.org/types/TokenizableIntoNumericAndAlphaTokens`
* `https://metadata.datadrivendiscovery.org/types/TokenizableByPunctuation`
* `https://metadata.datadrivendiscovery.org/types/AmericanPhoneNumber`
* `https://metadata.datadrivendiscovery.org/types/UnspecifiedStructure`
* `http://schema.org/email`
* `http://schema.org/URL`
* `http://schema.org/address`
* `http://schema.org/State`
* `http://schema.org/City`
* `http://schema.org/Country`
* `http://schema.org/addressCountry`
* `http://schema.org/postalCode`
* `http://schema.org/latitude`
* `http://schema.org/longitude`
[!62](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/62)
[!95](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/95)
[!94](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/94)
* Updated core dependencies. Some important packages are now at versions:
* `scikit-learn`: 0.20.2
* `numpy`: 1.15.4
* `pandas`: 0.23.4
* `networkx`: 2.2
* `pyarrow`: 0.11.1
[106](https://gitlab.com/datadrivendiscovery/d3m/issues/106)
[175](https://gitlab.com/datadrivendiscovery/d3m/issues/175)
* Added to `algorithm_types`:
* `IDENTITY_FUNCTION`
* `DATA_SPLITTING`
* `BREADTH_FIRST_SEARCH`
* Moved a major part of README to Sphinx documentation which is built
and available at [http://docs.datadrivendiscovery.org/](http://docs.datadrivendiscovery.org/).
* Added a `produce_methods` argument to `Primitive` hyper-parameter class
which allows one to limit matching primitives only to those providing all
of the listed produce methods.
[124](https://gitlab.com/datadrivendiscovery/d3m/issues/124)
[!56](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/56)
* Fixed `sample_multiple` method of the `Hyperparameter` class.
[157](https://gitlab.com/datadrivendiscovery/d3m/issues/157)
[!50](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/50)
* Fixed pickling of `Choice` hyper-parameter.
[!49](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/49)
[!51](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/51)
* Added `Constant` hyper-parameter class.
[186](https://gitlab.com/datadrivendiscovery/d3m/issues/186)
[!90](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/90)
* Added `count` to aggregate values in metafeatures.
[!52](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/52)
* Clarified and generalized some metafeatures, mostly renamed so that it can be
used on attributes as well:
* `number_of_classes` to `number_distinct_values`
* `class_entropy` to `entropy_of_values`
* `majority_class_ratio` to `value_probabilities_aggregate.max`
* `minority_class_ratio` to `value_probabilities_aggregate.min`
* `majority_class_size` to `value_counts_aggregate.max`
* `minority_class_size` to `value_counts_aggregate.min`
* `class_probabilities` to `value_probabilities_aggregate`
* `target_values` to `values_aggregate`
* `means_of_attributes` to `mean_of_attributes`
* `standard_deviations_of_attributes` to `standard_deviation_of_attributes`
* `categorical_joint_entropy` to `joint_entropy_of_categorical_attributes`
* `numeric_joint_entropy` to `joint_entropy_of_numeric_attributes`
* `pearson_correlation_of_attributes` to `pearson_correlation_of_numeric_attributes`
* `spearman_correlation_of_attributes` to `spearman_correlation_of_numeric_attributes`
* `canonical_correlation` to `canonical_correlation_of_numeric_attributes`
[!52](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/52)
* Added metafeatures:
* `default_accuracy`
* `oner`
* `jrip`
* `naive_bayes_tree`
* `number_of_string_attributes`
* `ratio_of_string_attributes`
* `number_of_other_attributes`
* `ratio_of_other_attributes`
* `attribute_counts_by_structural_type`
* `attribute_ratios_by_structural_type`
* `attribute_counts_by_semantic_type`
* `attribute_ratios_by_semantic_type`
* `value_counts_aggregate`
* `number_distinct_values_of_discrete_attributes`
* `entropy_of_discrete_attributes`
* `joint_entropy_of_discrete_attributes`
* `joint_entropy_of_attributes`
* `mutual_information_of_discrete_attributes`
* `equivalent_number_of_discrete_attributes`
* `discrete_noise_to_signal_ratio`
[!21](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/21)
[!52](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/52)
* Added special handling when reading scoring D3M datasets (those with true targets in a separate
file `targets.csv`). When such dataset is detected, the values from the separate file are now
merged into the dataset, and its ID is changed to finish with `SCORE` suffix. Similarly, an
ID of a scoring problem description gets its suffix changed to `SCORE`.
[176](https://gitlab.com/datadrivendiscovery/d3m/issues/176)
* Organized semantic types and add to some of them parent semantic types to organize/structure
them better. New parent semantic types added: `https://metadata.datadrivendiscovery.org/types/ColumnRole`,
`https://metadata.datadrivendiscovery.org/types/DimensionType`, `https://metadata.datadrivendiscovery.org/types/HyperParameter`.
* Fixed that `dateTime` column type is mapped to `http://schema.org/DateTime` semantic
type and not `https://metadata.datadrivendiscovery.org/types/Time`.
**Backwards incompatible.**
* Updated generated [site for metadata](https://metadata.datadrivendiscovery.org/) and
generate sites describing semantic types.
[33](https://gitlab.com/datadrivendiscovery/d3m/issues/33)
[114](https://gitlab.com/datadrivendiscovery/d3m/issues/114)
[!37](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/37)
* Optimized resolving of primitives in `Resolver` to not require loading of
all primitives when loading a pipeline, in the common case.
[162](https://gitlab.com/datadrivendiscovery/d3m/issues/162)
[!38](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/38)
* Added `NotFoundError`, `AlreadyExistsError`, and `PermissionDeniedError`
exceptions to `d3m.exceptions`.
* `Pipeline`'s `to_json_structure`, `to_json`, and `to_yaml` now have `nest_subpipelines`
argument which allows conversion with nested sub-pipelines instead of them
being only referenced.
* Made sure that Arrow serialization of metadata does not pickle also linked
values (`for_value`).
* Made sure enumerations are picklable.
* `PerformanceMetric` class now has `best_value` and `worst_value` which
return the range of possible values for the metric. Moreover, `normalize`
method normalizes the metric's value to a range between 0 and 1.
* Load D3M dataset qualities only after data is loaded. This fixes
lazy loading of datasets with qualities which was broken before.
* Added `load_all_primitives` argument to the default pipeline `Resolver`
which allows one to control loading of primitives outside of the resolver.
* Added `primitives_blacklist` argument to the default pipeline `Resolver`
which allows one to specify a collection of primitive path prefixes to not
(try to) load.
* Fixed return value of the `fit` method in `TransformerPrimitiveBase`.
It now correctly returns `CallResult` instead of `None`.
* Fixed a typo and renamed `get_primitive_hyparparams` to `get_primitive_hyperparams`
in `PrimitiveStep`.
**Backwards incompatible.**
* Additional methods were added to the `Pipeline` class and step classes,
to support runtime and easier manipulation of pipelines programmatically
(`get_free_hyperparams`, `get_input_data_references`, `has_placeholder`,
`replace_step`, `get_exposable_outputs`).
* Added reference implementation of the runtime. It is available
in the `d3m.runtime` module. This module also has an extensive
command line interface you can access through `python3 -m d3m.runtime`.
[115](https://gitlab.com/datadrivendiscovery/d3m/issues/115)
[!57](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/57)
[!72](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/72)
* `GeneratorPrimitiveBase` interface has been changed so that `produce` method
accepts a list of non-negative integers as an input instead of a list of `None` values.
This allows for batching and control by the caller which outputs to generate.
Previously outputs would depend on number of calls to `produce` and number of outputs
requested in each call. Now these integers serve as an index into the set of potential
outputs.
**Backwards incompatible.**
* We now try to preserve metadata log in default implementation of `can_accept`.
* Added `sample_rate` field to `dimension` metadata.
* `python3 -m d3m.index download` command now accepts `--prefix` argument to limit the
primitives for which static files are downloaded. Useful for testing.
* Added `check` argument to `DataMetadata`'s `update` and `remove` methods which allows
one to control if selector check against `for_value` should be done or not. When
it is known that selector is valid, not doing the check can speed up those methods.
* Defined metadata field `file_columns` which allows to store known columns metadata for
tables referenced from columns. This is now used by a D3M dataset reader to store known
columns metadata for collections of CSV files. Previously, this metadata was lost despite
being available in Lincoln Labs dataset metadata.