Tensorflow-datasets

Latest version: v4.9.8

Safety actively analyzes 723882 Python packages for vulnerabilities to keep your Python projects secure.

Page 3 of 6

4.8.0

Added

- [API] `DatasetBuilder`'s description and citations can be specified in
dedicated `README.md` and `CITATIONS.bib` files, within the dataset package
(see https://www.tensorflow.org/datasets/add_dataset).
- Tags can be associated to Datasets, in the `TAGS.txt` file. For
now, they are only used in the generated documentation.
- [API][Experimental] New `ViewBuilder` to define datasets as transformations
of existing datasets. Also adds `tfds.transform` with functionality to apply
transformations.
- Loggers are also called on `tfds.as_numpy(...)`, base `Logger` class has a
new corresponding method.
- `tfds.core.DatasetBuilder` can have a default limit for the number of
simultaneous downloads. `tfds.download.DownloadConfig` can override it.
- `tfds.features.Audio` supports storing raw audio data for lazy decoding.
- The number of shards can be overridden when preparing a dataset:
`builder.download_and_prepare(download_config=tfds.download.DownloadConfig(num_shards=42))`.
Alternatively, you can configure the min and max shard size if you want TFDS
to compute the number of shards for you, but want to have control over the
shard sizes.

Changed

Deprecated

Removed

Fixed

Security

4.7.0

Added

- [API] Added
[TfDataBuilder](https://www.tensorflow.org/datasets/format_specific_dataset_builders#datasets_based_on_tfdatadataset)
that is handy for storing experimental ad hoc TFDS datasets in notebook-like
environments such that they can be versioned, described, and easily shared
with teammates.
- [API] Added options to create format-specific dataset builders. The new API
now includes a number of NLP-specific builders, such as:
- [CoNNL](https://www.tensorflow.org/datasets/format_specific_dataset_builders#conll)
- [CoNNL-U](https://www.tensorflow.org/datasets/format_specific_dataset_builders#conllu)
- [API] Added `tfds.beam.inc_counter` to reduce `beam.metrics.Metrics.counter`
boilerplate
- [API] Added options to group together existing TFDS datasets into
[dataset collections](https://www.tensorflow.org/datasets/dataset_collections)
and to perform simple operations over them.
- [Documentation] update, specifically:
- [New guide](https://www.tensorflow.org/datasets/format_specific_dataset_builders)
on format-specific dataset builders;
- [New guide](https://www.tensorflow.org/datasets/add_dataset_collection)
on adding new dataset collections to TFDS;
- Updated [TFDS CLI](https://www.tensorflow.org/datasets/cli)
documentation.
- [TFDS CLI] Supports custom config through Json (e.g. `tfds build my_dataset
--config='{"name": "my_custom_config", "description": "Abc"}'`)
- New datasets:
- [conll2003](https://www.tensorflow.org/datasets/catalog/conll2003)
- [universal_dependency 2.10](https://www.tensorflow.org/datasets/catalog/universal_dependency)
- [bucc](https://www.tensorflow.org/datasets/catalog/bucc)
- [i_naturalist2021](https://www.tensorflow.org/datasets/catalog/i_naturalist2021)
- [mtnt](https://www.tensorflow.org/datasets/catalog/mtnt) Machine
Translation of Noisy Text.
- [placesfull](https://www.tensorflow.org/datasets/catalog/placesfull)
- [tatoeba](https://www.tensorflow.org/datasets/catalog/tatoeba)
- [user_libri_audio](https://www.tensorflow.org/datasets/catalog/user_libri_audio)
- [user_libri_text](https://www.tensorflow.org/datasets/catalog/user_libri_text)
- [xtreme_pos](https://www.tensorflow.org/datasets/catalog/xtreme_pos)
- [yahoo_ltrc](https://www.tensorflow.org/datasets/catalog/yahoo_ltrc)
- Updated datasets:
- [C4](https://www.tensorflow.org/datasets/catalog/c4) was updated to
version 3.1.
- [common_voice](https://www.tensorflow.org/datasets/catalog/common_voice)
was updated to a more recent snapshot.
- [wikipedia](https://www.tensorflow.org/datasets/catalog/wikipedia) was
updated with the `20220620` snapshot.
- New dataset collections, such as
[xtreme](https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/dataset_collections/xtreme/xtreme.py)
and
[LongT5](https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/dataset_collections/longt5/longt5.py)

Changed

- The base `Logger` class expects more information to be passed to the
`as_dataset` method. This should only be relevant to people who have
implemented and registered custom `Logger` class(es).
- You can set `DEFAULT_BUILDER_CONFIG_NAME` in a `DatasetBuilder` to change
the default config if it shouldn't be the first builder config defined in
`BUILDER_CONFIGS`.

Deprecated

Removed

Fixed

- Various datasets
- In Linux, when loading a dataset from a directory that is not your home
(`~`) directory, a new `~` directory is not created in the current directory
(fixes [4117](https://github.com/tensorflow/datasets/issues/4117)).

Security

4.6.0

Added

- Support for community datasets on GCS.
- [API] `tfds.builder_from_directory` and `tfds.builder_from_directories`, see
https://www.tensorflow.org/datasets/external_tfrecord#directly_from_folder.
- [API] Dash ("-") support in split names.
- [API] `file_format` argument to `download_and_prepare` method, allowing user
to specify an alternative file format to store prepared data (e.g.
"riegeli").
- [API] `file_format` to `DatasetInfo` string representation.
- [API] Expose the return value of Beam pipelines. This allows for users to
read the Beam metrics.
- [API] Expose Feature `tf_example_spec` to public.
- [API] `doc` kwarg on `Feature`s, to describe a feature.
- [Documentation] Features description is shown on
[TFDS Catalog](https://www.tensorflow.org/datasets/catalog/overview).
- [Documentation] More metadata about HuggingFace datasets in TFDS catalog.
- [Performance] Parallel load of metadata files.
- [Testing] TFDS tests are now run using GitHub actions - misc improvements
such as caching and sharding.
- [Testing] Improvements to MockFs.
- New datasets.

Changed

- [API] `num_shards` is now optional in the shard name.

Removed

- TFDS pathlib API, migrated to a self-contained `etils.epath` (see
https://github.com/google/etils).

Fixed

- Various datasets.
- Dataset builders that are defined adhoc (e.g. in Colab).
- Better `DatasetNotFoundError` messages.
- Don't set `deterministic` on a global level but locally in interleave, so it
only apply to interleave and not all transformations.
- Google drive downloader.

4.5.2

Added

- [API] `split=tfds.split_for_jax_process('train')` (alias of
`tfds.even_splits('train', n=jax.process_count())[jax.process_index()]`).
- [Documentation] update.

Fixed

- Import bug on Windows (3709).

4.5.0

Added

- [API] Better split API:
- Splits can be selected using shards: `split='train[3shard]'`.
- Underscore supported in numbers for better readability:
`split='train[:500_000]'`.
- Select the union of all splits with `split='all'`.
- [`tfds.even_splits`](https://www.tensorflow.org/datasets/splits#tfdseven_splits_multi-host_training)
is more precise and flexible:
- Return splits exactly of the same size when passed
`tfds.even_splits('train', n=3, drop_remainder=True)`.
- Works on subsplits `tfds.even_splits('train[:75%]', n=3)` or even
nested.
- Can be composed with other splits: `tfds.even_splits('train', n=3)[0] +
'test'`.
- [API] `serialize_example` / `deserialize_example` methods on features to
encode/decode example to proto: `example_bytes =
features.serialize_example(example_data)`.
- [API] `Audio` feature now supports `encoding='zlib'` for better compression.
- [API] Features specs are exposed in proto for better compatibility with
other languages.
- [API] Create beam pipeline using TFDS as input with
[tfds.beam.ReadFromTFDS](https://www.tensorflow.org/datasets/api_docs/python/tfds/beam/ReadFromTFDS).
- [API] Support setting the file formats in `tfds build
--file_format=tfrecord`.
- [API] Typing annotations exposed in `tfds.typing`.
- [API] `tfds.ReadConfig` has a new `assert_cardinality=False` argument to
disable cardinality.
- [API] `tfds.display_progress_bar(True)` for functional control.
- [API] DatasetInfo exposes `.release_notes`.
- Support for huge number of shards (>99999).
- [Performance] Faster dataset generation (using tfrecords).
- [Testing] Mock dataset now supports nested datasets
- [Testing] Customize the number of sub examples
- [Documentation] Community datasets:
https://www.tensorflow.org/datasets/community_catalog/overview.
- [Documentation]
[Guide on TFDS and determinism](https://www.tensorflow.org/datasets/determinism).
- [[RLDS](https://github.com/google-research/rlds)] Support for nested
datasets features.
- [[RLDS](https://github.com/google-research/rlds)] New datasets: Robomimic,
D4RL Ant Maze, RLU Real World RL, and RLU Atari with ordered episodes.
- New datasets.

Deprecated

- Python 3.6 support: this is the last version of TFDS supporting Python 3.6.
Future versions will use Python 3.7.

Fixed

- Misc bugs.

4.4.0

Added

- [API]
[`PartialDecoding` support](https://www.tensorflow.org/datasets/decode#only_decode_a_sub-set_of_the_features),
to decode only a subset of the features (for performances).
- [API] `tfds.features.LabeledImage` for semantic segmentation (like image but
with additional `info.features['image_label'].name` label metadata).
- [API] float32 support for `tfds.features.Image` (e.g. for depth map).
- [API] Loading datasets from files now supports custom
`tfds.features.FeatureConnector`.
- [API] All FeatureConnector can now have a `None` dimension anywhere
(previously restricted to the first position).
- [API] `tfds.features.Tensor()` can have arbitrary number of dynamic
dimension (`Tensor(..., shape=(None, None, 3, None)`)).
- [API] `tfds.features.Tensor` can now be serialised as bytes, instead of
float/int values (to allow better compression): `Tensor(...,
encoding='zlib')`.
- [API] Support for datasets with `None` in `tfds.as_numpy`.
- Script to add TFDS metadata files to existing TF-record (see
[doc](https://www.tensorflow.org/datasets/external_tfrecord)).
- [TESTING] `tfds.testing.mock_data` now supports:
- non-scalar tensors with dtype `tf.string`;
- `builder_from_files` and path-based community datasets.
- [Documentation] Catalog now exposes links to
[KnowYourData visualisations](https://knowyourdata-tfds.withgoogle.com/).
- [Documentation] Guide on
[common implementation gotchas](https://www.tensorflow.org/datasets/common_gotchas).
- Many new reinforcement learning datasets. Changed
- [API] Dataset generated with `disable_shuffling=True` are now read in
generation order.

Fixed

- File format automatically restored (for datasets generated with
`tfds.builder(..., file_format=)`).
- Dynamically set number of worker threads during extraction.
- Update progress bar during download even if downloads are cached.
- Misc bug fixes.

Page 3 of 6

Releases

Has known vulnerabilities

Previous Next

Tensorflow-datasets

Page 3 of 6

4.8.0

4.7.0

4.6.0

4.5.2

4.5.0

4.4.0

Page 3 of 6

Links

Releases