Llm-dataset-converter

Latest version: v0.2.2

Safety actively analyzes 623866 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

0.2.2

------------------

- requiring seppl>=0.2.3 now

0.2.1

------------------

- filters `split` and `tee` now support `ClassificationData` as well
- added `metadata-from-name` filter to extract meta-data from the current input file name
- added `inspect` filter that allows inspecting data interactively as it passes through the pipeline
- added `empty_str_if_none` helper method to `ldc.text_utils` to ensure no None/null values are output with writers
- upgraded seppl to 0.2.2 and switched to using `seppl.ClassListerRegistry`

0.2.0

------------------

- added support for XTuner conversation JSON format: `from-xtuner` and `to-xtuner`
- added filter `update-pair-data` to allow tweaking or rearranging of the data
- introduced `ldc.api` module to separate out abstract superclasses and avoid circular imports
- readers now set the 'file' meta-data value
- added `file-filter` filter for explicitly allowing/discarding records that stem from certain files (entry in meta-data: 'file')
- added `record-files` filter for recording the files that the records are based on (entry in meta-data: 'file')
- filter `pretrain-sentences-to-pairs` can now omit filling the `instruction` when using 0 as prompt step
- requiring seppl>=0.1.2 now
- added global option `-U, --unescape_unicode` to `llm-convert` tool to allow conversion of escaped unicode characters
- the `llm-append` tool now supports appending for json, jsonlines and CSV files apart from plain-text files (default)

0.1.1

------------------

- added `classification` domain
- added `from-jsonlines-cl` reader and `to-jsonlines-cl` writer for classification data in JSON lines format
- added filter `pretrain-sentences-to-classification` to turn pretrain data into classification data (with a predefined label)
- added filter `classification-label-map` that can generate a label string/int map
- the `to-llama2-format` filter now has the `--skip_tokens` options to leave out the [INST] [/INST] tokens
- added `from-parquet-cl` reader and `to-parquet-cl` writer for classification data in Parquet database format
- added `from-csv-cl`/`from-tsv-cl` readers and `to-csv-cl`/`to-tsv-cl` writers for classification data in CSV/TSV file format

0.1.0

------------------

- fixed output format of `to-llama2-format` filter
- `llama2-to-pairs` filter has more robust parsing now
- upgraded seppl to 0.1.0
- switched to seppl classes: Splitter, MetaDataHandler, Reader, Writer, StreamWriter, BatchWriter

0.0.5

------------------

- added flag `-b/--force_batch` to the `llm-convert` tool which all data to be reader from the
reader before filtering it and then passing it to the writer; useful for batch filters.
- added the `randomize-records` batch filter
- added the `--encoding ENC` option to file readers
- auto-determined encoding is now being logged (`INFO` level)
- the `LDC_ENCODING_MAX_CHECK_LENGTH` environment variable allows overriding the default
number of bytes used for determining the file encoding in auto-detect mode
- default max number of bytes inspected for determining file encoding is now 10kb
- method `locate_files` in `base_io` no longer includes directories when expanding globs
- added tool `llm-file-encoding` for determining file encodings of text files
- added method `replace_extension` to `base_io` module for changing a files extension
(removes any supported compression suffix first)
- stream writers (.jsonl/.txt) now work with `--force_batch` mode; the output file name
gets automatically generated from the input file name when just using a directory for
the output

Page 1 of 2

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.