Llm-dataset-converter

Latest version: v0.2.4

Safety actively analyzes 681812 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

0.2.4

------------------

- requiring seppl>=0.2.6 now
- readers use default globs now, allowing the user to simply supply directories as input
- renamed `split` filter to `split-records` to avoid name clash with meta-data key `split` as parameter

0.2.3

------------------

- requiring seppl>=0.2.4 now

0.2.2

------------------

- requiring seppl>=0.2.3 now

0.2.1

------------------

- filters `split` and `tee` now support `ClassificationData` as well
- added `metadata-from-name` filter to extract meta-data from the current input file name
- added `inspect` filter that allows inspecting data interactively as it passes through the pipeline
- added `empty_str_if_none` helper method to `ldc.text_utils` to ensure no None/null values are output with writers
- upgraded seppl to 0.2.2 and switched to using `seppl.ClassListerRegistry`

0.2.0

------------------

- added support for XTuner conversation JSON format: `from-xtuner` and `to-xtuner`
- added filter `update-pair-data` to allow tweaking or rearranging of the data
- introduced `ldc.api` module to separate out abstract superclasses and avoid circular imports
- readers now set the 'file' meta-data value
- added `file-filter` filter for explicitly allowing/discarding records that stem from certain files (entry in meta-data: 'file')
- added `record-files` filter for recording the files that the records are based on (entry in meta-data: 'file')
- filter `pretrain-sentences-to-pairs` can now omit filling the `instruction` when using 0 as prompt step
- requiring seppl>=0.1.2 now
- added global option `-U, --unescape_unicode` to `llm-convert` tool to allow conversion of escaped unicode characters
- the `llm-append` tool now supports appending for json, jsonlines and CSV files apart from plain-text files (default)

0.1.1

------------------

- added `classification` domain
- added `from-jsonlines-cl` reader and `to-jsonlines-cl` writer for classification data in JSON lines format
- added filter `pretrain-sentences-to-classification` to turn pretrain data into classification data (with a predefined label)
- added filter `classification-label-map` that can generate a label string/int map
- the `to-llama2-format` filter now has the `--skip_tokens` options to leave out the [INST] [/INST] tokens
- added `from-parquet-cl` reader and `to-parquet-cl` writer for classification data in Parquet database format
- added `from-csv-cl`/`from-tsv-cl` readers and `to-csv-cl`/`to-tsv-cl` writers for classification data in CSV/TSV file format

Page 1 of 2

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.