- filters `split` and `tee` now support `ClassificationData` as well - added `metadata-from-name` filter to extract meta-data from the current input file name - added `inspect` filter that allows inspecting data interactively as it passes through the pipeline - added `empty_str_if_none` helper method to `ldc.text_utils` to ensure no None/null values are output with writers - upgraded seppl to 0.2.2 and switched to using `seppl.ClassListerRegistry`
0.2.0
------------------
- added support for XTuner conversation JSON format: `from-xtuner` and `to-xtuner` - added filter `update-pair-data` to allow tweaking or rearranging of the data - introduced `ldc.api` module to separate out abstract superclasses and avoid circular imports - readers now set the 'file' meta-data value - added `file-filter` filter for explicitly allowing/discarding records that stem from certain files (entry in meta-data: 'file') - added `record-files` filter for recording the files that the records are based on (entry in meta-data: 'file') - filter `pretrain-sentences-to-pairs` can now omit filling the `instruction` when using 0 as prompt step - requiring seppl>=0.1.2 now - added global option `-U, --unescape_unicode` to `llm-convert` tool to allow conversion of escaped unicode characters - the `llm-append` tool now supports appending for json, jsonlines and CSV files apart from plain-text files (default)
0.1.1
------------------
- added `classification` domain - added `from-jsonlines-cl` reader and `to-jsonlines-cl` writer for classification data in JSON lines format - added filter `pretrain-sentences-to-classification` to turn pretrain data into classification data (with a predefined label) - added filter `classification-label-map` that can generate a label string/int map - the `to-llama2-format` filter now has the `--skip_tokens` options to leave out the [INST] [/INST] tokens - added `from-parquet-cl` reader and `to-parquet-cl` writer for classification data in Parquet database format - added `from-csv-cl`/`from-tsv-cl` readers and `to-csv-cl`/`to-tsv-cl` writers for classification data in CSV/TSV file format
0.1.0
------------------
- fixed output format of `to-llama2-format` filter - `llama2-to-pairs` filter has more robust parsing now - upgraded seppl to 0.1.0 - switched to seppl classes: Splitter, MetaDataHandler, Reader, Writer, StreamWriter, BatchWriter
0.0.5
------------------
- added flag `-b/--force_batch` to the `llm-convert` tool which all data to be reader from the reader before filtering it and then passing it to the writer; useful for batch filters. - added the `randomize-records` batch filter - added the `--encoding ENC` option to file readers - auto-determined encoding is now being logged (`INFO` level) - the `LDC_ENCODING_MAX_CHECK_LENGTH` environment variable allows overriding the default number of bytes used for determining the file encoding in auto-detect mode - default max number of bytes inspected for determining file encoding is now 10kb - method `locate_files` in `base_io` no longer includes directories when expanding globs - added tool `llm-file-encoding` for determining file encodings of text files - added method `replace_extension` to `base_io` module for changing a files extension (removes any supported compression suffix first) - stream writers (.jsonl/.txt) now work with `--force_batch` mode; the output file name gets automatically generated from the input file name when just using a directory for the output