- Semantic Deduplication - Resiliparse for Text Extraction - Improve Distributed Data Classification - Domain classifier is 1.55x faster through intelligent batching - Synthetic data generation for fine-tuning
What's Changed
* Update README by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/6 * [Tutorials] Add a readme file for the TinyStories tutorial by Maghoumi in https://github.com/NVIDIA/NeMo-Curator/pull/5 * Add workflow for running cpu pytests by ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/13 * Add pre-commit style checks by ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/14 * Add citation by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/15 * Fix Noisy CUDA Shutdown by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/20 * Bump Python and RAPIDS versions by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/16 * Add batched decorator by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/18 * Add issue templates by ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/22 * Add dependency to fix justext by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/24 * Fix metadata inference with pandas and dask by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/35 * Disable PyTorch Compile Multiprocessing by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/34 * Improve speed of AddId module by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/36 * Make GPU dependencies optional by ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/27 * Fix failing GPU tests with latest pandas bump by ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/41 * Adds Nemo Curator K8s example by terrykong in https://github.com/NVIDIA/NeMo-Curator/pull/40 * Move common dedup utils and remove unused code by ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/42 * Fix lang id example by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/37 * Add dataset blending tool by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/32 * High level fuzzy duplicates module by ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/46 * Fix indexing in PII Modifier by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/55 * Disable string conversion globally by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/56 * Fix issue 43 (empty files creation) and improve reading/writing speed by miguelusque in https://github.com/NVIDIA/NeMo-Curator/pull/57 * [Tutorials] Add a tutorial for PEFT data curation by Maghoumi in https://github.com/NVIDIA/NeMo-Curator/pull/45 * Only import PII constants during Curator import by ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/61 * Align `extract_partitioning_index` logic with upstream shuffling by rjzamora in https://github.com/NVIDIA/NeMo-Curator/pull/60 * [REVIEW] Switch Models to use Crossfit by VibhuJawa in https://github.com/NVIDIA/NeMo-Curator/pull/58 * Remove argparse from get_client function signature by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/12 * Fuzzy Dedup: Use text_field instead of hardcoded text column by ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/74 * Add pull request template by ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/78 * Add jupyter notebook tutorial for single node mulilingual dataset by nicoleeeluo in https://github.com/NVIDIA/NeMo-Curator/pull/30 * Update issue templates by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/81 * Fix 91 - Incorrect reference to domain_classifier_example.py by miguelusque in https://github.com/NVIDIA/NeMo-Curator/pull/92 * Fix 63. Add --input-meta parameter to explicitly specify the jsonl field dtypes by miguelusque in https://github.com/NVIDIA/NeMo-Curator/pull/75 * Update readme by ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/93 * Update documentation for new version by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/83 * Update requirements documentation. by ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/98 * Make sure query-planning is disabled for now by rjzamora in https://github.com/NVIDIA/NeMo-Curator/pull/97 * Applying SEO Best Pratices by aschilling-nv in https://github.com/NVIDIA/NeMo-Curator/pull/104 * Shuffle CC result on group before writing out by ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/110 * Added tutorials to index.rst by jgerh in https://github.com/NVIDIA/NeMo-Curator/pull/113 * Pin to numpy<2 to avoid spacy compat issues by ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/119 * Fix 116. Fix broken links by miguelusque in https://github.com/NVIDIA/NeMo-Curator/pull/117 * Update index.rst by aschilling-nv in https://github.com/NVIDIA/NeMo-Curator/pull/129 * Fix nemo_curator import in CPU only environment when GPU packages are installed. by ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/123 * Improve Common Crawl download by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/82 * Update README.md by Maghoumi in https://github.com/NVIDIA/NeMo-Curator/pull/126 * Allow multiple filenames per partition when separating by metadata by ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/99 * [REVIEW] Add Resiliparse option for text extraction by sarahyurick in https://github.com/NVIDIA/NeMo-Curator/pull/128 * Fix 69 - Refactor how arguments are added to scripts by miguelusque in https://github.com/NVIDIA/NeMo-Curator/pull/102 * Stricter check for query planning. by ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/107 * Add DataFrame example to Distributed Data Classification tutorial by sarahyurick in https://github.com/NVIDIA/NeMo-Curator/pull/137 * Enable Sem-dedup by VibhuJawa in https://github.com/NVIDIA/NeMo-Curator/pull/130 * Remove lxml installation by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/140 * Nemotron 340 SDG Pipeline Tutorial by chrisalexiuk-nvidia in https://github.com/NVIDIA/NeMo-Curator/pull/144 * Add Synthetic Data Generation Module by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/136 * Skip explicit comms shuffle for dask-cuda 24.06 by ayushdg in https://github.com/NVIDIA/NeMo-Curator/pull/147 * Add support for NeMo SDK by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/131 * [REVIEW] Fix SemDedup bugs by VibhuJawa in https://github.com/NVIDIA/NeMo-Curator/pull/151 * [pre-commit.ci] pre-commit suggestions by pre-commit-ci in https://github.com/NVIDIA/NeMo-Curator/pull/135 * Fix bug with torch rmm and nemo by ryantwolf in https://github.com/NVIDIA/NeMo-Curator/pull/155
New Contributors * ayushdg made their first contribution in https://github.com/NVIDIA/NeMo-Curator/pull/13 * terrykong made their first contribution in https://github.com/NVIDIA/NeMo-Curator/pull/40 * rjzamora made their first contribution in https://github.com/NVIDIA/NeMo-Curator/pull/60 * nicoleeeluo made their first contribution in https://github.com/NVIDIA/NeMo-Curator/pull/30 * aschilling-nv made their first contribution in https://github.com/NVIDIA/NeMo-Curator/pull/104 * pre-commit-ci made their first contribution in https://github.com/NVIDIA/NeMo-Curator/pull/135