Dolma

Latest version: v1.0.3

Safety actively analyzes 632511 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 3

1.0.3

What's Changed
* Fix local shuffling failure by soldni in https://github.com/allenai/dolma/pull/140
* Fix issue in getting started tutorial using wikipedia data by RohitRathore1 in https://github.com/allenai/dolma/pull/117
* Add an option to improve tokenization shuffling by soldni in https://github.com/allenai/dolma/pull/141
* Optionally add total/sum to output of analyzer by soldni in https://github.com/allenai/dolma/pull/144
* Add extra tests for multi-byte unicode spans in deduper. by soldni in https://github.com/allenai/dolma/pull/145
* Bump s3 client lib and parameterize region in s3 tests + devcontainer by undfined in https://github.com/allenai/dolma/pull/147

New Contributors
* RohitRathore1 made their first contribution in https://github.com/allenai/dolma/pull/117
* undfined made their first contribution in https://github.com/allenai/dolma/pull/147

**Full Changelog**: https://github.com/allenai/dolma/compare/v1.0.2...v1.0.3

1.0.2

What's Changed
* Taggers for URL filtering by soldni in https://github.com/allenai/dolma/pull/112
* Updated CFF and Bibtex by soldni in https://github.com/allenai/dolma/pull/118
* Add preliminary Dolma v1.7 configurations, fix corner case in tokens. by soldni in https://github.com/allenai/dolma/pull/120
* Update CITATION.cff by soldni in https://github.com/allenai/dolma/pull/126
* Option to use ngram overlap to dedupe paragraphs by rodneykinney in https://github.com/allenai/dolma/pull/122
* Tagger modules import (fix for 128) by soldni in https://github.com/allenai/dolma/pull/129
* Added Support for JQ syntax in include/exclude mixer config by soldni in https://github.com/allenai/dolma/pull/131
* Added JQ syntax for replacements + added minimum score. by soldni in https://github.com/allenai/dolma/pull/133
* Bump the cargo group group with 1 update by dependabot in https://github.com/allenai/dolma/pull/132
* Improves tool to compute statistics; adds deduplication options. by soldni in https://github.com/allenai/dolma/pull/135
* use precompiled regex when loading url blocklists by peterbjorgensen in https://github.com/allenai/dolma/pull/137


**Full Changelog**: https://github.com/allenai/dolma/compare/v1.0.1...v1.0.2

1.0.1

What's Changed
* Update README.md by eltociear in https://github.com/allenai/dolma/pull/115
* do not overwrite tagger outputs with the same output path, fixes 113 by peterbjorgensen in https://github.com/allenai/dolma/pull/114
* Fix broken data sheet link in README by simonw in https://github.com/allenai/dolma/pull/107
* Modify CI to build when version is incremented; increment to v1.0.1 by soldni in https://github.com/allenai/dolma/pull/116

New Contributors
* eltociear made their first contribution in https://github.com/allenai/dolma/pull/115
* simonw made their first contribution in https://github.com/allenai/dolma/pull/107

**Full Changelog**: https://github.com/allenai/dolma/compare/v1.0.0...v1.0.1

1.0.0

What's Changed
* Add robust median to gopher filter by KennethEnevoldsen in https://github.com/allenai/dolma/pull/98
* Disambiguating that the repo is for the dolma toolkit in various docs by arnavic in https://github.com/allenai/dolma/pull/104
* V1.0 candidate; new deduper options, new taggers by soldni in https://github.com/allenai/dolma/pull/100
* Fixing Errors in Linux Build by soldni in https://github.com/allenai/dolma/pull/105

New Contributors
* KennethEnevoldsen made their first contribution in https://github.com/allenai/dolma/pull/98
* arnavic made their first contribution in https://github.com/allenai/dolma/pull/104

**Full Changelog**: https://github.com/allenai/dolma/compare/v0.9.4...v1.0.0

0.9.4

What's Changed
* Bump h2 from 0.3.20 to 0.3.24 by dependabot in https://github.com/allenai/dolma/pull/101
* BOS/EOS/PAD options in `tokens` cli; speed up tokenization by segmenting paragraphs. by soldni in https://github.com/allenai/dolma/pull/102
* Fixed Dangling CLI Options; E2E Tokenizer Tests by soldni in https://github.com/allenai/dolma/pull/103


**Full Changelog**: https://github.com/allenai/dolma/compare/v0.9.2...v0.9.4

0.9.2

What's Changed
* Remove unnecessary spawn in tokenizer, fix config with multiple paths by soldni in https://github.com/allenai/dolma/pull/67
* Add tagger_modules option to tagger cli by peterbjorgensen in https://github.com/allenai/dolma/pull/69
* feature to get the compliment of a hash sample by IanMagnusson in https://github.com/allenai/dolma/pull/72
* Fix Hardcoded Tokenizer by soldni in https://github.com/allenai/dolma/pull/71
* Fix a few issues of the FixedBucketsValTracker by peterbjorgensen in https://github.com/allenai/dolma/pull/73
* Add attribute correlations by Muennighoff in https://github.com/allenai/dolma/pull/68
* Porting missing code filtering rules to dolma repo by soldni in https://github.com/allenai/dolma/pull/86
* Disable cache in CI to prevent build failures by soldni in https://github.com/allenai/dolma/pull/90
* Reddit processing code by drschwenk in https://github.com/allenai/dolma/pull/74
* update readme by kyleclo in https://github.com/allenai/dolma/pull/95
* code/reasoning evaluation script by benbogin in https://github.com/allenai/dolma/pull/94
* Add The Stack statistics by Muennighoff in https://github.com/allenai/dolma/pull/92
* Fixing Build Config Issues by soldni in https://github.com/allenai/dolma/pull/99

New Contributors
* peterbjorgensen made their first contribution in https://github.com/allenai/dolma/pull/69
* IanMagnusson made their first contribution in https://github.com/allenai/dolma/pull/72
* drschwenk made their first contribution in https://github.com/allenai/dolma/pull/74
* benbogin made their first contribution in https://github.com/allenai/dolma/pull/94

**Full Changelog**: https://github.com/allenai/dolma/compare/v0.9.1...v0.9.2

Page 1 of 3

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.