Ir-datasets

Latest version: v0.5.7

Safety actively analyzes 638646 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 3

0.5.1

What's Changed
* [MINOR FIX / TYPO] Update trec-robust04.yaml by cakiki in https://github.com/allenai/ir_datasets/pull/137
* .z compression support for robust04 by seanmacavaney in https://github.com/allenai/ir_datasets/pull/139
* moving msmarco-passage scoreddocs around by seanmacavaney in https://github.com/allenai/ir_datasets/pull/142
* mmarco updates (files hosted elsewhere & new version of some sources) by seanmacavaney in https://github.com/allenai/ir_datasets/pull/145
* new data available for mmarco (scoreddocs, docpairs, and dev/small) by seanmacavaney in https://github.com/allenai/ir_datasets/pull/146
* added tripclick/train/hofstaetter-triples by seanmacavaney in https://github.com/allenai/ir_datasets/pull/147
* additional versions of msmarco-passage triples by seanmacavaney in https://github.com/allenai/ir_datasets/pull/149
* mMARCO v2 by seanmacavaney in https://github.com/allenai/ir_datasets/pull/150
* Anchor Text for msmarco-document and msmarco-document-v2 by seanmacavaney in https://github.com/allenai/ir_datasets/pull/155
* mmarco source files renamed by seanmacavaney in https://github.com/allenai/ir_datasets/pull/153
* TREC CAsT 2019, 2020 by seanmacavaney in https://github.com/allenai/ir_datasets/pull/156
* HC4 by eugene-yang in https://github.com/allenai/ir_datasets/pull/158
* LoTTE dataset by seanmacavaney in https://github.com/allenai/ir_datasets/pull/159
* kilt by seanmacavaney in https://github.com/allenai/ir_datasets/pull/161
* some trec 2021 qrels released by seanmacavaney in https://github.com/allenai/ir_datasets/pull/162
* some trec 2021 qrels released by seanmacavaney in https://github.com/allenai/ir_datasets/pull/171
* CODEC by seanmacavaney in https://github.com/allenai/ir_datasets/pull/172
* improved HTML/XML parser, TREC 7 and 8 by seanmacavaney in https://github.com/allenai/ir_datasets/pull/173
* fixed and tested issue affecting some clueweb lookups by seanmacavaney in https://github.com/allenai/ir_datasets/pull/174
* cache hc4 topics/qrels by seanmacavaney in https://github.com/allenai/ir_datasets/pull/176
* wikiclir by seanmacavaney in https://github.com/allenai/ir_datasets/pull/178
* NeuCLIR Collection 1 (documents and HC4-filtered subset) by eugene-yang in https://github.com/allenai/ir_datasets/pull/179
* neuMARCO by seanmacavaney in https://github.com/allenai/ir_datasets/pull/181

New Contributors
* cakiki made their first contribution in https://github.com/allenai/ir_datasets/pull/137
* eugene-yang made their first contribution in https://github.com/allenai/ir_datasets/pull/158

**Full Changelog**: https://github.com/allenai/ir_datasets/compare/v0.5.0...v0.5.1

0.5.0

New Features:
- Metadata is included for all datasets, including record counts, without needing to download or process the data.
- New entity type (`qlogs`) for query log records

New datasets:
- argsme & touche (thanks heinrichreimer!)
- aol-ia dataset
- tripclick logs
- trec-dl-2021 qrels (active participants only for now)

Miscellaneous:
- No longer updates root logger instance, allowing other applications to easily cusomise logging output from this package
- Updates to documentation

0.4.3

Added:
- `trec-fair-2021/eval` topics
- `clinicaltrials/2021/trec-ct-2021`
- `c4` and `c4/en-noclean-tr/trec-misinfo-2021`
- `wikir/en78k` and `wikir/ens78k`
- `msmarco-passage-v2/trec-dl-2021` and `msmarco-document-v2/trec-dl-2021`
- `mr-tydi`
- `mmarco`

Misc:
- some minor changes to `clean` command
- msmarco-passage-v2 lookups now performed by ID instead of lz4
- file linking info not shown when downloading small files
- fixed `cord19/fulltext`
- other minor fixes

0.4.2

Adds the following datasets:
- MS MARCO Passage version 2
- TREC Fair Ranking 2021

A few other minor improvements:
- Progress bars: units + totals in a few more places
- Checks for adequate disk space before big downloads (can be disabled with an environment variable)

0.4.1

- Adds version 2 of the MS MARCO document collection.
- Using mirror.ir-datsets.com as a fallback for some small files
- More examples in the documentation (the python API is now joined by the CLI and a PyTerrier example)
- Improved bibtex, including a master bib file that can be imported papers (e.g., in overleaf).
- Other minor improvements

0.4.0

New datasets:
- BEIR suite
- Cranfield
- CLIRMatrix
- DPR-W100
- NQ
- TREC DL Hard
- TREC News
- TripClick

Other:
- Download dashboard
- Improved documentation for non-downloadable datasets
- A beta "more pythonic API"
- Speeding up library load time
- Minor bug fixes, improvements, etc.

Page 2 of 3

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.