Gpl

Latest version: v0.1.4

Safety actively analyzes 682457 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

0.1.4

In this version, we have added more easy-to-understand hints when a certain exception is thrown.

0.1.3

Previously, there was a conflict between `easy_elasticsearch` and `beir` on the dependency of `elasticsearch`:
- `easy_elasticsearch` requires elasticsearch==7.12.1 while
- `beir` requires elasticserch==7.9.1

In the lastest version of `easy_elasticsearch`, the requirements have been changed to solve this issue. Here we update `gpl` to install this version (easy_elasticsearch==0.0.9). Another update of easy_elasticsearch==0.0.9 is that it has solved the issue that ES could return empty results (due to `refresh` is not called for indexing)

0.1.0

Updated paper, accepted by NAACL 2022
The GPL paper has been accepted by NAACL 2022! Major updates:
- Improved the setting: Down-sampled the corpus if it is too large; calculate the number of generated queries according to the corpus size;
- Added more analysis about the influence of the number of generated queries: Small corpus needs more queries;
- Added results on the full 18 BeIR datasets: The conclusions remain the same, while we also tried training GPL on top of the power TAS-B model and achieved new improvements.

Automatic hyper-parameter
Previously, we use the whole corpus and number of generated queries = 3, no matter the corpus size. This actually results in a very bad training efficiency for large corpus. In the new version, we automatically set these two hyper-parameters by meeting the standard: the total number of generated queries = 250K.
>In detail, we first set the queries_per_passage >= 3 and uniformly down-sample the corpus if 3 × |C| > 250K, where |C| is the corpus size; then we calculate queries_per_passage = 250K/|C|. For example, the queries_per_passage values for FiQA (original size = 57.6K) and Robust04 (original size = 528.2K) are 5 and 3, resp. and the Robust04 corpus is down-sampled to 83.3K.

Released checkpoints (TAS-B ones)
We now release the pre-trained GPL models via the https://huggingface.co/GPL. They also include the power GPL models trained on top of [TAS-B](https://huggingface.co/sentence-transformers/msmarco-distilbert-base-tas-b).

0.0.9

Fixed bug of max.-sequence-length mismatch between student and teacher
Previously, the teacher (i.e. the cross-encoder) got the input of the concatenation of query and document texts and had no limits of max. sequence length (cf. [here](https://github.com/UKPLab/gpl/blob/8724d2d71c1a0ab7790db0637bba3eee73a6f068/gpl/toolkit/pl.py#L21) and [here](https://github.com/UKPLab/gpl/blob/8724d2d71c1a0ab7790db0637bba3eee73a6f068/gpl/toolkit/pl.py#L42)). However, the students actually had the limits of max. sequence length on both query texts and document texts **separately**. This causes the mismatch between the information which can be seen by the student and the teacher models.

In the new release, we fixed this by doing "[retokenization](https://github.com/UKPLab/gpl/blob/9c17ecdc7aa6d2b7b34068d50548921bbdbdaac7/gpl/toolkit/pl.py#L34)": [Right before pseudo labeling](https://github.com/UKPLab/gpl/blob/9c17ecdc7aa6d2b7b34068d50548921bbdbdaac7/gpl/toolkit/pl.py#L69), we let the tokenizer of the teacher model tokenize the query texts and the document texts also separately and then decode the results (token IDs) back into the texts again. The resulting texts will meet the same max.-sequence-length requirements as the student model does and thus fix this bug.

Keep full precision of the pseudo labels
Previously, we saved the pseudo labels from PyTorch's tensors directly, which would not give the full precision. Now we have fixed this by [doing `labels.tolist()`](https://github.com/UKPLab/gpl/blob/9c17ecdc7aa6d2b7b34068d50548921bbdbdaac7/gpl/toolkit/pl.py#L75) right before the data dumping. This actually would not influence a lot, since previously it kept 6-digit precision and was high enough.

0.0.8

Independent evaluation and k_values supported
One can now run the `gpl.toolkit.evaluation` directly. Previously, it was only possible as part of the whole `gpl.train` workflow. Please check [this example](gpl/toolkitevaluation) for more details.

And we have also added argument `k_values` in `gpl.toolkit.evaluation.evaluate`. This is for specifying the K values in "nDCGK", "recallK", etc.

Fixed bugs & use load_sbert in mnrl and evaluation
Now almost all methods that require a separation token has this argument called `sep` (previously it was fixed as a blank token `" "`). Two exceptions are `mnrl` (a loss function in [SBERT repo](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py), also the default training loss for the QGen method) and `qgen`, since they are from the BeIR repo (we will update the BeIR repo in the future if possible).

0.0.7

Rewrite SBERT loading
Previously, GPL loads starting checkpoints (`--base_ckpt`) by [constructing SBERT model from scratch](https://www.sbert.net/docs/training/overview.html?highlight=projection#creating-networks-from-scratch). This way would lose some information of the checkpoint (e.g. pooling and max_seq_length), and one needed to specify them carefully.

Now we have created another method called [`load_sbert`](https://github.com/UKPLab/gpl/blob/920c0e82177ceab3a0e574f68b1184ad9a16995c/gpl/toolkit/sbert.py#L15). It will use `SentenceTransformer(base_ckpt)` to load the checkpoint directly and do some checking & assertions. Loading from a Huggingface-format checkpoint (e.g. "distilbert-base-uncased") now is still possible for many cases as previous, **but we do recommend users to load from a SBERT-format if possible**, since it will be less likely to misuse the starting checkpoint.

Reformatting examples
In some cases, Huggingface-format checkpoint cannot be loaded directly by SBERT, e.g. "facebook/dpr-question_encoder-single-nq-base". This is because:
1. Of course, they are **not in SBERT-format** but in Hugginface-format;
2. And **for Huggingface-format**, SBERT can only work with the checkpoint with **a Transformer layer as the last layer**, i.e. the outputs should contain hidden states with shape `(batch_size, sequence_length, hidden_dimenstion)`.

To use these checkpoints, one needs to reformat them into SBERT-format. We have provided two examples/templates in the new toolkit source file, [gpl/toolkit/reformat.py](gpl/toolkit/reformat.py). Please refer to its readme [here](https://github.com/UKPLab/gpl/tree/main/gpl/toolkit#reformat).

Solved logging bug
Previously, the logging in GPL is overridden by some other loggers and the formatting cannot display as we want. Now we have solved this by dealing with the root logger. And the new formatting will show many usefull details:
python
fmt='[%(asctime)s] %(levelname)s [%(name)s.%(funcName)s:%(lineno)d] %(message)s'

Links

Releases

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.