**Refactored Tokenization**
- Faster tokenization speed: Using batched tokenization for training & inference - Now, all sentences in a batch are tokenized simoultanously.
- Usage of the `SentencesDataset` no longer needed for training. You can pass your train examples directly to the DataLoader:
python
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
- If you use a custom torch DataSet class: The dataset class must now return `InputExample` objects instead of tokenized texts
- Class `SentenceLabelDataset` has been updated to new tokenization flow: It returns always two or more `InputExamples` with the same label
**Asymmetric Models**
Add new `models.Asym` class that allows different encoding of sentences based on some tag (e.g. *query* vs *paragraph*). Minimal example:
python
word_embedding_model = models.Transformer(base_model, max_seq_length=250)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
d1 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
d2 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
asym_model = models.Asym({'QRY': [d1], 'DOC': [d2]})
model = SentenceTransformer(modules=[word_embedding_model, pooling_model, asym_model])
Your input examples have to look like this:
inp_example = InputExample(texts=[{'QRY': 'your query'}, {'DOC': 'your document text'}], label=1)
Encoding (Note: Mixed inputs are not allowed)
model.encode([{'QRY': 'your query1'}, {'QRY': 'your query2'}])
Inputs that have the key 'QRY' will be passed through the `d1` dense layer, while inputs with they key 'DOC' through the `d2` dense layer.
More documentation on how to design asymmetric models will follow soon.
**New Namespace & Models for Cross-Encoder**
Cross-Encoder are now hosted at [https://huggingface.co/cross-encoder](https://huggingface.co/cross-encoder). Also, new [pre-trained models](https://www.sbert.net/docs/pretrained_cross-encoders.html) have been added for: NLI & QNLI.
**Logging**
Log messages now use a custom logger from `logging` thanks to PR 623. This allows you which log messages you want to see from which components.
**Unit tests**
A lot more unit tests have been added, which test the different components of the framework.