huspacy Changelog

4.0

Pipeline details

  | Vectors | Tokenizer | Sentencizer | Tagger | Parser | Lemmatizer
-- | -- | -- | -- | -- | -- | -- |
Model | [Word2Vec CBOW `dim=300` `minfreq=10`](https://github.com/oroszgy/hunlp-resources/releases/tag/webcorpuswiki_word2vec_v0.1) | Rule-based implemented in SpaCy | Rule-based | Multi-task CNN | multi-task CNN | [Lemmy (CST-like)](https://github.com/sorenlind/lemmy/)
Training data | Wikipedia dump (2017-04-21)) and the [Hungarian Webcorpus](http://mokk.bme.hu/resources/webcorpus/) | - | - | [CONLL'17 training data](https://github.com/UniversalDependencies/UD_Hungarian-Szeged) | [CONLL'17 training data](https://github.com/UniversalDependencies/UD_Hungarian-Szeged) | UD converted Szeged Korpusz
Test data | [Hungarian analogical questions](http://corpus.nytud.hu/efnilex-vect/data/questions-words-hu.txt) | [CONLL'17 test data](https://github.com/UniversalDependencies/UD_Hungarian-Szeged) | [CONLL'17 test data](https://github.com/UniversalDependencies/UD_Hungarian-Szeged) | [CONLL'17 test data](https://github.com/UniversalDependencies/UD_Hungarian-Szeged) | [CONLL'17 test data](https://github.com/UniversalDependencies/UD_Hungarian-Szeged) | [CONLL'17 test data](https://github.com/UniversalDependencies/UD_Hungarian-Szeged)
Accuracy | `ACC` 20.95 | `F1` 99.88 | `F1` 96.64| `ACC` 95.11 | `UAS` 77.52 `LAS` 68.45 | `ACC` 95.60

hu_tagger_web_md-0.1.0
Baseline tagger and parser from Universal dependencies + vocabulary and word vector model generated from the Hungarian Webcorpus and Wikipedia

Feature | Description
------- | ------------
**Tagger** | 98.23 ACC trained/tested on the Szeged Corpus (Universal Morphology transcript)
**Word vectors** | word2vec bow with 150 dimensions, generated from the Hungarian Webcorpus and Wikipedia
**Brown clusters** | 1024 clusters generated from the Hungarian Webcorpus and Wikipedia

hu_parser_web_md-0.1.0
Baseline tagger and parser from Universal dependencies + vocabulary and word vector model generated from the Hungarian Webcorpus and Wikipedia

Feature | Description
------- | ------------
**Tagger** | 93.95 ACC trained/tested on Universal dependencies corpus
**Parser** | 75.12 UAS and 64.85 LAS trained/tested on Universal dependencies corpus
**Word vectors** | word2vec bow with 150 dimensions, generated from the Hungarian Webcorpus and Wikipedia
**Brown clusters** | 1024 clusters generated from the Hungarian Webcorpus and Wikipedia

hu_vectors_web_md-0.1.0
Vocabulary and word vector model trained on the Hungarian Webcorpus and Wikipedia

Feature | Description
------- | ------------
**Corpora** | Hungarian Webcorpus, Hungarian Wikipedia
**Word vectors** | 150 dimension, word2vec
**Brown clusters** | 1024

hu_vectors_web_lg-0.1.0
Vocabulary and word vector model trained on the Hungarian Webcorpus and Wikipedia

Feature | Description
------- | ------------
**Corpora** | Hungarian Webcorpus, Hungarian Wikipedia
**Word vectors** | 300 dimension, word2vec
**Brown clusters** | 1024

2.2.1

Model size | 1360 MB
Pipeline | tokenizer, sentencizer, tagger, parser, lemmatizer, ner
Vectors | 1140008 unique vectors (300 dimensions)
Sources | Universal Dependencies, Szeged Corpus, Web Corpus, Wikipedia, Hunnerwiki, Szeged NER corpora

2.1.8

Model size | 1360 MB
Pipeline | tokenizer, sentencizer, tagger, parser, lemmatizer, ner
Vectors | 1140008 unique vectors (300 dimensions)
Sources | Universal Dependencies, Szeged Corpus, Web Corpus, Wikipedia, Hunnerwiki, Szeged NER corpora

2.1.0

Model size | 1360 MB
Pipeline | tokenizer, sentencizer, tagger, parser, lemmatizer
Vectors | 1140008 unique vectors (300 dimensions)
Sources | Universal Dependencies, Szeged Corpus, Web Corpus, Wikipedia

2.0.0

Model size | 1350 MB
Pipeline | tokenizer, sentencizer, tagger, parser, lemmatizer
Vectors | 1140008 unique vectors (300 dimensions)
Sources | Universal Dependencies, Szeged Corpus, Web Corpus, Wikipedia

0.9.0

Changed

- Added support for new models (`hu_core_news_md-v3.5.2`, `hu_core_news_lg-v3.5.2`, `hu_core_news_trf_xl-v3.5.2`, `hu_core_news_trf_xl-v3.5.2`)
- Updated documentation with `benepar` usage and the noun chunking

Huspacy

Page 1 of 3

4.0