| Vectors | Tokenizer | Sentencizer | Tagger | Parser | Lemmatizer -- | -- | -- | -- | -- | -- | -- | Model | [Word2Vec CBOW `dim=300` `minfreq=10`](https://github.com/oroszgy/hunlp-resources/releases/tag/webcorpuswiki_word2vec_v0.1) | Rule-based implemented in SpaCy | Rule-based | Multi-task CNN | multi-task CNN | [Lemmy (CST-like)](https://github.com/sorenlind/lemmy/) Training data | Wikipedia dump (2017-04-21)) and the [Hungarian Webcorpus](http://mokk.bme.hu/resources/webcorpus/) | - | - | [CONLL'17 training data](https://github.com/UniversalDependencies/UD_Hungarian-Szeged) | [CONLL'17 training data](https://github.com/UniversalDependencies/UD_Hungarian-Szeged) | UD converted Szeged Korpusz Test data | [Hungarian analogical questions](http://corpus.nytud.hu/efnilex-vect/data/questions-words-hu.txt) | [CONLL'17 test data](https://github.com/UniversalDependencies/UD_Hungarian-Szeged) | [CONLL'17 test data](https://github.com/UniversalDependencies/UD_Hungarian-Szeged) | [CONLL'17 test data](https://github.com/UniversalDependencies/UD_Hungarian-Szeged) | [CONLL'17 test data](https://github.com/UniversalDependencies/UD_Hungarian-Szeged) | [CONLL'17 test data](https://github.com/UniversalDependencies/UD_Hungarian-Szeged) Accuracy | `ACC` 20.95 | `F1` 99.88 | `F1` 96.64| `ACC` 95.11 | `UAS` 77.52 `LAS` 68.45 | `ACC` 95.60
hu_tagger_web_md-0.1.0 Baseline tagger and parser from Universal dependencies + vocabulary and word vector model generated from the Hungarian Webcorpus and Wikipedia
Feature | Description ------- | ------------ **Tagger** | 98.23 ACC trained/tested on the Szeged Corpus (Universal Morphology transcript) **Word vectors** | word2vec bow with 150 dimensions, generated from the Hungarian Webcorpus and Wikipedia **Brown clusters** | 1024 clusters generated from the Hungarian Webcorpus and Wikipedia
hu_parser_web_md-0.1.0 Baseline tagger and parser from Universal dependencies + vocabulary and word vector model generated from the Hungarian Webcorpus and Wikipedia
Feature | Description ------- | ------------ **Tagger** | 93.95 ACC trained/tested on Universal dependencies corpus **Parser** | 75.12 UAS and 64.85 LAS trained/tested on Universal dependencies corpus **Word vectors** | word2vec bow with 150 dimensions, generated from the Hungarian Webcorpus and Wikipedia **Brown clusters** | 1024 clusters generated from the Hungarian Webcorpus and Wikipedia
hu_vectors_web_md-0.1.0 Vocabulary and word vector model trained on the Hungarian Webcorpus and Wikipedia
Model size | 1360 MB Pipeline | tokenizer, sentencizer, tagger, parser, lemmatizer, ner Vectors | 1140008 unique vectors (300 dimensions) Sources | Universal Dependencies, Szeged Corpus, Web Corpus, Wikipedia, Hunnerwiki, Szeged NER corpora
2.1.8
Model size | 1360 MB Pipeline | tokenizer, sentencizer, tagger, parser, lemmatizer, ner Vectors | 1140008 unique vectors (300 dimensions) Sources | Universal Dependencies, Szeged Corpus, Web Corpus, Wikipedia, Hunnerwiki, Szeged NER corpora
2.1.0
Model size | 1360 MB Pipeline | tokenizer, sentencizer, tagger, parser, lemmatizer Vectors | 1140008 unique vectors (300 dimensions) Sources | Universal Dependencies, Szeged Corpus, Web Corpus, Wikipedia
2.0.0
Model size | 1350 MB Pipeline | tokenizer, sentencizer, tagger, parser, lemmatizer Vectors | 1140008 unique vectors (300 dimensions) Sources | Universal Dependencies, Szeged Corpus, Web Corpus, Wikipedia
0.9.0
Changed
- Added support for new models (`hu_core_news_md-v3.5.2`, `hu_core_news_lg-v3.5.2`, `hu_core_news_trf_xl-v3.5.2`, `hu_core_news_trf_xl-v3.5.2`) - Updated documentation with `benepar` usage and the noun chunking