----------------
Span Classifiers for question answering
Albert, Bert, DeBerta, DistilBert, LongFormer, RoBerta, XlmRoBerta based Transformer Architectures are now avaiable for question answering with almost 1000 models avaiable for 35 unique languages powerd by their corrosponding Spark NLP XXXForQuestionAnswering Annotator Classes and in various tuning and dataset flavours.
`<lang>.answer_question.<domain>.<datasets>.<annotator_class><tune info>.by_<username>`
If multiple datasets or tune parameters are defined , they are connected with a `_` .
These substrings define up the `<domain>` part of the NLU reference
- Legal [cuad](https://arxiv.org/abs/2103.06268)
- COVID 19 Biomedical [biosaq](http://bioasq.org/)
- Biomedical Literature [pubmed](https://pubmed.ncbi.nlm.nih.gov/)
- Twitter [tweet](https://aclanthology.org/P19-1496.pdf)
- Wikipedia [wiki](https://www.wikipedia.org/)
- News [news](https://www.microsoft.com/en-us/research/project/newsqa-dataset/)
- Tech [tech](https://arxiv.org/abs/1911.02984)
These substrings define up the `<dataset>` part of the NLU reference
- Arabic [SQUAD ARCD](https://metatext.io/datasets/arabic-reading-comprehension-dataset-(arcd))
- Turkish [TQUAD](https://github.com/TQuad/turkish-nlp-qa-dataset)
- German [GermanQuad](https://arxiv.org/abs/2104.12741)
- Indonesian [AQG](https://github.com/FerdiantJoshua/question-generator)
- Korean [KLUE](https://arxiv.org/abs/2105.09680), [KORQUAD](https://korquad.github.io/)
- Hindi[CHAI](https://www.kaggle.com/competitions/chaii-hindi-and-tamil-question-answering)
- Multi-Lingual[MLQA](https://github.com/facebookresearch/MLQA)
- Multi-Lingual[tydiqa](https://github.com/google-research-datasets/tydiqa)
- Multi-Lingual[xquad](https://arxiv.org/abs/1910.11856)
These substrings define up the `<dataset>` part of the NLU reference
- Alternative Eval method [reqa](https://arxiv.org/pdf/1907.04780.pdf)
- Synthetic Data [synqa](https://aclanthology.org/2021.emnlp-main.696/)
- Benchmark / Eval Method ABSA-Bench [roberta_absa](https://arxiv.org/abs/2104.04986)
- Arabic architecture type [soqaol](https://arxiv.org/abs/1906.05394)
These substrings define the `<annotator_class>` substring, if it does not map to a sparknlp annotator
- [sci_bert](https://www.aclweb.org/anthology/D19-1371/)
- [electra](https://arxiv.org/abs/2003.10555)
- [mini_lm](https://arxiv.org/abs/2002.10957)
- [covid_bert](https://arxiv.org/abs/2005.07503)
- [bio_bert](https://arxiv.org/abs/1901.08746)
- [indo_bert](https://arxiv.org/abs/2011.00677)
- [muril](https://arxiv.org/abs/2103.10730)
- [sapbert](https://github.com/cambridgeltl/sapbert)
- [bioformer](https://github.com/WGLab/Bioformer)
- [link_bert](https://arxiv.org/abs/2203.15827)
- [mac_bert](https://aclanthology.org/2020.findings-emnlp.58/)
These substrings define the `<tune_info>` substring, if it does not map to a sparknlp annotator
- Train tweaks : `multilingual`,`mini_lm`,`xtremedistiled`,`distilled`,`xtreme`,`augmented`,`zero_shot`
- Size tweaks `xl`, `xxl`, `large`, `base`, `medium`, `base`, `small`, `tiny`, `cased`, `uncased`
- Dimension tweaks : `1024d`,`768d`,`512d`,`256d`,`128d`,`64d`,`32d`
QA DataFormat
You need to use one of the Data formats below to pass context and question correctly to the model.
python
use ||| to seperate question||context
data = 'What is my name?|||My name is Clara and I live in Berkeley'
pass a tuple (question,context)
data = ('What is my name?','My name is Clara and I live in Berkeley')
use pandas Dataframe, one column = question, one column=context
data = pd.DataFrame({
'question': ['What is my name?'],
'context': ["My name is Clara and I live in Berkely"]
})
Get your answers with any of above formats
nlu.load("en.answer_question.squadv2.deberta").predict(data)
returns :
| answer | answer_confidence | context | question |
|:---------|--------------------:|:---------------------------------------|:-----------------|
| Clara | 0.994931 | My name is Clara and I live in Berkely | What is my name? |
----------------
New NLU helper Methods
You can see all features showcased in the [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/spark_nlp_utilities/NLU_utils_for_Spark_NLP.ipynb) notebook or on [the new docs page for Spark NLP utils](https://nlu.johnsnowlabs.com/docs/en/spellbook/utils_for_spark_nlp)
nlu.viz(pipe,data)
Visualize input data with an already configured Spark NLP pipeline,
for Algorithms of type (Ner,Assertion, Relation, Resolution, Dependency)
using [Spark NLP Display](https://nlp.johnsnowlabs.com/docs/en/display)
Automatically infers applicable viz type and output columns to use for visualization.
Example:
python
works with Pipeline, LightPipeline, PipelineModel,PretrainedPipeline List[Annotator]
ade_pipeline = PretrainedPipeline('explain_clinical_doc_ade', 'en', 'clinical/models')
text = """I have an allergic reaction to vancomycin.
My skin has be itchy, sore throat/burning/itchy, and numbness in tongue and gums.
I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication."""
nlu.viz(ade_pipeline, text)
returns:
<img src="https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/docs/assets/images/releases/4_0_0/nlu_utils_viz_example.png" />
If a pipeline has multiple models candidates that can be used for a viz,
the first Annotator that is vizzable will be used to create viz.
You can specify which type of viz to create with the viz_type parameter
Output columns to use for the viz are automatically deducted from the pipeline, by using the
first annotator that provides the correct output type for a specific viz.
You can specify which columns to use for a viz by using the
corresponding ner_col, pos_col, dep_untyped_col, dep_typed_col, resolution_col, relation_col, assertion_col, parameters.
nlu.autocomplete_pipeline(pipe)
Auto-Complete a pipeline or single annotator into a runnable pipeline by harnessing NLU's DAG Autocompletion algorithm and returns it as NLU pipeline.
The standard Spark pipeline is avaiable on the `.vanilla_transformer_pipe` attribute of the returned nlu pipe
Every Annotator and Pipeline of Annotators defines a `DAG` of tasks, with various dependencies that must be satisfied in `topoligical order`.
NLU enables the completion of an incomplete DAG by finding or creating a path between
the very first input node which is almost always is `DocumentAssembler/MultiDocumentAssembler`
and the very last node(s), which is given by the `topoligical sorting` the iterable annotators parameter.
Paths are created by resolving input features of annotators to the corrrosponding providers with matching storage references.
Example:
python
Lets autocomplete the pipeline for a RelationExtractionModel, which as many input columns and sub-dependencies.
from sparknlp_jsl.annotator import RelationExtractionModel
re_model = RelationExtractionModel().pretrained("re_ade_clinical", "en", 'clinical/models').setOutputCol('relation')
text = """I have an allergic reaction to vancomycin.
My skin has be itchy, sore throat/burning/itchy, and numbness in tongue and gums.
I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication."""
nlu_pipe = nlu.autocomplete_pipeline(re_model)
nlu_pipe.predict(text)
returns :
| relation | relation_confidence | relation_entity1 | relation_entity2 | relation_entity2_class |
|------------------------------------------------:|-----------------------------------------------------------:|:--------------------------------------------------------|:--------------------------------------------------------|:--------------------------------------------------------------|
| 1 | 1 | allergic reaction | vancomycin | Drug_Ingredient |
| 1 | 1 | skin | itchy | Symptom |
| 1 | 0.99998 | skin | sore throat/burning/itchy | Symptom |
| 1 | 0.956225 | skin | numbness | Symptom |
| 1 | 0.999092 | skin | tongue | External_body_part_or_region |
| 0 | 0.942927 | skin | gums | External_body_part_or_region |
| 1 | 0.806327 | itchy | sore throat/burning/itchy | Symptom |
| 1 | 0.526163 | itchy | numbness | Symptom |
| 1 | 0.999947 | itchy | tongue | External_body_part_or_region |
| 0 | 0.994618 | itchy | gums | External_body_part_or_region |
| 0 | 0.994162 | sore throat/burning/itchy | numbness | Symptom |
| 1 | 0.989304 | sore throat/burning/itchy | tongue | External_body_part_or_region |
| 0 | 0.999969 | sore throat/burning/itchy | gums | External_body_part_or_region |
| 1 | 1 | numbness | tongue | External_body_part_or_region |
| 1 | 1 | numbness | gums | External_body_part_or_region |
| 1 | 1 | tongue | gums | External_body_part_or_region |
nlu.to_pretty_df(pipe,data)
Annotates a Pandas Dataframe/Pandas Series/Numpy Array/Spark DataFrame/Python List strings /Python String
with given Spark NLP pipeline, which is assumed to be complete and runnable and returns it in a pythonic pandas dataframe format.
Example:
python
works with Pipeline, LightPipeline, PipelineModel,PretrainedPipeline List[Annotator]
ade_pipeline = PretrainedPipeline('explain_clinical_doc_ade', 'en', 'clinical/models')
text = """I have an allergic reaction to vancomycin.
My skin has be itchy, sore throat/burning/itchy, and numbness in tongue and gums.
I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication."""
output is same as nlu.autocomplete_pipeline(re_model).nlu_pipe.predict(text)
nlu.to_pretty_df(ade_pipeline,text)
returns :
| assertion | asserted_entitiy | entitiy_class | assertion_confidence |
|:------------|:------------------------------------|:------------------------------------------|-----------------------------------------------:|
| present | allergic reaction | ADE | 0.998 |
| present | itchy | ADE | 0.8414 |
| present | sore throat/burning/itchy | ADE | 0.9019 |
| present | numbness in tongue and gums | ADE | 0.9991 |
Annotators are grouped internally by NLU into output levels `token`,`sentence`, `document`,`chunk` and `relation`
Same level annotators output columns are zipped and exploded together to create the final output df.
Additionally, most keys from the metadata dictionary in the result annotations will be collected and expanded into their own columns in the resulting Dataframe, with special handling for Annotators that encode multiple metadata fields inside of one, seperated by strings like `|||` or `:::`.
Some columns are omitted from metadata to reduce total amount of output columns, these can be re-enabled by setting `metadata=True`
For a given pipeline output level is automatically set to the last anntators output level by default.
This can be changed by defining `to_preddty_df(pipe,text,output_level='my_level'` for levels `token`,`sentence`, `document`,`chunk` and `relation` .
nlu.to_nlu_pipe(pipe)
Convert a pipeline or list of annotators into a NLU pipeline making `.predict()` and `.viz()` avaiable for every Spark NLP pipeline.
Assumes the pipeline is already runnable.
python
works with Pipeline, LightPipeline, PipelineModel,PretrainedPipeline List[Annotator]
ade_pipeline = PretrainedPipeline('explain_clinical_doc_ade', 'en', 'clinical/models')
text = """I have an allergic reaction to vancomycin.
My skin has be itchy, sore throat/burning/itchy, and numbness in tongue and gums.
I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication."""
nlu_pipe = nlu.to_nlu_pipe(ade_pipeline)
Same output as nlu.to_pretty_df(pipe,text)
nlu_pipe.predict(text)
same output as nlu.viz(pipe,text)
nlu_pipe.viz(text)
Acces auto-completed Spark NLP big data pipeline,
nlu_pipe.vanilla_transformer_pipe.transform(spark_df)
returns :
| assertion | asserted_entitiy | entitiy_class | assertion_confidence |
|:------------|:------------------------------------|:------------------------------------------|-----------------------------------------------:|
| present | allergic reaction | ADE | 0.998 |
| present | itchy | ADE | 0.8414 |
| present | sore throat/burning/itchy | ADE | 0.9019 |
| present | numbness in tongue and gums | ADE | 0.9991 |
and
<img src="https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/docs/assets/images/releases/4_0_0/nlu_utils_viz_example.png" />
---------------
4 new Demo Notebooks
These notebooks showcase some of latest classifier models for Banking Queries, Intents in Text, Question and new s classification
* [Notebook for Classification of Banking Queries](https://github.com/JohnSnowLabs/nlu/blob/4.0.0/examples/colab/component_examples/classifiers/Banking_Queries_Classification.ipynb)
* [Notebook for Classification of Intent in Texts ](https://github.com/JohnSnowLabs/nlu/blob/4.0.0/examples/colab/component_examples/classifiers/Identify_intent_in_general_text.ipynb)
* [Notebook for classification of Similar Questions ](https://github.com/JohnSnowLabs/nlu/blob/4.0.0/examples/colab/component_examples/classifiers/Question_Pair_Classification.ipynb)
* [Notebook for Classification of Questions vs Statements](https://github.com/JohnSnowLabs/nlu/blob/4.0.0/examples/colab/component_examples/classifiers/Question_vs_Statement.ipynb)
- [Notebook for Classification of News into 4 classes](https://github.com/JohnSnowLabs/nlu/blob/4.0.0/examples/colab/component_examples/classifiers/News_Classification.ipynb)
----------------------
NLU captures every Annotator of Spark NLP and Spark NLP for healthcare
The entire universe of Annotators in Spark NLP and Spark-NLP for healthcare is now embellished by NLU Components by using generalizable annotation extractors methods and configs internally to support enable the new NLU util methods.
The following annotator classes are newly captured:
- AssertionFilterer
- ChunkConverter
- ChunkKeyPhraseExtraction
- ChunkSentenceSplitter
- ChunkFiltererApproach
- ChunkFilterer
- ChunkMapperApproach
- ChunkMapperFilterer
- DocumentLogRegClassifierApproach
- DocumentLogRegClassifierModel
- ContextualParserApproach
- ReIdentification
- NerDisambiguator
- NerDisambiguatorModel
- AverageEmbeddings
- EntityChunkEmbeddings
- ChunkMergeApproach
- IOBTagger
- NerChunker
- NerConverterInternalModel
- DateNormalizer
- PosologyREModel
- RENerChunksFilter
- ResolverMerger
- AnnotationMerger
- Router
- Word2VecApproach
- WordEmbeddings
- EntityRulerApproach
- EntityRulerModel
- TextMatcherModel
- BigTextMatcher
- BigTextMatcherModel
- DateMatcher
- MultiDateMatcher
- RegexMatcher
- TextMatcher
- NerApproach
- NerCrfApproach
- NerOverwriter
- DependencyParserApproach
- TypedDependencyParserApproach
- SentenceDetectorDLApproach
- SentimentDetector
- ViveknSentimentApproach
- ContextSpellCheckerApproach
- NorvigSweetingApproach
- SymmetricDeleteApproach
- ChunkTokenizer
- ChunkTokenizerModel
- RecursiveTokenizer
- RecursiveTokenizerModel
- Token2Chunk
- WordSegmenterApproach
- GraphExtraction
- Lemmatizer
- Normalizer
--------------------
All NLU 4.0 for Healthcare Models
Some examples:
[en.rxnorm.umls.mapping](https://nlp.johnsnowlabs.com/2022/06/27/rxnorm_umls_mapping_en_3_0.html)
Code:
python
nlu.load('en.rxnorm.umls.mapping').predict('1161611 315677')
| mapped_entity_umls_code_origin_entity | mapped_entity_umls_code |
|-----------------------------------------:|:--------------------------|
| 1161611 | C3215948 |
| 315677 | C0984912 |
[en.ner.clinical_trials_abstracts](https://nlp.johnsnowlabs.com/2022/06/22/ner_clinical_trials_abstracts_en_3_0.html)
Code:
python
nlu.load('en.ner.clinical_trials_abstracts').predict('A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes.')
Results:
| | entities_clinical_trials_abstracts | entities_clinical_trials_abstracts_class | entities_clinical_trials_abstracts_confidence |
|---:|:-------------------------------------|:-------------------------------------------|------------------------------------------------:|
| 0 | randomised | CTDesign | 0.9996 |
| 0 | multicentre | CTDesign | 0.9998 |
| 0 | insulin glargine | Drug | 0.99135 |
| 0 | NPH insulin | Drug | 0.96875 |
| 0 | type 2 diabetes | DisorderOrSyndrome | 0.999933 |
Code:
python
nlu.load('en.ner.clinical_trials_abstracts').viz('A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes.')
Results:
<img src="https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/docs/assets/images/releases/4_0_0/en.ner.clinical_trials_abstracts.png" />
[en.med_ner.pathogen](https://nlp.johnsnowlabs.com/2022/06/28/ner_pathogen_en_3_0.html)
Code:
python
nlu.load('en.med_ner.pathogen').predict('Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. While it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions.')
Results:
| | entities_pathogen | entities_pathogen_class | entities_pathogen_confidence |
|---:|:--------------------|:--------------------------|-------------------------------:|
| 0 | Racecadotril | Medicine | 0.9468 |
| 0 | loperamide | Medicine | 0.9987 |
| 0 | Diarrhea | MedicalCondition | 0.9848 |
| 0 | dehydration | MedicalCondition | 0.6307 |
| 0 | rabies virus | Pathogen | 0.95685 |
| 0 | Lyssavirus | Pathogen | 0.9694 |
| 0 | Ephemerovirus | Pathogen | 0.6917 |
Code:
python
nlu.load('en.med_ner.pathogen').viz('Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. While it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions.')
Results:
<img src="https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/docs/assets/images/releases/4_0_0/en.med_ner.pathogen.png" />
[es.med_ner.living_species.roberta](https://nlp.johnsnowlabs.com/2022/06/22/ner_living_species_roberta_es_3_0.html)
Code:
python
nlu.load('es.med_ner.living_species.roberta').predict('Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.')
Results:
| | entities_living_species | entities_living_species_class | entities_living_species_confidence |
|---:|:--------------------------|:--------------------------------|-------------------------------------:|
| 0 | Lactante varón | HUMAN | 0.93175 |
| 0 | familiares | HUMAN | 1 |
| 0 | personales | HUMAN | 1 |
| 0 | neonatal | HUMAN | 0.9997 |
| 0 | legumbres | SPECIES | 0.9962 |
| 0 | lentejas | SPECIES | 0.9988 |
| 0 | garbanzos | SPECIES | 0.9901 |
| 0 | legumbres | SPECIES | 0.9976 |
| 0 | madre | HUMAN | 1 |
| 0 | Cacahuete | SPECIES | 0.998 |
| 0 | padres | HUMAN | 1 |
Code:
python
nlu.load('es.med_ner.living_species.roberta').viz('Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.')
Results:
<img src="https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/docs/assets/images/releases/4_0_0/es.med_ner.living_species.roberta.png" />