Releases: JohnSnowLabs/spark-nlp-models
New Russian models and pipelines pack
Russian Models and Pipelines
We are happy to announce Spark NLP pre-trained Russian models and pipelines.
Models:
Model | name | language |
---|---|---|
LemmatizerModel (Lemmatizer) | lemma |
ru |
PerceptronModel (POS UD) | pos_ud_gsd |
ru |
NerDLModel | wikiner_6B_100 |
ru |
NerDLModel | wikiner_6B_300 |
ru |
NerDLModel | wikiner_840B_300 |
ru |
Pipelines:
Pipeline | name | language |
---|---|---|
Explain Document (Small) | explain_document_sm |
ru |
Explain Document (Medium) | explain_document_md |
ru |
Explain Document (Large) | explain_document_lg |
ru |
Entity Recognizer (Small) | entity_recognizer_sm |
ru |
Entity Recognizer (Medium) | entity_recognizer_md |
ru |
Entity Recognizer (Large) | entity_recognizer_lg |
ru |
Example:
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP
SparkNLP.version()
val pipeline = PretrainedPipeline("explain_document_sm", lang="ru")
val testData = spark.createDataFrame(Seq(
(1, "Пик распространения коронавируса и вызываемой им болезни Covid-19 в Китае прошел, заявил в четверг агентству Синьхуа официальный представитель Госкомитета по гигиене и здравоохранению КНР Ми Фэн.")
)).toDF("id", "text")
val annotation = pipeline.transform(testData)
annotation.show()
Spark NLP:
- PUBLIC
Last update
12/03/2020
Works with
Spark NLP 2.4.4 and above
New Spanish models and pipelines pack
Spanish Models and Pipelines
We are happy to announce Spark NLP pre-trained Spanish models and pipelines.
Models:
Model | name | language |
---|---|---|
LemmatizerModel (Lemmatizer) | lemma |
es |
PerceptronModel (POS UD) | pos_ud_gsd |
es |
NerDLModel | wikiner_6B_100 |
es |
NerDLModel | wikiner_6B_300 |
es |
NerDLModel | wikiner_840B_300 |
es |
Pipelines:
Pipeline | name | language |
---|---|---|
Explain Document (Small) | explain_document_sm |
es |
Explain Document (Medium) | explain_document_md |
es |
Explain Document (Large) | explain_document_lg |
es |
Entity Recognizer (Small) | entity_recognizer_sm |
es |
Entity Recognizer (Medium) | entity_recognizer_md |
es |
Entity Recognizer (Large) | entity_recognizer_lg |
es |
Example:
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP
SparkNLP.version()
val pipeline = PretrainedPipeline("explain_document_sm", lang="es")
val testData = spark.createDataFrame(Seq(
(1, "Ésta se convertiría en una amistad de por vida, y Peleo, conociendo la sabiduría de Quirón , más adelante le confiaría la educación de su hijo Aquiles."),
(2, "Durante algo más de 200 años el territorio de la actual Bolivia constituyó la Real Audiencia de Charcas, uno de los centros más prósperos y densamente poblados de los virreinatos españoles.")
)).toDF("id", "text")
val annotation = pipeline.transform(testData)
annotation.show()
Spark NLP:
- PUBLIC
Last update
16/02/2020
Works with
Spark NLP 2.4.0 and above
Introducing new Universal Sentence Encoder models
Universal Sentence Encoder models:
The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks.
The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. We apply this model to the STS benchmark for semantic similarity, and the results can be seen in the example notebook made available. The universal-sentence-encoder model is trained with a deep averaging network (DAN) encoder.
We are very excited to share these 2 new Universal Sentence Encoder models coming from TF Hub:
Model | name | language |
---|---|---|
UniversalSentenceEncoder | tfhub_use |
en |
UniversalSentenceEncoder | tfhub_use_lg |
en |
Example:
val useEmbeddings = UniversalSentenceEncoder.pretrained(name="tfhub_use", lang="en")
.setInputCols("document")
.setOutputCol("sentence_embeddings")
Spark NLP:
- PUBLIC
Last update
01/02/2020
Works with
Spark NLP 2.4.0 and above
New ICD10 and ICDO Pack: Embeddings, EntityResolver and TextMatcher models
Model or model pack description:
ICD10 and ICDO model pack:
Model | name | language | loc |
---|---|---|---|
TextMatcherModel | textmatch_icdo_ner_n2c4 |
en | clinical/models |
TextMatcherModel | textmatch_cpt_token_n2c1 |
en | clinical/models |
WordEmbeddingsModel | embeddings_icdoem |
en | clinical/models |
EntityResolutionModel | resolve_icd10cm_icdoem |
en | clinical/models |
EntityResolutionModel | resolve_icdo_icdoem |
en | clinical/models |
EntityResolutionModel | resolve_cpt_icdoem |
en | clinical/models |
ChunkEntityResolutionModel | chunkresolve_icdo_icdoem |
en | clinical/models |
ChunkEntityResolutionModel | chunkresolve_cpt_icdoem |
en | clinical/models |
The textmatch_icdo_ner_n2c4
and textmatch_cpt_token_n2c1
are Text Matching models trained from comprehensive glossaries for Oncology and Procedural terms
The embeddings_icdoem
WordEmbeddingsModel, was trained with a semantically augmented corpus of clinical texts, case studies, and curated datasets.
The resolve_icd10cm_icdoem
, resolve_icdo_icdoem
and resolve_cpt_icdoem
models are EntityResolvers trained with the embeddings_icdoem
model and semantically augmented datasets from JSL Data Market
The chunkresolve_icdo_icdoem
and chunkresolve_cpt_icdoem
models are ChunkEntityResolvers that connect with the new ChunkEmbeddings annotator
Spark NLP:
- HEALTHCARE
Last update
26/11/2019
Notes
ChunkEntityResolutionApproach
and ChunkEntityResolutionModel
are new annotators coming in for Spark NLP 2.3.4.
The main difference with respect to EntityResolutionApproach
and EntityResolutionModel
is that they expect embeddings from ChunkEmbeddings. This makes WordEmbedding aggregation functions flexible for chunks.
Works with
Spark NLP 2.3.4 and above
Link
Examples on how to use these models can be found here:
Notebooks
Healthcare Notebooks
New BioNLP NER Model
BioNLP-CG NER model:
BioNLP Named Entity Recognition (NER) model is the first NER model in the Spark NLP library that is trained on Cancer Genetics dataset with SOTA NER architecture.
The Cancer Genetics (CG) task is an information extraction task organized as part of the BioNLP Shared Task 2013. The CG task aims to advance the automatic extraction of information from statements on the biological processes relating to the development and progression of cancer. Details here: http://2013.bionlp-st.org/tasks/cancer-genetics
There are 16 different entities in this NER model:
Entities |
---|
Gene_or_gene_product |
Organism |
Organ |
Anatomical_system |
Cell |
Multi |
Tissue |
Pathological_formation |
Cancer |
Simple_chemical |
Amino_acid |
Cellular_component |
Organism_subdivision |
Developing_anatomical_structure |
Immaterial_anatomical_entity |
Organism_substance |
|
Spark NLP:
- HEALTHCARE
- PUBLIC
Last update
-- 27/11/2019
Works with:
-- 2.3.x and above
Link
New BERT Healthcare Embeddings Pack
Model or model pack description:
BioBERT models pack:
We are very excited to share these 5 new BioBERT models with our enterprise users!
Model | name | language | loc |
---|---|---|---|
BertEmbeddingsModel | biobert_pubmed_cased |
en | clinical/models |
BertEmbeddingsModel | biobert_pmc_cased |
en | clinical/models |
BertEmbeddingsModel | biobert_pubmed_pmc_cased |
en | clinical/models |
BertEmbeddingsModel | biobert_clinical_cased |
en | clinical/models |
BertEmbeddingsModel | biobert_discharge_cased |
en | clinical/models |
The biobert_pubmed_cased
, biobert_pmc_cased
, and biobert_pubmed_pmc_cased
models are thanks to BioBERT pretrained models from their paper: https://arxiv.org/abs/1901.08746
The biobert_clinical_cased
and biobert_discharge_cased
models are from another amazing release called clinicalBERT from their paper: https://www.aclweb.org/anthology/W19-1909/
Spark NLP:
- HEALTHCARE
Last update
26/11/2019
Works with
Spark NLP 2.3.1 and above
New BertEmbeddings Models
Model | Name | en |
---|---|---|
BertEmbeddings (base_uncased) | bert_base_uncased |
Download |
BertEmbeddings (base_cased) | bert_base_cased |
Download |
BertEmbeddings (large_uncased) | bert_large_uncased |
Download |
BertEmbeddings (large_cased) | bert_large_cased |
Download |
Spark NLP:
- PUBLIC
Last update
-- 24/08/2019
Works with:
-- 2.2.0 and above
New WikiNER models
Models
We have renamed our multi-lingual NerDL
models from ner_dl
to wikiner_840B_300
in Spark NLP 2.1.0
. they are being trained by WikiNER
and they have the highest accuracy against pretrained WordEmbeddings wikiner_840B_300
.
English
Model | Name | en |
---|---|---|
NerDLModel (OntoNotes with GloVe 100d) | onto_100 |
Download |
NerDLModel (OntoNotes with GloVe 300d) | onto_300 |
Download |
French
Model | Name | fr |
---|---|---|
NerDLModel (glove_840B_300) | wikiner_840B_300 |
Download |
German
Model | Name | de |
---|---|---|
NerDLModel (glove_840B_300) | wikiner_840B_300 |
Download |
Italian
Model | Name | it |
---|---|---|
NerDLModel (glove_840B_300) | wikiner_840B_300 |
Download |
Multi-language
Model | Name | xx |
---|---|---|
WordEmbeddings (GloVe) | glove_840B_300 |
Download |
WordEmbeddings (GloVe) | glove_6B_300 |
Download |
WordEmbeddings (BERT) | bert_multi_cased |
Download |
Spark NLP:
- PUBLIC
Last update
-- 03/08/2019
Works with:
-- 2.1.0 and above
New Italian and German pipelines and models
We are happy to announce our new Italian and German pipelines and models. We are also going to release new entity_recognizer_lg
and entity_recognizer_md
pipelines for Italian and French.
Pipelines
Italian
Pipelines | Name | Language |
---|---|---|
Explain Document Large | explain_document_lg |
it |
Explain Document Medium | explain_document_md |
it |
Entity Recognizer Large | entity_recognizer_lg |
it |
Entity Recognizer Medium | entity_recognizer_md |
it |
French
Pipelines | Name | Language |
---|---|---|
Entity Recognizer Large | entity_recognizer_lg |
fr |
Entity Recognizer Medium | entity_recognizer_md |
fr |
Models
Italian
Model | Name | Language |
---|---|---|
PerceptronModel (POS UD) | pos_ud_isdt |
it |
NerDLModel (glove_6B_300 and glove_840B_300) | ner_dl |
it |
German
Model | Name | Language |
---|---|---|
LemmatizerModel (Lemmatizer) | lemma |
de |
PerceptronModel (POS UD) | pos_ud_hdt |
de |
NerDLModel (glove_6B_300 and glove_840B_300) | ner_dl |
de |
Dataset
Feature | Description |
---|---|
Lemma | Trained by Lemmatizer annotator on lemmatization-lists by Michal Měchura |
POS | Trained by PerceptronApproach annotator on the Universal Dependencies |
NER | Trained by NerDLApproach annotator with BiLSTM-CNN on the WikiNER corpus and supports the identification of PER , LOC , ORG and MISC entities |
Example
German POS model
val perceptronModel = PerceptronModel.pretrained("pos_ud_hdt", lang="de")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
German NerDL model
val ner = NerDLModel.pretrained("ner_dl", lang="de")
.setInputCols("sentence", "token", "embeddings")
.setOutputCol("ner")