Skip to content
This repository has been archived by the owner on Aug 12, 2021. It is now read-only.

Releases: JohnSnowLabs/spark-nlp-models

New Russian models and pipelines pack

19 Mar 15:08
eb5e0bc
Compare
Choose a tag to compare

Russian Models and Pipelines

We are happy to announce Spark NLP pre-trained Russian models and pipelines.

Models:

Model name language
LemmatizerModel (Lemmatizer) lemma ru
PerceptronModel (POS UD) pos_ud_gsd ru
NerDLModel wikiner_6B_100 ru
NerDLModel wikiner_6B_300 ru
NerDLModel wikiner_840B_300 ru

Pipelines:

Pipeline name language
Explain Document (Small) explain_document_sm ru
Explain Document (Medium) explain_document_md ru
Explain Document (Large) explain_document_lg ru
Entity Recognizer (Small) entity_recognizer_sm ru
Entity Recognizer (Medium) entity_recognizer_md ru
Entity Recognizer (Large) entity_recognizer_lg ru

Example:

import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP

SparkNLP.version()

val pipeline = PretrainedPipeline("explain_document_sm", lang="ru")

val testData = spark.createDataFrame(Seq(
(1, "Пик распространения коронавируса и вызываемой им болезни Covid-19 в Китае прошел, заявил в четверг агентству Синьхуа официальный представитель Госкомитета по гигиене и здравоохранению КНР Ми Фэн.")
)).toDF("id", "text")

val annotation = pipeline.transform(testData)

annotation.show()

Spark NLP:

  • PUBLIC

Last update

12/03/2020

Works with

Spark NLP 2.4.4 and above

New Spanish models and pipelines pack

18 Feb 06:53
9762360
Compare
Choose a tag to compare

Spanish Models and Pipelines

We are happy to announce Spark NLP pre-trained Spanish models and pipelines.

Models:

Model name language
LemmatizerModel (Lemmatizer) lemma es
PerceptronModel (POS UD) pos_ud_gsd es
NerDLModel wikiner_6B_100 es
NerDLModel wikiner_6B_300 es
NerDLModel wikiner_840B_300 es

Pipelines:

Pipeline name language
Explain Document (Small) explain_document_sm es
Explain Document (Medium) explain_document_md es
Explain Document (Large) explain_document_lg es
Entity Recognizer (Small) entity_recognizer_sm es
Entity Recognizer (Medium) entity_recognizer_md es
Entity Recognizer (Large) entity_recognizer_lg es

Example:

import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP

SparkNLP.version()

val pipeline = PretrainedPipeline("explain_document_sm", lang="es")

val testData = spark.createDataFrame(Seq(
(1, "Ésta se convertiría en una amistad de por vida, y Peleo, conociendo la sabiduría de Quirón , más adelante le confiaría la educación de su hijo Aquiles."),
(2, "Durante algo más de 200 años el territorio de la actual Bolivia constituyó la Real Audiencia de Charcas, uno de los centros más prósperos y densamente poblados de los virreinatos españoles.")
)).toDF("id", "text")

val annotation = pipeline.transform(testData)

annotation.show()

Spark NLP:

  • PUBLIC

Last update

16/02/2020

Works with

Spark NLP 2.4.0 and above

Introducing new Universal Sentence Encoder models

16 Feb 21:14
9229729
Compare
Choose a tag to compare

Universal Sentence Encoder models:

The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks.

The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. We apply this model to the STS benchmark for semantic similarity, and the results can be seen in the example notebook made available. The universal-sentence-encoder model is trained with a deep averaging network (DAN) encoder.

We are very excited to share these 2 new Universal Sentence Encoder models coming from TF Hub:

Model name language
UniversalSentenceEncoder tfhub_use en
UniversalSentenceEncoder tfhub_use_lg en

Example:

val useEmbeddings = UniversalSentenceEncoder.pretrained(name="tfhub_use", lang="en")
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")

Spark NLP:

  • PUBLIC

Last update

01/02/2020

Works with

Spark NLP 2.4.0 and above

New ICD10 and ICDO Pack: Embeddings, EntityResolver and TextMatcher models

27 Nov 19:33
Compare
Choose a tag to compare

Model or model pack description:

ICD10 and ICDO model pack:

Model name language loc
TextMatcherModel textmatch_icdo_ner_n2c4 en clinical/models
TextMatcherModel textmatch_cpt_token_n2c1 en clinical/models
WordEmbeddingsModel embeddings_icdoem en clinical/models
EntityResolutionModel resolve_icd10cm_icdoem en clinical/models
EntityResolutionModel resolve_icdo_icdoem en clinical/models
EntityResolutionModel resolve_cpt_icdoem en clinical/models
ChunkEntityResolutionModel chunkresolve_icdo_icdoem en clinical/models
ChunkEntityResolutionModel chunkresolve_cpt_icdoem en clinical/models

The textmatch_icdo_ner_n2c4 and textmatch_cpt_token_n2c1 are Text Matching models trained from comprehensive glossaries for Oncology and Procedural terms

The embeddings_icdoem WordEmbeddingsModel, was trained with a semantically augmented corpus of clinical texts, case studies, and curated datasets.

The resolve_icd10cm_icdoem, resolve_icdo_icdoem and resolve_cpt_icdoem models are EntityResolvers trained with the embeddings_icdoem model and semantically augmented datasets from JSL Data Market
The chunkresolve_icdo_icdoem and chunkresolve_cpt_icdoem models are ChunkEntityResolvers that connect with the new ChunkEmbeddings annotator

Spark NLP:

  • HEALTHCARE

Last update

26/11/2019

Notes

ChunkEntityResolutionApproach and ChunkEntityResolutionModel are new annotators coming in for Spark NLP 2.3.4.
The main difference with respect to EntityResolutionApproach and EntityResolutionModel is that they expect embeddings from ChunkEmbeddings. This makes WordEmbedding aggregation functions flexible for chunks.

Works with

Spark NLP 2.3.4 and above

Link

Examples on how to use these models can be found here:
Notebooks
Healthcare Notebooks

New BioNLP NER Model

27 Nov 22:17
Compare
Choose a tag to compare

BioNLP-CG NER model:


BioNLP Named Entity Recognition (NER) model is the first NER model in the Spark NLP library that is trained on Cancer Genetics dataset with SOTA NER architecture.

The Cancer Genetics (CG) task is an information extraction task organized as part of the BioNLP Shared Task 2013. The CG task aims to advance the automatic extraction of information from statements on the biological processes relating to the development and progression of cancer. Details here: http://2013.bionlp-st.org/tasks/cancer-genetics

There are 16 different entities in this NER model:

Entities
Gene_or_gene_product
Organism
Organ
Anatomical_system
Cell
Multi
Tissue
Pathological_formation
Cancer
Simple_chemical
Amino_acid
Cellular_component
Organism_subdivision
Developing_anatomical_structure
Immaterial_anatomical_entity
Organism_substance

Spark NLP:

  • HEALTHCARE
  • PUBLIC

Last update

-- 27/11/2019

Works with:

-- 2.3.x and above

Link

-- https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/enterprise/healthcare/BioNLP-NER.ipynb

New BERT Healthcare Embeddings Pack

27 Nov 22:13
Compare
Choose a tag to compare

Model or model pack description:

BioBERT models pack:

We are very excited to share these 5 new BioBERT models with our enterprise users!

Model name language loc
BertEmbeddingsModel biobert_pubmed_cased en clinical/models
BertEmbeddingsModel biobert_pmc_cased en clinical/models
BertEmbeddingsModel biobert_pubmed_pmc_cased en clinical/models
BertEmbeddingsModel biobert_clinical_cased en clinical/models
BertEmbeddingsModel biobert_discharge_cased en clinical/models

The biobert_pubmed_cased, biobert_pmc_cased, and biobert_pubmed_pmc_cased models are thanks to BioBERT pretrained models from their paper: https://arxiv.org/abs/1901.08746
The biobert_clinical_cased and biobert_discharge_cased models are from another amazing release called clinicalBERT from their paper: https://www.aclweb.org/anthology/W19-1909/

Spark NLP:

  • HEALTHCARE

Last update

26/11/2019

Works with

Spark NLP 2.3.1 and above

New BertEmbeddings Models

28 Sep 15:51
732cb25
Compare
Choose a tag to compare
Model Name en
BertEmbeddings (base_uncased) bert_base_uncased Download
BertEmbeddings (base_cased) bert_base_cased Download
BertEmbeddings (large_uncased) bert_large_uncased Download
BertEmbeddings (large_cased) bert_large_cased Download

Spark NLP:

  • PUBLIC

Last update

-- 24/08/2019

Works with:

-- 2.2.0 and above

New WikiNER models

13 Jul 18:18
732cb25
Compare
Choose a tag to compare

Models

We have renamed our multi-lingual NerDL models from ner_dl to wikiner_840B_300 in Spark NLP 2.1.0. they are being trained by WikiNER and they have the highest accuracy against pretrained WordEmbeddings wikiner_840B_300.

English

Model Name en
NerDLModel (OntoNotes with GloVe 100d) onto_100 Download
NerDLModel (OntoNotes with GloVe 300d) onto_300 Download

French

Model Name fr
NerDLModel (glove_840B_300) wikiner_840B_300 Download

German

Model Name de
NerDLModel (glove_840B_300) wikiner_840B_300 Download

Italian

Model Name it
NerDLModel (glove_840B_300) wikiner_840B_300 Download

Multi-language

Model Name xx
WordEmbeddings (GloVe) glove_840B_300 Download
WordEmbeddings (GloVe) glove_6B_300 Download
WordEmbeddings (BERT) bert_multi_cased Download

Spark NLP:

  • PUBLIC

Last update

-- 03/08/2019

Works with:

-- 2.1.0 and above

New Italian and German pipelines and models

10 Jun 15:38
732cb25
Compare
Choose a tag to compare

We are happy to announce our new Italian and German pipelines and models. We are also going to release new entity_recognizer_lg and entity_recognizer_md pipelines for Italian and French.

Pipelines

Italian

Pipelines Name Language
Explain Document Large explain_document_lg it
Explain Document Medium explain_document_md it
Entity Recognizer Large entity_recognizer_lg it
Entity Recognizer Medium entity_recognizer_md it

French

Pipelines Name Language
Entity Recognizer Large entity_recognizer_lg fr
Entity Recognizer Medium entity_recognizer_md fr

Models

Italian

Model Name Language
PerceptronModel (POS UD) pos_ud_isdt it
NerDLModel (glove_6B_300 and glove_840B_300) ner_dl it

German

Model Name Language
LemmatizerModel (Lemmatizer) lemma de
PerceptronModel (POS UD) pos_ud_hdt de
NerDLModel (glove_6B_300 and glove_840B_300) ner_dl de

Dataset

Feature Description
Lemma Trained by Lemmatizer annotator on lemmatization-lists by Michal Měchura
POS Trained by PerceptronApproach annotator on the Universal Dependencies
NER Trained by NerDLApproach annotator with BiLSTM-CNN on the WikiNER corpus and supports the identification of PER, LOC, ORG and MISC entities

Example

German POS model

val perceptronModel = PerceptronModel.pretrained("pos_ud_hdt", lang="de")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("pos")

German NerDL model

val ner = NerDLModel.pretrained("ner_dl", lang="de")
    .setInputCols("sentence", "token", "embeddings")
    .setOutputCol("ner")

New multi-language BERT embeddings

10 Jun 12:49
732cb25
Compare
Choose a tag to compare

Models

Multi-language

Model Name Language
WordEmbeddings (BERT) bert_multi_cased xx