Releases · JohnSnowLabs/spark-nlp-models

This repository has been archived by the owner on Aug 12, 2021. It is now read-only.

19 Mar 15:08

maziyarpanahi

2.4.4-russian-pack

eb5e0bc

New Russian models and pipelines pack Latest

Latest

Russian Models and Pipelines

We are happy to announce Spark NLP pre-trained Russian models and pipelines.

Models:

Model	name	language
LemmatizerModel (Lemmatizer)	`lemma`	`ru`
PerceptronModel (POS UD)	`pos_ud_gsd`	`ru`
NerDLModel	`wikiner_6B_100`	`ru`
NerDLModel	`wikiner_6B_300`	`ru`
NerDLModel	`wikiner_840B_300`	`ru`

Pipelines:

Pipeline	name	language
Explain Document (Small)	`explain_document_sm`	`ru`
Explain Document (Medium)	`explain_document_md`	`ru`
Explain Document (Large)	`explain_document_lg`	`ru`
Entity Recognizer (Small)	`entity_recognizer_sm`	`ru`
Entity Recognizer (Medium)	`entity_recognizer_md`	`ru`
Entity Recognizer (Large)	`entity_recognizer_lg`	`ru`

Example:

import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP

SparkNLP.version()

val pipeline = PretrainedPipeline("explain_document_sm", lang="ru")

val testData = spark.createDataFrame(Seq(
(1, "Пик распространения коронавируса и вызываемой им болезни Covid-19 в Китае прошел, заявил в четверг агентству Синьхуа официальный представитель Госкомитета по гигиене и здравоохранению КНР Ми Фэн.")
)).toDF("id", "text")

val annotation = pipeline.transform(testData)

annotation.show()

Spark NLP:

PUBLIC

Last update

12/03/2020

Works with

Spark NLP 2.4.4 and above

Assets 2

18 Feb 06:53

maziyarpanahi

2.4.1-spanish-pack

9762360

New Spanish models and pipelines pack

Spanish Models and Pipelines

We are happy to announce Spark NLP pre-trained Spanish models and pipelines.

Models:

Model	name	language
LemmatizerModel (Lemmatizer)	`lemma`	`es`
PerceptronModel (POS UD)	`pos_ud_gsd`	`es`
NerDLModel	`wikiner_6B_100`	`es`
NerDLModel	`wikiner_6B_300`	`es`
NerDLModel	`wikiner_840B_300`	`es`

Pipelines:

Pipeline	name	language
Explain Document (Small)	`explain_document_sm`	`es`
Explain Document (Medium)	`explain_document_md`	`es`
Explain Document (Large)	`explain_document_lg`	`es`
Entity Recognizer (Small)	`entity_recognizer_sm`	`es`
Entity Recognizer (Medium)	`entity_recognizer_md`	`es`
Entity Recognizer (Large)	`entity_recognizer_lg`	`es`

Example:

import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP

SparkNLP.version()

val pipeline = PretrainedPipeline("explain_document_sm", lang="es")

val testData = spark.createDataFrame(Seq(
(1, "Ésta se convertiría en una amistad de por vida, y Peleo, conociendo la sabiduría de Quirón , más adelante le confiaría la educación de su hijo Aquiles."),
(2, "Durante algo más de 200 años el territorio de la actual Bolivia constituyó la Real Audiencia de Charcas, uno de los centros más prósperos y densamente poblados de los virreinatos españoles.")
)).toDF("id", "text")

val annotation = pipeline.transform(testData)

annotation.show()

Spark NLP:

PUBLIC

Last update

16/02/2020

Works with

Spark NLP 2.4.0 and above

Assets 2

16 Feb 21:14

maziyarpanahi

2.4.0-universal-sentence-encoder

9229729

Introducing new Universal Sentence Encoder models

Universal Sentence Encoder models:

The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks.

The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. We apply this model to the STS benchmark for semantic similarity, and the results can be seen in the example notebook made available. The universal-sentence-encoder model is trained with a deep averaging network (DAN) encoder.

We are very excited to share these 2 new Universal Sentence Encoder models coming from TF Hub:

Model	name	language
UniversalSentenceEncoder	`tfhub_use`	en
UniversalSentenceEncoder	`tfhub_use_lg`	en

Example:

val useEmbeddings = UniversalSentenceEncoder.pretrained(name="tfhub_use", lang="en")
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")

Spark NLP:

PUBLIC

Last update

01/02/2020

Works with

Spark NLP 2.4.0 and above

Assets 2

27 Nov 19:33

maziyarpanahi

2.3.4-icd-embeddings

fc8a8d0

New ICD10 and ICDO Pack: Embeddings, EntityResolver and TextMatcher models

Model or model pack description:

ICD10 and ICDO model pack:

Model	name	language	loc
TextMatcherModel	`textmatch_icdo_ner_n2c4`	en	clinical/models
TextMatcherModel	`textmatch_cpt_token_n2c1`	en	clinical/models
WordEmbeddingsModel	`embeddings_icdoem`	en	clinical/models
EntityResolutionModel	`resolve_icd10cm_icdoem`	en	clinical/models
EntityResolutionModel	`resolve_icdo_icdoem`	en	clinical/models
EntityResolutionModel	`resolve_cpt_icdoem`	en	clinical/models
ChunkEntityResolutionModel	`chunkresolve_icdo_icdoem`	en	clinical/models
ChunkEntityResolutionModel	`chunkresolve_cpt_icdoem`	en	clinical/models

The textmatch_icdo_ner_n2c4 and textmatch_cpt_token_n2c1 are Text Matching models trained from comprehensive glossaries for Oncology and Procedural terms

The embeddings_icdoem WordEmbeddingsModel, was trained with a semantically augmented corpus of clinical texts, case studies, and curated datasets.

The resolve_icd10cm_icdoem, resolve_icdo_icdoem and resolve_cpt_icdoem models are EntityResolvers trained with the embeddings_icdoem model and semantically augmented datasets from JSL Data Market
The chunkresolve_icdo_icdoem and chunkresolve_cpt_icdoem models are ChunkEntityResolvers that connect with the new ChunkEmbeddings annotator

Spark NLP:

HEALTHCARE

Last update

26/11/2019

Notes

ChunkEntityResolutionApproach and ChunkEntityResolutionModel are new annotators coming in for Spark NLP 2.3.4.
The main difference with respect to EntityResolutionApproach and EntityResolutionModel is that they expect embeddings from ChunkEmbeddings. This makes WordEmbedding aggregation functions flexible for chunks.

Works with

Spark NLP 2.3.4 and above

Link

Examples on how to use these models can be found here:
Notebooks
Healthcare Notebooks

Assets 2

27 Nov 22:17

saif-ellafi

2.3.4-bionlp-ner

fc8a8d0

New BioNLP NER Model

BioNLP-CG NER model:

BioNLP Named Entity Recognition (NER) model is the first NER model in the Spark NLP library that is trained on Cancer Genetics dataset with SOTA NER architecture.

The Cancer Genetics (CG) task is an information extraction task organized as part of the BioNLP Shared Task 2013. The CG task aims to advance the automatic extraction of information from statements on the biological processes relating to the development and progression of cancer. Details here: http://2013.bionlp-st.org/tasks/cancer-genetics

There are 16 different entities in this NER model:

Entities
Gene_or_gene_product
Organism
Organ
Anatomical_system
Cell
Multi
Tissue
Pathological_formation
Cancer
Simple_chemical
Amino_acid
Cellular_component
Organism_subdivision
Developing_anatomical_structure
Immaterial_anatomical_entity
Organism_substance

Spark NLP:

HEALTHCARE
PUBLIC

Last update

-- 27/11/2019

Works with:

-- 2.3.x and above

Link

-- https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/enterprise/healthcare/BioNLP-NER.ipynb

Assets 2

27 Nov 22:13

saif-ellafi

2.3.4-biobert-embeddings

fc8a8d0

New BERT Healthcare Embeddings Pack

Model or model pack description:

BioBERT models pack:

We are very excited to share these 5 new BioBERT models with our enterprise users!

Model	name	language	loc
BertEmbeddingsModel	`biobert_pubmed_cased`	en	clinical/models
BertEmbeddingsModel	`biobert_pmc_cased`	en	clinical/models
BertEmbeddingsModel	`biobert_pubmed_pmc_cased`	en	clinical/models
BertEmbeddingsModel	`biobert_clinical_cased`	en	clinical/models
BertEmbeddingsModel	`biobert_discharge_cased`	en	clinical/models

The biobert_pubmed_cased, biobert_pmc_cased, and biobert_pubmed_pmc_cased models are thanks to BioBERT pretrained models from their paper: https://arxiv.org/abs/1901.08746
The biobert_clinical_cased and biobert_discharge_cased models are from another amazing release called clinicalBERT from their paper: https://www.aclweb.org/anthology/W19-1909/

Spark NLP:

HEALTHCARE

Last update

26/11/2019

Works with

Spark NLP 2.3.1 and above

Assets 2

28 Sep 15:51

maziyarpanahi

2.2.0-bert-models

732cb25

New BertEmbeddings Models

Model	Name	en
BertEmbeddings (base_uncased)	`bert_base_uncased`	Download
BertEmbeddings (base_cased)	`bert_base_cased`	Download
BertEmbeddings (large_uncased)	`bert_large_uncased`	Download
BertEmbeddings (large_cased)	`bert_large_cased`	Download

Spark NLP:

PUBLIC

Last update

-- 24/08/2019

Works with:

-- 2.2.0 and above

Assets 2

13 Jul 18:18

maziyarpanahi

2.1.0-wikiner-models

732cb25

New WikiNER models

Models

We have renamed our multi-lingual NerDL models from ner_dl to wikiner_840B_300 in Spark NLP 2.1.0. they are being trained by WikiNER and they have the highest accuracy against pretrained WordEmbeddings wikiner_840B_300.

English

Model	Name	en
NerDLModel (OntoNotes with GloVe 100d)	`onto_100`	Download
NerDLModel (OntoNotes with GloVe 300d)	`onto_300`	Download

French

Model	Name	fr
NerDLModel (glove_840B_300)	`wikiner_840B_300`	Download

German

Model	Name	de
NerDLModel (glove_840B_300)	`wikiner_840B_300`	Download

Italian

Model	Name	it
NerDLModel (glove_840B_300)	`wikiner_840B_300`	Download

Multi-language

Model	Name	xx
WordEmbeddings (GloVe)	`glove_840B_300`	Download
WordEmbeddings (GloVe)	`glove_6B_300`	Download
WordEmbeddings (BERT)	`bert_multi_cased`	Download

Spark NLP:

PUBLIC

Last update

-- 03/08/2019

Works with:

-- 2.1.0 and above

Assets 2

10 Jun 15:38

maziyarpanahi

2.0.8-italian-german-models

732cb25

New Italian and German pipelines and models

We are happy to announce our new Italian and German pipelines and models. We are also going to release new entity_recognizer_lg and entity_recognizer_md pipelines for Italian and French.

Pipelines

Italian

Pipelines	Name	Language
Explain Document Large	`explain_document_lg`	it
Explain Document Medium	`explain_document_md`	it
Entity Recognizer Large	`entity_recognizer_lg`	it
Entity Recognizer Medium	`entity_recognizer_md`	it

French

Pipelines	Name	Language
Entity Recognizer Large	`entity_recognizer_lg`	fr
Entity Recognizer Medium	`entity_recognizer_md`	fr

Models

Italian

Model	Name	Language
PerceptronModel (POS UD)	`pos_ud_isdt`	it
NerDLModel (glove_6B_300 and glove_840B_300)	`ner_dl`	it

German

Model	Name	Language
LemmatizerModel (Lemmatizer)	`lemma`	de
PerceptronModel (POS UD)	`pos_ud_hdt`	de
NerDLModel (glove_6B_300 and glove_840B_300)	`ner_dl`	de

Dataset

Feature	Description
Lemma	Trained by Lemmatizer annotator on lemmatization-lists by `Michal Měchura`
POS	Trained by PerceptronApproach annotator on the Universal Dependencies
NER	Trained by NerDLApproach annotator with BiLSTM-CNN on the WikiNER corpus and supports the identification of `PER`, `LOC`, `ORG` and `MISC` entities

Example

German POS model

val perceptronModel = PerceptronModel.pretrained("pos_ud_hdt", lang="de")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("pos")

German NerDL model

val ner = NerDLModel.pretrained("ner_dl", lang="de")
    .setInputCols("sentence", "token", "embeddings")
    .setOutputCol("ner")

Assets 2

10 Jun 12:49

maziyarpanahi

2.0.3-bert-multi-language-model

732cb25

New multi-language BERT embeddings

Models

Multi-language

Model	Name	Language
WordEmbeddings (BERT)	`bert_multi_cased`	xx

Assets 2

Releases: JohnSnowLabs/spark-nlp-models

New Russian models and pipelines pack

Russian Models and Pipelines

Example:

Spark NLP:

Last update

Works with

New Spanish models and pipelines pack

Spanish Models and Pipelines

Example:

Spark NLP:

Last update

Works with

Introducing new Universal Sentence Encoder models

Universal Sentence Encoder models:

Example:

Spark NLP:

Last update

Works with

New ICD10 and ICDO Pack: Embeddings, EntityResolver and TextMatcher models

Model or model pack description:

ICD10 and ICDO model pack:

Spark NLP:

Last update

Notes

Works with

Link

New BioNLP NER Model

BioNLP-CG NER model:

Spark NLP:

Last update

Works with:

Link

New BERT Healthcare Embeddings Pack

Model or model pack description:

BioBERT models pack:

Spark NLP:

Last update

Works with

New BertEmbeddings Models

Spark NLP:

Last update

Works with:

New WikiNER models

Models

English

French

German

Italian

Multi-language

Spark NLP:

Last update

Works with:

New Italian and German pipelines and models

Pipelines

Italian

French

Models

Italian

German

Dataset

Example

New multi-language BERT embeddings

Models

Multi-language