Skip to content

48 new Transformer based models in 9 new languages, including NER for Finance, Industry, Politcal Policies, COVID and Chemical Trials, various clinical and medical domains in Spanish and English and much more in NLU 3.3.1

Compare
Choose a tag to compare
@C-K-Loan C-K-Loan released this 06 Dec 15:42
· 708 commits to master since this release
fd7e73b

We are incredibly excited to announce NLU 3.3.1 has been released with 48 new models in 9 languages!

It comes with 2 new types of state-of-the-art models,distilBERT and BERT for sequence classification with various pre-trained weights,
state-of-the-art bert based classifiers for problems in the domains of Finance, Sentiment Classification, Industry, News, and much more.

On the healthcare side, NLU features 22 new models in for English and Spanish with
with entity Resolver Models for LOINC, MeSH, NDC and SNOMED and UMLS Diseases,
NER models for Biomarkers, NIHSS-Guidelines, COVID Trials , Chemical Trials,
Bert based Token Classifier models for biological, genetical,cancer, cellular terms,
Bert for Sequence Classification models for clinical question vs statement classification
and finally Spanish Clinical NER and Resolver Models

Once again, we would like to thank our community for making another amazing release possible!

New Open Source Models and Features

Integrates the amazing Spark NLP 3.3.3 and 3.3.2 releases, featuring:

  • New state-of-the-art fine-tuned BERT models for Sequence Classification in English, French, German, Spanish, Japanese, Turkish, Russian, and multilingual languages.
  • DistilBertForSequenceClassification models in English, French and Urdu
  • Word2Vec models.
  • classify.distilbert_sequence.banking77 : Banking NER model trained on BANKING77 dataset, which provides a very fine-grained set of intents in a banking domain. It comprises 13,083 customer service queries labeled with 77 intents. It focuses on fine-grained single-domain intent detection. Can extract entities like activate_my_card, age_limit, apple_pay_or_google_pay, atm_support, automatic_top_up, balance_not_updated_after_bank_transfer, balance_not_updated_after_cheque_or_cash_deposit, beneficiary_not_allowed, cancel_transfer, card_about_to_expire, card_acceptance, card_arrival, card_delivery_estimate, card_linking, card_not_working, card_payment_fee_charged, card_payment_not_recognised, card_payment_wrong_exchange_rate, card_swallowed, cash_withdrawal_charge, cash_withdrawal_not_recognised, change_pin, compromised_card, contactless_not_working, country_support, declined_card_payment, declined_cash_withdrawal, declined_transfer, direct_debit_payment_not_recognised, disposable_card_limits, edit_personal_details, exchange_charge, exchange_rate, exchange_via_app, extra_charge_on_statement, failed_transfer, fiat_currency_support, get_disposable_virtual_card, get_physical_card, getting_spare_card, getting_virtual_card, lost_or_stolen_card, lost_or_stolen_phone, order_physical_card, passcode_forgotten, pending_card_payment, pending_cash_withdrawal, pending_top_up, pending_transfer, pin_blocked, receiving_money,
  • classify.distilbert_sequence.industry : Industry NER model which can extract entities like Advertising, Aerospace & Defense, Apparel Retail, Apparel, Accessories & Luxury Goods, Application Software, Asset Management & Custody Banks, Auto Parts & Equipment, Biotechnology, Building Products, Casinos & Gaming, Commodity Chemicals, Communications Equipment, Construction & Engineering, Construction Machinery & Heavy Trucks, Consumer Finance, Data Processing & Outsourced Services, Diversified Metals & Mining, Diversified Support Services, Electric Utilities, Electrical Components & Equipment, Electronic Equipment & Instruments, Environmental & Facilities Services, Gold, Health Care Equipment, Health Care Facilities, Health Care Services.
  • xx.classify.bert_sequence.sentiment : Multi-Lingual Sentiment Classifier This a bert-base-multilingual-uncased model finetuned for sentiment analysis on product reviews in six languages: English, Dutch, German, French, Spanish and Italian. It predicts the sentiment of the review as a number of stars (between 1 and 5). This model is intended for direct use as a sentiment analysis model for product reviews in any of the six languages above, or for further finetuning on related sentiment analysis tasks.
  • distilbert_sequence.policy : Policy Classifier This model was trained on 129.669 manually annotated sentences to classify text into one of seven political categories: ‘Economy’, ‘External Relations’, ‘Fabric of Society’, ‘Freedom and Democracy’, ‘Political System’, ‘Welfare and Quality of Life’ or ‘Social Groups’.
  • classify.bert_sequence.dehatebert_mono : Hate Speech Classifier This model was trained on 129.669 manually annotated sentences to classify text into one of seven political categories: ‘Economy’, ‘External Relations’, ‘Fabric of Society’, ‘Freedom and Democracy’, ‘Political System’, ‘Welfare and Quality of Life’ or ‘Social Groups’.

Complete List of Open Source Models :

Language NLU Reference Spark NLP Reference Task
en en.classify.bert_sequence.imdb_large bert_large_sequence_classifier_imdb Text Classification
en en.classify.bert_sequence.imdb bert_base_sequence_classifier_imdb Text Classification
en en.classify.bert_sequence.ag_news bert_base_sequence_classifier_ag_news Text Classification
en en.classify.bert_sequence.dbpedia_14 bert_base_sequence_classifier_dbpedia_14 Text Classification
en en.classify.bert_sequence.finbert bert_sequence_classifier_finbert Text Classification
en en.classify.bert_sequence.dehatebert_mono bert_sequence_classifier_dehatebert_mono Text Classification
tr tr.classify.bert_sequence.sentiment bert_sequence_classifier_turkish_sentiment Text Classification
de de.classify.bert_sequence.sentiment bert_sequence_classifier_sentiment Text Classification
ru ru.classify.bert_sequence.sentiment bert_sequence_classifier_rubert_sentiment Text Classification
ja ja.classify.bert_sequence.sentiment bert_sequence_classifier_japanese_sentiment Text Classification
es es.classify.bert_sequence.sentiment bert_sequence_classifier_beto_sentiment_analysis Text Classification
es es.classify.bert_sequence.emotion bert_sequence_classifier_beto_emotion_analysis Text Classification
xx xx.classify.bert_sequence.sentiment bert_sequence_classifier_multilingual_sentiment Text Classification
en en.classify.distilbert_sequence.sst2 distilbert_sequence_classifier_sst2 Text Classification
en en.classify.distilbert_sequence.policy distilbert_sequence_classifier_policy Text Classification
en en.classify.distilbert_sequence.industry distilbert_sequence_classifier_industry Text Classification
en en.classify.distilbert_sequence.emotion distilbert_sequence_classifier_emotion Text Classification
en en.classify.distilbert_sequence.banking77 distilbert_sequence_classifier_banking77 Text Classification
en en.classify.distilbert_sequence.imdb distilbert_base_sequence_classifier_imdb Text Classification
en en.classify.distilbert_sequence.amazon_polarity distilbert_base_sequence_classifier_amazon_polarity Text Classification
en en.classify.distilbert_sequence.ag_news distilbert_base_sequence_classifier_ag_news Text Classification
fr fr.classify.distilbert_sequence.allocine distilbert_multilingual_sequence_classifier_allocine Text Classification
ur ur.classify.distilbert_sequence.imdb distilbert_base_sequence_classifier_imdb Text Classification
en en.embed_sentence.doc2vec doc2vec_gigaword_300 Embeddings
en en.embed_sentence.doc2vec.gigaword_300 doc2vec_gigaword_300 Embeddings
en en.embed_sentence.doc2vec.gigaword_wiki_300 doc2vec_gigaword_wiki_300 Embeddings

New Healthcare models and Features

Integrates the incredible Spark NLP for Healthcare releases 3.3.4, 3.3.2 and 3.3.1, featuring:

  • New Clinical NER Models for protected health information(PHI),
    • ner_biomarker for extracting extract biomarkers, therapies, oncological, and other general concepts
      • Oncogenes, Tumor_Finding, UnspecificTherapy, Ethnicity, Age, ResponseToTreatment, Biomarker, HormonalTherapy, Staging, Drug, CancerDx, Radiotherapy, CancerSurgery, TargetedTherapy, PerformanceStatus, CancerModifier, Radiological_Test_Result, Biomarker_Measurement, Metastasis, Radiological_Test, Chemotherapy, Test, Dosage, Test_Result, Immunotherapy, Date, Gender, Prognostic_Biomarkers, Duration, Predictive_Biomarkers
  • ner_nihss : NER model that can identify entities according to NIHSS guidelines for clinical stroke assessment to evaluate neurological status in acute stroke patients
    • 11_ExtinctionInattention, 6b_RightLeg, 1c_LOCCommands, 10_Dysarthria, NIHSS, 5_Motor, 8_Sensory, 4_FacialPalsy, 6_Motor, 2_BestGaze, Measurement, 6a_LeftLeg, 5b_RightArm, 5a_LeftArm, 1b_LOCQuestions, 3_Visual, 9_BestLanguage, 7_LimbAtaxia, 1a_LOC .
  • redl_nihss_biobert : relation extraction model that can relate scale items and their measurements according to NIHSS guidelines.
  • es.med_ner.roberta_ner_diag_proc : New Spanish Clinical NER Models for extracting the entities DIAGNOSTICO, PROCEDIMIENTO
  • es.resolve.snomed: New Spanish SNOMED Entity Resolvers
  • bert_sequence_classifier_question_statement_clinical:New Clinical Question vs Statement for BertForSequenceClassification model
  • med_ner.covid_trials : This model is trained to extract covid-specific medical entities in clinical trials. It supports the following entities ranging from virus type to trial design: Stage, Severity, Virus, Trial_Design, Trial_Phase, N_Patients, Institution, Statistical_Indicator, Section_Header, Cell_Type, Cellular_component, Viral_components, Physiological_reaction, Biological_molecules, Admission_Discharge, Age, BMI, Cerebrovascular_Disease, Date, Death_Entity, Diabetes, Disease_Syndrome_Disorder, Dosage, Drug_Ingredient, Employment, Frequency, Gender, Heart_Disease, Hypertension, Obesity, Pulse, Race_Ethnicity, Respiration, Route, Smoking, Time, Total_Cholesterol, Treatment, VS_Finding, Vaccine .
  • med_ner.chemd : This model extract the names of chemical compounds and drugs in medical texts. The entities that can be detected are as follows : SYSTEMATIC, IDENTIFIERS, FORMULA, TRIVIAL, ABBREVIATION, FAMILY, MULTIPLE . For reference click here . https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331685/
  • bert_token_classifier_ner_bionlp : This model is BERT-based version of ner_bionlp model and can detect biological and genetics terms in cancer-related texts. (Amino_acid, Anatomical_system, Cancer, Cell, Cellular_component, Developing_anatomical_Structure, Gene_or_gene_product, Immaterial_anatomical_entity, Multi-tissue_structure, Organ, Organism, Organism_subdivision, Simple_chemical, Tissue
  • bert_token_classifier_ner_cellular : This model is BERT-based version of ner_cellular model and can detect molecular biology-related terms (DNA, Cell_type, Cell_line, RNA, Protein) in medical texts.
  • We have updated med_ner.jsl.enriched model by enriching the training data using clinical trials data to make it more robust. This model is capable of predicting up to 87 different entities and is based on ner_jsl model. Here are the entities this model can detect; Social_History_Header, Oncology_Therapy, Blood_Pressure, Respiration, Performance_Status, Family_History_Header, Dosage, Clinical_Dept, Diet, Procedure, HDL, Weight, Admission_Discharge, LDL, Kidney_Disease, Oncological, Route, Imaging_Technique, Puerperium, Overweight, Temperature, Diabetes, Vaccine, Age, Test_Result, Employment, Time, Obesity, EKG_Findings, Pregnancy, Communicable_Disease, BMI, Strength, Tumor_Finding, Section_Header, RelativeDate, ImagingFindings, Death_Entity, Date, Cerebrovascular_Disease, Treatment, Labour_Delivery, Pregnancy_Delivery_Puerperium, Direction, Internal_organ_or_component, Psychological_Condition, Form, Medical_Device, Test, Symptom, Disease_Syndrome_Disorder, Staging, Birth_Entity, Hyperlipidemia, O2_Saturation, Frequency, External_body_part_or_region, Drug_Ingredient, Vital_Signs_Header, Substance_Quantity, Race_Ethnicity, VS_Finding, Injury_or_Poisoning, Medical_History_Header, Alcohol, Triglycerides, Total_Cholesterol, Sexually_Active_or_Sexual_Orientation, Female_Reproductive_Status, Relationship_Status, Drug_BrandName, RelativeTime, Duration, Hypertension, Metastasis, Gender, Oxygen_Therapy, Pulse, Heart_Disease, Modifier, Allergen, Smoking, Substance, Cancer_Modifier, Fetus_NewBorn, Height
  • classify.bert_sequence.question_statement_clinical : This model classifies sentences into one of these two classes: question (interrogative sentence) or statement (declarative sentence) and trained with BertForSequenceClassification. This model is at first trained on SQuAD and SPAADIA dataset and then fine tuned on the clinical visit documents and MIMIC-III dataset annotated in-house. Using this model, you can find the question statements and exclude & utilize in the downstream tasks such as NER and relation extraction models.
  • classify.token_bert.ner_chemical : This model is BERT-based version of ner_chemicals model and can detect chemical compounds (CHEM) in the medical texts.
  • resolve.umls_disease_syndrome : This model is trained on the Disease or Syndrome category using sbiobert_base_cased_mli embeddings.

Complete List of Healthcare Models :

Language NLU Reference Spark NLP Reference Task
en en.med_ner.deid_subentity_augmented_i2b2 ner_deid_subentity_augmented_i2b2 Named Entity Recognition
en en.med_ner.biomarker ner_biomarker Named Entity Recognition
en en.med_ner.nihss ner_nihss Named Entity Recognition
en en.extract_relation.nihss redl_nihss_biobert Relation Extraction
en en.resolve.mesh sbiobertresolve_mesh Entity Resolution
en en.resolve.mli sbiobert_base_cased_mli Embeddings
en en.resolve.ndc sbiobertresolve_ndc Entity Resolution
en en.resolve.loinc.augmented sbiobertresolve_loinc_augmented Entity Resolution
en en.resolve.clinical_snomed_procedures_measurements sbiobertresolve_clinical_snomed_procedures_measurements Entity Resolution
es es.embed.roberta_base_biomedical roberta_base_biomedical Embeddings
es es.med_ner.roberta_ner_diag_proc roberta_ner_diag_proc Named Entity Recognition
es es.resolve.snomed robertaresolve_snomed Entity Resolution
en en.med_ner.covid_trials ner_covid_trials Named Entity Recognition
en en.classify.token_bert.bionlp bert_token_classifier_ner_bionlp Named Entity Recognition
en en.classify.token_bert.cellular bert_token_classifier_ner_cellular Named Entity Recognition
en en.classify.token_bert.chemicals bert_token_classifier_ner_chemicals Named Entity Recognition
en en.resolve.rxnorm_augmented sbiobertresolve_rxnorm_augmented Entity Resolution
en en.resolve.rxnorm_augmented sbiobertresolve_rxnorm_augmented Entity Resolution
en en.resolve.rxnorm_augmented sbiobertresolve_rxnorm_augmented Entity Resolution
en en.resolve.umls_disease_syndrome sbiobertresolve_umls_disease_syndrome Entity Resolution
en en.resolve.umls_clinical_drugs sbiobertresolve_umls_clinical_drugs Entity Resolution
en en.classify.bert_sequence.question_statement_clinical bert_sequence_classifier_question_statement_clinical Text Classification