From 06f400d75b84af2f60f144dfc1b4121b23df1962 Mon Sep 17 00:00:00 2001 From: David Cecchini Date: Sun, 12 Nov 2023 06:01:49 -0300 Subject: [PATCH] Finance 1.20.0 (#755) * Add model 2023-08-03-finner_bert_subpoenas_sm_en (#493) Co-authored-by: gadde5300 * Delete subpoenas ner finance * Add model 2023-08-30-finpipe_deid_en (#566) Co-authored-by: Meryem1425 * Add model 2023-08-30-finpipe_deid_en (#570) Co-authored-by: SKocer * Add model 2023-08-30-finpipe_deid_en (#571) Co-authored-by: SKocer * Delete 2023-08-30-finpipe_deid_en.md * Add model 2023-08-30-finpipe_deid_en (#572) Co-authored-by: gokhanturer * Add model 2023-08-30-finpipe_deid_en (#574) Co-authored-by: SKocer * Add model 2023-09-01-finpipe_deid_en (#586) Co-authored-by: Meryem1425 * Add model 2023-09-01-finpipe_deid_en (#589) Co-authored-by: SKocer * Add model 2023-09-01-finpipe_deid_en (#593) Co-authored-by: gokhanturer * 2023-10-06-finembedding_e5_base_en (#685) * Add model 2023-10-06-finembedding_e5_base_en * Add model 2023-10-06-finner_absa_sm_en * Add model 2023-10-06-finassertion_absa_sm_en --------- Co-authored-by: dcecchini * Add model 2023-11-09-finembedding_e5_large_en (#745) Co-authored-by: dcecchini * 2023-11-11-finner_aspect_based_sentiment_md_en (#754) * Add model 2023-11-11-finner_aspect_based_sentiment_md_en * Add model 2023-11-11-finassertion_aspect_based_sentiment_md_en * Update 2023-11-11-finner_aspect_based_sentiment_md_en.md * Update 2023-11-11-finassertion_aspect_based_sentiment_md_en.md --------- Co-authored-by: Mary-Sci Co-authored-by: Merve Ertas Uslu <67653613+Mary-Sci@users.noreply.github.com> --------- Co-authored-by: jsl-models <74001263+jsl-models@users.noreply.github.com> Co-authored-by: gadde5300 Co-authored-by: Meryem1425 Co-authored-by: SKocer Co-authored-by: Merve Ertas Uslu <67653613+Mary-Sci@users.noreply.github.com> Co-authored-by: gokhanturer Co-authored-by: Mary-Sci --- ...nassertion_aspect_based_sentiment_md_en.md | 131 +++++++++++++++++ ...-11-finner_aspect_based_sentiment_md_en.md | 136 ++++++++++++++++++ .../2023-10-06-finembedding_e5_base_en.md | 1 + .../2023-11-09-finembedding_e5_large_en.md | 90 ++++++++++++ 4 files changed, 358 insertions(+) create mode 100644 docs/_posts/Mary-Sci/2023-11-11-finassertion_aspect_based_sentiment_md_en.md create mode 100644 docs/_posts/Mary-Sci/2023-11-11-finner_aspect_based_sentiment_md_en.md create mode 100644 docs/_posts/dcecchini/2023-11-09-finembedding_e5_large_en.md diff --git a/docs/_posts/Mary-Sci/2023-11-11-finassertion_aspect_based_sentiment_md_en.md b/docs/_posts/Mary-Sci/2023-11-11-finassertion_aspect_based_sentiment_md_en.md new file mode 100644 index 0000000000..12ca101255 --- /dev/null +++ b/docs/_posts/Mary-Sci/2023-11-11-finassertion_aspect_based_sentiment_md_en.md @@ -0,0 +1,131 @@ +--- +layout: model +title: Financial Assertion of Aspect-Based Sentiment (md, Medium) +author: John Snow Labs +name: finassertion_aspect_based_sentiment_md +date: 2023-11-11 +tags: [assertion, licensed, en, finance] +task: Assertion Status +language: en +edition: Finance NLP 1.0.0 +spark_version: 3.0 +supported: true +annotator: AssertionDLModel +article_header: +type: cover +use_language_switcher: "Python-Scala-Java" +--- + +## Description + +This assertion model classifies financial entities into an aspect-based sentiment. It is designed to be used together with the associated NER model. + +## Predicted Entities + +`POSITIVE`, `NEGATIVE`, `NEUTRAL` + +{:.btn-box} + + +[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finassertion_aspect_based_sentiment_md_en_1.0.0_3.0_1699705705778.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} +[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finassertion_aspect_based_sentiment_md_en_1.0.0_3.0_1699705705778.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} + +## How to use + + + +
+{% include programmingLanguageSelectScalaPythonNLU.html %} +```python +documentAssembler = nlp.DocumentAssembler()\ + .setInputCol("text")\ + .setOutputCol("document") + +# Sentence Detector annotator, processes various sentences per line +sentenceDetector = nlp.SentenceDetector()\ + .setInputCols(["document"])\ + .setOutputCol("sentence") + +# Tokenizer splits words in a relevant format for NLP +tokenizer = nlp.Tokenizer()\ + .setInputCols(["sentence"])\ + .setOutputCol("token") + +bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")\ + .setInputCols("sentence", "token")\ + .setOutputCol("embeddings")\ + .setMaxSentenceLength(512) + +finance_ner = finance.NerModel.pretrained("finner_aspect_based_sentiment_md", "en", "finance/models")\ + .setInputCols(["sentence", "token", "embeddings"])\ + .setOutputCol("ner") + +ner_converter = finance.NerConverterInternal()\ + .setInputCols(["sentence", "token", "ner"])\ + .setOutputCol("ner_chunk") + +assertion_model = finance.AssertionDLModel.pretrained("finassertion_aspect_based_sentiment_md", "en", "finance/models")\ + .setInputCols(["sentence", "ner_chunk", "embeddings"])\ + .setOutputCol("assertion") + + +nlpPipeline = nlp.Pipeline( + stages=[documentAssembler, + sentenceDetector, + tokenizer, + bert_embeddings, + finance_ner, + ner_converter, + assertion_model]) + +text = "Equity and earnings of affiliates in Latin America increased to $4.8 million in the quarter from $2.2 million in the prior year as the commodity markets in Latin America remain strong through the end of the quarter." + +spark_df = spark.createDataFrame([[text]]).toDF("text") + +result = nlpPipeline.fit(spark_df ).transform(spark_df) + +result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata", "assertion.result", "assertion.metadata")).alias("cols"))\ + .select(F.expr("cols['0']").alias("entity"), + F.expr("cols['1']['entity']").alias("label"), + F.expr("cols['2']").alias("assertion"), + F.expr("cols['3']['confidence']").alias("confidence")).show(50, truncate=False) +``` + +
+ +## Results + +```bash ++--------+---------+---------+----------+ +|entity |label |assertion|confidence| ++--------+---------+---------+----------+ +|Equity |LIABILITY|POSITIVE |0.9895 | +|earnings|PROFIT |POSITIVE |0.995 | ++--------+---------+---------+----------+ +``` + +{:.model-param} +## Model Information + +{:.table-model} +|---|---| +|Model Name:|finassertion_aspect_based_sentiment_md| +|Compatibility:|Finance NLP 1.0.0+| +|License:|Licensed| +|Edition:|Official| +|Input Labels:|[document, chunk, embeddings]| +|Output Labels:|[assertion]| +|Language:|en| +|Size:|2.7 MB| + +## Benchmarking + +```bash + label precision recall f1-score support + NEGATIVE 0.68 0.43 0.53 232 + NEUTRAL 0.44 0.65 0.53 441 + POSITIVE 0.79 0.69 0.74 947 + accuracy - - 0.64 1620 + macro-avg 0.64 0.59 0.60 1620 + weighted-avg 0.68 0.64 0.65 1620 +``` diff --git a/docs/_posts/Mary-Sci/2023-11-11-finner_aspect_based_sentiment_md_en.md b/docs/_posts/Mary-Sci/2023-11-11-finner_aspect_based_sentiment_md_en.md new file mode 100644 index 0000000000..fb1df22a2a --- /dev/null +++ b/docs/_posts/Mary-Sci/2023-11-11-finner_aspect_based_sentiment_md_en.md @@ -0,0 +1,136 @@ +--- +layout: model +title: Financial NER on Aspect-Based Sentiment Analysis +author: John Snow Labs +name: finner_aspect_based_sentiment_md +date: 2023-11-11 +tags: [ner, licensed, finance, en] +task: Named Entity Recognition +language: en +edition: Finance NLP 1.0.0 +spark_version: 3.0 +supported: true +annotator: FinanceNerModel +article_header: +type: cover +use_language_switcher: "Python-Scala-Java" +--- + +## Description + +This NER model identifies entities that can be associated with a financial sentiment. The model is designed to be used with the associated Assertion Status model that classifies the entities into a sentiment category. + +## Predicted Entities + +`ASSET`, `CASHFLOW`, `EXPENSE`, `FREE_CASH_FLOW`, `GAINS`, `KPI`, `LIABILITY`, `LOSSES`, `PROFIT`, `REVENUE` + +{:.btn-box} + + +[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_aspect_based_sentiment_md_en_1.0.0_3.0_1699704469251.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} +[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_aspect_based_sentiment_md_en_1.0.0_3.0_1699704469251.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} + +## How to use + + + +
+{% include programmingLanguageSelectScalaPythonNLU.html %} +```python +documentAssembler = nlp.DocumentAssembler()\ + .setInputCol("text")\ + .setOutputCol("document") + +# Sentence Detector annotator, processes various sentences per line +sentenceDetector = nlp.SentenceDetector()\ + .setInputCols(["document"])\ + .setOutputCol("sentence") + +# Tokenizer splits words in a relevant format for NLP +tokenizer = nlp.Tokenizer()\ + .setInputCols(["sentence"])\ + .setOutputCol("token") + +bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")\ + .setInputCols("sentence", "token")\ + .setOutputCol("embeddings")\ + .setMaxSentenceLength(512) + + +ner_model = finance.NerModel().pretrained("finner_aspect_based_sentiment_md", "en", "finance/models")\ + .setInputCols(["sentence", "token", "embeddings"])\ + .setOutputCol("ner") + +ner_converter = nlp.NerConverter()\ + .setInputCols(["sentence","token","ner"])\ + .setOutputCol("ner_chunk") + +nlpPipeline = nlp.Pipeline(stages=[ + documentAssembler, + sentenceDetector, + tokenizer, + bert_embeddings, + ner_model, + ner_converter]) + +empty_data = spark.createDataFrame([[""]]).toDF("text") +model = nlpPipeline.fit(empty_data) + +text = ["""Equity and earnings of affiliates in Latin America increased to $4.8 million in the quarter from $2.2 million in the prior year as the commodity markets in Latin America remain strong through the end of the quarter."""] +result = model.transform(spark.createDataFrame([text]).toDF("text")) + +from pyspark.sql import functions as F + +result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.begin, result.ner_chunk.end, result.ner_chunk.metadata)).alias("cols")) \ + .select(F.expr("cols['0']").alias("chunk"), + F.expr("cols['1']").alias("begin"), + F.expr("cols['2']").alias("end"), + F.expr("cols['3']['entity']").alias("ner_label") + ).show(100, truncate=False) +``` + +
+ +## Results + +```bash ++--------+-----+---+---------+ +|chunk |begin|end|ner_label| ++--------+-----+---+---------+ +|Equity |1 |6 |LIABILITY| +|earnings|12 |19 |PROFIT | ++--------+-----+---+---------+ +``` + +{:.model-param} +## Model Information + +{:.table-model} +|---|---| +|Model Name:|finner_aspect_based_sentiment_md| +|Compatibility:|Finance NLP 1.0.0+| +|License:|Licensed| +|Edition:|Official| +|Input Labels:|[sentence, token, embeddings]| +|Output Labels:|[ner]| +|Language:|en| +|Size:|16.5 MB| + +## Benchmarking + +```bash + label precision recall f1-score support + ASSET 0.50 0.72 0.59 53 + CASHFLOW 0.78 0.60 0.68 30 + EXPENSE 0.71 0.68 0.70 151 + FREE_CASH_FLOW 1.00 1.00 1.00 19 + GAINS 0.80 0.78 0.79 55 + KPI 0.72 0.58 0.64 106 + LIABILITY 0.65 0.51 0.57 39 + LOSSES 0.77 0.59 0.67 29 + PROFIT 0.77 0.74 0.75 101 + REVENUE 0.74 0.78 0.76 231 + micro-avg 0.72 0.71 0.71 814 + macro-avg 0.74 0.70 0.71 814 + weighted-avg 0.73 0.71 0.71 814 +``` diff --git a/docs/_posts/dcecchini/2023-10-06-finembedding_e5_base_en.md b/docs/_posts/dcecchini/2023-10-06-finembedding_e5_base_en.md index 29bc41a29e..10141dc87d 100644 --- a/docs/_posts/dcecchini/2023-10-06-finembedding_e5_base_en.md +++ b/docs/_posts/dcecchini/2023-10-06-finembedding_e5_base_en.md @@ -87,4 +87,5 @@ result. Select("E5.result").show() ## References + In-house curated financial datasets. diff --git a/docs/_posts/dcecchini/2023-11-09-finembedding_e5_large_en.md b/docs/_posts/dcecchini/2023-11-09-finembedding_e5_large_en.md new file mode 100644 index 0000000000..d0641108b7 --- /dev/null +++ b/docs/_posts/dcecchini/2023-11-09-finembedding_e5_large_en.md @@ -0,0 +1,90 @@ +--- +layout: model +title: Finance E5 Embedding Large +author: John Snow Labs +name: finembedding_e5_large +date: 2023-11-09 +tags: [finance, en, licensed, e5, sentence_embedding, onnx] +task: Embeddings +language: en +edition: Finance NLP 1.0.0 +spark_version: 3.0 +supported: true +engine: onnx +annotator: E5Embeddings +article_header: + type: cover +use_language_switcher: "Python-Scala-Java" +--- + +## Description + +This model is a financial version of the E5 large model fine-tuned on in-house curated financial datasets. Reference: Wang, Liang, et al. “Text embeddings by weakly-supervised contrastive pre-training.” arXiv preprint arXiv:2212.03533 (2022). + +## Predicted Entities + + + +{:.btn-box} + + +[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finembedding_e5_large_en_1.0.0_3.0_1699530885080.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} +[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finembedding_e5_large_en_1.0.0_3.0_1699530885080.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} + +## How to use + + + +
+{% include programmingLanguageSelectScalaPythonNLU.html %} +```python +document_assembler = ( + nlp.DocumentAssembler().setInputCol("text").setOutputCol("document") +) + +E5_embedding = ( + nlp.E5Embeddings.pretrained( + "finembedding_e5_large", "en", "finance/models" + ) + .setInputCols(["document"]) + .setOutputCol("E5") +) +pipeline = nlp.Pipeline(stages=[document_assembler, E5_embedding]) + +data = spark.createDataFrame( + [["What is the best way to invest in the stock market?"]] +).toDF("text") + +result = pipeline.fit(data).transform(data) +result. Select("E5.result").show() +``` + +
+ +## Results + +```bash ++----------------------------------------------------------------------------------------------------+ +| embeddings| ++----------------------------------------------------------------------------------------------------+ +|[0.8358813, -1.30341, -0.576791, 0.25893408, 0.26888973, 0.028243342, 0.47971666, 0.47653574, 0.4...| ++----------------------------------------------------------------------------------------------------+ +``` + +{:.model-param} +## Model Information + +{:.table-model} +|---|---| +|Model Name:|finembedding_e5_large| +|Compatibility:|Finance NLP 1.0.0+| +|License:|Licensed| +|Edition:|Official| +|Input Labels:|[document]| +|Output Labels:|[E5]| +|Language:|en| +|Size:|1.2 GB| + +## References + +In-house annotated financial datasets. \ No newline at end of file