Spacy LLM NER fails on repeated entities. This is a big problem. #260

innocent-charles · 2023-08-11T06:47:46Z

innocent-charles
Aug 11, 2023

I have used spacy LLM NER for quite some time, and it fails on repeated entities. I logged in using spacy_llm.logger and I discovered that the results are returned well by OpenAI GPT 3.5 as I expected even for repeated entities. So the problem is taking those outputs returned to the doc. ents in Spacy NER, Spacy NER doc. ents do not return the repeated entities.

Example for the case of extracting work experience in resumes:

let's consider :
work experience one from resume: From July 2019 up to August 2019 volunteering at VSO INTERNATIONAL in VIJANA NA AJIRA

work experience two from resume: PROJECT ZANZIBAR
from January 2018 up to October 2018 volunteering at Buguruni health center as health secretary

When i logg, the GPT 3.5 did pretty good job:

///The output from spacy_llm.logger
Working Start_Date_Org_ONE: July 2019
Working End_Date_Org_ONE: August 2019
Working Position_Type_Org_ONE: volunteering
Organization Name_Org_ONE: VSO INTERNATIONAL

Working Start_Date_Org_TWO: January 2018
Working End_Date_Org_TWO: October 2018
Working Position_Type_Org_TWO: volunteering
Organization Name_Org_TWO: Buguruni health center

///The output of Spacy NER LLM
But Spacy NER LLM doc.ents does do not :

  "Working Start_Date_Org_ONE": "July 2019",
  "Working End_Date_Org_ONE": "August 2019",
  "Working Position_Type_Org_ONE": "volunteering",
  "Organization Name_Org_ONE": "VSO INTERNATIONAL",
  
  "Working Start_Date_Org_TWO": "January 2018",
  "Working End_Date_Org_TWO": "October 2018",
  "Working Position_Type_Org_TWO": " ", //---The problem is here
  "Organization Name_Org_TWO": "Buguruni health center",

The above shows that doc. ents does not return Working Position_Type_Org_TWO entities because the same ent.text has already been returned above by Working Position_Type_Org_ONE.

So, how to solve this problem ?... LLMs are pretty good but the framework did the job differently.....any ideas please?

rmitsch · 2023-08-11T07:03:37Z

rmitsch
Aug 11, 2023
Maintainer

Hi @innocent-charles, thanks for reporting this! We are aware of the problem and currently working on improvements to the parsing. I can't give you an exact date as to when this will be published, but my current estimation is that next week we'll release v0.5.0 which should fix this issue.

0 replies

innocent-charles · 2023-08-11T09:12:47Z

innocent-charles
Aug 11, 2023
Author

Thanks @rmitsch. Then there is another problem that when extracting or recognizing entities from a document, GPT-3.5 sometimes incorporates its own interpretation and presents the answer accordingly. Consequently, in such cases, the doc. ents method fails to give out these kinds of entities from GPT-3.5 since they do not match the appearance in the original document.

Example :
Document Example: "Kenya Civil Aviation Authority, ICT officer from July - Oct 2015".

Spacy LLM Logger's output. The results returned by GPT 3.5

Working Start_Date_Org: July 2015 ///// Here GPT 3.5 returned the entity exactly as required even though it is not what appears in the document.
Working End_Date_Org: October 2015 ////// Here GPT 3.5 returned the entity exactly as required even though it is not what appears in the document.
Working Position_Type_Org: ICT officer
Organization Name_Org: Tanzania Civil Aviation Authority

The outputs after doc.ents :

  "Working Start_Date_Org": " ",   //// Here doc. ents fail to return the output, but it shows above GPT3.5 has returned the output. 
  "Working End_Date_Org": " ",     //// Here doc. ents fail to return the output, but it shows above GPT3.5 has returned the output. 
  "Working Position_Type_Org": "ICT officer",
  "Organization Name_Org": "Tanzania Civil Aviation Authority",

Therefore the problem is , when results are returned by GPT 3.5 and do not match as they appeared in the document, then doc. ents method in spacy fails to show such results to the user.

This is a problem since GPT 3.5 is capable and normally adds its creativity to understanding documents like what I have shown above.

What I think might be a solution is, if we're building a pipeline to integrate LLM's ability to Spacy framework, it's better to have another way of taking results/responses from LLMs and not "Doc. ents" implementation.

2 replies

rmitsch Aug 11, 2023
Maintainer

This should be resolved by the new parsing approach as well.

To be precise, doc.ents in itself is not involved in how information is parsed from the LLM response. Each task in spacy-llm has a parsing implementation that aims to extract information from the LLM response and maps it to a Doc instance. In this case, the information is stored in Doc.ents. So there is an issue with how this information is extracted from the LLM response in the current NER parsing.

I'll update here as soon as the new version is published - as mentioned, likely at some point next week 🙂

rmitsch Aug 11, 2023
Maintainer

If you want to tackle this before we release a new version, I recommend having a look at defining a custom task.

innocent-charles · 2023-08-11T09:33:27Z

innocent-charles
Aug 11, 2023
Author

Thanks a lot, @rmitsch, I appreciate this helpful clarification. Thank you once again. Let me try to work on it too.

0 replies

rmitsch · 2023-09-01T09:24:27Z

rmitsch
Sep 1, 2023
Maintainer

Short update: this has been fixed in our develop branch. I can't give an exact release date yet, but we're aiming for next week.

0 replies

innocent-charles · 2023-09-01T10:25:36Z

innocent-charles
Sep 1, 2023
Author

Thanks a lot @rmitsch , for the update.

0 replies

innocent-charles · 2023-09-11T09:54:12Z

innocent-charles
Sep 11, 2023
Author

Hello @rmitsch , has this been fixed on this newly released version?

1 reply

rmitsch Sep 11, 2023
Maintainer

Sorry, forgot to update! Yes, 0.5.0 includes improved parsing for the NER task that should resolve this issue.

innocent-charles · 2023-09-23T08:18:11Z

innocent-charles
Sep 23, 2023
Author

Now i got this,

File "pydantic/main.py", line 342, in pydantic.main.BaseModel.init
pydantic.error_wrappers.ValidationError: 1 validation error for NERCoTExample
spans
field required (type=value_error.missing)

0 replies

innocent-charles · 2023-09-25T09:22:08Z

innocent-charles
Sep 25, 2023
Author

Hello ! @rmitsch , i have upgraded to version 0.5.1 and changed the task to spacy.NER.v3 . But , nothing has changed I end up got the error like

File "pydantic/main.py", line 342, in pydantic.main.BaseModel.init
pydantic.error_wrappers.ValidationError: 1 validation error for NERCoTExample
spans
field required (type=value_error.missing)

Please help out !

1 reply

rmitsch Sep 25, 2023
Maintainer

Continuing this in #304.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spacy LLM NER fails on repeated entities. This is a big problem. #260

{{title}}

Replies: 8 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Spacy LLM NER fails on repeated entities. This is a big problem. #260

innocent-charles Aug 11, 2023

Replies: 8 comments · 4 replies

rmitsch Aug 11, 2023 Maintainer

innocent-charles Aug 11, 2023 Author

rmitsch Aug 11, 2023 Maintainer

rmitsch Aug 11, 2023 Maintainer

innocent-charles Aug 11, 2023 Author

rmitsch Sep 1, 2023 Maintainer

innocent-charles Sep 1, 2023 Author

innocent-charles Sep 11, 2023 Author

rmitsch Sep 11, 2023 Maintainer

innocent-charles Sep 23, 2023 Author

innocent-charles Sep 25, 2023 Author

rmitsch Sep 25, 2023 Maintainer

innocent-charles
Aug 11, 2023

Replies: 8 comments 4 replies

rmitsch
Aug 11, 2023
Maintainer

innocent-charles
Aug 11, 2023
Author

rmitsch Aug 11, 2023
Maintainer

rmitsch Aug 11, 2023
Maintainer

innocent-charles
Aug 11, 2023
Author

rmitsch
Sep 1, 2023
Maintainer

innocent-charles
Sep 1, 2023
Author

innocent-charles
Sep 11, 2023
Author

rmitsch Sep 11, 2023
Maintainer

innocent-charles
Sep 23, 2023
Author

innocent-charles
Sep 25, 2023
Author

rmitsch Sep 25, 2023
Maintainer