-
Notifications
You must be signed in to change notification settings - Fork 820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Character confidence threshold #3860
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@property | ||
def TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD(self) -> int: | ||
"""Tesseract predictions with confidence below this threshold are ignored""" | ||
return self._get_float("TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD", 0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, maybe we'd like to have some really low default threshold, i.e. 0.1, just to filter out complete garbage chars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with 0; the default behavior is no filter at all so this PR should just keep that for now. We can use followups to change this value.
image: np.ndarray, | ||
lang: str = "eng", | ||
config: str = "", | ||
character_confidence_threshold: float = 0.5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are adding some default, so maybe let's also keep it in config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see below we again have 0.5 as a default in hocr_to_dataframe
, so either way, I would unify those
ocr_df = self.hocr_to_dataframe(hocr, character_confidence_threshold) | ||
return ocr_df | ||
|
||
def hocr_to_dataframe( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the compute performance with this code? We essentially were relying on tesseract internal cpp code to parse results but here we do it in python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not analyzed this. We simply iterate over ~300 words, I am not sure there is any risk of significant slowdowns. What do you think?
"width": right - left, | ||
"height": bottom - top, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit on performance we can create df using bbox first then use vector ops to compute width and height (and overwrite the data for right and bottom).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
This change adds the ability to filter out characters predicted by Tesseract with low confidence scores.
Some notes: