-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Differential Binarization model from PaddleOCR to Keras3 #1739
base: master
Are you sure you want to change the base?
Conversation
Let's split this up. Start with ResNetVD backbone? Some notes...
|
1826dce
to
753047d
Compare
753047d
to
a5e5d8f
Compare
@gowthamkpr is the PR ready for review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I have left a reorganization comment.
example for structuring the code - https://github.com/keras-team/keras-hub/tree/master/keras_hub/src/models/sam
Hi @gowthamkpr! can you please refactor the code to KerasHub style?
|
I've refactored using SAM as example.
I've added
I've subclassed
Done. The model is not yet in Kaggle, so I've disabled the presets test for now.
Done. Not sure if there are additional standard test routines other than the ones used in SAM that should be run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Gowtham! left a few comments!
keras_hub/src/models/differential_binarization/differential_binarization_backbone.py
Show resolved
Hide resolved
56, | ||
256, | ||
), | ||
run_mixed_precision_check=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the mixed precision check pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I tried adding an explicit dtype
argument, but the problem remains that the mixed precision check checks against each sublayer of the model. The ResNet backbone, which is instantiated separately, therefore has the wrong dtype
.
keras_hub/src/models/differential_binarization/differential_binarization_test.py
Outdated
Show resolved
Hide resolved
keras_hub/src/models/differential_binarization/differential_binarization_test.py
Outdated
Show resolved
Hide resolved
keras_hub/src/models/differential_binarization/differential_binarization.py
Outdated
Show resolved
Hide resolved
keras_hub/src/models/differential_binarization/differential_binarization.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR Gowtham! Left a few comments. Can you please also add a demo colab in the PR description to verify the model is working before merging?
keras_hub/src/models/differential_binarization/differential_binarization_backbone.py
Show resolved
Hide resolved
keras_hub/src/models/differential_binarization/differential_binarization_backbone.py
Outdated
Show resolved
Hide resolved
keras_hub/src/models/differential_binarization/differential_binarization_backbone.py
Show resolved
Hide resolved
|
||
@keras_hub_export("keras_hub.layers.DifferentialBinarizationImageConverter") | ||
class DifferentialBinarizationImageConverter(ImageConverter): | ||
backbone_cls = DifferentialBinarizationBackbone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there should be some resizing/rescaling ops here right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends. Basically these image operations are implemented in the super class, ImageConverter
, and can be used as depicted in the demo colab I've added in the PR description. Dedicated code in this class might make sense to resize to resolutions of multiples of 32, which the model requires. On the other hand, it might be confusing for the user if the masks that are predicted have different resolutions than the input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to look into Segformer for this. The output masks will need to be resized as well
|
||
|
||
@keras_hub_export("keras_hub.models.DifferentialBinarizationOCR") | ||
class DifferentialBinarizationOCR(ImageSegmenter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to add a new base class for ocr, I don't think ImageSegmenter is a good. one. Do you have a specific reason you chose to subclass ImageSegmenter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually you suggested to subclass ImageSegmenter
(here) if I understood correctly. Technically, the task is somewhat similar to segmentation tasks. We can of course add a separate base class for it to catch the semantic differences, but I would rather name it "scene text detection".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think it is better to add a new base class for OCR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I suggest to create an ImageTextDetector
base class and include in this class the code (from the notebook) for translating the segmentation mask output into polygons, which are often needed in such applications. I'll try to get rid of the OpenCV and shapely dependencies in this code, I believe we don't have them in our requirements.
Does this work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the output of the task? Seems like this is not quite OCR right? Not text output?
Is this just a piece of what we would need for a full OCR setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the output of the task? Seems like this is not quite OCR right? Not text output?
Is this just a piece of what we would need for a full OCR setup?
It performs scene text detection, not OCR in the narrow sense. It belongs to OCR in a wider sense, though.
With text detection, we find where there is text in the image. The model typically outputs polygons or bounding boxes (after postproc), indicating the positions of text fragments. These portions of the image are then fed into an OCR model to get the text in ASCII.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I suggest to create an
ImageTextDetector
base class and include in this class the code (from the notebook) for translating the segmentation mask output into polygons, which are often needed in such applications. I'll try to get rid of the OpenCV and shapely dependencies in this code, I believe we don't have them in our requirements. Does this work for you?
I've added the ImageTextDetector
task with the logic for transformation to polygons. With our dependencies, this requires quite a bit of code, but it works well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gowthamkpr what's the high level use journey we'd expect this to be used with? Is this eventually going to be part of an ocr pipeline? What would the full ocr pipeline code look like? E.g.
ocr = keras_hub.OCR.from_preset("something_paddle_paddle")
ocr.predict(image) # returns text and polygons for text?
Or does this look like a two stage thing where we first go image -> polygons, and second go image and polygons -> text?
I think the thing we are missing is the high level look of the user journey we are trying to cover. Once we have that figured out, the rest of this will become a lot clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest exposing this two stage process to the user, i.e. image -> polygons and image+polygons -> text.
Differential Binarization is not able to perform the actual OCR, so we'll have to staple another model to it to perform the full pipeline. Additionally, it might be useful to have additional preprocessing steps in between, e.g., to rotate a region if a flat, but tilted bounding box has been detected. This is too much detail for hiding it from the user imho.
Maybe something like
# prediction
text_detector = keras_hub.OCR.from_preset("diffbin")
map_output = text_detector.predict(image)
# get polygons
polygons = text_detector.postprocess_to_polygons(map_output)
# OCR (not yet available)
ocr = keras_hub.OCR.from_preset("something")
for polygon in polygons:
image_patch = keras_hub.utils.extract_patch(image, polygon)
text = ocr.predict(image_patch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly questions and some style stuff.
I am curious what to do with the task here. Does this output a segmentation mask? Maybe most importantly, does this fit into a bigger picture of an OCR system? If so, how do we expect the whole thing to work?
|
||
@keras_hub_export("keras_hub.models.DifferentialBinarizationBackbone") | ||
class DifferentialBinarizationBackbone(Backbone): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always start docstring with a one liner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've improved/added the docstrings here and in losses.py
. ptal
keras_hub/src/models/differential_binarization/differential_binarization_ocr.py
Outdated
Show resolved
Hide resolved
|
||
|
||
@keras_hub_export("keras_hub.models.DifferentialBinarizationOCR") | ||
class DifferentialBinarizationOCR(ImageSegmenter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the output of the task? Seems like this is not quite OCR right? Not text output?
Is this just a piece of what we would need for a full OCR setup?
backbone=backbone | ||
) | ||
|
||
detector(input_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the output of the task look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We get the probability map, threshold map and binary map output in the last dimension from the model. I've added some documentation here.
"""Differential Binarization preset configurations.""" | ||
|
||
backbone_presets = { | ||
"diffbin_r50vd_icdar2015": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use consistent abbreviations. If the class is DifferentialBinarizationXX
, then the preset name would be "differential_binarization"
. Or we could do DiffBinBackbone
and diff_bin_....
for the preset name.
Part of me is actually tempted by the latter, since DifferentialBinarization
is such a mouthful. But I am not sure what is common in other libraries? Is there a common pattern in other projects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PaddleOCR mostly uses "DB", but I find this abbreviation somewhat indistinctive (in the sense that you have a hard time googling it). I've also seen "DBNet", but I don't know where this name comes from, as the paper never names it "DBNet", but just "Differential Binarization".
Agreed that DifferentialBinarization
is quite long, though, and consistency is always good.
|
||
|
||
class DiceLoss: | ||
def __init__(self, eps=1e-6, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A brief comment explaining these would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
return balance_loss | ||
|
||
|
||
class DBLoss(keras.losses.Loss): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again lets be consistent in our abbreviations. We have DiffBin, DB and DifferentialBinarization. Maybe DiffBin is the right middle ground everywhere? But as I mentioned above, not sure what is common in other projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Can you please see the above comment) I'll rename as soon as we have agreed on a good name.
This adds the Differential Binarization model for scene text detection.
I implemented the architecture based on ResNet50_vd from PaddleOCR and ported the weights.
Demo colab: https://colab.research.google.com/gist/gowthamkpr/bd4a7f7742e92e66cfc57827052b8619/keras_paddleocr_v3.ipynb