Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Differential Binarization model from PaddleOCR to Keras3 #1739

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

gowthamkpr
Copy link
Collaborator

@gowthamkpr gowthamkpr commented Aug 6, 2024

This adds the Differential Binarization model for scene text detection.

I implemented the architecture based on ResNet50_vd from PaddleOCR and ported the weights.

Demo colab: https://colab.research.google.com/gist/gowthamkpr/bd4a7f7742e92e66cfc57827052b8619/keras_paddleocr_v3.ipynb

@mattdangerw mattdangerw changed the base branch from master to keras-hub August 6, 2024 17:36
@mattdangerw
Copy link
Member

Let's split this up. Start with ResNetVD backbone?

Some notes...

  • Remove the aliases. One ResNetVDBackbone can handle all of these with different presets.
  • Conversion scripts as scripts not colabs.
  • Follow the local style for backbones as closely as possible. See some comments here Add VGG16 and VGG19 backbone #1737
  • Keep models a flat directory. No backbones/xx etc.
  • Add some tests.

@divyashreepathihalli
Copy link
Collaborator

@gowthamkpr is the PR ready for review?

Copy link
Collaborator

@divyashreepathihalli divyashreepathihalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I have left a reorganization comment.

example for structuring the code - https://github.com/keras-team/keras-hub/tree/master/keras_hub/src/models/sam

keras_nlp/src/models/diffbin/diffbin.py Outdated Show resolved Hide resolved
keras_nlp/src/models/diffbin/diffbin.py Outdated Show resolved Hide resolved
keras_nlp/src/models/diffbin/losses.py Outdated Show resolved Hide resolved
@gowthamkpr gowthamkpr changed the base branch from keras-hub to master October 22, 2024 20:24
@divyashreepathihalli
Copy link
Collaborator

Hi @gowthamkpr! can you please refactor the code to KerasHub style?

  • Add a preprocessor flow
  • subclass image segementer model for the task class
  • add preset class
  • add standard test routines

@gowthamkpr
Copy link
Collaborator Author

Hi @gowthamkpr! can you please refactor the code to KerasHub style?

I've refactored using SAM as example.

* [ ]  Add a preprocessor flow

I've added DifferentialBinarizationPreprocessor and DifferentialBinarizationImageConverter.

* [ ]  subclass image segementer model for the task class

I've subclassed ImageSegmenter, but I left the custom compile() method, since we need a different loss than the one used in ImageSegmenter's compile().

* [ ]  add preset class

Done. The model is not yet in Kaggle, so I've disabled the presets test for now.

* [ ]  add standard test routines

Done. Not sure if there are additional standard test routines other than the ones used in SAM that should be run.

Copy link
Collaborator

@divyashreepathihalli divyashreepathihalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Gowtham! left a few comments!

56,
256,
),
run_mixed_precision_check=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the mixed precision check pass?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I tried adding an explicit dtype argument, but the problem remains that the mixed precision check checks against each sublayer of the model. The ResNet backbone, which is instantiated separately, therefore has the wrong dtype.

Copy link
Collaborator

@divyashreepathihalli divyashreepathihalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR Gowtham! Left a few comments. Can you please also add a demo colab in the PR description to verify the model is working before merging?


@keras_hub_export("keras_hub.layers.DifferentialBinarizationImageConverter")
class DifferentialBinarizationImageConverter(ImageConverter):
backbone_cls = DifferentialBinarizationBackbone
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be some resizing/rescaling ops here right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends. Basically these image operations are implemented in the super class, ImageConverter, and can be used as depicted in the demo colab I've added in the PR description. Dedicated code in this class might make sense to resize to resolutions of multiples of 32, which the model requires. On the other hand, it might be confusing for the user if the masks that are predicted have different resolutions than the input.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to look into Segformer for this. The output masks will need to be resized as well



@keras_hub_export("keras_hub.models.DifferentialBinarizationOCR")
class DifferentialBinarizationOCR(ImageSegmenter):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to add a new base class for ocr, I don't think ImageSegmenter is a good. one. Do you have a specific reason you chose to subclass ImageSegmenter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you suggested to subclass ImageSegmenter (here) if I understood correctly. Technically, the task is somewhat similar to segmentation tasks. We can of course add a separate base class for it to catch the semantic differences, but I would rather name it "scene text detection".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it is better to add a new base class for OCR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I suggest to create an ImageTextDetector base class and include in this class the code (from the notebook) for translating the segmentation mask output into polygons, which are often needed in such applications. I'll try to get rid of the OpenCV and shapely dependencies in this code, I believe we don't have them in our requirements.
Does this work for you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the output of the task? Seems like this is not quite OCR right? Not text output?

Is this just a piece of what we would need for a full OCR setup?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the output of the task? Seems like this is not quite OCR right? Not text output?

Is this just a piece of what we would need for a full OCR setup?

It performs scene text detection, not OCR in the narrow sense. It belongs to OCR in a wider sense, though.

With text detection, we find where there is text in the image. The model typically outputs polygons or bounding boxes (after postproc), indicating the positions of text fragments. These portions of the image are then fed into an OCR model to get the text in ASCII.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I suggest to create an ImageTextDetector base class and include in this class the code (from the notebook) for translating the segmentation mask output into polygons, which are often needed in such applications. I'll try to get rid of the OpenCV and shapely dependencies in this code, I believe we don't have them in our requirements. Does this work for you?

I've added the ImageTextDetector task with the logic for transformation to polygons. With our dependencies, this requires quite a bit of code, but it works well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gowthamkpr what's the high level use journey we'd expect this to be used with? Is this eventually going to be part of an ocr pipeline? What would the full ocr pipeline code look like? E.g.

ocr = keras_hub.OCR.from_preset("something_paddle_paddle")
ocr.predict(image) # returns text and polygons for text?

Or does this look like a two stage thing where we first go image -> polygons, and second go image and polygons -> text?

I think the thing we are missing is the high level look of the user journey we are trying to cover. Once we have that figured out, the rest of this will become a lot clearer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest exposing this two stage process to the user, i.e. image -> polygons and image+polygons -> text.

Differential Binarization is not able to perform the actual OCR, so we'll have to staple another model to it to perform the full pipeline. Additionally, it might be useful to have additional preprocessing steps in between, e.g., to rotate a region if a flat, but tilted bounding box has been detected. This is too much detail for hiding it from the user imho.

Maybe something like

# prediction
text_detector = keras_hub.OCR.from_preset("diffbin")
map_output = text_detector.predict(image)
# get polygons
polygons = text_detector.postprocess_to_polygons(map_output)

# OCR (not yet available)
ocr = keras_hub.OCR.from_preset("something")
for polygon in polygons:
  image_patch = keras_hub.utils.extract_patch(image, polygon)
  text = ocr.predict(image_patch)

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly questions and some style stuff.

I am curious what to do with the task here. Does this output a segmentation mask? Maybe most importantly, does this fit into a bigger picture of an OCR system? If so, how do we expect the whole thing to work?


@keras_hub_export("keras_hub.models.DifferentialBinarizationBackbone")
class DifferentialBinarizationBackbone(Backbone):
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always start docstring with a one liner

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've improved/added the docstrings here and in losses.py. ptal



@keras_hub_export("keras_hub.models.DifferentialBinarizationOCR")
class DifferentialBinarizationOCR(ImageSegmenter):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the output of the task? Seems like this is not quite OCR right? Not text output?

Is this just a piece of what we would need for a full OCR setup?

backbone=backbone
)

detector(input_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the output of the task look like?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get the probability map, threshold map and binary map output in the last dimension from the model. I've added some documentation here.

"""Differential Binarization preset configurations."""

backbone_presets = {
"diffbin_r50vd_icdar2015": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use consistent abbreviations. If the class is DifferentialBinarizationXX, then the preset name would be "differential_binarization". Or we could do DiffBinBackbone and diff_bin_.... for the preset name.

Part of me is actually tempted by the latter, since DifferentialBinarization is such a mouthful. But I am not sure what is common in other libraries? Is there a common pattern in other projects?

Copy link
Collaborator Author

@gowthamkpr gowthamkpr Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PaddleOCR mostly uses "DB", but I find this abbreviation somewhat indistinctive (in the sense that you have a hard time googling it). I've also seen "DBNet", but I don't know where this name comes from, as the paper never names it "DBNet", but just "Differential Binarization".

Agreed that DifferentialBinarization is quite long, though, and consistency is always good.



class DiceLoss:
def __init__(self, eps=1e-6, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A brief comment explaining these would be helpful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

keras_hub/src/models/differential_binarization/losses.py Outdated Show resolved Hide resolved
return balance_loss


class DBLoss(keras.losses.Loss):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again lets be consistent in our abbreviations. We have DiffBin, DB and DifferentialBinarization. Maybe DiffBin is the right middle ground everywhere? But as I mentioned above, not sure what is common in other projects.

Copy link
Collaborator Author

@gowthamkpr gowthamkpr Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Can you please see the above comment) I'll rename as soon as we have agreed on a good name.

@divyashreepathihalli divyashreepathihalli added the kokoro:force-run Runs Tests on GPU label Dec 10, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants