tiseg results not usable #80

bertsky · 2021-02-01T11:50:12Z

The way in which the trained pixel classifier for text-image segmentation is integrated here makes these predictions completely unusable:

original:
results:

image part	text part

The reason for this is actually quite simple:

ocrd_anybaseocr/ocrd_anybaseocr/cli/ocrd_anybaseocr_tiseg.py

Lines 130 to 137 in e63f555

    
           out = model.predict(I) 
        
           out = out.reshape((2048, 1600, 3)).argmax(axis=2) 
        
           text_part = np.ones(out.shape) 
        
           text_part[np.where(out==1)] = 0 
        
           image_part = np.ones(out.shape) 
        
           image_part[np.where(out==2)] = 0

Here, the predictions for text (1) and image (2) classes compete with the background (0) class. Where the argmax favours background over both, all is lost. This would be somewhat expectable and acceptable if this method was trained as a binarization method (on suitable GT and with suitable augmentation). But appearantly, it is not.

@mahmed1995 @mjenckel , am I correct in assuming you've used keras_segmentation for this, with 3 classes – 1 for text regions, 2 for image regions and 0 for background? What was the GT?

The obvious fix would be to just compare text vs image scores, and apply the result as an alpha mask on the original image. The result actually does look somewhat better.

image vs text as alpha mask:

But does any consuming processor actually make use of the alpha channel? I highly doubt it.

Since the model was obviously trained on raw images, we have to apply it on raw images. But we can still take binarized images (from a binarization step in the workflow) to apply our resulting mask – by filling with white.

That seems like the better OCR-D interface to me. (Of course, contour-finding and annotation via coordinates would still be better than as clipped derived image.) What do you think, @kba?

Also, I think it's not a good idea to just keep the best scoring pixels independent of each other. This leaves results unecessarily noisy and flickery, especially where confidence is low already. Smoothing via morphological post-processing (e.g. by closing the argmax results with a suitable kernel) or filtering (e.g. by a Gaussian filter on the scores) etc should be applied. (Ideally, the model itself would get trained with a fc-CRF top layer, but that's out of scope here.) What's the "right way" to do this?

Considering that the above shown result is still unusable, I think we need to consider post-processing for the neural segmentation.

Lastly, talking about the legacy text-image segmentation integrated here as well, this does at least work reliably:

image part result	text part result

However, both of these approaches seem to only look for images, not for line-art separators at all. IMHO that latter task is much more needed (considering the existing tools available in OCR-D right now).

The text was updated successfully, but these errors were encountered:

bertsky mentioned this issue Feb 1, 2021

Fix tiseg #79

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tiseg results not usable #80

tiseg results not usable #80

bertsky commented Feb 1, 2021

tiseg results not usable #80

tiseg results not usable #80

Comments

bertsky commented Feb 1, 2021