Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Title text partially missing issue in recovery_to_markdown.py #14216

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Coobiw
Copy link

@Coobiw Coobiw commented Nov 13, 2024

when I run the quicktour code as following:

import os
import cv2
from PIL import Image
from pathlib import Path
from paddleocr import PPStructure,save_structure_res, draw_structure_result
from paddleocr.ppstructure.recovery.recovery_to_doc import sorted_layout_boxes
from paddleocr.ppstructure.recovery.recovery_to_markdown import convert_info_markdown

# 中文测试图
# table_engine = PPStructure(recovery=True)
# 英文测试图
table_engine = PPStructure(recovery=True, lang='en')

save_folder = './paddleocr_markdown_restore_new'
img_path = './pics/20241113-091849.jpeg'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])

for line in result:
    # line.pop('img')
    print(line)

im_show = draw_structure_result(img, result, font_path='/cpfs/data/user/zhiqi/mm_ocr/got/fonts/simfang.ttf')
im_show = Image.fromarray(im_show)
im_show.save(f'{save_folder}/{Path(img_path).stem}_vis_paddle.jpg')
h, w, _ = img.shape
res = sorted_layout_boxes(result, w)
convert_info_markdown(res, save_folder, os.path.basename(img_path).split('.')[0])

I find that the output markdown file has some minor mistakes on the title texts. I will show the result.

test pdf screenshot:
img_v3_02gj_5e6fbf19-198e-47e5-a13f-2c455dd5901g

part of original recovered markdown:

# 3

# 3.1
...

# 3.1.1
...

The title texts are incomplete. I find that the source code only append the first part of detected title text:

elif region["type"].lower() == "title":
            markdown_string.append(f"""# {region["res"][0]["text"]}""")

So I modify recovery_to_markdown.py. After that, the title text is complete, as following:

# 3 Contribucion de la JERS al marco de politicas

# 3.1 Sector bancario
...

# 3.1.1 Dictamenes relativos al articulo 458 del R eglamento sobre Requisitos de Capital
...

@CLAassistant
Copy link

CLAassistant commented Nov 13, 2024

CLA assistant check
All committers have signed the CLA.

@GreatV
Copy link
Collaborator

GreatV commented Nov 15, 2024

please fix codestyle and sign the CLA.

@Coobiw
Copy link
Author

Coobiw commented Nov 27, 2024

Hi! @GreatV Those are done. Looking forward to your review. Thanks!

markdown_string.append(
f"""# {region['res'][0]['text']}"""
+ "".join(
[" " + one_region["text"] for one_region in region["res"][1:]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for region["res"] to contain only one element, and could this lead to potential out-of-bounds issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants