翻译扫描档存在重影 / feat (main): supports ocr on scanned document #19

jackiehejian · 2024-11-07T03:32:16Z

当pdf文件均为图像，而不是可编辑（复制）状态时，翻译完全失败，具体见图

Byaidu · 2024-11-07T03:59:48Z

图片型的 PDF 文档暂时还没办法翻译，目前主要还是在优化电子书和论文的翻译效果

jackiehejian · 2024-11-07T05:30:21Z

图片型的 PDF 文档暂时还没办法翻译，目前主要还是在优化电子书和论文的翻译效果

好的，非常感谢

fireinrain · 2024-11-08T02:11:05Z

均为图像有点为难人了，ocr的质量影响文字的质量影响翻译的效果

xxsunyxx · 2024-11-19T03:24:03Z

加一个可选流程paddleOCR，

xxnuo · 2024-11-20T15:41:20Z

sayura
这个模型非常准确，就是对算力的要求会高于 Paddle OCR

Byaidu · 2024-11-20T15:42:33Z

sayura 这个模型非常准确，就是对算力的要求会高于 Paddle OCR

和 minerU/marker 比较怎么样呀

xxnuo · 2024-11-21T08:33:36Z

Owner

sayura 就是 marker 的作者做的开源多国语言和表格的 OCR 模型😂
minerU 这个我没有测试，我只测试了 PaddleOCR 高精度模型，Sayura 效果比它好很多，而且支持多国语言效果很好。
我看 minerU 的 issue，对多国语言的支持好像不佳
缺点就是 Sayura 对 GPU 显存要求有点高，头疼，不太会量化模型。

xxnuo · 2024-12-02T01:57:30Z

佬们 ocr 的进展如何，我觉得用 paddleocr 撸一个不错，如果已经有佬在做了我就不再造轮子了 @reycn @Byaidu

Byaidu · 2024-12-02T02:24:49Z

佬们 ocr 的进展如何，我觉得用 paddleocr 撸一个不错，如果已经有佬在做了我就不再造轮子了 @reycn @Byaidu

目前还一点没做…

如果写好了的话欢迎来贡献代码

hellofinch · 2024-12-06T07:35:21Z

from typing import BinaryIO
import numpy as np
import tqdm
from pymupdf import Document
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdf2zh.converter import TranslateConverter
from pdf2zh.pdfinterp import PDFPageInterpreterEx
from pymupdf import Font
import numpy as np
from paddleocr import PaddleOCR

file=""

def extract_text_to_fp(
    inf: BinaryIO,
    pages=None,
    password: str = "",
    debug: bool = False,
    page_count: int = 0,
    vfont: str = "",
    vchar: str = "",
    thread: int = 0,
    doc_en: Document = None,
    model=None,
    lang_in: str = "",
    lang_out: str = "",
    service: str = "",
    resfont: str = "",
    noto: Font = None,
    callback: object = None,
    **kwarg,
) -> None:
    ocr = PaddleOCR(use_angle_cls=True, lang="en")
    rsrcmgr = PDFResourceManager()
    layout = {}
    device = TranslateConverter(
        rsrcmgr, vfont, vchar, thread, layout, lang_in, lang_out, service, resfont, noto
    )

    assert device is not None
    obj_patch = {}
    interpreter = PDFPageInterpreterEx(rsrcmgr, device, obj_patch)
    if pages:
        total_pages = len(pages)
    else:
        total_pages = page_count

    parser = PDFParser(inf)
    doc = PDFDocument(parser, password=password)
    with tqdm.tqdm(
        enumerate(PDFPage.create_pages(doc)),
        total=total_pages,
    ) as progress:
        for pageno, page in progress:
            if pages and (pageno not in pages):
                continue
            if callback:
                callback(progress)
            page.pageno = pageno
            pix = doc_en[page.pageno].get_pixmap()
            image = np.fromstring(pix.samples, np.uint8).reshape(
                pix.height, pix.width, 3
            )[:, :, ::-1]
            page_layout = model.predict(image, imgsz=int(pix.height / 32) * 32)[0]
            # kdtree 是不可能 kdtree 的，不如直接渲染成图片，用空间换时间
            box = np.ones((pix.height, pix.width))
            h, w = box.shape
            result_text=[]
            vcls = ["abandon", "figure", "table", "isolate_formula", "formula_caption"]
            for i, d in enumerate(page_layout.boxes):
                text=''
                if not page_layout.names[int(d.cls)] in vcls:
                    x0, y0, x1, y1 = d.xyxy.squeeze()
                    x0, y0, x1, y1 = (
                        np.clip(int(x0 - 1), 0, w - 1),
                        np.clip(int(h - y1 - 1), 0, h - 1),
                        np.clip(int(x1 + 1), 0, w - 1),
                        np.clip(int(h - y0 + 1), 0, h - 1),
                    )
                    box[y0:y1, x0:x1] = i + 2
                    if page_layout.names[int(d.cls)]=="plain text":
                        imagex = image[y0:y1,x0:x1]
                        result = ocr.ocr(imagex, cls=False)
                        for idx in range(len(result)):
                            res = result[idx]
                            for line in res:
                                text+=line[1][0]
                        result_text.append(text)
            for i, d in enumerate(page_layout.boxes):
                if page_layout.names[int(d.cls)] in vcls:
                    x0, y0, x1, y1 = d.xyxy.squeeze()
                    x0, y0, x1, y1 = (
                        np.clip(int(x0 - 1), 0, w - 1),
                        np.clip(int(h - y1 - 1), 0, h - 1),
                        np.clip(int(x1 + 1), 0, w - 1),
                        np.clip(int(h - y0 + 1), 0, h - 1),
                    )
                    box[y0:y1, x0:x1] = 0
            layout[page.pageno] = box
            # 新建一个 xref 存放新指令流
            page.page_xref = doc_en.get_new_xref()  # hack 插入页面的新 xref
            doc_en.update_object(page.page_xref, "<<>>")
            doc_en.update_stream(page.page_xref, b"")
            doc_en[page.pageno].set_contents(page.page_xref)
            interpreter.process_page(page)

    device.close()
    return obj_patch,result_text

只有一段OCR的内容，实在是看不懂怎么把OCR出来的结果往后传了。
:(

xxnuo · 2024-12-15T13:43:33Z

https://github.com/jingsongliujing/OnnxOCR

xxnuo · 2024-12-20T10:09:36Z

https://huggingface.co/spaces/stepfun-ai/GOT_official_online_demo

Byaidu added the enhancement New feature or request label Nov 7, 2024

Byaidu mentioned this issue Nov 19, 2024

进度条走完了但是并没有翻译 #64

Closed

reycn changed the title ~~当PDF每一页均为图像时，无法进行翻译~~ feat (main): supports ocr on scanned document Nov 21, 2024

reycn added the help wanted Extra attention is needed label Nov 21, 2024

Byaidu mentioned this issue Nov 28, 2024

pdf扫描件问题 #140

Closed

hellofinch mentioned this issue Dec 9, 2024

无法正常翻译 #185

Closed

Byaidu mentioned this issue Dec 11, 2024

翻译后仍是英语, 并且与原文堆叠在一起 #212

Closed

hellofinch mentioned this issue Dec 13, 2024

译文存在大量重叠 #62

Closed

Byaidu changed the title ~~feat (main): supports ocr on scanned document~~ 翻译扫描档存在重影 / feat (main): supports ocr on scanned document Dec 13, 2024

Byaidu pinned this issue Dec 13, 2024

Byaidu mentioned this issue Dec 15, 2024

翻译后的 PDF 文本覆盖原文（高质量扫描） #235

Closed

This was referenced Dec 16, 2024

后期是否会支持"图片类PDF"进行翻译 #239

Closed

无法翻译PDF中图片里的文字 #269

Closed

Byaidu mentioned this issue Dec 18, 2024

非标准PDF会导致翻译失效 #280

Closed

This was referenced Dec 18, 2024

扫描件检测&输出警告 #264

Closed

翻译出来还是英文 #296

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

翻译扫描档存在重影 / feat (main): supports ocr on scanned document #19

翻译扫描档存在重影 / feat (main): supports ocr on scanned document #19

jackiehejian commented Nov 7, 2024 •

edited

Loading

Byaidu commented Nov 7, 2024

jackiehejian commented Nov 7, 2024

fireinrain commented Nov 8, 2024

xxsunyxx commented Nov 19, 2024

xxnuo commented Nov 20, 2024

Byaidu commented Nov 20, 2024

xxnuo commented Nov 21, 2024 •

edited

Loading

xxnuo commented Dec 2, 2024

Byaidu commented Dec 2, 2024

hellofinch commented Dec 6, 2024

xxnuo commented Dec 15, 2024

xxnuo commented Dec 20, 2024

翻译扫描档存在重影 / feat (main): supports ocr on scanned document #19

翻译扫描档存在重影 / feat (main): supports ocr on scanned document #19

Comments

jackiehejian commented Nov 7, 2024 • edited Loading

Byaidu commented Nov 7, 2024

jackiehejian commented Nov 7, 2024

fireinrain commented Nov 8, 2024

xxsunyxx commented Nov 19, 2024

xxnuo commented Nov 20, 2024

Byaidu commented Nov 20, 2024

xxnuo commented Nov 21, 2024 • edited Loading

xxnuo commented Dec 2, 2024

Byaidu commented Dec 2, 2024

hellofinch commented Dec 6, 2024

xxnuo commented Dec 15, 2024

xxnuo commented Dec 20, 2024

jackiehejian commented Nov 7, 2024 •

edited

Loading

xxnuo commented Nov 21, 2024 •

edited

Loading