Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Font Processing Error #313

Merged
merged 1 commit into from
Dec 21, 2024
Merged

Fix Font Processing Error #313

merged 1 commit into from
Dec 21, 2024

Conversation

7shi
Copy link
Contributor

@7shi 7shi commented Dec 21, 2024

Overview

Fixed an issue where certain PDF files would fail with KeyError: 'china-ss' due to incorrect font handling during doc_zh construction.

Problem Details

I noticed that certain PDF files result in errors. For example:

The following error occurs:

$ wget https://davidhestenes.net/geocalc/pdf/Tutorial%20on%20Geometric%20Calculus.pdf
$ uv run pdf2zh "Tutorial on Geometric Calculus.pdf"
  6%|███▋                                                      | 1/16 [00:00<00:05,  2.64it/s]
Traceback (most recent call last):
  File "/home/7shi/repos/PDFMathTranslate/.venv/bin/pdf2zh", line 8, in <module>
    sys.exit(main())
  File "/home/7shi/repos/PDFMathTranslate/pdf2zh/pdf2zh.py", line 192, in main
    translate(**vars(parsed_args))
  File "/home/7shi/repos/PDFMathTranslate/pdf2zh/high_level.py", line 366, in translate
    s_mono, s_dual = translate_stream(
  File "/home/7shi/repos/PDFMathTranslate/pdf2zh/high_level.py", line 235, in translate_stream
    obj_patch: dict = translate_patch(fp, prompt=kwarg["prompt"], **locals())
  File "/home/7shi/repos/PDFMathTranslate/pdf2zh/high_level.py", line 165, in translate_patch
    interpreter.process_page(page)
  File "/home/7shi/repos/PDFMathTranslate/pdf2zh/pdfinterp.py", line 270, in process_page
    ops_new = self.device.end_page(page)
  File "/home/7shi/repos/PDFMathTranslate/pdf2zh/converter.py", line 61, in end_page
    return self.receive_layout(self.cur_item)
  File "/home/7shi/repos/PDFMathTranslate/pdf2zh/converter.py", line 409, in receive_layout
    adv = self.fontmap[fcur_].char_width(ord(ch)) * size
KeyError: 'china-ss'

Investigation revealed that fonts were not being added correctly in the translate_stream function. The following debug code shows the difference in font handling between the problematic and working cases:

diff --git a/pdf2zh/high_level.py b/pdf2zh/high_level.py
index 2b65458..289f0c2 100644
--- a/pdf2zh/high_level.py
+++ b/pdf2zh/high_level.py
@@ -230,6 +230,9 @@ def translate_stream(
             except Exception:
                 pass
+    for pnum, page in enumerate(doc_zh, 1):
+        print(pnum, [f[4] for f in page.get_fonts()])
+
     fp = io.BytesIO()
     doc_zh.save(fp)
     obj_patch: dict = translate_patch(fp, prompt=kwarg["prompt"], **locals())

Problematic case:

$ uv run pdf2zh "Tutorial on Geometric Calculus.pdf"
1 ['F2.0', 'F7.0', 'F6.1', 'F1.0', 'F3.0', 'F5.0', 'F4.0', 'F8.0']
MuPDF error: format error: object out of range (310 0 R); xref size 310

MuPDF error: format error: object out of range (311 0 R); xref size 310

2 ['F2.0', 'F1.0', 'F3.0']
MuPDF error: format error: object out of range (310 0 R); xref size 310

MuPDF error: format error: object out of range (311 0 R); xref size 310
(snip)

Working case:

$ wget -O attention.pdf https://arxiv.org/pdf/1706.03762
$ uv run pdf2zh attention.pdf
1 ['F29', 'F56', 'F65', 'F84', 'F87', 'arXivStAmP', 'tiro', 'china-ss']
2 ['F124', 'F23', 'F24', 'F26', 'F27', 'F29', 'F84', 'F87', 'tiro', 'china-ss']
(snip)

Solution

To address this issue, I modified the code to save doc_en to a temporary BytesIO stream before creating doc_zh. This approach was inspired by the existing pattern in the same function where doc_zh is saved to a BytesIO stream before calling translate_patch. Applying the same pattern earlier in the process ensures proper font initialization and prevents the KeyError while maintaining the original functionality.

Testing

I have verified that:

  • The previously failing "Tutorial on Geometric Calculus.pdf" now processes successfully
  • Existing working PDFs (like attention.pdf) continue to work as expected with no changes in behavior

@Byaidu Byaidu merged commit 71632d7 into Byaidu:main Dec 21, 2024
2 checks passed
@7shi 7shi deleted the fix-font-error branch December 21, 2024 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants