Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError: 'latin-1' codec #25

Open
pszemraj opened this issue Mar 28, 2022 · 0 comments
Open

UnicodeEncodeError: 'latin-1' codec #25

pszemraj opened this issue Mar 28, 2022 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@pszemraj
Copy link
Owner

files with non-standard characters cause the latin-1 codec used by the package to error out

Context

  • need to be able to handle files with special characters in the name

Examples of weird char names:

 'SUMM OCR Pałczyński et al. - 2022 - Study of the Few-S .txt',
 'SUMM OCR Refinetti, Goldt - 2022 - The dynamics of repr .txt',
 'SUMM OCR Sercan, Arık, Pfister - 2019 - TabNet Attenti .txt',
 'SUMM OCR Serhal et al. - 2022 - Overview on prediction, .txt',
 'SUMM OCR Somani et al. - 2021 - Deep learning and the e .txt',
 'SUMM OCR Somepalli et al. - 2021 - SAINT Improved Neura .txt',
 'SUMM OCR Śmigiel, Pałczyński, Ledziński - 2021 - ECG .txt',

Process

  1. user passes path to directory with text files with special chars
  2. files are loaded
  3. when confectionary tries to write a chapter name errors out

Expected result

  • filenames are cleaned to remove special chars before writing to chapter name.
  • original file names are left intact

Current result

code fails to run

(pdf) C:\Users\peter\code-dev-22\misc-repos\text2pdf>python confectionary\text2pdf.py -i "G:\My Drive\ETHZ-2022-S\ml-healthcare\ml4hc-p1-papers\LED_large_5e_[batch=2048]_[nbeams=20]_[max_l=512]\NSC + SBD" -kw "ML4HC Paper Summaries - Project 1 - LED-L-5e"
18 files found matching extension .txt

# entries is 18, < title thresh 39
will use one page for TOC

Building Chapters in PDF file:   0%|                                                            | 0/18 [00:00<?, ?it/s]Traceback (most recent call last):
  File "C:\Users\peter\code-dev-22\misc-repos\text2pdf\confectionary\text2pdf.py", line 354, in <module>
    _finished_pdf_loc = dir_to_pdf(
  File "C:\Users\peter\code-dev-22\misc-repos\text2pdf\confectionary\text2pdf.py", line 255, in dir_to_pdf
    pdf.print_chapter(filepath=str(textfile.resolve()), num=i, title=out_name)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\confectionary\pdf.py", line 300, in print_chapter
    self.chapter_title(num, title)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\confectionary\pdf.py", line 207, in chapter_title
    self.start_section(total_title)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 221, in wrapper
    return fn(self, *args, **kwargs)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 4040, in start_section
    self.multi_cell(w=self.epw, h=self.font_size, txt=name, ln=1)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 221, in wrapper
    return fn(self, *args, **kwargs)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 2375, in multi_cell
    txt = self.normalize_text(txt)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 2945, in normalize_text
    return txt.encode(self.core_fonts_encoding).decode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0131' in position 33: ordinal not in range(256)

Possible Fix

  • use clean() from the clean-text package
@pszemraj pszemraj added the bug Something isn't working label Mar 28, 2022
@pszemraj pszemraj self-assigned this Mar 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant