UnicodeEncodeError: 'latin-1' codec #25

pszemraj · 2022-03-28T17:25:42Z

files with non-standard characters cause the latin-1 codec used by the package to error out

Context

need to be able to handle files with special characters in the name

Examples of weird char names:

 'SUMM OCR Pałczyński et al. - 2022 - Study of the Few-S .txt',
 'SUMM OCR Refinetti, Goldt - 2022 - The dynamics of repr .txt',
 'SUMM OCR Sercan, Arık, Pfister - 2019 - TabNet Attenti .txt',
 'SUMM OCR Serhal et al. - 2022 - Overview on prediction, .txt',
 'SUMM OCR Somani et al. - 2021 - Deep learning and the e .txt',
 'SUMM OCR Somepalli et al. - 2021 - SAINT Improved Neura .txt',
 'SUMM OCR Śmigiel, Pałczyński, Ledziński - 2021 - ECG .txt',

Process

user passes path to directory with text files with special chars
files are loaded
when confectionary tries to write a chapter name errors out

Expected result

filenames are cleaned to remove special chars before writing to chapter name.
original file names are left intact

Current result

code fails to run

(pdf) C:\Users\peter\code-dev-22\misc-repos\text2pdf>python confectionary\text2pdf.py -i "G:\My Drive\ETHZ-2022-S\ml-healthcare\ml4hc-p1-papers\LED_large_5e_[batch=2048]_[nbeams=20]_[max_l=512]\NSC + SBD" -kw "ML4HC Paper Summaries - Project 1 - LED-L-5e"
18 files found matching extension .txt

# entries is 18, < title thresh 39
will use one page for TOC

Building Chapters in PDF file:   0%|                                                            | 0/18 [00:00<?, ?it/s]Traceback (most recent call last):
  File "C:\Users\peter\code-dev-22\misc-repos\text2pdf\confectionary\text2pdf.py", line 354, in <module>
    _finished_pdf_loc = dir_to_pdf(
  File "C:\Users\peter\code-dev-22\misc-repos\text2pdf\confectionary\text2pdf.py", line 255, in dir_to_pdf
    pdf.print_chapter(filepath=str(textfile.resolve()), num=i, title=out_name)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\confectionary\pdf.py", line 300, in print_chapter
    self.chapter_title(num, title)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\confectionary\pdf.py", line 207, in chapter_title
    self.start_section(total_title)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 221, in wrapper
    return fn(self, *args, **kwargs)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 4040, in start_section
    self.multi_cell(w=self.epw, h=self.font_size, txt=name, ln=1)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 221, in wrapper
    return fn(self, *args, **kwargs)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 2375, in multi_cell
    txt = self.normalize_text(txt)
  File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 2945, in normalize_text
    return txt.encode(self.core_fonts_encoding).decode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0131' in position 33: ordinal not in range(256)

Possible Fix

use clean() from the clean-text package

The text was updated successfully, but these errors were encountered:

pszemraj added the bug Something isn't working label Mar 28, 2022

pszemraj self-assigned this Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeEncodeError: 'latin-1' codec #25

UnicodeEncodeError: 'latin-1' codec #25

pszemraj commented Mar 28, 2022

UnicodeEncodeError: 'latin-1' codec #25

UnicodeEncodeError: 'latin-1' codec #25

Comments

pszemraj commented Mar 28, 2022

Context

Process

Expected result

Current result

Possible Fix