You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
files with non-standard characters cause the latin-1 codec used by the package to error out
Context
need to be able to handle files with special characters in the name
Examples of weird char names:
'SUMM OCR Pałczyński et al. - 2022 - Study of the Few-S .txt',
'SUMM OCR Refinetti, Goldt - 2022 - The dynamics of repr .txt',
'SUMM OCR Sercan, Arık, Pfister - 2019 - TabNet Attenti .txt',
'SUMM OCR Serhal et al. - 2022 - Overview on prediction, .txt',
'SUMM OCR Somani et al. - 2021 - Deep learning and the e .txt',
'SUMM OCR Somepalli et al. - 2021 - SAINT Improved Neura .txt',
'SUMM OCR Śmigiel, Pałczyński, Ledziński - 2021 - ECG .txt',
Process
user passes path to directory with text files with special chars
files are loaded
when confectionary tries to write a chapter name errors out
Expected result
filenames are cleaned to remove special chars before writing to chapter name.
original file names are left intact
Current result
code fails to run
(pdf) C:\Users\peter\code-dev-22\misc-repos\text2pdf>python confectionary\text2pdf.py -i "G:\My Drive\ETHZ-2022-S\ml-healthcare\ml4hc-p1-papers\LED_large_5e_[batch=2048]_[nbeams=20]_[max_l=512]\NSC + SBD" -kw "ML4HC Paper Summaries - Project 1 - LED-L-5e"
18 files found matching extension .txt
# entries is 18, < title thresh 39
will use one page for TOC
Building Chapters in PDF file: 0%| | 0/18 [00:00<?, ?it/s]Traceback (most recent call last):
File "C:\Users\peter\code-dev-22\misc-repos\text2pdf\confectionary\text2pdf.py", line 354, in <module>
_finished_pdf_loc = dir_to_pdf(
File "C:\Users\peter\code-dev-22\misc-repos\text2pdf\confectionary\text2pdf.py", line 255, in dir_to_pdf
pdf.print_chapter(filepath=str(textfile.resolve()), num=i, title=out_name)
File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\confectionary\pdf.py", line 300, in print_chapter
self.chapter_title(num, title)
File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\confectionary\pdf.py", line 207, in chapter_title
self.start_section(total_title)
File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 221, in wrapper
return fn(self, *args, **kwargs)
File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 4040, in start_section
self.multi_cell(w=self.epw, h=self.font_size, txt=name, ln=1)
File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 221, in wrapper
return fn(self, *args, **kwargs)
File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 2375, in multi_cell
txt = self.normalize_text(txt)
File "C:\Users\peter\miniconda3\envs\pdf\lib\site-packages\fpdf\fpdf.py", line 2945, in normalize_text
return txt.encode(self.core_fonts_encoding).decode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0131' in position 33: ordinal not in range(256)
Possible Fix
use clean() from the clean-text package
The text was updated successfully, but these errors were encountered:
files with non-standard characters cause the latin-1 codec used by the package to error out
Context
Examples of weird char names:
Process
Expected result
Current result
code fails to run
Possible Fix
clean()
from the clean-text packageThe text was updated successfully, but these errors were encountered: