Can't process docx files even though docx2txt is installed #521

ShvetsIvan · 2024-07-31T10:42:34Z

Hello,

I am trying to use textract to do the obvious with docx files in a AWS Lambda using python. Textract library is included in the package, as is the dependency - docx2txt. I try getting the text out of the file, but still getting the ExtensionNotSupported stating that docx is not supported. I tried putting the doc2txt library in the parsers folder too - didn't help.

Using:

Textract version 1.6.3
Python version 3.11
AWS Lambda function

phil-scholarcy · 2024-08-16T09:58:25Z

Is the file definitely a docx file and not a .doc file masquerading as one? I find there can be issues with the following scenarios:

A .docx file is given a .doc extension
A .doc file is given a .docx extension

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't process docx files even though docx2txt is installed #521

Can't process docx files even though docx2txt is installed #521

ShvetsIvan commented Jul 31, 2024

phil-scholarcy commented Aug 16, 2024 •

edited

Loading

Can't process docx files even though docx2txt is installed #521

Can't process docx files even though docx2txt is installed #521

Comments

ShvetsIvan commented Jul 31, 2024

phil-scholarcy commented Aug 16, 2024 • edited Loading

phil-scholarcy commented Aug 16, 2024 •

edited

Loading