Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Textract-py3 #543

Draft
wants to merge 19 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Textract-py3

This is a minimally maintained fork of [deanmalmgren/textract](https://github.com/deanmalmgren/textract) to replace '*' dependencies because they block usage of `asdf`, `uvx` and modern `pip` (open issue: https://github.com/deanmalmgren/textract/issues/461).

## Usage

Install with `asdf plugin add textract-py3 https://github.com/amrox/asdf-pyapp.git` and `asdf install textract-py3 latest` or with `uvx` (`uv tool install textract-py3`), `mise`, etc.

## Development

This fork has been migrated to `poetry` and does not have CI/CD. For local testing and release, use:

```sh
poetry install --sync
poetry run bumpversion minor
poetry publish --build
```
2 changes: 1 addition & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,4 +44,4 @@ Originally written by @deanmalmgren. Maintained by the good people at
:target: https://github.com/deanmalmgren/textract/stargazers

.. |Forks| image:: https://img.shields.io/github/forks/deanmalmgren/textract.svg
:target: https://github.com/deanmalmgren/textract/network
:target: https://github.com/deanmalmgren/textract/network
16 changes: 16 additions & 0 deletions docs/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,22 @@ latest changes in development for next release
----------------------------------------------

.. THANKS FOR CONTRIBUTING; ADD YOUR UNRELEASED CHANGES HERE!

2.1.0
-------------------

* Merge minor bugfixes from upstream PRs and update documentation

2.0.1
-------------------

* Fix package version and changelog

2.0.0
-------------------

* Switch to poetry and relax all dependency constraints

1.6.5
-------------------

Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@
# built documents.
#
# The short X.Y version.
release = version = "1.6.5"
release = version = "2.1.0"

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
4 changes: 2 additions & 2 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -80,10 +80,10 @@ Don't see your operating system installation instructions here?

My apologies! Installing system packages is a bit of a drag and its
hard to anticipate all of the different environments that need to be
accomodated (wouldn't it be awesome if there were a system-agnostic
accommodated (wouldn't it be awesome if there were a system-agnostic
package manager or, better yet, if python could install these system
dependencies for you?!?!). If you're operating system doesn't have
documenation about how to install the textract dependencies, please
documentation about how to install the textract dependencies, please
:ref:`contribute a pull request <contributing>` with:

1. A new section in here with the appropriate details about how to
Expand Down
1,237 changes: 1,237 additions & 0 deletions poetry.lock

Large diffs are not rendered by default.

38 changes: 38 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
[build-system]
build-backend = "poetry.core.masonry.api"
requires = ["poetry-core"]

[tool.poetry]
authors = ["Dean Malmgren <[email protected]>"]
description = "Minimally maintained fork of deanmalmgren/textract to replace '*' dependencies "
license = "MIT"
name = "textract-py3"
packages = [{include = "textract"}]
readme = "README.md"
repository = "https://github.com/KyleKing/textract-py3"
version = "2.1.0"

[tool.poetry.dependencies]
python = "^3.7"
SpeechRecognition = ">=3.8.1"
argcomplete = ">=1.10.0"
beautifulsoup4 = ">=4.8.0"
chardet = ">=3"
docx2txt = ">=0.8"
extract-msg = ">=0.30.11"
"pdfminer.six" = ">=20221105"
python-pptx = ">=0.6.18"
six = ">=1.16.0"
xlrd = ">=1.2.0"
pocketsphinx = {version = ">=0.1.15", optional = true}

[tool.poetry.extras]
pocketsphinx = ["pocketsphinx"]

[tool.poetry.group.dev.dependencies]
bumpversion = ">=0.6.0"
nose = ">=1.3.7"
pytest = ">=7.2.1"

[tool.poetry.scripts]
textract = "textract.bin.textract:main"
6 changes: 5 additions & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
[bumpversion]
current_version = 1.6.5
current_version = 2.1.0
commit = True
tag = True

[bumpversion:file:pyproject.toml]
search = version = "{current_version}"
replace = version = "{new_version}"

[bumpversion:file:setup.py]
search = version="{current_version}"
replace = version="{new_version}"
Expand Down
4 changes: 3 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# FYI: This file is kept for reference, but not used

import glob
import os
from setuptools import setup
Expand Down Expand Up @@ -42,7 +44,7 @@ def parse_requirements(requirements_filename):

setup(
name=textract.__name__,
version="1.6.5",
version="2.1.0",
description="extract text from any document. no muss. no fuss.",
long_description=long_description,
url=github_url,
Expand Down
2 changes: 1 addition & 1 deletion textract/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from .parsers import process

VERSION = "1.6.5"
VERSION = "2.1.0"
Empty file added textract/bin/__init__.py
Empty file.
14 changes: 4 additions & 10 deletions bin/textract → textract/bin/textract.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,13 @@
#!/usr/bin/env python
# -*- mode: python -*-
# PYTHON_ARGCOMPLETE_OK

"""
Command-line application.
"""

import sys

from textract.cli import get_parser
from textract import process
from textract.exceptions import CommandLineError
from textract.colors import red
from ..cli import get_parser
from .. import process
from ..exceptions import CommandLineError
from ..colors import red


# extract text
Expand All @@ -29,5 +25,3 @@ def main():
else:
args.output.write(output)


main()
2 changes: 1 addition & 1 deletion textract/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

# traceback from exceptions that inherit from this class are suppressed
class CommandLineError(Exception):
"""The traceback of all CommandLineError's is supressed when the
"""The traceback of all CommandLineError's is suppressed when the
errors occur on the command line to provide a useful command line
interface.
"""
Expand Down
4 changes: 3 additions & 1 deletion textract/parsers/msg_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,15 @@
def ensure_bytes(string):
"""Normalize string to bytes.

`ExtractMsg.Message._getStringStream` can return unicode or bytes depending
`extract_msg.Message._getStringStream` can return unicode or bytes depending
on what is originally stored in message file.

This helper functon makes sure, that bytes type is returned.
"""
if isinstance(string, six.string_types):
return string.encode('utf-8')
if string is None:
return b''
return string


Expand Down
9 changes: 6 additions & 3 deletions textract/parsers/pdf_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,10 @@
from .utils import ShellParser
from .image import Parser as TesseractParser

from distutils.spawn import find_executable
try:
from shutil import which
except ImportError:
from distutils.spawn import find_executable as which

class Parser(ShellParser):
"""Extract text from pdf files using either the ``pdftotext`` method
Expand Down Expand Up @@ -49,10 +52,10 @@ def extract_pdfminer(self, filename, **kwargs):
#Nested try/except loops? Not great
#Try the normal pdf2txt, if that fails try the python3
# pdf2txt, if that fails try the python2 pdf2txt
pdf2txt_path = find_executable('pdf2txt.py')
pdf2txt_path = which("pdf2txt.py")
try:
stdout, _ = self.run(['pdf2txt.py', filename])
except OSError:
except (OSError, ShellError):
try:
stdout, _ = self.run(['python3',pdf2txt_path, filename])
except ShellError:
Expand Down
2 changes: 1 addition & 1 deletion textract/parsers/txt_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,5 @@ class Parser(BaseParser):
"""Parse ``.txt`` files"""

def extract(self, filename, **kwargs):
with open(filename) as stream:
with open(filename, "rb") as stream:
return stream.read()