All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Added extra filtering methods for ElementList
- Make sure tests and docs are not included in binary distribution wheels (PyPi) and source distribution (sdist).
- Added support for opening password protected files (#350)
- Various dependency updates
- PyPI releases now use Trusted Publishers
- Fixed typo in docs (#361)
- Various dependency updates
- Removed unused PyYAML dependency (#262)
- The
visualise
function properly uses the elements parameter in order to filter visualised elements. (#256)
- Various dependency updates
- [BREAKING] Changes from using pyqt5 to using tkinter for the visualise tool. This means we don't need the python3-dev as a requirement, and seems to solve endless issues with pyqt5 not finding the correct qt bindings. This is a potential breaking change, although the visualise tool is only in the development version. No code changes are needed, but you will need tkinter installed for visualise to still work.
- Changed python version from 3.6 to 3.8 in
.readthedocs.yml
.
- Various dependency updates (matplotlib, pyqt5)
- Removed all but the tests dockerfile for simplicity. Use Docker BuildKit. We will no longer be pushing images to DockerHub on release. (#203)
- Various dependency updates
- Updated CI to avoid login issue (#182)
- Ensure we only accept LTTextBoxes at the top level (not LTTextLines) (#155)
- Enabled dependabot which should help to keep packages up to date (#124)
- Various dependency updates
- Fixed a typo in simple memo example in the documentation. (#121)
- New functions on
ElementList
,move_forwards_from
andmove_backwards_from
, to allow moving forwards and backwards from a certain element in the list easily. (#113)
- When the layout parameter all_texts is True, the text inside figures is now also returned as elements in the document. (#99)
- Passing a tolerance less than the width/height of an element no longer causes an error. The tolerance is now capped at half the width/height of the element. (#103)
- Added
__len__
and__repr__
functions to the Section class. (#90) - Added flag to
extract_simple_table
andextract_table
functions to remove duplicate header rows. (#89) - You can now specify
element_ordering
when instantiating a PDFDocument. This defaults to the old behaviour or left to right, top to bottom. (#95)
- Advanced layout analysis is now disabled by default. (#88)
- Published to PyPI as py-pdf-parser.
- Documentation is now hosted here. (#71)
- Added new examples to the documentation. (#74)
- Font filtering now caches the elements by font. (#73) (updated in #78)
- Font filtering now caches the elements by font. (#73)
- The visualise tool now draws an outline around each section on the page. (#69) (updated in #80)
- This product is now complete enough for the needs of Optimor Ltd, however
jstockwin
is going to continue development as a personal project. The repository has been moved fromoptimor/py-pdf-parser
tojstockwin/py-pdf-parser
.
- It is now possible to specify
font_size_precision
when instantiating a PDFDocument. This is the number of decimal places the font size will be rounded to. (#60) extract_simple_table
now allows extracting tables with gaps, provided there is at least one full row and one full column. This is only the case if you passallow_gaps=True
, otherwise the original logic of raising an exception if there a gap remains. You can optionally pass areference_element
which must be in both a full row and a full column, this defaults to the first (top-left) element. (#57)
- Font sizes are now
float
notint
. Thefont_size_precision
in the additions defaults to 1, and as such all fonts will change to have a single decimal place. To keep the old behaviour, you can passfont_size_precision=0
when instantiating your PDFDocument.
- Improved performance of
extract_simple_table
, which is now much faster. (#65)
- Initial version of the product. Note: The version is less than 1, so this product should not yet be considered stable. API changes and other breaking changes are possible, if not likely.