Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STY: Minor code-style improvements for _reader.py #123

Open
wants to merge 46 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
4bd54bd
DEV: Test against Python 3.13 (#2776)
stefan6419846 Jul 28, 2024
d4df20d
STY: Remove boolean value comparison (#2779)
j-t-1 Jul 31, 2024
3ad9234
ROB: Handle images with empty data when processing an image from byte…
williamgagnonpoka Aug 2, 2024
582557e
SEC: Fix GitHub workflow vulnerable to script injection (#2787)
diogoteles08 Aug 2, 2024
38f3925
MAINT: Remove unused paeth_predictor (#2773)
j-t-1 Aug 5, 2024
09f9b7e
MAINT: Remove unused AnnotationFlag
j-t-1 Aug 5, 2024
b2d7204
BUG: Handle Sequence as an IndirectObject when extracting text with l…
owurman Aug 5, 2024
5abd590
STY: Refactor b_ (#2772)
j-t-1 Aug 7, 2024
219eb13
MAINT: Drop Python 3.7 support (#2793)
pubpub-zz Aug 12, 2024
46c89dd
MAINT: Remove b_ and str_ (#2792)
pubpub-zz Aug 12, 2024
a9758ae
MAINT: Improve test coverage (#2796)
pubpub-zz Aug 12, 2024
cf7fcfd
ENH: Compress PDF files merging identical objects (#2795)
pubpub-zz Aug 13, 2024
2eb565d
ROB: Fix extract_text() issues on damaged PDFs (#2760)
pubpub-zz Aug 13, 2024
d9a8c54
ENH: Report PdfReadError instead of RecursionError (#2800)
pubpub-zz Aug 14, 2024
799630d
BUG: Fix sheared image (#2801)
pubpub-zz Aug 15, 2024
454a62a
MAINT: Fix mypy type output (#2799)
pubpub-zz Aug 15, 2024
0c81f3c
ENH: Accept utf strings for metadata (#2802)
pubpub-zz Aug 16, 2024
d2d520b
MAINT: Remove unused code (#2805)
pubpub-zz Aug 22, 2024
9f08cd0
ROB: Raise PdfReadError when missing /Root in trailer (#2808)
BertrandBordage Aug 23, 2024
b7b3c8c
MAINT: Improve wording of set_data error message (#2810)
stefan6419846 Aug 23, 2024
f55d332
ENH: Robustify on missing font for Tf operator in text_extract() (#2…
pubpub-zz Aug 27, 2024
38ea8c5
ENH: Add UniGB-UTF16 encodings (#2819)
pubpub-zz Aug 28, 2024
82eac7e
ROB: Robustify .set_data() (#2821)
pubpub-zz Aug 29, 2024
e694d55
DEV: Fix coverage uploads (#2832)
stefan6419846 Sep 5, 2024
b85c171
DOC: Small changes to PaperSize notes (#2834)
j-t-1 Sep 6, 2024
98d4425
ENH: Add incremental capability to PdfWriter (#2811)
pubpub-zz Sep 11, 2024
9d54f63
ENH: Robustify parsing for Object streams in XRef rebuilding (#2818)
pubpub-zz Sep 13, 2024
c4e95bd
STY: Use f-string = functionality (#2835)
j-t-1 Sep 13, 2024
78baa8f
BUG: Warn when visitor* arguments are ignored (#2845)
kaos-ocs Sep 14, 2024
a790532
ENH: Add capability to remove /Info from PDF (#2820)
pubpub-zz Sep 14, 2024
1bbc301
MAINT: Deprecate PdfMerger, AnnotationBuilder and other deprecations …
pubpub-zz Sep 14, 2024
8ebd311
MAINT: Simplify test with None and NullObject (#2829)
pubpub-zz Sep 14, 2024
ac2983b
STY: Minor code-style improvements for _reader.py
MartinThoma Sep 14, 2024
1f0861f
Merge branch 'main' into reader-minor-sty
MartinThoma Sep 14, 2024
dfa3d5c
Fix tests
MartinThoma Sep 15, 2024
8eefba8
BUG: test_image_without_pillow cannot find pypdf (#2850)
kaos-ocs Sep 15, 2024
6253b4b
Update pypdf/_reader.py
MartinThoma Sep 15, 2024
bc3ae82
fix doc building warning
MartinThoma Sep 15, 2024
7a4409f
Undo is_null_or_none
MartinThoma Sep 15, 2024
dd68fa1
Undo
MartinThoma Sep 15, 2024
7510d54
Undo
MartinThoma Sep 15, 2024
e21eff3
Merge branch 'main' into reader-minor-sty
MartinThoma Sep 15, 2024
637bc44
REL: 5.0.0 (#2851)
pubpub-zz Sep 17, 2024
27df17e
Merge branch 'main' into reader-minor-sty
pubpub-zz Sep 17, 2024
c00ec60
DOC: Tiny changes (#2844)
j-t-1 Sep 17, 2024
847ae54
Merge branch 'main' into reader-minor-sty
pubpub-zz Sep 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions .github/workflows/github-ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,12 +57,12 @@ jobs:
runs-on: ubuntu-20.04
strategy:
matrix:
python-version: ["3.7", "3.8", "3.9", "3.10", "3.11", "3.12"]
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12", "3.13-dev"]
use-crypto-lib: ["cryptography"]
include:
- python-version: "3.7"
- python-version: "3.8"
use-crypto-lib: "pycryptodome"
- python-version: "3.7"
- python-version: "3.8"
use-crypto-lib: "none"
steps:
- name: Update APT packages
Expand All @@ -83,14 +83,14 @@ jobs:
key: cache-downloaded-files
- name: Setup Python
uses: actions/setup-python@v5
if: matrix.python-version == '3.7' || matrix.python-version == '3.8' || matrix.python-version == '3.9' || matrix.python-version == '3.10'
if: matrix.python-version == '3.8' || matrix.python-version == '3.9' || matrix.python-version == '3.10'
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
cache-dependency-path: '**/requirements/ci.txt'
- name: Setup Python (3.11+)
uses: actions/setup-python@v5
if: matrix.python-version == '3.11' || matrix.python-version == '3.12'
if: matrix.python-version == '3.11' || matrix.python-version == '3.12' || matrix.python-version == '3.13-dev'
with:
python-version: ${{ matrix.python-version }}
allow-prereleases: true
Expand All @@ -102,11 +102,11 @@ jobs:
- name: Install requirements (Python 3)
run: |
pip install -r requirements/ci.txt
if: matrix.python-version == '3.7' || matrix.python-version == '3.8' || matrix.python-version == '3.9' || matrix.python-version == '3.10'
if: matrix.python-version == '3.8' || matrix.python-version == '3.9' || matrix.python-version == '3.10'
- name: Install requirements (Python 3.11+)
run: |
pip install -r requirements/ci-3.11.txt
if: matrix.python-version == '3.11' || matrix.python-version == '3.12'
if: matrix.python-version == '3.11' || matrix.python-version == '3.12' || matrix.python-version == '3.13-dev'
- name: Remove pycryptodome and cryptography
run: |
pip uninstall pycryptodome cryptography -y
Expand Down Expand Up @@ -135,6 +135,7 @@ jobs:
name: coverage-data.${{ matrix.python-version }}-${{ matrix.use-crypto-lib }}
path: .coverage.*
if-no-files-found: ignore
include-hidden-files: true

codestyle:
name: Check code style issues
Expand Down
7 changes: 5 additions & 2 deletions .github/workflows/release.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ on:
permissions:
contents: write

env:
HEAD_COMMIT_MESSAGE: ${{ github.event.head_commit.message }}

jobs:
build_and_publish:
name: Publish a new version
Expand All @@ -24,15 +27,15 @@ jobs:
- name: Extract version from commit message
id: extract_version
run: |
VERSION=$(echo "${{ github.event.head_commit.message }}" | grep -oP '(?<=REL: )\d+\.\d+\.\d+')
VERSION=$(echo "$HEAD_COMMIT_MESSAGE" | grep -oP '(?<=REL: )\d+\.\d+\.\d+')
echo "version=$VERSION" >> $GITHUB_OUTPUT

- name: Extract tag message from commit message
id: extract_message
run: |
VERSION="${{ steps.extract_version.outputs.version }}"
delimiter="$(openssl rand -hex 8)"
MESSAGE=$(echo "${{ github.event.head_commit.message }}" | sed "0,/REL: $VERSION/s///" )
MESSAGE=$(echo "$HEAD_COMMIT_MESSAGE" | sed "0,/REL: $VERSION/s///" )
echo "message<<${delimiter}" >> $GITHUB_OUTPUT
echo "$MESSAGE" >> $GITHUB_OUTPUT
echo "${delimiter}" >> $GITHUB_OUTPUT
Expand Down
33 changes: 33 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,38 @@
# CHANGELOG

## Version 5.0.0, 2024-09-15

This version drops support for Python 3.7 (not maintained since July 2023), PdfMerger (use PdfWriter instead) and AnnotationBuilder (use annotations instead).


### Deprecations (DEP)
- Remove the deprecated PfdMerger and AnnotationBuilder classes and other deprecations cleanup (#2813)
- Drop Python 3.7 support (#2793)

### New Features (ENH)
- Add capability to remove /Info from PDF (#2820)
- Add incremental capability to PdfWriter (#2811)
- Add UniGB-UTF16 encodings (#2819)
- Accept utf strings for metadata (#2802)
- Report PdfReadError instead of RecursionError (#2800)
- Compress PDF files merging identical objects (#2795)

### Bug Fixes (BUG)
- Fix sheared image (#2801)

### Robustness (ROB)
- Robustify .set_data() (#2821)
- Raise PdfReadError when missing /Root in trailer (#2808)
- Fix extract_text() issues on damaged PDFs (#2760)
- Handle images with empty data when processing an image from bytes (#2786)

### Developer Experience (DEV)
- Fix coverage uploads (#2832)
- Test against Python 3.13 (#2776)


[Full Changelog](https://github.com/py-pdf/pypdf/compare/4.3.1...5.0.0)

## Version 4.3.1, 2024-07-21

### Bug Fixes (BUG)
Expand Down
1 change: 1 addition & 0 deletions CONTRIBUTORS.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ history and [GitHub's 'Contributors' feature](https://github.com/py-pdf/pypdf/gr
* [ediamondscience](https://github.com/ediamondscience)
* [Ermeson, Felipe](https://github.com/FelipeErmeson)
* [Freitag, François](https://github.com/francoisfreitag)
* [Gagnon, William G.](https://github.com/williamgagnon)
* [Górny, Michał](https://github.com/mgorny)
* [Grillo, Miguel](https://github.com/Ineffable22)
* [Gutteridge, David H.](https://github.com/dhgutteridge)
Expand Down
2 changes: 1 addition & 1 deletion docs/dev/documentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,4 +53,4 @@ The title of the PR will be used as the first line of that combined commit messa

The first comment within the commit will be used as the message body.

See [dev intro](intro.html#commit-messages) for more details.
See [developer intro](intro.html#commit-messages) for more details.
8 changes: 3 additions & 5 deletions docs/modules/PageObject.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,12 @@ The PageObject Class
:undoc-members:
:show-inheritance:

.. autoclass:: pypdf._utils.ImageFile
.. autoclass:: pypdf._page.VirtualListImages
:members:
:undoc-members:
:show-inheritance:
:exclude-members: IndirectObject

.. autoclass:: pypdf._utils.File
.. autoclass:: pypdf._page.ImageFile
:members:
:inherited-members: File
:undoc-members:
:show-inheritance:
:exclude-members: IndirectObject
20 changes: 7 additions & 13 deletions docs/user/file-size.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,23 +9,17 @@ Some PDF documents contain the same object multiple times. For example, if an
image appears three times in a PDF it could be embedded three times. Or it can
be embedded once and referenced twice.

This can be done by reading and writing the file:
When adding data to a PdfWriter, the data is copied while respecting the original format.
For example, if two pages include the same image which is duplicated in the source document, the object will be duplicated in the PdfWriter object.

```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("big-old-file.pdf")
writer = PdfWriter()
Additionally, when you delete objects in a document, pypdf cannot easily identify whether the objects are used elsewhere or not or if the user wants to keep them in. When writing the PDF file, these objects will be hidden within (part of the file, but not displayed).

for page in reader.pages:
writer.add_page(page)
In order to reduce the file size, use a compression call: `writer.compress_identical_objects(remove_identicals=True, remove_orphans=True)`

if reader.metadata is not None:
writer.add_metadata(reader.metadata)
* `remove_identicals` enables/disables compression merging identical objects.
* `remove_orphans` enables/disables suppression of unused objects.

with open("smaller-new-file.pdf", "wb") as fp:
writer.write(fp)
```
It is recommended to apply this process just before writing to the file/stream.

It depends on the PDF how well this works, but we have seen an 86% file
reduction (from 5.7 MB to 0.8 MB) within a real PDF.
Expand Down
24 changes: 24 additions & 0 deletions docs/user/metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,30 @@ writer.add_metadata(
}
)

# Clear all data but keep the entry in PDF
writer.metadata = {}

# Replace all entries with new set of entries
writer.metadata = {
"/Author": "Martin",
"/Producer": "Libre Writer",
}

# Save the new PDF to a file
with open("meta-pdf.pdf", "wb") as f:
writer.write(f)
```

## Removing metadata entry

```python
from pypdf import PdfWriter

writer = PdfWriter("example.pdf")

# Remove Metadata (/Info entry)
writer.metadata = None

# Save the new PDF to a file
with open("meta-pdf.pdf", "wb") as f:
writer.write(f)
Expand Down
39 changes: 17 additions & 22 deletions pypdf/_cmap.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,12 @@
from typing import Any, Dict, List, Tuple, Union, cast

from ._codecs import adobe_glyphs, charset_encoding
from ._utils import b_, logger_error, logger_warning
from ._utils import logger_error, logger_warning
from .generic import (
DecodedStreamObject,
DictionaryObject,
IndirectObject,
NullObject,
StreamObject,
is_null_or_none,
)


Expand Down Expand Up @@ -127,6 +126,8 @@ def build_char_map_from_dict(
"/ETenms-B5-V": "cp950",
"/UniCNS-UTF16-H": "utf-16-be",
"/UniCNS-UTF16-V": "utf-16-be",
"/UniGB-UTF16-H": "gb18030",
"/UniGB-UTF16-V": "gb18030",
# UCS2 in code
}

Expand Down Expand Up @@ -258,8 +259,8 @@ def prepare_cm(ft: DictionaryObject) -> bytes:
tu = ft["/ToUnicode"]
cm: bytes
if isinstance(tu, StreamObject):
cm = b_(cast(DecodedStreamObject, ft["/ToUnicode"]).get_data())
elif isinstance(tu, str) and tu.startswith("/Identity"):
cm = cast(DecodedStreamObject, ft["/ToUnicode"]).get_data()
else: # if (tu is None) or cast(str, tu).startswith("/Identity"):
# the full range 0000-FFFF will be processed
cm = b"beginbfrange\n<0000> <0001> <0000>\nendbfrange"
if isinstance(cm, str):
Expand Down Expand Up @@ -448,34 +449,27 @@ def compute_space_width(
en: int = cast(int, ft["/LastChar"])
if st > space_code or en < space_code:
raise Exception("Not in range")
if w[space_code - st] == 0:
if w[space_code - st].get_object() == 0:
raise Exception("null width")
sp_width = w[space_code - st]
sp_width = w[space_code - st].get_object()
except Exception:
if "/FontDescriptor" in ft and "/MissingWidth" in cast(
DictionaryObject, ft["/FontDescriptor"]
):
sp_width = ft["/FontDescriptor"]["/MissingWidth"] # type: ignore
sp_width = ft["/FontDescriptor"]["/MissingWidth"].get_object() # type: ignore
else:
# will consider width of char as avg(width)/2
m = 0
cpt = 0
for x in w:
if x > 0:
m += x
for xx in w:
xx = xx.get_object()
if xx > 0:
m += xx
cpt += 1
sp_width = m / max(1, cpt) / 2

if isinstance(sp_width, IndirectObject):
# According to
# 'Table 122 - Entries common to all font descriptors (continued)'
# the MissingWidth should be a number, but according to #2286 it can
# be an indirect object
obj = sp_width.get_object()
if obj is None or isinstance(obj, NullObject):
return 0.0
return obj # type: ignore

if is_null_or_none(sp_width):
sp_width = 0.0
return sp_width


Expand All @@ -488,8 +482,9 @@ def type1_alternative(
if "/FontDescriptor" not in ft:
return map_dict, space_code, int_entry
ft_desc = cast(DictionaryObject, ft["/FontDescriptor"]).get("/FontFile")
if ft_desc is None:
if is_null_or_none(ft_desc):
return map_dict, space_code, int_entry
assert ft_desc is not None, "mypy"
txt = ft_desc.get_object().get_data()
txt = txt.split(b"eexec\n")[0] # only clear part
txt = txt.split(b"/Encoding")[1] # to get the encoding part
Expand Down
Loading