Dealing with invalid bookmarks #90

rgoubet · 2020-09-17T07:43:52Z

I'm dealing with a collection of PDF files that contain invalid bookmarks: in Acrobat, they show in the bookmark tree, but the bookmark properties shows that they have no destination.

Using the getOutlines() method does not return these bookmarks at all. I was hoping that I could extract these invalid bookmarks and fix them with pyPDF4 (in my specific case, I would set them to the destination of the previous bookmark), but since they're not listed, that's not possible.

Is there a tweak I could apply to get these invalid bookmarks (and update them)?

Thanks!

R.

pubpub-zz · 2020-09-17T18:42:32Z

Hi @MrGou,
I've done quite some rewriting/fix in my fork
https://github.com/pubpub-zz/PyPDF4
this is still in alpha. whl is available here:
pypdf4-1.27.0PPzz_1-py2.py3-none-any.whl.zip
If you still get some isssue, can you send an example for analysis.

pubpub-zz · 2020-09-24T20:00:37Z

Hi @MrGou,
Have you been able to test my whl. Do you have any feed back? else can you send me an example with invalid bookmarks ?

rgoubet · 2020-10-01T15:10:42Z

Sorry for the delay in responding. I've tested your whl but couldn't see any difference. Invalid bookmarks seem to be skipped entirely. So they are not returned by the following code:

def printBookmarkTree(bookmark_list):
    i = 0
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call
            i += 1
            # print(i)
            printBookmarkTree(item)
        else:
            page = reader.getDestinationPageNumber(item) + 1
            print('\t' * i + item.title + '\t' + str(page))
            i -= 1

printBookmarkTree(reader.getOutlines())

By the way, it seems that I had to import PyPDF4 under the name pypdf:

import pypdf as PyPDF4

Not sure whether this is intended.

Unfortunately, I am not able to share an example file, as they are highly confidential. I can't even say how to reproduce the issue: how the issue occured is what I'm trying to find out in the first place

pubpub-zz · 2020-10-04T10:37:00Z

No problem for delay and your problem to share documents
First about the pypdf renaming. It is something that was already in claird fork : this is a choice that was already implemented in claird's fork.
about your problem I propse you the following code:

def get_outlines1(self, node=None, _outlines=None):
        if _outlines is None:
            _outlines = []
            catalog = self.root_object
            # get the outline dictionary and named destinations
            if "/Outlines" in catalog:
                try:
                    lines = catalog["/Outlines"]
                except PdfReadError:
                    # This occurs if the /Outlines object reference is
                    # incorrect for an example of such a file, see
                    # https://unglueit-files.s3.amazonaws.com/ebf/7552c42e9280b4476e59e77acc0bc812.pdf
                    # so continue to load the file without the Bookmarks
                    return _outlines
                if "/First" in lines:
                    node = lines["/First"]
        if node is None:
            return _outlines
        # see if there are any more outlines
        while True:
            outline = self._build_outline(node)
            if outline:
                _outlines.append((node,outline))
            # check for sub-outlines
            if "/First" in node:
                sub_outlines = []
                get_outlines1(self,node["/First"], sub_outlines)
                if sub_outlines:
                    _outlines.append(sub_outlines)
            if "/Next" not in node:
                break
            node = node["/Next"]
        return _outlines
def flatten_and_whiten_outlines(ar,prefix,blanking,line_=0):
    for a in ar:
        if isinstance(a,list):
            flatten_and_whiten_outlines(a,prefix+"  ",blanking,line_+1)
        else:
            #print(a)
            if(blanking):
                a[0]["/Title"]=pypdf.TextStringObject("****BLANKED***")
                a[1]["/Title"]=pypdf.TextStringObject("****BLANKED***")
                try:
                    a[1]["/Dest"]["/Title"]=pypdf.TextStringObject("****BLANKDEST***")
                except:
                    pass
            print("$%d"%line_,prefix,a.__repr__())
            line_+=1

you define directly those two functions directly in the shell
You can the call the two function one after the other

ar=get_outlines1(***pdf_object***)

flatten_and_whiten_outlines(ar,"",False)

This will output the raw data data from the pdf and then the decrypted data.
This should help you to troubleshoot.
I've also implemented some code to blank data, you may find usefull to share some data
Hope this help to improve the library

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with invalid bookmarks #90

Dealing with invalid bookmarks #90

rgoubet commented Sep 17, 2020

pubpub-zz commented Sep 17, 2020

pubpub-zz commented Sep 24, 2020

rgoubet commented Oct 1, 2020

pubpub-zz commented Oct 4, 2020 •

edited

Loading

Dealing with invalid bookmarks #90

Dealing with invalid bookmarks #90

Comments

rgoubet commented Sep 17, 2020

pubpub-zz commented Sep 17, 2020

pubpub-zz commented Sep 24, 2020

rgoubet commented Oct 1, 2020

pubpub-zz commented Oct 4, 2020 • edited Loading

pubpub-zz commented Oct 4, 2020 •

edited

Loading