Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with invalid bookmarks #90

Open
rgoubet opened this issue Sep 17, 2020 · 4 comments
Open

Dealing with invalid bookmarks #90

rgoubet opened this issue Sep 17, 2020 · 4 comments

Comments

@rgoubet
Copy link

rgoubet commented Sep 17, 2020

I'm dealing with a collection of PDF files that contain invalid bookmarks: in Acrobat, they show in the bookmark tree, but the bookmark properties shows that they have no destination.

Using the getOutlines() method does not return these bookmarks at all. I was hoping that I could extract these invalid bookmarks and fix them with pyPDF4 (in my specific case, I would set them to the destination of the previous bookmark), but since they're not listed, that's not possible.

Is there a tweak I could apply to get these invalid bookmarks (and update them)?

Thanks!

R.

@pubpub-zz
Copy link

Hi @MrGou,
I've done quite some rewriting/fix in my fork
https://github.com/pubpub-zz/PyPDF4
this is still in alpha. whl is available here:
pypdf4-1.27.0PPzz_1-py2.py3-none-any.whl.zip
If you still get some isssue, can you send an example for analysis.

@pubpub-zz
Copy link

Hi @MrGou,
Have you been able to test my whl. Do you have any feed back? else can you send me an example with invalid bookmarks ?

@rgoubet
Copy link
Author

rgoubet commented Oct 1, 2020

Sorry for the delay in responding. I've tested your whl but couldn't see any difference. Invalid bookmarks seem to be skipped entirely. So they are not returned by the following code:

def printBookmarkTree(bookmark_list):
    i = 0
    for item in bookmark_list:
        if isinstance(item, list):
            # recursive call
            i += 1
            # print(i)
            printBookmarkTree(item)
        else:
            page = reader.getDestinationPageNumber(item) + 1
            print('\t' * i + item.title + '\t' + str(page))
            i -= 1

printBookmarkTree(reader.getOutlines())

By the way, it seems that I had to import PyPDF4 under the name pypdf:

import pypdf as PyPDF4

Not sure whether this is intended.

Unfortunately, I am not able to share an example file, as they are highly confidential. I can't even say how to reproduce the issue: how the issue occured is what I'm trying to find out in the first place

@pubpub-zz
Copy link

pubpub-zz commented Oct 4, 2020

No problem for delay and your problem to share documents
First about the pypdf renaming. It is something that was already in claird fork : this is a choice that was already implemented in claird's fork.
about your problem I propse you the following code:

def get_outlines1(self, node=None, _outlines=None):
        if _outlines is None:
            _outlines = []
            catalog = self.root_object
            # get the outline dictionary and named destinations
            if "/Outlines" in catalog:
                try:
                    lines = catalog["/Outlines"]
                except PdfReadError:
                    # This occurs if the /Outlines object reference is
                    # incorrect for an example of such a file, see
                    # https://unglueit-files.s3.amazonaws.com/ebf/7552c42e9280b4476e59e77acc0bc812.pdf
                    # so continue to load the file without the Bookmarks
                    return _outlines
                if "/First" in lines:
                    node = lines["/First"]
        if node is None:
            return _outlines
        # see if there are any more outlines
        while True:
            outline = self._build_outline(node)
            if outline:
                _outlines.append((node,outline))
            # check for sub-outlines
            if "/First" in node:
                sub_outlines = []
                get_outlines1(self,node["/First"], sub_outlines)
                if sub_outlines:
                    _outlines.append(sub_outlines)
            if "/Next" not in node:
                break
            node = node["/Next"]
        return _outlines
def flatten_and_whiten_outlines(ar,prefix,blanking,line_=0):
    for a in ar:
        if isinstance(a,list):
            flatten_and_whiten_outlines(a,prefix+"  ",blanking,line_+1)
        else:
            #print(a)
            if(blanking):
                a[0]["/Title"]=pypdf.TextStringObject("****BLANKED***")
                a[1]["/Title"]=pypdf.TextStringObject("****BLANKED***")
                try:
                    a[1]["/Dest"]["/Title"]=pypdf.TextStringObject("****BLANKDEST***")
                except:
                    pass
            print("$%d"%line_,prefix,a.__repr__())
            line_+=1

you define directly those two functions directly in the shell
You can the call the two function one after the other

ar=get_outlines1(***pdf_object***)

flatten_and_whiten_outlines(ar,"",False)

This will output the raw data data from the pdf and then the decrypted data.
This should help you to troubleshoot.
I've also implemented some code to blank data, you may find usefull to share some data
Hope this help to improve the library

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants