Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capability to return a BytesIO/filelike even if it isn't encrypted? #85

Open
CDWimmer opened this issue Jan 11, 2024 · 3 comments
Open

Comments

@CDWimmer
Copy link

CDWimmer commented Jan 11, 2024

Hello,

I understand that this is potentially out of scope for the project, but considering the existence of OfficeFile.is_encrypted() I feel this would tie its usage up nicely.

I'll explain a use case via example:
I am using this to load up a set of usually-encrypted Excel files into pandas, this is great, except a handful of these Excel files have randomly have not been password protected. I don't actually care whether or not they have a password, I just want to put them all into dataframes.

Right now, the argument I pass to pandas.read_excel() is either a non-protected Excel file's Path, or a BytesIO objected retrieved using this library.

This is fine but it has resulted in this messy function:

def decrypt_office_file(file: Path, password: str = None) -> Union[io.BytesIO, Path]:
    decrypted_file = io.BytesIO()
    with open(file, 'rb') as f:
        office_file = msoffcrypto.OfficeFile(f)
        if office_file.is_encrypted():
            office_file.load_key(password=password)
            office_file.decrypt(decrypted_file)
        else:
            decrypted_file = file
    return decrypted_file


excel_file = decrypt_office_file("my_file.xlsx")
df = pd.read_excel(excel_file, ...)

And then I just have to hope everything downstream is cool with taking either a BytesIO or a str/Path, which is okay for pandas but I imagine is less okay for other libraries/use cases.

I'm not sure how it would be best to insert the functionality, but something like OfficeFile.to_bytes() (I'm sure there are better ideas for function names available) would be great, then we can have consistent return types.

I also find it really odd that .decrypt() takes the object you want to inject the file into as an argument, rather than returning a BytesIO object? It makes following the code flow feel awkward to me, but that's an issue for another day!

@nolze
Copy link
Owner

nolze commented Jan 11, 2024

Thanks for your suggestion!
I think adding a utility function (as a context manager) similar to your decrypt_office_file() to the library is a good idea and will do it.
In the case of your code example, I might replace decrypted_file = file with decrypted_file = BytesIO(f) so that the function always returns BytesIO.

I also find it really odd that .decrypt() takes the object you want to inject the file into as an argument, rather than returning a BytesIO object? It makes following the code flow feel awkward to me, but that's a for another day!

I understand. The problem is that always creating a BytesIO object can consume unnecessary memory, especially if the document is large in file size.

@CDWimmer
Copy link
Author

CDWimmer commented Jan 11, 2024

Thanks for your suggestion! I think adding a utility function (as a context manager) similar to your decrypt_office_file() to the library is a good idea and will do it. In the case of your code example, I might replace decrypted_file = file with decrypted_file = BytesIO(f) so that the function always returns BytesIO.

Brilliant, I just threw that together to get what I needed, I'm sure you'll make it better 😄

I understand. The problem is that always creating a BytesIO object can consume unnecessary memory, especially if the document is large in file size.

~ snip ~ I wrote a load of nonsense, I'm sure you know what you're doing! Plus, this way can support alternatives to just BytesIO which is nice, I guess you could with open as f and pass f as ofile!

@954-Ivory
Copy link

Thanks for your suggestion! I think adding a utility function (as a context manager) similar to your decrypt_office_file() to the library is a good idea and will do it. In the case of your code example, I might replace decrypted_file = file with decrypted_file = BytesIO(f) so that the function always returns BytesIO.

Brilliant, I just threw that together to get what I needed, I'm sure you'll make it better 😄

I understand. The problem is that always creating a BytesIO object can consume unnecessary memory, especially if the document is large in file size.

~ snip ~ I wrote a load of nonsense, I'm sure you know what you're doing! Plus, this way can support alternatives to just BytesIO which is nice, I guess you could with open as f and pass f as ofile!

There will probably be a lot of people like me who forgot to f.seek(0) 😂.

with open(file_path, 'rb') as f:
    file = msoffcrypto.OfficeFile(f)
    if file.is_encrypted():
        file.load_key(password=password)  # Use password
        file.decrypt(decrypted_io)
    else:
        f.seek(0) # don't forget this
        decrypted_io.write(f.read())
return decrypted_io

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants