Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python 3 compatibility #19

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

frenzymadness
Copy link

@frenzymadness frenzymadness commented Oct 24, 2019

Hello.

I am trying o make this tool Python 3 compatible while keeping backward compatibility with Python 2.7. I've tested my work with three scenarios and one testing Word document. I am not a user of this tool so I just compared the output for Python 2 and 3 and it seems to be okay.

Tested commands:

officeparser.py --create-manifest --extract-streams test/test.doc
officeparser.py --dump-stream-by-name=WordDocument test/test.doc
officeparser.py --print-streams test/test.doc

If you find something missing, please provide a reproducer (shell command) so I can use it to test my work and backward compatibility.

Fixes: #18

@xambroz
Copy link

xambroz commented Oct 25, 2019

Thank you very much for help it is cool.
I am adding 1 more patch for xrange, 1 more for the "to_hex" output where the usage of binary strings would clash, and one little cosmetics patch to get rid of the annoying error for not passing a filename when officeparser executed without parameters.
Still testing.

@frenzymadness
Copy link
Author

@xambroz I can give you commit rights to my repository so you can continue there and your commits appear here. What do you think?

@xambroz
Copy link

xambroz commented Oct 26, 2019

Thank you - that would work. I will add what I have.

Currently I have patches which make it work on plain office file.

The only thing which I know is not working yet is the extraction of macroes, but I hope to fix that as well.

@xambroz
Copy link

xambroz commented Oct 27, 2019

In the meanwhile - this is what I have to add at this point:
frenzymadness#1

I know that --export-macros is not working in python3.
Tested like this:

  1. download malware sample xls with macros from hybrid-analysis.com
    https://www.hybrid-analysis.com/sample/8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f?environmentId=100

  2. gunzip the file

gunzip 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin.gz
  1. try with python2
python2 officeparser.py --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin
Traceback (most recent call last):
  File "officeparser.py", line 1235, in <module>
    _main()
  File "officeparser.py", line 836, in _main
    buffer = StringIO()
NameError: global name 'StringIO' is not defined
  1. try with python3
$ python3 $(which officeparser.py) --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin 
Traceback (most recent call last):
  File "/usr/bin/officeparser.py", line 1234, in <module>
    _main()
  File "/usr/bin/officeparser.py", line 835, in _main
    buffer = StringIO()
NameError: name 'StringIO' is not defined

Even including "from io import StringIO" is not directly fixing the situation:
5) try with python2 and io.StringIO

python2 officeparser.py --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin
Traceback (most recent call last):
  File "officeparser.py", line 1235, in <module>
    _main()
  File "officeparser.py", line 837, in _main
    buffer.write(ofdoc.get_stream(project.index))
TypeError: unicode argument expected, got 'str'
  1. try with python3 and io.StringIO
python3 officeparser.py --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin
Traceback (most recent call last):
  File "officeparser.py", line 1235, in <module>
    _main()
  File "officeparser.py", line 837, in _main
    buffer.write(ofdoc.get_stream(project.index))
TypeError: string argument expected, got 'bytes'

The original (cStringIO.StringIO) gives this:

$ python2 officeparser.py.orig --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin 
INFO: Saving VBA code to ./Sem_1.cls
INFO: Saving VBA code to ./Page1_1.cls
INFO: Saving VBA code to ./Module1_1.bas
INFO: Saving VBA code to ./UserForm1_1.frm
INFO: Saving VBA code to ./Module2_1.bas
INFO: Saving VBA code to ./Module3_1.bas
INFO: Saving VBA code to ./UserForm6_1.frm
INFO: Saving VBA code to ./Page11_1.cls
INFO: Saving VBA code to ./Module6_1.bas
INFO: Saving VBA code to ./Module5_1.bas
INFO: Saving VBA code to ./Module4_1.bas
INFO: Saving VBA code to ./Class1_1.cls
INFO: Saving VBA code to ./Sheet1_1.cls

rpmbuild and others added 4 commits October 31, 2019 13:41
for python2 alias range to xrange
This fixes annoying bug/feature that the script crashes when no file attribute is provided
…python3 binarray to ascii/hexdump

This fixes issue with --print-header and --print-directory
@frenzymadness
Copy link
Author

I am investigating the macros extraction.

The very first question I need an answer for is whether PROJECT stream in a document should be handled as bytes or Unicode. Because now it's mixed and that's the reason why it does not work in Python 3. Do I understand it correctly that it contains some code in VB script so it should be handled as Unicode?

@xambroz
Copy link

xambroz commented Nov 14, 2019

Hello,
yes PROJECT stream in Office documents seems to hold metadata about the macros in the plaintext form in the INI format.

$ python2 officeparser.py.orig --dump-stream-by-name PROJECT word_form.doc 
ID="{F71D9A8C-3763-458D-A309-7E5E41C49A1A}"
Document=ThisDocument/&H00000000
Module=NewMacros
Name="Project"
HelpContextID="0"
VersionCompatible32="393222000"
CMG="C1C327AD2BAD2BAD2BAD2B"
DPB="828064A724A824A824"
GC="4341A5E667E767E798"

[Host Extender Info]
&H00000001={3832D640-CF90-11CF-8E43-00A0C911005A};VBE;&H00000000
&H00000002={000209F2-0000-0000-C000-000000000046};Word8.0;&H00000000

[Workspace]
ThisDocument=46, 46, 678, 454, 
NewMacros=69, 69, 678, 506, Z

@frenzymadness
Copy link
Author

Hello.

Unfortunately, I don't have the capacity to work on this anymore. Could we please merge this PR to make the officeparser at least partially Python 3 compatible so others can continue without repeating the same work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Python3 support
2 participants