-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Categorize web and script files #121
Conversation
for more information, see https://pre-commit.ci
Wondering if:
|
I was thinking it could be made into its own function, since it isn't really a check based on magic bytes. Using the |
for more information, see https://pre-commit.ci
Something I wanted to make a note on regarding the regex checks -- I came across a few
In particular, the last one stands out where the end of the line is passing a This sounds like a lot of work though for an uncommon case, so I'm leaning towards adding that as a GitHub issue for the backlog. |
The issue with having arguments in the shebang line is tricky. From what I know:
My next thought would be to check for single/double quotes or backslashes that include the space in the path. Windows doesn't acknowledge shebang lines, so maybe backslash checking won't be a huge issue. For the GitHub issue, Is there anything else that would complicate this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to make these comments visible.
suffix = pathlib.Path(filepath).suffix.lower() | ||
head = f.read(256) | ||
if suffix in _filetype_extensions: | ||
return _filetype_extensions[suffix] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that instead of mixing the suffix and #! checks, let's move the suffix related parts of this to after the try/except block (and remove the return None
that's right above the except line.
This will also give priority to whatever the #! line says, which should be more reliable than the file suffix, which could "lie" a bit more easily.
head = head[: head.index(b"\n")].decode("utf-8") | ||
for interpreter, filetype in _interpreters.items(): | ||
if interpreter in head: | ||
return filetype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section should also be wrapped in a try/except block to catch any UnicodeDecodeError
that could get thrown, and log a warning with some details on what happened. (Or add a 2nd except block after the existing one -- that then falls through to the file suffix check).
for more information, see https://pre-commit.ci
…L/Surfactant into script-file-extension-identification
end_line = head.index(b"\n") | ||
head = head[:end_line] | ||
for interpreter, filetype in _interpreters.items(): | ||
if re.search(interpreter, head): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea how the performance is for re.search
vs just using the in
keyword?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so I put together some quick scripts to test with time python3 <script>
. I got around 2.3s using re.search
, and 1.2s using in
.
import re
interpreters = {
b"sh": "SHELL",
b"bash": "BASH",
b"zsh": "ZSH",
b"php": "PHP",
b"python": "PYTHON",
b"python3": "PYTHON",
}
candidate = b"#!/bin/fsdafsdf/local/bas/bas/bas/bas/pytho/python3"
for i in range(0, 1000000):
for interpreter, filetype in interpreters.items():
if re.search(interpreter, candidate):
pass
interpreters = {
b"sh": "SHELL",
b"bash": "BASH",
b"zsh": "ZSH",
b"php": "PHP",
b"python": "PYTHON",
b"python3": "PYTHON",
}
candidate = b"#!/bin/fsdafsdf/local/bas/bas/bas/bas/pytho/python3"
for i in range(0, 1000000):
for interpreter, filetype in interpreters.items():
if interpreter in candidate:
pass
Out of curiosity, I tried out using Aho-Corasick (on strings though... using the pyahocorasick library) to see how fast it could be if we really wanted to get a faster search; with this, the test took 0.35s to run:
import ahocorasick
automaton = ahocorasick.Automaton()
interpreters = {
"sh": "SHELL",
"bash": "BASH",
"zsh": "ZSH",
"php": "PHP",
"python": "PYTHON",
"python3": "PYTHON3",
}
for interpreter, filetype in interpreters.items():
automaton.add_word(interpreter, filetype)
automaton.make_automaton()
candidate = "#!/bin/fsdafsdf/local/bas/bas/bas/bas/pytho/python3"
for i in range(0, 1000000):
for item in automaton.iter_long(candidate):
pass
At some point, it might be nice to switch some of the different file type ID checks over to a faster method (magic bytes, srec/hex, etc) -- but saving a second or two isn't that much compared to the time it takes to generate hashes and analyze large binaries.
Summary
Categorize web files and scripts for later analysis. Related to issue #75.
If merged this pull request will
Proposed changes
Added branches to magic bytes that cover python, bash, html, css, and js. The bash and html branches use magic bytes in case of a missing file type extension.