Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorize web and script files #121

Merged
merged 21 commits into from
Feb 20, 2024
Merged

Conversation

Czatar
Copy link
Collaborator

@Czatar Czatar commented Jan 27, 2024

Summary

Categorize web files and scripts for later analysis. Related to issue #75.

If merged this pull request will

Proposed changes

Added branches to magic bytes that cover python, bash, html, css, and js. The bash and html branches use magic bytes in case of a missing file type extension.

@Czatar Czatar requested a review from nightlark January 27, 2024 23:24
@Czatar
Copy link
Collaborator Author

Czatar commented Jan 27, 2024

Wondering if:

  • File type extension should be made into its own function or if it's fine to have it where it is
  • There are any file types to include or exclude for this PR

@Czatar Czatar marked this pull request as ready for review January 27, 2024 23:31
@nightlark
Copy link
Collaborator

* File type extension should be made into its own function or if it's fine to have it where it is

I was thinking it could be made into its own function, since it isn't really a check based on magic bytes. Using the #! line is also maybe more of a heuristic than magic bytes?

@Czatar Czatar requested a review from nightlark February 4, 2024 23:12
@nightlark
Copy link
Collaborator

Something I wanted to make a note on regarding the regex checks -- I came across a few #! lines in a stackoverflow post:

#!/usr/bin/env bash
#!/bin/bash
#!/bin/sh
#!/bin/sh -

In particular, the last one stands out where the end of the line is passing a - as an argument to sh. I think in theory the shebang line could have any set of arguments passed to the given command, so to recognize some corner cases it might be necessary to split the line similar to how the OS would to identify the command to run (+ a special case for using the first argument if env is the command).

This sounds like a lot of work though for an uncommon case, so I'm leaning towards adding that as a GitHub issue for the backlog.

@Czatar Czatar marked this pull request as draft February 8, 2024 00:38
@Czatar Czatar marked this pull request as ready for review February 8, 2024 03:25
@Czatar
Copy link
Collaborator Author

Czatar commented Feb 8, 2024

The issue with having arguments in the shebang line is tricky. From what I know:

  • It's possible for the path to the binary to have spaces, so we can't just split by spaces
  • It's possible for the arguments to have paths, so just splitting it by slashes also won't help

My next thought would be to check for single/double quotes or backslashes that include the space in the path. Windows doesn't acknowledge shebang lines, so maybe backslash checking won't be a huge issue. For the GitHub issue, Is there anything else that would complicate this?

Copy link
Collaborator

@nightlark nightlark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to make these comments visible.

Comment on lines 36 to 39
suffix = pathlib.Path(filepath).suffix.lower()
head = f.read(256)
if suffix in _filetype_extensions:
return _filetype_extensions[suffix]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that instead of mixing the suffix and #! checks, let's move the suffix related parts of this to after the try/except block (and remove the return None that's right above the except line.

This will also give priority to whatever the #! line says, which should be more reliable than the file suffix, which could "lie" a bit more easily.

Comment on lines 43 to 46
head = head[: head.index(b"\n")].decode("utf-8")
for interpreter, filetype in _interpreters.items():
if interpreter in head:
return filetype
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section should also be wrapped in a try/except block to catch any UnicodeDecodeError that could get thrown, and log a warning with some details on what happened. (Or add a 2nd except block after the existing one -- that then falls through to the file suffix check).

surfactant/filetypeid/id_extension.py Show resolved Hide resolved
end_line = head.index(b"\n")
head = head[:end_line]
for interpreter, filetype in _interpreters.items():
if re.search(interpreter, head):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea how the performance is for re.search vs just using the in keyword?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I put together some quick scripts to test with time python3 <script>. I got around 2.3s using re.search, and 1.2s using in.

import re

interpreters = {
    b"sh": "SHELL",
    b"bash": "BASH",
    b"zsh": "ZSH",
    b"php": "PHP",
    b"python": "PYTHON",
    b"python3": "PYTHON",
}

candidate = b"#!/bin/fsdafsdf/local/bas/bas/bas/bas/pytho/python3"

for i in range(0, 1000000):
    for interpreter, filetype in interpreters.items():
        if re.search(interpreter, candidate):
            pass
interpreters = {
    b"sh": "SHELL",
    b"bash": "BASH",
    b"zsh": "ZSH",
    b"php": "PHP",
    b"python": "PYTHON",
    b"python3": "PYTHON",
}

candidate = b"#!/bin/fsdafsdf/local/bas/bas/bas/bas/pytho/python3"

for i in range(0, 1000000):
    for interpreter, filetype in interpreters.items():
        if interpreter in candidate:
            pass

Out of curiosity, I tried out using Aho-Corasick (on strings though... using the pyahocorasick library) to see how fast it could be if we really wanted to get a faster search; with this, the test took 0.35s to run:

import ahocorasick

automaton = ahocorasick.Automaton()

interpreters = {
        "sh": "SHELL",
        "bash": "BASH",
        "zsh": "ZSH",
        "php": "PHP",
        "python": "PYTHON",
        "python3": "PYTHON3",
}

for interpreter, filetype in interpreters.items():
    automaton.add_word(interpreter, filetype)

automaton.make_automaton()

candidate = "#!/bin/fsdafsdf/local/bas/bas/bas/bas/pytho/python3"

for i in range(0, 1000000):
    for item in automaton.iter_long(candidate):
        pass

At some point, it might be nice to switch some of the different file type ID checks over to a faster method (magic bytes, srec/hex, etc) -- but saving a second or two isn't that much compared to the time it takes to generate hashes and analyze large binaries.

@nightlark nightlark merged commit 5771a40 into main Feb 20, 2024
11 checks passed
@nightlark nightlark deleted the script-file-extension-identification branch February 20, 2024 04:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants