Categorize web and script files #121

Czatar · 2024-01-27T23:24:22Z

Summary

Categorize web files and scripts for later analysis. Related to issue #75.

If merged this pull request will

Proposed changes

Added branches to magic bytes that cover python, bash, html, css, and js. The bash and html branches use magic bytes in case of a missing file type extension.

…ional first line

for more information, see https://pre-commit.ci

Czatar · 2024-01-27T23:30:45Z

Wondering if:

File type extension should be made into its own function or if it's fine to have it where it is
There are any file types to include or exclude for this PR

surfactant/filetypeid/id_magic.py

nightlark · 2024-01-29T20:17:50Z

* File type extension should be made into its own function or if it's fine to have it where it is

I was thinking it could be made into its own function, since it isn't really a check based on magic bytes. Using the #! line is also maybe more of a heuristic than magic bytes?

for more information, see https://pre-commit.ci

surfactant/filetypeid/id_extension.py

nightlark · 2024-02-06T19:30:54Z

Something I wanted to make a note on regarding the regex checks -- I came across a few #! lines in a stackoverflow post:

#!/usr/bin/env bash
#!/bin/bash
#!/bin/sh
#!/bin/sh -

In particular, the last one stands out where the end of the line is passing a - as an argument to sh. I think in theory the shebang line could have any set of arguments passed to the given command, so to recognize some corner cases it might be necessary to split the line similar to how the OS would to identify the command to run (+ a special case for using the first argument if env is the command).

This sounds like a lot of work though for an uncommon case, so I'm leaning towards adding that as a GitHub issue for the backlog.

surfactant/filetypeid/id_extension.py

Czatar · 2024-02-08T03:37:54Z

The issue with having arguments in the shebang line is tricky. From what I know:

It's possible for the path to the binary to have spaces, so we can't just split by spaces
It's possible for the arguments to have paths, so just splitting it by slashes also won't help

My next thought would be to check for single/double quotes or backslashes that include the space in the path. Windows doesn't acknowledge shebang lines, so maybe backslash checking won't be a huge issue. For the GitHub issue, Is there anything else that would complicate this?

nightlark

Forgot to make these comments visible.

nightlark · 2024-02-10T00:15:08Z

surfactant/filetypeid/id_extension.py

+            suffix = pathlib.Path(filepath).suffix.lower()
+            head = f.read(256)
+            if suffix in _filetype_extensions:
+                return _filetype_extensions[suffix]


I'm thinking that instead of mixing the suffix and #! checks, let's move the suffix related parts of this to after the try/except block (and remove the return None that's right above the except line.

This will also give priority to whatever the #! line says, which should be more reliable than the file suffix, which could "lie" a bit more easily.

nightlark · 2024-02-10T00:18:26Z

surfactant/filetypeid/id_extension.py

+                head = head[: head.index(b"\n")].decode("utf-8")
+                for interpreter, filetype in _interpreters.items():
+                    if interpreter in head:
+                        return filetype


I think this section should also be wrapped in a try/except block to catch any UnicodeDecodeError that could get thrown, and log a warning with some details on what happened. (Or add a 2nd except block after the existing one -- that then falls through to the file suffix check).

surfactant/filetypeid/id_extension.py

…g checking

for more information, see https://pre-commit.ci

…L/Surfactant into script-file-extension-identification

nightlark · 2024-02-20T04:13:41Z

surfactant/filetypeid/id_extension.py

+                end_line = head.index(b"\n")
+                head = head[:end_line]
+                for interpreter, filetype in _interpreters.items():
+                    if re.search(interpreter, head):


Any idea how the performance is for re.search vs just using the in keyword?

Okay, so I put together some quick scripts to test with time python3 <script>. I got around 2.3s using re.search, and 1.2s using in.

import re interpreters = { b"sh": "SHELL", b"bash": "BASH", b"zsh": "ZSH", b"php": "PHP", b"python": "PYTHON", b"python3": "PYTHON", } candidate = b"#!/bin/fsdafsdf/local/bas/bas/bas/bas/pytho/python3" for i in range(0, 1000000): for interpreter, filetype in interpreters.items(): if re.search(interpreter, candidate): pass

interpreters = { b"sh": "SHELL", b"bash": "BASH", b"zsh": "ZSH", b"php": "PHP", b"python": "PYTHON", b"python3": "PYTHON", } candidate = b"#!/bin/fsdafsdf/local/bas/bas/bas/bas/pytho/python3" for i in range(0, 1000000): for interpreter, filetype in interpreters.items(): if interpreter in candidate: pass

Out of curiosity, I tried out using Aho-Corasick (on strings though... using the pyahocorasick library) to see how fast it could be if we really wanted to get a faster search; with this, the test took 0.35s to run:

import ahocorasick automaton = ahocorasick.Automaton() interpreters = { "sh": "SHELL", "bash": "BASH", "zsh": "ZSH", "php": "PHP", "python": "PYTHON", "python3": "PYTHON3", } for interpreter, filetype in interpreters.items(): automaton.add_word(interpreter, filetype) automaton.make_automaton() candidate = "#!/bin/fsdafsdf/local/bas/bas/bas/bas/pytho/python3" for i in range(0, 1000000): for item in automaton.iter_long(candidate): pass

At some point, it might be nice to switch some of the different file type ID checks over to a faster method (magic bytes, srec/hex, etc) -- but saving a second or two isn't that much compared to the time it takes to generate hashes and analyze large binaries.

Czatar added 2 commits January 27, 2024 16:23

Added return for bash and python based on file extension and bash opt…

50004b0

…ional first line

Included detection for web files (js/html/css)

02c8d37

Czatar requested a review from nightlark January 27, 2024 23:24

[pre-commit.ci] auto fixes from pre-commit.com hooks

cdef364

for more information, see https://pre-commit.ci

Czatar marked this pull request as ready for review January 27, 2024 23:31

nightlark reviewed Jan 29, 2024

View reviewed changes

surfactant/filetypeid/id_magic.py Outdated Show resolved Hide resolved

nightlark reviewed Jan 29, 2024

View reviewed changes

surfactant/filetypeid/id_magic.py Outdated Show resolved Hide resolved

Czatar and others added 5 commits February 4, 2024 14:06

Revert changes in id_magic.py, with formatting changes

160cb26

Placed extension and shebang checking in its own file

d6e21fb

Added id_extension

b8369aa

[pre-commit.ci] auto fixes from pre-commit.com hooks

a8a7d43

for more information, see https://pre-commit.ci

Merge branch 'main' into script-file-extension-identification

c24d859

Czatar commented Feb 4, 2024

View reviewed changes

surfactant/filetypeid/id_extension.py Outdated Show resolved Hide resolved

Czatar requested a review from nightlark February 4, 2024 23:12

Merge branch 'main' into script-file-extension-identification

5bedacb

nightlark reviewed Feb 6, 2024

View reviewed changes

surfactant/filetypeid/id_extension.py Outdated Show resolved Hide resolved

nightlark reviewed Feb 6, 2024

View reviewed changes

surfactant/filetypeid/id_extension.py Show resolved Hide resolved

nightlark and others added 3 commits February 7, 2024 11:54

Merge branch 'main' into script-file-extension-identification

0c3bbce

startswith used over slicing

a1dd205

added default case for returning 'SHEBANG'

99fe91b

nightlark reviewed Feb 7, 2024

View reviewed changes

surfactant/filetypeid/id_extension.py Outdated Show resolved Hide resolved

Czatar marked this pull request as draft February 8, 2024 00:38

Czatar added 2 commits February 7, 2024 18:38

Merge branch 'main' into script-file-extension-identification

6a9e24b

Restructured file type id using dicts

460d5d2

Czatar marked this pull request as ready for review February 8, 2024 03:25

Use 'in' instead of re.search

1ad42d0

Regex preferred due to possible future modifications

ce6ffb2

nightlark reviewed Feb 11, 2024

View reviewed changes

Czatar and others added 5 commits February 10, 2024 19:52

Included UnicodeDecodeError and moved extension checking after sheban…

aa333a4

…g checking

[pre-commit.ci] auto fixes from pre-commit.com hooks

28d7136

for more information, see https://pre-commit.ci

Merge branch 'main' into script-file-extension-identification

736aec0

Avoid decoding shebang line

85a5128

Merge branch 'script-file-extension-identification' of github.com:LLN…

7860b80

…L/Surfactant into script-file-extension-identification

nightlark reviewed Feb 20, 2024

View reviewed changes

nightlark approved these changes Feb 20, 2024

View reviewed changes

nightlark merged commit 5771a40 into main Feb 20, 2024
11 checks passed

nightlark deleted the script-file-extension-identification branch February 20, 2024 04:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categorize web and script files #121

Categorize web and script files #121

Czatar commented Jan 27, 2024

Czatar commented Jan 27, 2024

nightlark commented Jan 29, 2024

nightlark commented Feb 6, 2024

Czatar commented Feb 8, 2024

nightlark left a comment

nightlark Feb 10, 2024

nightlark Feb 10, 2024

nightlark Feb 20, 2024

nightlark Feb 20, 2024

Categorize web and script files #121

Categorize web and script files #121

Conversation

Czatar commented Jan 27, 2024

Summary

Proposed changes

Czatar commented Jan 27, 2024

nightlark commented Jan 29, 2024

nightlark commented Feb 6, 2024

Czatar commented Feb 8, 2024

nightlark left a comment

Choose a reason for hiding this comment

nightlark Feb 10, 2024

Choose a reason for hiding this comment

nightlark Feb 10, 2024

Choose a reason for hiding this comment

nightlark Feb 20, 2024

Choose a reason for hiding this comment

nightlark Feb 20, 2024

Choose a reason for hiding this comment