Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native library detection plugin #267

Merged
merged 42 commits into from
Dec 17, 2024
Merged

Native library detection plugin #267

merged 42 commits into from
Dec 17, 2024

Conversation

wangmot
Copy link
Collaborator

@wangmot wangmot commented Oct 14, 2024

Summary

If merged this pull request will detect native libraries from files. It will find matches either through file name or file content using regex patterns. Statically linked libraries will also be detected.

First, run get_emba_db.py to generate the EMBA database.

Then, just run surfactant and in the output file there will be a nativeLibraries: [] for each file that states all the libraries that it was able to detect. The output will have either isLibrary: [] meaning the file is the library, or containsLibrary: [] which means the libraries were statically linked within the file.

@wangmot wangmot changed the title Native library detection plugin #267 Native library detection plugin Oct 14, 2024
Copy link
Collaborator

@nightlark nightlark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, mostly a few changes related to where the database of patterns is stored.

Let's add this helper function to the ConfigManager class (in surfactant/configmanager.py):

        """Determines the path to the data directory, for storing things such as databases.

        Returns:
            Path: The path to the data directory.
        """
        if platform.system() == "Windows":
            data_dir = Path(os.getenv("LOCALAPPDATA", os.path.expanduser("~\\AppData\\Local")))
        else:
            data_dir = Path(os.getenv("XDG_DATA_HOME", os.path.expanduser("~/.local/share")))
        data_dir = data_dir / self.app_name
        return data_dir

(There are also ~3 things flagged by the pre-commit CI check to change).

scripts/native_libraries/get_emba_db.py Show resolved Hide resolved
scripts/native_libraries/get_emba_db.py Outdated Show resolved Hide resolved
scripts/native_libraries/get_emba_db.py Show resolved Hide resolved
scripts/native_libraries/get_emba_db.py Show resolved Hide resolved
surfactant/infoextractors/native_lib_file.py Outdated Show resolved Hide resolved
surfactant/infoextractors/native_lib_file.py Outdated Show resolved Hide resolved
@nightlark nightlark added the enhancement New feature or request label Oct 18, 2024
@wangmot wangmot force-pushed the native-lib-detection branch from 86155bb to c3f7960 Compare November 13, 2024 19:50
@wangmot wangmot force-pushed the native-lib-detection branch 3 times, most recently from f1fda19 to 61092f3 Compare November 26, 2024 02:24
@wangmot wangmot force-pushed the native-lib-detection branch 3 times, most recently from 606e2ca to 8b74a71 Compare December 2, 2024 08:08
@nightlark
Copy link
Collaborator

Interesting, I came across two patterns that don't compile and cause an error:

  • C++\ RTMP\ Server\ .*\ version\ v[0-9](\.[0-9])+?\ r\.[0-9]+ due to the ++ at the start being intended to match literally; emba may be taking advantage of some non-standard behavior in the tool they use for regex parsing, whereas Python's re module (and every other regexp parser on https://regex101.com/) either doesn't like that construct and complains about the pattern, or treats it as matching multiple C's.
  • User-Agent: Siemens Canada Limited - ROX2 - [0-9](\.[0-9]+)?+$ is disliked by Python's re module due to the ?+ towards the end. Other regexp parsers give an error, or treat it as basically the same as just having a ?... I'm not sure if it intends to match against a literal + or not.. or maybe they meant to just have + at the end so it matches e.g. 4.2.4.4 with an arbitrary number of dots, instead of just a maximum of one dot in the version number.

Will need to test whatever they are using for matching with the regular expressions, but it is possible the first doesn't have the behavior they intended (and the 2nd is just ambiguous).

@nightlark
Copy link
Collaborator

I think they are using grep -o -a -E "<pattern>" for matching, and for both of those patterns grep says repetition-operator operand invalid. So I think these are actually just broken patterns in the EMBA database.

@@ -96,7 +96,7 @@ def match_by_attribute(attribute: str, content: str, patterns_database: Dict) ->
if attribute == "filename":
if name == content:
Copy link
Collaborator

@nightlark nightlark Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if name == content:
if name.lower() == content.lower():

For the filename, let's make it a case-insensitive comparison.

@wangmot wangmot force-pushed the native-lib-detection branch from c57f26a to ef0919f Compare December 16, 2024 23:27
@nightlark nightlark merged commit bff6a2f into main Dec 17, 2024
13 checks passed
@nightlark nightlark deleted the native-lib-detection branch December 17, 2024 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants