Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New patterns & ML update #600

Merged
merged 8 commits into from
Sep 19, 2024
Merged

New patterns & ML update #600

merged 8 commits into from
Sep 19, 2024

Conversation

babenek
Copy link
Contributor

@babenek babenek commented Aug 27, 2024

Description

Please include a summary of the change and which is fixed.

  • Refactoring ML features with reducing DC and added regex matching for an attribute
  • Retrain ML. summary F1 = 0.952886
  • Added new ML features for using regex in an attribute
  • ML training produces chart with MD5 sums of config and model in bottom to manage them
  • Removed VariableNotAllowedPatternCheck filter but apply the pattern in ML. Variuos case with _ID postfixes and AWS keys.
  • Refactored Keyword pattern to deal with escaped quotes in a value. Tests added. Some cases still are not covered.
  • Filter ValuePath was extended with morphemes check to reduce FN
  • Removed duplicated filters

How has this been tested?

Please describe the tests that you ran to verify your changes.

  • UnitTest
  • Benchmark

@babenek babenek changed the title Auxiliary ML update feature WordsInPath Aug 28, 2024
@babenek babenek changed the title ML update feature WordsInPath New patterns & ML update Aug 30, 2024
@babenek
Copy link
Contributor Author

babenek commented Aug 30, 2024

@Samsung/credsweeper_maintainers , please give your opinion about the rules names.

Copy link
Contributor Author

@babenek babenek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: rollback after Samsung/CredData#167

.github/workflows/benchmark.yml Outdated Show resolved Hide resolved
.github/workflows/benchmark.yml Outdated Show resolved Hide resolved
.github/workflows/benchmark.yml Outdated Show resolved Hide resolved
.github/workflows/benchmark.yml Outdated Show resolved Hide resolved
@babenek babenek marked this pull request as ready for review September 3, 2024 10:42
@babenek babenek requested a review from a team as a code owner September 3, 2024 10:42
@babenek babenek marked this pull request as draft September 3, 2024 10:47
@codecov-commenter
Copy link

codecov-commenter commented Sep 4, 2024

Codecov Report

Attention: Patch coverage is 92.30769% with 23 lines in your changes missing coverage. Please review.

Project coverage is 90.57%. Comparing base (a73faaa) to head (355abb2).

Files with missing lines Patch % Lines
credsweeper/ml_model/features/has_html_tag.py 77.77% 2 Missing and 2 partials ⚠️
credsweeper/ml_model/features/feature.py 86.95% 2 Missing and 1 partial ⚠️
credsweeper/ml_model/features/word_in.py 92.50% 2 Missing and 1 partial ⚠️
credsweeper/ml_model/features/word_in_path.py 80.00% 2 Missing and 1 partial ⚠️
credsweeper/credentials/line_data.py 90.47% 0 Missing and 2 partials ⚠️
credsweeper/ml_model/features/word_in_line.py 85.71% 1 Missing and 1 partial ⚠️
credsweeper/ml_model/features/word_in_value.py 83.33% 1 Missing and 1 partial ⚠️
credsweeper/utils/util.py 60.00% 1 Missing and 1 partial ⚠️
credsweeper/ml_model/features/file_extension.py 91.66% 1 Missing ⚠️
credsweeper/ml_model/features/rule_name.py 91.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #600      +/-   ##
==========================================
- Coverage   90.63%   90.57%   -0.06%     
==========================================
  Files         130      143      +13     
  Lines        4794     4870      +76     
  Branches      783      786       +3     
==========================================
+ Hits         4345     4411      +66     
- Misses        292      298       +6     
- Partials      157      161       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@babenek
Copy link
Contributor Author

babenek commented Sep 5, 2024

bencmark fails before Samsung/CredData#167

@babenek babenek marked this pull request as ready for review September 5, 2024 04:32
Copy link
Collaborator

@csh519 csh519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my comments below.

And next time, please separate PR by feature and request review.
As you know huge PR couldn't have good review.

while self.value.endswith('}') and '{' in self.line[:self.value_start]:
self.value = self.value[:-1]
"""Parenthesis, curly and squared brackets may be caught in TOML format and bash. Simple clearing"""
dirty = self.value and self.value[-1] in ['}', ']', ')']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dirty is also good name but IMO it would be good to re-name with like as is_OOO.
How about is_cleared or is_clear?

if "function" in line_data.value or self.PATTERN.search(line_data.value):
if "function" in line_data.value or self.PATTERN.match(line_data.value):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any FP cases that include function keyword in the value?
I think checking function keyword by search() seems reasonable..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, there was FN case in random generated credentials like "12#@5(!)fs", so I refactored the pattern and applied match to use the search from begin of a value

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. still i can get it clearly.
Do you mean there is a random generated credential value which includes "function" keyword in it?
Like as "12#@5(!)function3w@91"?
If then i think search() method working good.

Copy link
Contributor Author

@babenek babenek Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, i mean the second part of OR condition in the check only.
The function check does not use regex, so i did not find any FP or FN for function yet.
Otherside, "12#@5(!)fs" - or something like this was set as FN with the filter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah.. I just misunderstood about this.
But in some case like below, search() method still more useful doesn't it?

myPassword = "this_is_not_password"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter does not return True for the line. Only if it will be

myPassword = this_is_not_password()

or

myPasswsord=password_function

*depends on file

Otherside, your code line means that a string is assigned for variable with name "password"
So, it may be a password. Even "1234" may be a password, but CredSweeper skips the sequence.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay i agree. Let's keep the change then.

@@ -1,36 +0,0 @@
import re
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this filter has been removed?
There is no description for it..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWS_KEY_ID is filtered often :(
The feature was migrated to ML

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VariableNotAllowedPatternCheck mention was added to PR description

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I see..
Is there a lot of patterns that do not allowed?
I think if we delegate this filter's role to ML, we will be harder to reasoning and debugging why.
So I'm worry about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter eliminated a chance for ML to report for a credential like:

PASSWORD_FOR_ID="Th@tHid$mYpDW"

and something sensetive data like AWS ID.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems filtering the credentials from ML will be better..
Okay let's move the role to ML.

@babenek babenek marked this pull request as draft September 5, 2024 07:36
@babenek babenek requested a review from csh519 September 6, 2024 06:06
@babenek babenek force-pushed the auxiliary branch 2 times, most recently from f3241c5 to 330cc04 Compare September 8, 2024 13:23
@babenek babenek mentioned this pull request Sep 9, 2024
2 tasks
@babenek
Copy link
Contributor Author

babenek commented Sep 10, 2024

20240910_143429

@babenek babenek marked this pull request as ready for review September 10, 2024 13:47
xDizzix
xDizzix previously approved these changes Sep 16, 2024
@babenek babenek marked this pull request as draft September 16, 2024 08:13
@babenek babenek marked this pull request as ready for review September 16, 2024 10:18
@babenek babenek requested a review from xDizzix September 16, 2024 10:30
@babenek babenek mentioned this pull request Sep 17, 2024
2 tasks
Copy link
Contributor

@Yullia Yullia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

@babenek babenek merged commit 80da63d into Samsung:main Sep 19, 2024
27 checks passed
@babenek babenek deleted the auxiliary branch September 19, 2024 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants