Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autokeras hyperparam tuning #610

Merged
merged 32 commits into from
Dec 18, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
3e76407
softreset
babenek Dec 3, 2024
c476462
fix
babenek Dec 3, 2024
6649674
BM markup upd
babenek Dec 3, 2024
f3e4f24
BM scores upd
babenek Dec 3, 2024
2fb18be
workaround
babenek Dec 3, 2024
6b5923a
upd
babenek Dec 3, 2024
a8a8aad
upd2
babenek Dec 3, 2024
8cb3ac2
tensorrt==10.6.0
babenek Dec 3, 2024
ba362a7
update action test
babenek Dec 14, 2024
2970ea7
fix keyword pattern with HTML escape quotes
babenek Dec 14, 2024
bfe9c23
MailChimp API Key right border
babenek Dec 16, 2024
15d8791
Merge branch 'main' into fpcase
babenek Dec 16, 2024
9e4c954
Merge branch 'fpcase' into kerastuner
babenek Dec 16, 2024
2d50f76
fix
babenek Dec 16, 2024
bf2c159
Merge branch 'main' into fpcase
babenek Dec 16, 2024
7bd168b
Merge branch 'fpcase' into kerastuner
babenek Dec 16, 2024
30196d4
[skip actions] [kerastuner] 2024-12-16T12:44:35+02:00
babenek Dec 16, 2024
397a748
retrain
babenek Dec 16, 2024
f748583
[skip actions] [doccred] 2024-12-16T13:31:58+02:00
babenek Dec 16, 2024
19b3bb3
[skip actions] [kerastuner] 2024-12-16T13:47:19+02:00
babenek Dec 16, 2024
f3a5396
retrain
babenek Dec 16, 2024
dcb1ac8
md5
babenek Dec 16, 2024
d2001e2
testfix
babenek Dec 16, 2024
95ceebe
BM fix
babenek Dec 16, 2024
9da2333
refactoring_retrain
babenek Dec 17, 2024
b3c11ec
cfgfix
babenek Dec 17, 2024
53b44b8
md5sum
babenek Dec 17, 2024
0e10c7e
testfix
babenek Dec 17, 2024
60daf58
py3.8 workaround
babenek Dec 17, 2024
cc0bf3a
import opt
babenek Dec 17, 2024
3bffcd3
style
babenek Dec 18, 2024
17d1e66
BM scores upd
babenek Dec 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 16 additions & 16 deletions .ci/benchmark.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
META MD5 72b4b7db8a2ffef0f19e802c09032e14
META MD5 414228344bac7e55c5127be7b244e460
DATA MD5 abd9c025d5c323af814fbeb33f469c90
DATA: 16342283 interested lines. MARKUP: 62020 items
FileType FileNumber ValidLines Positives Negatives Templates
Expand Down Expand Up @@ -82,7 +82,7 @@ FileType FileNumber ValidLines Positives Negatives Templat
.ipynb 1 134 5
.j 1 241 4
.j2 30 5530 6 186 10
.java 621 134132 362 1363 172
.java 621 134132 362 1365 171
.jenkinsfile 1 58 2 6
.jinja2 1 64 2
.js 659 536413 531 2497 331
Expand Down Expand Up @@ -222,26 +222,26 @@ FileType FileNumber ValidLines Positives Negatives Templat
.yml 419 36169 559 889 376
.zsh 6 872 12
.zsh-theme 1 97 1
TOTAL: 10232 16342283 12255 49690 5102
credsweeper result_cnt : 11517, lost_cnt : 0, true_cnt : 11342, false_cnt : 175
TOTAL: 10232 16342283 12255 49692 5101
credsweeper result_cnt : 11679, lost_cnt : 0, true_cnt : 11391, false_cnt : 288
Rules Positives Negatives Templates Reported TP FP TN FN FPR FNR ACC PRC RCL F1
------------------------------ ----------- ----------- ----------- ---------- ----- ---- ----- ---- -------- -------- -------- -------- -------- --------
API 130 3166 188 125 123 2 3352 7 0.000596 0.053846 0.997417 0.984000 0.946154 0.964706
API 130 3166 188 138 126 12 3342 4 0.003578 0.030769 0.995408 0.913043 0.969231 0.940299
AWS Client ID 168 21 0 160 160 0 21 8 0.000000 0.047619 0.957672 1.000000 0.952381 0.975610
AWS Multi 82 10 0 84 82 1 9 0 0.100000 0.000000 0.989130 0.987952 1.000000 0.993939
AWS S3 Bucket 67 23 0 92 67 23 0 0 1.000000 0.000000 0.744444 0.744444 1.000000 0.853503
Atlassian Old PAT token 27 308 3 12 3 8 303 24 0.025723 0.888889 0.905325 0.272727 0.111111 0.157895
Auth 414 2739 82 390 387 3 2818 27 0.001063 0.065217 0.990726 0.992308 0.934783 0.962687
Auth 414 2739 82 407 388 19 2802 26 0.006735 0.062802 0.986090 0.953317 0.937198 0.945189
Azure Access Token 19 0 0 12 12 0 0 7 0.368421 0.631579 1.000000 0.631579 0.774194
BASE64 Private Key 7 4 0 7 7 0 4 0 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000
BASE64 encoded PEM Private Key 7 0 0 5 5 0 0 2 0.285714 0.714286 1.000000 0.714286 0.833333
Bitbucket Client ID 143 2095 9 48 28 19 2085 115 0.009030 0.804196 0.940365 0.595745 0.195804 0.294737
Bitbucket Client Secret 301 807 10 40 29 11 806 272 0.013464 0.903654 0.746869 0.725000 0.096346 0.170088
CMD ConvertTo-SecureString 13 4 0 13 13 0 4 0 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000
CMD Password 21 128 6 18 18 0 134 3 0.000000 0.142857 0.980645 1.000000 0.857143 0.923077
CMD Password 21 128 6 20 20 0 134 1 0.000000 0.047619 0.993548 1.000000 0.952381 0.975610
CMD Secret 1 1 0 1 1 0 1 0 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000
CMD Token 6 0 0 6 6 0 0 0 0.000000 1.000000 1.000000 1.000000 1.000000
Certificate 24 471 0 20 20 0 471 4 0.000000 0.166667 0.991919 1.000000 0.833333 0.909091
Certificate 24 471 0 26 20 6 465 4 0.012739 0.166667 0.979798 0.769231 0.833333 0.800000
Credential 91 421 76 92 91 1 496 0 0.002012 0.000000 0.998299 0.989130 1.000000 0.994536
Docker Swarm Token 2 0 0 1 1 0 0 1 0.500000 0.500000 1.000000 0.500000 0.666667
Dropbox App secret 64 139 1 46 35 10 130 29 0.071429 0.453125 0.808824 0.777778 0.546875 0.642202
Expand All @@ -257,18 +257,18 @@ Grafana Provisioned API Key 22 1 0
JSON Web Token 170 61 0 131 131 0 61 39 0.000000 0.229412 0.831169 1.000000 0.770588 0.870432
Jira / Confluence PAT token 0 4 0 0 0 4 0 0.000000 1.000000
Jira 2FA 15 6 1 12 12 0 7 3 0.000000 0.200000 0.863636 1.000000 0.800000 0.888889
Key 3909 15717 485 3944 3893 51 16151 16 0.003148 0.004093 0.996668 0.987069 0.995907 0.991468
Nonce 91 49 0 89 88 1 48 3 0.020408 0.032967 0.971429 0.988764 0.967033 0.977778
Key 3909 15717 485 3982 3896 86 16116 13 0.005308 0.003326 0.995077 0.978403 0.996674 0.987454
Nonce 91 49 0 90 89 1 48 2 0.020408 0.021978 0.978571 0.988889 0.978022 0.983425
Other 8 7445 1 0 0 7446 8 0.000000 1.000000 0.998927 0.000000
PEM Private Key 1019 1483 0 1023 1019 4 1479 0 0.002697 0.000000 0.998401 0.996090 1.000000 0.998041
Password 1869 7535 2680 1776 1758 18 10197 111 0.001762 0.059390 0.989325 0.989865 0.940610 0.964609
Salt 47 76 1 44 44 0 77 3 0.000000 0.063830 0.975806 1.000000 0.936170 0.967033
Secret 1297 1576 802 1288 1283 5 2373 14 0.002103 0.010794 0.994830 0.996118 0.989206 0.992650
Password 1869 7536 2680 1830 1778 52 10164 91 0.005090 0.048689 0.988167 0.971585 0.951311 0.961341
Salt 47 76 1 45 45 0 77 2 0.000000 0.042553 0.983871 1.000000 0.957447 0.978261
Secret 1297 1576 802 1292 1288 4 2374 9 0.001682 0.006939 0.996463 0.996904 0.993061 0.994979
Seed 1 6 0 0 0 6 1 0.000000 1.000000 0.857143 0.000000
Slack Token 4 1 0 4 4 0 1 0 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000
Tencent WeChat API App ID 6 0 0 6 6 0 0 0 0.000000 1.000000 1.000000 1.000000 1.000000
Token 643 4170 454 616 614 2 4622 29 0.000433 0.045101 0.994114 0.996753 0.954899 0.975377
Token 643 4170 454 633 622 11 4613 21 0.002379 0.032659 0.993924 0.982622 0.967341 0.974922
Twilio Credentials 30 39 0 30 30 0 39 0 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000
URL Credentials 210 156 216 205 205 0 372 5 0.000000 0.023810 0.991409 1.000000 0.976190 0.987952
URL Credentials 210 157 215 214 210 4 368 0 0.010753 0.000000 0.993127 0.981308 1.000000 0.990566
UUID 1069 265 0 1068 1067 1 264 2 0.003774 0.001871 0.997751 0.999064 0.998129 0.998596
12255 49690 5102 11524 11342 175 49515 913 0.003522 0.074500 0.982436 0.984805 0.925500 0.954232
12255 49692 5101 11686 11391 288 49404 864 0.005796 0.070502 0.981403 0.975340 0.929498 0.951868
4 changes: 2 additions & 2 deletions .github/workflows/check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@ jobs:
- name: Check ml_model.onnx integrity
if: ${{ always() && steps.code_checkout.conclusion == 'success' }}
run: |
md5sum --binary credsweeper/ml_model/ml_config.json | grep 49c4352ae9ec82ad432d49d7e51c27f1
md5sum --binary credsweeper/ml_model/ml_model.onnx | grep ff66e97c446d0f2bbd8d37b7dfff7361
md5sum --binary credsweeper/ml_model/ml_config.json | grep 4a397e4481c409bbff63e6cc7f9bdef9
md5sum --binary credsweeper/ml_model/ml_model.onnx | grep a1f2272381aaaacf4c94447842845984

# # # line ending

Expand Down
54 changes: 27 additions & 27 deletions credsweeper/common/constants.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import string
import typing
from enum import Enum
from typing import Optional, Union
Expand Down Expand Up @@ -59,41 +60,37 @@ def get(confidence: Union[str, "Confidence"]) -> Optional["Confidence"]:
return None


class Base(Enum):
"""Stores types of character sets in lower case"""
digits = "digits"
ascii_uppercase = "ascii_uppercase"
ascii_lowercase = "ascii_lowercase"
base16upper = "base16upper"
base16lower = "base16lower"
base32 = "base32"
base36 = "base36"
base64 = "base64"
base64std = "base64std"
base64url = "base64url"
hex = "hex"


class Chars(Enum):
"""Stores three types characters sets.
"""
"""Stores enumeration of characters sets of encoding dictionaries"""

# set of characters, hexadecimal numeral system (Base16). Upper- and lowercase
HEX_CHARS = "0123456789ABCDEFabcdef"
HEX_CHARS = string.digits + "ABCDEFabcdef"
# UUID charset in uppercase
UUID_UPPER_CHARS = string.digits + "ABCDEF-"
# UUID charset in lowercase
UUID_LOWER_CHARS = string.digits + "abcdef-"
# set of characters, hexadecimal numeral system (Base16). Uppercase
BASE16UPPER = "0123456789ABCDEF"
BASE16UPPER = string.digits + "ABCDEF"
# set of characters, hexadecimal numeral system (Base16). Lowercase
BASE16LOWER = "0123456789abcdef"
BASE16LOWER = string.digits + "abcdef"
# set of 32 characters, used in Base32 encoding
BASE32_CHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ234567"
BASE32_CHARS = string.ascii_uppercase + "234567"
# set of 36 characters, used in Base36 encoding
BASE36_CHARS = "abcdefghijklmnopqrstuvwxyz1234567890"
# standard base64 with padding sign
BASE64_CHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/="
BASE36_CHARS = string.digits + string.ascii_lowercase
# base62 set https://en.wikipedia.org/wiki/Base62
BASE62_CHARS = string.digits + string.ascii_letters
# URL- and filename-safe standard
BASE64URL_CHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_"
# standard base64
BASE64STD_CHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
BASE64URL_CHARS = string.digits + string.ascii_letters + "-_"
# URL- and filename-safe standard plus padding sign
BASE64URLPAD_CHARS = string.digits + string.ascii_letters + "-_="
# standard base64 charset
BASE64STD_CHARS = string.digits + string.ascii_letters + "+/"
# standard base64 plus padding sign
BASE64STDPAD_CHARS = string.digits + string.ascii_letters + "+/="
# except whitespaces
ASCII_VISIBLE = string.digits + string.ascii_letters + string.punctuation
# all printable symbols
ASCII_PRINTABLE = string.printable


ENTROPY_LIMIT_BASE64 = 4.5
Expand Down Expand Up @@ -179,3 +176,6 @@ class DiffRowType(Enum):
# PEM x509 patterns
PEM_BEGIN_PATTERN = "-----BEGIN"
PEM_END_PATTERN = "-----END"

# similar min_line_len in rule_template - no real credential in data less than 8 bytes
MIN_DATA_LEN = 8
2 changes: 1 addition & 1 deletion credsweeper/deep_scanner/pdf_scanner.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def data_scan(
pdf_content_provider = DataContentProvider(
data=element_text.encode(),
file_path=data_provider.file_path,
file_type=".xml",
file_type=data_provider.file_type,
info=f"{data_provider.info}|PDF:{page.pageid}")
new_limit = recursive_limit_size - len(pdf_content_provider.data)
element_candidates = self.recursive_scan(pdf_content_provider, depth, new_limit)
Expand Down
5 changes: 1 addition & 4 deletions credsweeper/file_handler/data_content_provider.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,14 @@
import yaml
from bs4 import BeautifulSoup, Tag, XMLParsedAsHTMLWarning

from credsweeper.common.constants import DEFAULT_ENCODING, ASCII
from credsweeper.common.constants import DEFAULT_ENCODING, ASCII, MIN_DATA_LEN
from credsweeper.file_handler.analysis_target import AnalysisTarget
from credsweeper.file_handler.content_provider import ContentProvider
from credsweeper.utils import Util

warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning, module='bs4')
logger = logging.getLogger(__name__)

# similar min_line_len in rule_template - no real credential in data less than 8 bytes
MIN_DATA_LEN = 8

# 8 bytes encodes to 12 symbols 12345678 -> MTIzNDU2NzgK
MIN_ENCODED_DATA_LEN = 12

Expand Down
14 changes: 7 additions & 7 deletions credsweeper/filters/value_file_path_check.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from credsweeper.common.constants import Chars
from credsweeper.common import static_keyword_checklist
from credsweeper.common.constants import Chars
from credsweeper.config import Config
from credsweeper.credentials import LineData
from credsweeper.file_handler.analysis_target import AnalysisTarget
Expand All @@ -13,9 +13,9 @@ class ValueFilePathCheck(Filter):
Check if a value contains either '/' or ':\' separators (but not both)
and do not have any special characters ( !$@`&*()+)
"""
base64_possible_set = set(Chars.BASE64_CHARS.value) | set(Chars.BASE64URL_CHARS.value)
unusual_windows_symbols_in_path = "\t\n\r !$@`&*()[]{}<>+=;,~^"
unusual_linux_symbols_in_path = unusual_windows_symbols_in_path + ":\\"
base64stdpad_possible_set = set(Chars.BASE64STDPAD_CHARS.value)
unusual_windows_symbols_in_path = "\t\n\r!$@`&*(){}<>+=;,~^"
unusual_linux_symbols_in_path = "\t\n\r!@`&*<>+=;,~^:\\"

def __init__(self, config: Config = None) -> None:
pass
Expand Down Expand Up @@ -48,12 +48,12 @@ def run(self, line_data: LineData, target: AnalysisTarget) -> bool:
# get minimal entropy to compare with shannon entropy of found value
# min_entropy == 0 means that the value cannot be checked with the entropy due high variance
for i in value:
if i not in self.base64_possible_set:
# value contains wrong BASE64STD_CHARS symbols like .
if i not in self.base64stdpad_possible_set:
# value contains wrong BASE64STDPAD_CHARS symbols like -_
break
else:
# all symbols are from base64 alphabet
entropy = Util.get_shannon_entropy(value, Chars.BASE64STD_CHARS.value)
entropy = Util.get_shannon_entropy(value, Chars.BASE64STDPAD_CHARS.value)
if 0 == min_entropy or min_entropy > entropy:
contains_unix_separator = 1 < value.count('/')
else:
Expand Down
7 changes: 3 additions & 4 deletions credsweeper/ml_model/features/__init__.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
from credsweeper.ml_model.features.char_set import CharSet
from credsweeper.ml_model.features.entropy_evaluation import EntropyEvaluation
from credsweeper.ml_model.features.file_extension import FileExtension
from credsweeper.ml_model.features.hartley_entropy import HartleyEntropy
from credsweeper.ml_model.features.has_html_tag import HasHtmlTag
from credsweeper.ml_model.features.is_secret_numeric import IsSecretNumeric
from credsweeper.ml_model.features.length_of_attribute import LengthOfAttribute
from credsweeper.ml_model.features.morpheme_dense import MorphemeDense
from credsweeper.ml_model.features.search_in_attribute import SearchInAttribute
from credsweeper.ml_model.features.reny_entropy import RenyiEntropy
from credsweeper.ml_model.features.rule_name import RuleName
from credsweeper.ml_model.features.shannon_entropy import ShannonEntropy
from credsweeper.ml_model.features.word_in_line import WordInLine
from credsweeper.ml_model.features.word_in_path import WordInPath
from credsweeper.ml_model.features.word_in_value import WordInValue
Expand Down
41 changes: 0 additions & 41 deletions credsweeper/ml_model/features/char_set.py

This file was deleted.

Loading
Loading