Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datasets: fix dataset clobbering, and file naming - v1 #341

Merged
merged 2 commits into from
Mar 11, 2024

Conversation

jasonish
Copy link
Member

@jasonish jasonish commented Mar 6, 2024

jasonish added 2 commits March 5, 2024 17:11
By using a hash of the content, a new file was created everytime the
dataset was updated and never cleaned up. To address this, use a
filename that doesn't change based on the content.

Bug: #6763
To prevent dataset files from difference sources from overwriting each
other, give each file downloaded and extracted a prefix based on the
URL (a hash). This ensures unique filenames across all rulesets.

This mostly matters for datasets, as when datasets are processed we
are working with a merged set of filenames, unlike rules which are
parsed much earlier when we still have a list of files.

Not the most elegant solution, but saves a rather large refactor.

Bug: #6833
@jasonish jasonish requested a review from inashivb March 6, 2024 16:21
Copy link
Member

@inashivb inashivb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good from an overview. :) But, I haven't tested yet.
Do you already have some test data or should I create some? For manual testing.

@jasonish
Copy link
Member Author

jasonish commented Mar 7, 2024

Looking good from an overview. :) But, I haven't tested yet. Do you already have some test data or should I create some? For manual testing.

rules.tar.gz

This file has a dataset that has the same name as a dataset in the pawpatrules that I used to test the clobbering.

To test the extraneous files, you'd have to update, modify the dataset in the ruleset, and update and again and see the new file creation. Then run with patch and see how it no longer happens. This is the more critical one, as some datasets are pretty large and updated daily.

@jasonish
Copy link
Member Author

jasonish commented Mar 7, 2024

For example, after a few weeks of rulesets with datasets that update frequently, I have these files in my rules directory:

-rw-r--r-- 1 root suricata  58970580 Feb 14 18:26 038e705f72ec4b6467de9fa3cbf829b5
-rw-r--r-- 1 root suricata 155318241 Feb 27 18:26 0beb471fa939b006cdb6ac05fb3a1199
-rw-r--r-- 1 root suricata    569412 Feb 15 18:26 0c70f5bcf2a9735d6a58407f6965494c
-rw-r--r-- 1 root suricata  58546496 Feb 22 18:26 108dc1c2e424dda53fd89d780cac15a5
-rw-r--r-- 1 root suricata    574089 Feb 27 18:26 1185c32107a7b90647fa19e9d888a320
-rw-r--r-- 1 root suricata 150349987 Feb 22 18:26 172dfad5b098700edb639764075ff1ac
-rw-r--r-- 1 root suricata 168763611 Feb 10 18:26 1a7fe436d8b05eb0c8bea640c09fcfa3
-rw-r--r-- 1 root suricata  60047497 Mar  6 18:38 1e85d6915133968796625188c3f52580
-rw-r--r-- 1 root suricata    570510 Feb 17 18:26 203c82f4765979eb69758092c616dfd6
-rw-r--r-- 1 root suricata 149902623 Feb 20 18:26 205bbe69a9b493cd7038d49fe93e3b39
-rw-r--r-- 1 root suricata 152026543 Feb 14 18:26 213e11b91541583682d13bb59b43fd7c
-rw-r--r-- 1 root suricata    192079 Mar  4 18:26 22605fc76152fe66575080ccab7d457f
-rw-r--r-- 1 root suricata    197907 Feb 24 18:26 2582d7659424700e828da8d300d988fd
-rw-r--r-- 1 root suricata  60740385 Feb 25 18:26 27d18b3ca2c1c5d26c167121f0adc6d1
-rw-r--r-- 1 root suricata    573411 Feb 26 09:36 2837e8fab24c794b87fbc1fb312fa2cd
-rw-r--r-- 1 root suricata    572497 Feb 23 18:26 2997d7283013d7483f13e1065a231498
-rw-r--r-- 1 root suricata    190536 Feb 20 18:26 2abd7f3d5d95269e6b2136ced0984b5c
-rw-r--r-- 1 root suricata    189952 Feb 22 18:26 2bf3eb168d21ff24c07647ebae8aaf86
-rw-r--r-- 1 root suricata    199580 Feb 16 18:26 2c924401a046e527ca2efef9cb38a962
-rw-r--r-- 1 root suricata    219641 Feb 11 18:26 2dc9a08c0d5a7af6954e635b8298030f
-rw-r--r-- 1 root suricata    187079 Feb 21 18:26 3282072cc5bdaeaf35c776fb46043a0f
-rw-r--r-- 1 root suricata  61127992 Feb 24 18:26 34ac93e2befcb72ba65ec8fd08d580a4
-rw-r--r-- 1 root suricata 154586706 Mar  5 18:26 353471fc6a634443662c8e1c982dd0b6
-rw-r--r-- 1 root suricata  59746192 Feb 28 18:26 38a1aac456632f56f79117892d3a8f6a
-rw-r--r-- 1 root suricata    575900 Mar  2 18:26 38a775471993b1e8c62bf66a564bcf17
-rw-r--r-- 1 root suricata 157393788 Feb 24 18:26 3c956c07c0c575ebc9f3ad430ff0b4e0
-rw-r--r-- 1 root suricata 148020586 Feb 21 18:26 3cb0e059d27f9a3abb63f49e6a418531
-rw-r--r-- 1 root suricata    569126 Feb 14 18:26 4504b3cb4d5f929d8b539f3936e7b9f9
-rw-r--r-- 1 root suricata 156738417 Feb 29 18:26 4514a9c915f3ecce38077fd8716b19cb
-rw-r--r-- 1 root suricata  60238591 Mar  4 18:26 45e176cf52626cbccf988b551a65686b
-rw-r--r-- 1 root suricata 154912749 Feb 28 18:26 48b7d054fc673308236a77f32b71df4c
-rw-r--r-- 1 root suricata  59919471 Feb 27 18:26 493d9a83128a29005b50392125732f2c
-rw-r--r-- 1 root suricata  64320730 Feb 10 18:26 4b6d4b5df0e15b8f0555b045278fb485
-rw-r--r-- 1 root suricata    577577 Mar  6 18:38 4e1bd71d5293521e49ee2985c180dadc
-rw-r--r-- 1 root suricata 166208718 Feb  9 18:26 501248074f5637865e702886840a29e0
-rw-r--r-- 1 root suricata  59943174 Feb 17 18:26 538116a2d495ef665bb082e9602787d1
-rw-r--r-- 1 root suricata    572240 Feb 22 18:26 54148052ada73e1867394c2a4dbf824a
-rw-r--r-- 1 root suricata 156619071 Feb 25 18:26 5d4e37b999af49aba6400ce5c8152d10
-rw-r--r-- 1 root suricata  66514852 Feb 11 18:26 5e7ba6af68c21b2d12f644f6471066be
-rw-r--r-- 1 root suricata    190952 Feb 27 18:26 5f619ec4b1c4116d91bf0c0c7960a3f4
-rw-r--r-- 1 root suricata  57871853 Feb 21 18:26 610851bc7b04245ee6abbc25a4d19732
-rw-r--r-- 1 root suricata    197819 Mar  2 18:26 6308b5fdaf7ab1b515e7bac4976ef7cd
-rw-r--r-- 1 root suricata  59376331 Feb 18 18:26 64c629c7b1c98b51d150c858817bc1ed
-rw-r--r-- 1 root suricata    573626 Feb 26 18:26 64e8cb4fa39e3b8de9fb29bbd2199e7a
-rw-r--r-- 1 root suricata    188460 Feb 23 18:26 64f81fd270b36d72583bb7374a3b4a3c
-rw-r--r-- 1 root suricata 153678143 Feb 16 18:26 663b0612620ab9bd796ff1f5f5cbf4be
-rw-r--r-- 1 root suricata    574886 Feb 29 18:26 6988175b60e7eb41bf9d580fd9bb407e
-rw-r--r-- 1 root suricata    570856 Feb 18 18:26 69f17d78e3892db3cbcd87afde2b1130
-rw-r--r-- 1 root suricata    571951 Feb 21 18:26 6c2c42043f2c8deaca3f2f31479b7668
-rw-r--r-- 1 root suricata 158640998 Mar  2 18:26 6c67fa4f032b57b053445355757a07b1
-rw-r--r-- 1 root suricata  60568589 Feb 29 18:26 6c88fc18c45515bb0535a701f76807c2
-rw-r--r-- 1 root suricata  59859377 Feb 12 18:26 6d65d56c8e21a4e0575b5ea508a32bdc
-rw-r--r-- 1 root suricata    195770 Mar  3 18:26 711ccc5cf3fabd6157964941f4a4af2f
-rw-r--r-- 1 root suricata    575316 Mar  1 18:26 74a8f13d21a147321890f2abd6de0b68
-rw-r--r-- 1 root suricata    192031 Mar  6 18:38 74d4483ea094134b15641c377c343bf8
-rw-r--r-- 1 root suricata    192458 Feb 26 18:26 7835b2e5072bbe6b3e4e17e9c329200c
-rw-r--r-- 1 root suricata    207719 Feb  9 18:26 7afb5f69249e4d0133f7db141c387f40
-rw-r--r-- 1 root suricata  58984375 Feb 19 18:26 8153657363c85ae6b45a959ef8cb8e45
-rw-r--r-- 1 root suricata 152270457 Feb 15 18:26 8261592ef67778bbb1f54e427df269fc
-rw-r--r-- 1 root suricata    187456 Mar  5 18:26 852169ae823c98be9e3e8bfbb2c075c4
-rw-r--r-- 1 root suricata 153584490 Feb 17 18:26 881015148fef01ddb4caf7765bb6e806
-rw-r--r-- 1 root suricata 153206930 Feb 13 18:26 88b97dfe173dac58a0c9adf6cdc22cc7
-rw-r--r-- 1 root suricata    197480 Feb 13 18:26 8b03e4be862b4c0a05f8d1b831f4982d
-rw-r--r-- 1 root suricata    198115 Feb 14 18:26 9105818ee8b0bb24fd7cb89a95310410
-rw-r--r-- 1 root suricata  58292458 Feb 23 18:26 91b014cbb52dc071a3b2c2935662cf3f
-rw-r--r-- 1 root suricata 157341847 Mar  6 18:38 944c5f8c279bcad5a1a8d683eb3afb6a
-rw-r--r-- 1 root suricata    569693 Feb 16 18:26 974d4b606a51a38afc1cf06fa4b851cb
-rw-r--r-- 1 root suricata  63025346 Feb  9 18:26 981639fac11ef980215e74bf7df914d8
-rw-r--r-- 1 root suricata 173746343 Feb 11 18:26 9ab915a0a5941c78c6001812aaf65814
-rw-r--r-- 1 root suricata 156143111 Feb 26 18:26 9b7b4bc0b09e62f9cc72412858949ba7
-rw-r--r-- 1 root suricata  61307498 Mar  2 18:26 9c43b08f865741210750d53447b9b328
-rw-r--r-- 1 root suricata    574480 Feb 28 18:26 9c6ae80010e8d4cfb92c9da274cbd8f2
-rw-r--r-- 1 root suricata    195226 Feb 25 18:26 a103c69f53955a3335b656df6167c1ab
-rw-r--r-- 1 root suricata 150936787 Feb 19 18:26 a6e07c07d08c53aaae8b6bbd35ef7af1
-rw-r--r-- 1 root suricata 157167126 Mar  1 18:26 a9ec1ee47b9c4c7242bab6f7ac762cb2
-rw-r--r-- 1 root suricata  59241954 Feb 15 18:26 b047c9aa9989409edaf6b193a3363e2b
-rw-r--r-- 1 root suricata    573244 Feb 24 18:26 b4ef478eee2d6a1097372ac501ff286c
-rw-r--r-- 1 root suricata    575992 Mar  3 18:26 b68898aee4f804f1ade43e421a7bc915
-rw-r--r-- 1 root suricata    576241 Mar  4 18:26 b8e816236eef903cd1ec1d1500902001
-rw-r--r-- 1 root suricata    208532 Feb  8 18:26 b949a8c1380699bcb283438c98ef68b3
-rw-r--r-- 1 root suricata 158738574 Mar  3 18:26 ba939d4ec06f7992a7915110c121dfaf
-rw-r--r-- 1 root suricata 166927070 Feb  8 18:26 bd066f294802018c1a6ea138dc20f7b1
-rw-r--r-- 1 root suricata    571698 Feb 21 09:34 bd5f11453d88eccff3439947738d5585
-rw-r--r-- 1 root suricata  59811644 Feb 16 18:26 c0499d5f77f57be959d3a829eea59bd8
-rw-r--r-- 1 root suricata  59440106 Feb 13 18:26 c2df24c0d3a11207311e4facbe77f26d
-rw-r--r-- 1 root suricata  58576544 Feb 20 18:26 c2e590c7fec41d5a6bd43d2694ae3516
-rw-r--r-- 1 root suricata    193257 Feb 18 18:26 cc2f018e47fd9a6cca74bce66b064ce0
-rw-r--r-- 1 root suricata  63217302 Feb  8 18:26 cd2208c927c51c32b8315fb6808e8210
-rw-r--r-- 1 root suricata  59007769 Mar  5 18:26 cd5b903200d27f668f7d68cc2569f550
-rw-r--r-- 1 root suricata 154142948 Feb 12 18:26 d08ff884182b958ebc82dd0775f6f466
-rw-r--r-- 1 root suricata    571178 Feb 19 18:26 d2902f75ee84025067337e876ba25937
-rw-r--r-- 1 root suricata    213977 Feb 10 18:26 d299f230cdd3976e45ecc7e611fb83a5
-rw-r--r-- 1 root suricata    195269 Feb 29 18:26 d90329d991d33aad8bfea0c2a4a084ac
-rw-r--r-- 1 root suricata  60342375 Feb 26 18:26 dc69de7b37cd5e8837fe7fbb83b1e147
-rw-r--r-- 1 root suricata  60781376 Mar  1 18:26 e0557d9d71d706ebde45b5c928a74441
-rw-r--r-- 1 root suricata    199161 Feb 17 18:26 e3f60f6a7575432bb0273f2d2710c6ee
-rw-r--r-- 1 root suricata  61194695 Mar  3 18:26 e8987cfceaf05972f8c9621ae581642b
-rw-r--r-- 1 root suricata    198259 Feb 12 18:26 e99aebaa3579591dbfb64bdc5f094e5a
-rw-r--r-- 1 root suricata 150065367 Feb 23 18:26 ee4ef56838826cf6b8b9f59e8e8b5b33
-rw-r--r-- 1 root suricata 157385772 Mar  4 18:26 f0cc62d1193b8c78d6cb27f0b6c910b2
-rw-r--r-- 1 root suricata    191702 Feb 19 18:26 f193dabe02f6c322c1625da58efde4d0
-rw-r--r-- 1 root suricata    201743 Feb 15 18:26 f2150b43e182938db5b3914a079100d7
-rw-r--r-- 1 root suricata    194812 Mar  1 18:26 f25078a307be4f8164d4da4ef49ef003
-rw-r--r-- 1 root suricata 151719681 Feb 18 18:26 f43540b666ecf8c579477daff0c7da88
-rw-r--r-- 1 root suricata    191110 Feb 28 18:26 f6057075403aaf142b5e937a972ee79f
-rw-r--r-- 1 root suricata    568652 Feb 14 09:19 fbbe9530e8cd08fb60fd7fb551fb66c8
-rw-r--r-- 1 root suricata    576722 Mar  5 18:26 ff8f9344203d0cc0b003bfd3bbe338a3

@jasonish jasonish merged commit 8725e56 into OISF:master Mar 11, 2024
12 checks passed
@jasonish jasonish deleted the dataset-filenames/v1 branch March 19, 2024 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants