datasets: fix dataset clobbering, and file naming - v1 #341

jasonish · 2024-03-06T16:21:07Z

Don't base dataset filenames on the contents of the file, but
instead the filename path:
https://redmine.openinfosecfoundation.org/issues/6763
Give each file in a source a unique filename by prefixing the files
with a hash of the URL to prevent duplicate filenames from
cloberring each other, in particular dataset files:
https://redmine.openinfosecfoundation.org/issues/6833

By using a hash of the content, a new file was created everytime the dataset was updated and never cleaned up. To address this, use a filename that doesn't change based on the content. Bug: #6763

To prevent dataset files from difference sources from overwriting each other, give each file downloaded and extracted a prefix based on the URL (a hash). This ensures unique filenames across all rulesets. This mostly matters for datasets, as when datasets are processed we are working with a merged set of filenames, unlike rules which are parsed much earlier when we still have a list of files. Not the most elegant solution, but saves a rather large refactor. Bug: #6833

inashivb

Looking good from an overview. :) But, I haven't tested yet.
Do you already have some test data or should I create some? For manual testing.

jasonish · 2024-03-07T19:26:20Z

Looking good from an overview. :) But, I haven't tested yet. Do you already have some test data or should I create some? For manual testing.

rules.tar.gz

This file has a dataset that has the same name as a dataset in the pawpatrules that I used to test the clobbering.

To test the extraneous files, you'd have to update, modify the dataset in the ruleset, and update and again and see the new file creation. Then run with patch and see how it no longer happens. This is the more critical one, as some datasets are pretty large and updated daily.

jasonish · 2024-03-07T20:34:47Z

For example, after a few weeks of rulesets with datasets that update frequently, I have these files in my rules directory:

-rw-r--r-- 1 root suricata  58970580 Feb 14 18:26 038e705f72ec4b6467de9fa3cbf829b5
-rw-r--r-- 1 root suricata 155318241 Feb 27 18:26 0beb471fa939b006cdb6ac05fb3a1199
-rw-r--r-- 1 root suricata    569412 Feb 15 18:26 0c70f5bcf2a9735d6a58407f6965494c
-rw-r--r-- 1 root suricata  58546496 Feb 22 18:26 108dc1c2e424dda53fd89d780cac15a5
-rw-r--r-- 1 root suricata    574089 Feb 27 18:26 1185c32107a7b90647fa19e9d888a320
-rw-r--r-- 1 root suricata 150349987 Feb 22 18:26 172dfad5b098700edb639764075ff1ac
-rw-r--r-- 1 root suricata 168763611 Feb 10 18:26 1a7fe436d8b05eb0c8bea640c09fcfa3
-rw-r--r-- 1 root suricata  60047497 Mar  6 18:38 1e85d6915133968796625188c3f52580
-rw-r--r-- 1 root suricata    570510 Feb 17 18:26 203c82f4765979eb69758092c616dfd6
-rw-r--r-- 1 root suricata 149902623 Feb 20 18:26 205bbe69a9b493cd7038d49fe93e3b39
-rw-r--r-- 1 root suricata 152026543 Feb 14 18:26 213e11b91541583682d13bb59b43fd7c
-rw-r--r-- 1 root suricata    192079 Mar  4 18:26 22605fc76152fe66575080ccab7d457f
-rw-r--r-- 1 root suricata    197907 Feb 24 18:26 2582d7659424700e828da8d300d988fd
-rw-r--r-- 1 root suricata  60740385 Feb 25 18:26 27d18b3ca2c1c5d26c167121f0adc6d1
-rw-r--r-- 1 root suricata    573411 Feb 26 09:36 2837e8fab24c794b87fbc1fb312fa2cd
-rw-r--r-- 1 root suricata    572497 Feb 23 18:26 2997d7283013d7483f13e1065a231498
-rw-r--r-- 1 root suricata    190536 Feb 20 18:26 2abd7f3d5d95269e6b2136ced0984b5c
-rw-r--r-- 1 root suricata    189952 Feb 22 18:26 2bf3eb168d21ff24c07647ebae8aaf86
-rw-r--r-- 1 root suricata    199580 Feb 16 18:26 2c924401a046e527ca2efef9cb38a962
-rw-r--r-- 1 root suricata    219641 Feb 11 18:26 2dc9a08c0d5a7af6954e635b8298030f
-rw-r--r-- 1 root suricata    187079 Feb 21 18:26 3282072cc5bdaeaf35c776fb46043a0f
-rw-r--r-- 1 root suricata  61127992 Feb 24 18:26 34ac93e2befcb72ba65ec8fd08d580a4
-rw-r--r-- 1 root suricata 154586706 Mar  5 18:26 353471fc6a634443662c8e1c982dd0b6
-rw-r--r-- 1 root suricata  59746192 Feb 28 18:26 38a1aac456632f56f79117892d3a8f6a
-rw-r--r-- 1 root suricata    575900 Mar  2 18:26 38a775471993b1e8c62bf66a564bcf17
-rw-r--r-- 1 root suricata 157393788 Feb 24 18:26 3c956c07c0c575ebc9f3ad430ff0b4e0
-rw-r--r-- 1 root suricata 148020586 Feb 21 18:26 3cb0e059d27f9a3abb63f49e6a418531
-rw-r--r-- 1 root suricata    569126 Feb 14 18:26 4504b3cb4d5f929d8b539f3936e7b9f9
-rw-r--r-- 1 root suricata 156738417 Feb 29 18:26 4514a9c915f3ecce38077fd8716b19cb
-rw-r--r-- 1 root suricata  60238591 Mar  4 18:26 45e176cf52626cbccf988b551a65686b
-rw-r--r-- 1 root suricata 154912749 Feb 28 18:26 48b7d054fc673308236a77f32b71df4c
-rw-r--r-- 1 root suricata  59919471 Feb 27 18:26 493d9a83128a29005b50392125732f2c
-rw-r--r-- 1 root suricata  64320730 Feb 10 18:26 4b6d4b5df0e15b8f0555b045278fb485
-rw-r--r-- 1 root suricata    577577 Mar  6 18:38 4e1bd71d5293521e49ee2985c180dadc
-rw-r--r-- 1 root suricata 166208718 Feb  9 18:26 501248074f5637865e702886840a29e0
-rw-r--r-- 1 root suricata  59943174 Feb 17 18:26 538116a2d495ef665bb082e9602787d1
-rw-r--r-- 1 root suricata    572240 Feb 22 18:26 54148052ada73e1867394c2a4dbf824a
-rw-r--r-- 1 root suricata 156619071 Feb 25 18:26 5d4e37b999af49aba6400ce5c8152d10
-rw-r--r-- 1 root suricata  66514852 Feb 11 18:26 5e7ba6af68c21b2d12f644f6471066be
-rw-r--r-- 1 root suricata    190952 Feb 27 18:26 5f619ec4b1c4116d91bf0c0c7960a3f4
-rw-r--r-- 1 root suricata  57871853 Feb 21 18:26 610851bc7b04245ee6abbc25a4d19732
-rw-r--r-- 1 root suricata    197819 Mar  2 18:26 6308b5fdaf7ab1b515e7bac4976ef7cd
-rw-r--r-- 1 root suricata  59376331 Feb 18 18:26 64c629c7b1c98b51d150c858817bc1ed
-rw-r--r-- 1 root suricata    573626 Feb 26 18:26 64e8cb4fa39e3b8de9fb29bbd2199e7a
-rw-r--r-- 1 root suricata    188460 Feb 23 18:26 64f81fd270b36d72583bb7374a3b4a3c
-rw-r--r-- 1 root suricata 153678143 Feb 16 18:26 663b0612620ab9bd796ff1f5f5cbf4be
-rw-r--r-- 1 root suricata    574886 Feb 29 18:26 6988175b60e7eb41bf9d580fd9bb407e
-rw-r--r-- 1 root suricata    570856 Feb 18 18:26 69f17d78e3892db3cbcd87afde2b1130
-rw-r--r-- 1 root suricata    571951 Feb 21 18:26 6c2c42043f2c8deaca3f2f31479b7668
-rw-r--r-- 1 root suricata 158640998 Mar  2 18:26 6c67fa4f032b57b053445355757a07b1
-rw-r--r-- 1 root suricata  60568589 Feb 29 18:26 6c88fc18c45515bb0535a701f76807c2
-rw-r--r-- 1 root suricata  59859377 Feb 12 18:26 6d65d56c8e21a4e0575b5ea508a32bdc
-rw-r--r-- 1 root suricata    195770 Mar  3 18:26 711ccc5cf3fabd6157964941f4a4af2f
-rw-r--r-- 1 root suricata    575316 Mar  1 18:26 74a8f13d21a147321890f2abd6de0b68
-rw-r--r-- 1 root suricata    192031 Mar  6 18:38 74d4483ea094134b15641c377c343bf8
-rw-r--r-- 1 root suricata    192458 Feb 26 18:26 7835b2e5072bbe6b3e4e17e9c329200c
-rw-r--r-- 1 root suricata    207719 Feb  9 18:26 7afb5f69249e4d0133f7db141c387f40
-rw-r--r-- 1 root suricata  58984375 Feb 19 18:26 8153657363c85ae6b45a959ef8cb8e45
-rw-r--r-- 1 root suricata 152270457 Feb 15 18:26 8261592ef67778bbb1f54e427df269fc
-rw-r--r-- 1 root suricata    187456 Mar  5 18:26 852169ae823c98be9e3e8bfbb2c075c4
-rw-r--r-- 1 root suricata 153584490 Feb 17 18:26 881015148fef01ddb4caf7765bb6e806
-rw-r--r-- 1 root suricata 153206930 Feb 13 18:26 88b97dfe173dac58a0c9adf6cdc22cc7
-rw-r--r-- 1 root suricata    197480 Feb 13 18:26 8b03e4be862b4c0a05f8d1b831f4982d
-rw-r--r-- 1 root suricata    198115 Feb 14 18:26 9105818ee8b0bb24fd7cb89a95310410
-rw-r--r-- 1 root suricata  58292458 Feb 23 18:26 91b014cbb52dc071a3b2c2935662cf3f
-rw-r--r-- 1 root suricata 157341847 Mar  6 18:38 944c5f8c279bcad5a1a8d683eb3afb6a
-rw-r--r-- 1 root suricata    569693 Feb 16 18:26 974d4b606a51a38afc1cf06fa4b851cb
-rw-r--r-- 1 root suricata  63025346 Feb  9 18:26 981639fac11ef980215e74bf7df914d8
-rw-r--r-- 1 root suricata 173746343 Feb 11 18:26 9ab915a0a5941c78c6001812aaf65814
-rw-r--r-- 1 root suricata 156143111 Feb 26 18:26 9b7b4bc0b09e62f9cc72412858949ba7
-rw-r--r-- 1 root suricata  61307498 Mar  2 18:26 9c43b08f865741210750d53447b9b328
-rw-r--r-- 1 root suricata    574480 Feb 28 18:26 9c6ae80010e8d4cfb92c9da274cbd8f2
-rw-r--r-- 1 root suricata    195226 Feb 25 18:26 a103c69f53955a3335b656df6167c1ab
-rw-r--r-- 1 root suricata 150936787 Feb 19 18:26 a6e07c07d08c53aaae8b6bbd35ef7af1
-rw-r--r-- 1 root suricata 157167126 Mar  1 18:26 a9ec1ee47b9c4c7242bab6f7ac762cb2
-rw-r--r-- 1 root suricata  59241954 Feb 15 18:26 b047c9aa9989409edaf6b193a3363e2b
-rw-r--r-- 1 root suricata    573244 Feb 24 18:26 b4ef478eee2d6a1097372ac501ff286c
-rw-r--r-- 1 root suricata    575992 Mar  3 18:26 b68898aee4f804f1ade43e421a7bc915
-rw-r--r-- 1 root suricata    576241 Mar  4 18:26 b8e816236eef903cd1ec1d1500902001
-rw-r--r-- 1 root suricata    208532 Feb  8 18:26 b949a8c1380699bcb283438c98ef68b3
-rw-r--r-- 1 root suricata 158738574 Mar  3 18:26 ba939d4ec06f7992a7915110c121dfaf
-rw-r--r-- 1 root suricata 166927070 Feb  8 18:26 bd066f294802018c1a6ea138dc20f7b1
-rw-r--r-- 1 root suricata    571698 Feb 21 09:34 bd5f11453d88eccff3439947738d5585
-rw-r--r-- 1 root suricata  59811644 Feb 16 18:26 c0499d5f77f57be959d3a829eea59bd8
-rw-r--r-- 1 root suricata  59440106 Feb 13 18:26 c2df24c0d3a11207311e4facbe77f26d
-rw-r--r-- 1 root suricata  58576544 Feb 20 18:26 c2e590c7fec41d5a6bd43d2694ae3516
-rw-r--r-- 1 root suricata    193257 Feb 18 18:26 cc2f018e47fd9a6cca74bce66b064ce0
-rw-r--r-- 1 root suricata  63217302 Feb  8 18:26 cd2208c927c51c32b8315fb6808e8210
-rw-r--r-- 1 root suricata  59007769 Mar  5 18:26 cd5b903200d27f668f7d68cc2569f550
-rw-r--r-- 1 root suricata 154142948 Feb 12 18:26 d08ff884182b958ebc82dd0775f6f466
-rw-r--r-- 1 root suricata    571178 Feb 19 18:26 d2902f75ee84025067337e876ba25937
-rw-r--r-- 1 root suricata    213977 Feb 10 18:26 d299f230cdd3976e45ecc7e611fb83a5
-rw-r--r-- 1 root suricata    195269 Feb 29 18:26 d90329d991d33aad8bfea0c2a4a084ac
-rw-r--r-- 1 root suricata  60342375 Feb 26 18:26 dc69de7b37cd5e8837fe7fbb83b1e147
-rw-r--r-- 1 root suricata  60781376 Mar  1 18:26 e0557d9d71d706ebde45b5c928a74441
-rw-r--r-- 1 root suricata    199161 Feb 17 18:26 e3f60f6a7575432bb0273f2d2710c6ee
-rw-r--r-- 1 root suricata  61194695 Mar  3 18:26 e8987cfceaf05972f8c9621ae581642b
-rw-r--r-- 1 root suricata    198259 Feb 12 18:26 e99aebaa3579591dbfb64bdc5f094e5a
-rw-r--r-- 1 root suricata 150065367 Feb 23 18:26 ee4ef56838826cf6b8b9f59e8e8b5b33
-rw-r--r-- 1 root suricata 157385772 Mar  4 18:26 f0cc62d1193b8c78d6cb27f0b6c910b2
-rw-r--r-- 1 root suricata    191702 Feb 19 18:26 f193dabe02f6c322c1625da58efde4d0
-rw-r--r-- 1 root suricata    201743 Feb 15 18:26 f2150b43e182938db5b3914a079100d7
-rw-r--r-- 1 root suricata    194812 Mar  1 18:26 f25078a307be4f8164d4da4ef49ef003
-rw-r--r-- 1 root suricata 151719681 Feb 18 18:26 f43540b666ecf8c579477daff0c7da88
-rw-r--r-- 1 root suricata    191110 Feb 28 18:26 f6057075403aaf142b5e937a972ee79f
-rw-r--r-- 1 root suricata    568652 Feb 14 09:19 fbbe9530e8cd08fb60fd7fb551fb66c8
-rw-r--r-- 1 root suricata    576722 Mar  5 18:26 ff8f9344203d0cc0b003bfd3bbe338a3

jasonish added 2 commits March 5, 2024 17:11

datasets: use filename based on filename; not content

935d361

By using a hash of the content, a new file was created everytime the dataset was updated and never cleaned up. To address this, use a filename that doesn't change based on the content. Bug: #6763

jasonish requested a review from inashivb March 6, 2024 16:21

inashivb reviewed Mar 7, 2024

View reviewed changes

jasonish merged commit 8725e56 into OISF:master Mar 11, 2024
12 checks passed

jasonish deleted the dataset-filenames/v1 branch March 19, 2024 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets: fix dataset clobbering, and file naming - v1 #341

datasets: fix dataset clobbering, and file naming - v1 #341

jasonish commented Mar 6, 2024

inashivb left a comment

jasonish commented Mar 7, 2024

jasonish commented Mar 7, 2024

datasets: fix dataset clobbering, and file naming - v1 #341

datasets: fix dataset clobbering, and file naming - v1 #341

Conversation

jasonish commented Mar 6, 2024

inashivb left a comment

Choose a reason for hiding this comment

jasonish commented Mar 7, 2024

jasonish commented Mar 7, 2024