Add a CI check to validate IP address present in tests, and a helper to clean them #165

chadell · 2022-09-17T15:49:05Z

This PR addresses issue #166 by adding a helper script anonymize-ip-addresses.py, to be used in the CI to detect the presence of some IPv4 or IPv6 addresses, and to actually replace them by documentary IPs

Content

New script anonymize-ip-addresses.py to check for IPv4 and IPv6 addresses to detect them, and replace by documentation ones
Created two invoke tasks: check-anonymize-ips and clean-anonymize-ips
Add a first step in the CI to detect the presence of IP addresses in the test data

.github/workflows/ci.yml

tests/unit/data/aquacomms/aquacomms1.eml

tests/unit/data/telstra/telstra2_result.json

itdependsnetworks · 2022-09-19T14:07:13Z

Just a friendly thought when I think of anonymizing IPs and parsing

When you are parsing, there can be differences introduced by changing the octet size, such as a table
When viewing the data, sometimes IPs are related and anonimzing them all to the same, it makes it difficult to correlate

Does it make sense to consider swapping the first subnet as the following:

1.x.x.x for single digit first subnets
10.x.x.x for double digit first subnets
172.x.x.x for triple digit first subnets

and keep everything else the same?

chadell · 2022-09-19T14:46:13Z

and keep everything else the same?

The parsing should work the same for any kind of addresses, so if changing the octet size fails, it would be great to catch the error earlier.
So far, in all the test data sets, I have found only one IP address in the actual data (description), in the Megaport one. And I don't see a big problem to lose it by replacing by a generic one.
Maybe I am missing some good point, but so far I think it's easier to do a simple replace to a reference value for all the IP addresses.

itdependsnetworks · 2022-09-19T14:54:52Z

Case in point:

Additionally, I have had bad luck with parsing where a column would be based on 15 characters as an example and then when reduced would run into issues such as:

IP               Status
100.100.100.101  active

IP               Status
192.2.0.1  active

and creates a state that would never exist in the real world, but can only exist in the testing infra. So what win? The correct parsing strategy of 15 characters (imagine a real scenario where this is the 4th column and all columns are based on spacing) of the incorrect reality that was created by auto-anonymizing without spacial awareness?

Long and short, parsing is a fickle beast and I have been personally bitten by this one several times.

glennmatthews · 2022-09-21T13:02:48Z

I'm not as concerned as @itdependsnetworks about specifically preserving exact character counts, but I do agree that it would be good to keep some uniqueness across different IPs rather than making all addresses identical. I'd vote for something like just changing the first octet of IPv4 addresses to 10. and changing the first two tokens of IPv6 addresses (do we even have any in our examples?) to 2001:db8:.

glennmatthews · 2022-09-21T13:04:36Z

anonymize_public_ips.py

+    report = ""
+    try:
+        for line in fileinput.input(filename, inplace=True):
+            if any(pattern_to_skip in line for pattern_to_skip in PATTERNS_TO_SKIP):


a bit surprising to me that PATTERNS_TO_SKIP is a list of exact substring matches, rather than a list of regex patterns.

it could be changed to regex if at some point is needed. For the current need, he simple substring matching was good enough

glennmatthews · 2022-09-21T13:09:41Z

anonymize_public_ips.py

+                        content = test_file.readlines()
+                    except UnicodeDecodeError as error:
+                        print(f"Warning: Not able to process {filename}: {error}")
+                        continue


sys.exit here? Or do we have files that we are expecting to have decoding errors in?

yes, the CSV one is not supported, for now. We could solve it at some point if needed

chadell · 2022-09-21T14:15:06Z

I will give a try to https://github.com/intentionet/netconan
thanks @jvanderaa

scetron

Looks good to me.

chadell added 5 commits September 17, 2022 17:44

Update CI with a check for IP addresses in test data

f63372e

Merge branch 'develop' into anonymize-ip

a7a9ae4

Clean up IPv6 addresses too

3cf8ce4

Clean up all test data

6e578f8

Improve IPv6 regex and clean up

35c5ec5

chadell changed the title ~~[WIP] Add a CI check to validate IP address present in tests, and a helper to clean them~~ Add a CI check to validate IP address present in tests, and a helper to clean them Sep 18, 2022

chadell marked this pull request as ready for review September 18, 2022 17:08

chadell requested review from glennmatthews, pke11y and scetron as code owners September 18, 2022 17:08

glennmatthews reviewed Sep 19, 2022

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

tests/unit/data/aquacomms/aquacomms1.eml Show resolved Hide resolved

tests/unit/data/telstra/telstra2_result.json Show resolved Hide resolved

Move the anonymize check after black like the other CI checks

120022f

chadell added 2 commits September 19, 2022 18:35

Add comment after cleanup

2cb5379

extend support for all utf-8 files and fix ci dependency

c831197

chadell requested a review from glennmatthews September 21, 2022 03:59

glennmatthews reviewed Sep 21, 2022

View reviewed changes

chadell mentioned this pull request Sep 24, 2022

Anonymize IPs with Netconan #169

Merged

scetron approved these changes Oct 6, 2022

View reviewed changes

glennmatthews linked an issue Oct 21, 2022 that may be closed by this pull request

Anonymized IP addresses in tests #166

Closed

chadell mentioned this pull request Oct 24, 2022

Anonymized IP addresses in tests #166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a CI check to validate IP address present in tests, and a helper to clean them #165

Add a CI check to validate IP address present in tests, and a helper to clean them #165

chadell commented Sep 17, 2022 •

edited

Loading

itdependsnetworks commented Sep 19, 2022

chadell commented Sep 19, 2022

itdependsnetworks commented Sep 19, 2022

glennmatthews commented Sep 21, 2022

glennmatthews Sep 21, 2022

chadell Sep 21, 2022

glennmatthews Sep 21, 2022

chadell Sep 21, 2022

chadell commented Sep 21, 2022

scetron left a comment

Add a CI check to validate IP address present in tests, and a helper to clean them #165

Are you sure you want to change the base?

Add a CI check to validate IP address present in tests, and a helper to clean them #165

Conversation

chadell commented Sep 17, 2022 • edited Loading

Content

itdependsnetworks commented Sep 19, 2022

chadell commented Sep 19, 2022

itdependsnetworks commented Sep 19, 2022

glennmatthews commented Sep 21, 2022

glennmatthews Sep 21, 2022

Choose a reason for hiding this comment

chadell Sep 21, 2022

Choose a reason for hiding this comment

glennmatthews Sep 21, 2022

Choose a reason for hiding this comment

chadell Sep 21, 2022

Choose a reason for hiding this comment

chadell commented Sep 21, 2022

scetron left a comment

Choose a reason for hiding this comment

chadell commented Sep 17, 2022 •

edited

Loading