Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a CI check to validate IP address present in tests, and a helper to clean them #165

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

chadell
Copy link
Collaborator

@chadell chadell commented Sep 17, 2022

This PR addresses issue #166 by adding a helper script anonymize-ip-addresses.py, to be used in the CI to detect the presence of some IPv4 or IPv6 addresses, and to actually replace them by documentary IPs

Content

  • New script anonymize-ip-addresses.py to check for IPv4 and IPv6 addresses to detect them, and replace by documentation ones
  • Created two invoke tasks: check-anonymize-ips and clean-anonymize-ips
  • Add a first step in the CI to detect the presence of IP addresses in the test data

@chadell chadell changed the title [WIP] Add a CI check to validate IP address present in tests, and a helper to clean them Add a CI check to validate IP address present in tests, and a helper to clean them Sep 18, 2022
@chadell chadell marked this pull request as ready for review September 18, 2022 17:08
@itdependsnetworks
Copy link
Contributor

Just a friendly thought when I think of anonymizing IPs and parsing

  • When you are parsing, there can be differences introduced by changing the octet size, such as a table
  • When viewing the data, sometimes IPs are related and anonimzing them all to the same, it makes it difficult to correlate

Does it make sense to consider swapping the first subnet as the following:

  • 1.x.x.x for single digit first subnets
  • 10.x.x.x for double digit first subnets
  • 172.x.x.x for triple digit first subnets

and keep everything else the same?

@chadell
Copy link
Collaborator Author

chadell commented Sep 19, 2022

and keep everything else the same?

The parsing should work the same for any kind of addresses, so if changing the octet size fails, it would be great to catch the error earlier.
So far, in all the test data sets, I have found only one IP address in the actual data (description), in the Megaport one. And I don't see a big problem to lose it by replacing by a generic one.
Maybe I am missing some good point, but so far I think it's easier to do a simple replace to a reference value for all the IP addresses.

@itdependsnetworks
Copy link
Contributor

Case in point:

image

Additionally, I have had bad luck with parsing where a column would be based on 15 characters as an example and then when reduced would run into issues such as:

IP               Status
100.100.100.101  active
IP               Status
192.2.0.1  active

and creates a state that would never exist in the real world, but can only exist in the testing infra. So what win? The correct parsing strategy of 15 characters (imagine a real scenario where this is the 4th column and all columns are based on spacing) of the incorrect reality that was created by auto-anonymizing without spacial awareness?

Long and short, parsing is a fickle beast and I have been personally bitten by this one several times.

@glennmatthews
Copy link
Contributor

I'm not as concerned as @itdependsnetworks about specifically preserving exact character counts, but I do agree that it would be good to keep some uniqueness across different IPs rather than making all addresses identical. I'd vote for something like just changing the first octet of IPv4 addresses to 10. and changing the first two tokens of IPv6 addresses (do we even have any in our examples?) to 2001:db8:.

report = ""
try:
for line in fileinput.input(filename, inplace=True):
if any(pattern_to_skip in line for pattern_to_skip in PATTERNS_TO_SKIP):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit surprising to me that PATTERNS_TO_SKIP is a list of exact substring matches, rather than a list of regex patterns.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could be changed to regex if at some point is needed. For the current need, he simple substring matching was good enough

content = test_file.readlines()
except UnicodeDecodeError as error:
print(f"Warning: Not able to process {filename}: {error}")
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sys.exit here? Or do we have files that we are expecting to have decoding errors in?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the CSV one is not supported, for now. We could solve it at some point if needed

@chadell
Copy link
Collaborator Author

chadell commented Sep 21, 2022

I will give a try to https://github.com/intentionet/netconan
thanks @jvanderaa

Copy link
Contributor

@scetron scetron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@glennmatthews glennmatthews linked an issue Oct 21, 2022 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Anonymized IP addresses in tests
4 participants