Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add client-side data scrubbing to Sentry configuration #383

Merged
merged 2 commits into from
Dec 8, 2023

Conversation

lfdebrux
Copy link
Member

@lfdebrux lfdebrux commented Dec 8, 2023

What problem does this pull request solve?

Trello card: https://trello.com/c/EW20O2Zx

This PR adds code to the Sentry initializer that filters out anything that looks like an email address before it is sent to the Sentry.io servers.

I've tested this locally by using the DSN from the our forms-debugging-localhost-instances project. If you have access to our Sentry installation you can see examples of events with scrubbed data at https://govuk-forms.sentry.io/issues/4706430125/ and https://govuk-forms.sentry.io/issues/4706430136/. Note that you may need to look at the latest event for those issues to see the client-side filtering in action.

Although I have made sure that the replacement text used to mask out sensitive data client-side is different to the one used for server-side scrubbing, the server-side scrubbing is also very diligent, and will filter out values if the key contains the term "email" even when the value doesn't look like an email address. This does mean that in production usage we might have occasions where we're not sure if the client-side filtering got to the email address first or not; we might want to have a think about that further.

This PR also includes automated tests for the filter; as well as testing the logic of the filter itself we test it's integration with Sentry. Writing these tests was pretty hard-going, note that we had to add a bit of test specific logic to the Sentry configuration to make these tests work, as well as be very careful about how we reach into the Sentry code to test the behaviour we're interested in.

The filter is pretty thorough, it uses a regex to find anything that looks like a valid email address anywhere in the Sentry event object. The regex comes from https://www.regular-expressions.info/email.html, and covers 99% of valid email addresses. There is probably a bit of unnecessary cycles being spent here, however as this code should only be invoked when there is an exception, I think that's acceptable. Also, in production Sentry runs in a background thread so the CPU time being used shouldn't affect threads serving users unless the server is already close to capacity.

Things to consider when reviewing

  • Ensure that you consider the wider context.
  • Does it work when run on your machine?
  • Is it clear what the code is doing?
  • Do the commit messages explain why the changes were made?
  • Are there all the unit tests needed?
  • Has all relevant documentation been updated?

Copy link

sonarqubecloud bot commented Dec 8, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Copy link
Contributor

@DavidBiddle DavidBiddle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, tests run locally 🎉

@lfdebrux lfdebrux added this pull request to the merge queue Dec 8, 2023
Merged via the queue into main with commit a150ae5 Dec 8, 2023
5 checks passed
@lfdebrux lfdebrux deleted the ldeb-sentry-scrub-email-addresses branch December 8, 2023 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants