Add client-side data scrubbing to Sentry configuration #383
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What problem does this pull request solve?
Trello card: https://trello.com/c/EW20O2Zx
This PR adds code to the Sentry initializer that filters out anything that looks like an email address before it is sent to the Sentry.io servers.
I've tested this locally by using the DSN from the our
forms-debugging-localhost-instances
project. If you have access to our Sentry installation you can see examples of events with scrubbed data at https://govuk-forms.sentry.io/issues/4706430125/ and https://govuk-forms.sentry.io/issues/4706430136/. Note that you may need to look at the latest event for those issues to see the client-side filtering in action.Although I have made sure that the replacement text used to mask out sensitive data client-side is different to the one used for server-side scrubbing, the server-side scrubbing is also very diligent, and will filter out values if the key contains the term "email" even when the value doesn't look like an email address. This does mean that in production usage we might have occasions where we're not sure if the client-side filtering got to the email address first or not; we might want to have a think about that further.
This PR also includes automated tests for the filter; as well as testing the logic of the filter itself we test it's integration with Sentry. Writing these tests was pretty hard-going, note that we had to add a bit of test specific logic to the Sentry configuration to make these tests work, as well as be very careful about how we reach into the Sentry code to test the behaviour we're interested in.
The filter is pretty thorough, it uses a regex to find anything that looks like a valid email address anywhere in the Sentry event object. The regex comes from https://www.regular-expressions.info/email.html, and covers 99% of valid email addresses. There is probably a bit of unnecessary cycles being spent here, however as this code should only be invoked when there is an exception, I think that's acceptable. Also, in production Sentry runs in a background thread so the CPU time being used shouldn't affect threads serving users unless the server is already close to capacity.
Things to consider when reviewing