Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a blacklist for randomCharFromSetCensorStrategy #84

Closed
wants to merge 1 commit into from

Conversation

eltoder
Copy link
Contributor

@eltoder eltoder commented Nov 28, 2024

Fixes #82

Type of change:

  • Refactor
  • Performance improvement
  • New feature
  • Bug fix
  • Other (please describe):

Please describe the changes this PR makes and why it should be merged:

As suggested in #82, make randomCharFromSetCensorStrategy generate a new replacement string if it generated a bad word.

Status:

  • I've added/modified unit tests relevant to my change / not needed
  • This PR contains breaking changes
  • This PR doesn't include changes to the code

Copy link

codecov bot commented Nov 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (36b6512) to head (e7979d3).
Report is 35 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##              main       #84   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           27        26    -1     
  Lines          505       469   -36     
  Branches        92        81   -11     
=========================================
- Hits           505       469   -36     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Owner

@jo3-l jo3-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good. I'll merge this along with #78 when I have time to package a new release, probably during the winter holiday—you don't need to do anything with this PR in the meantime. Sorry for any delay, and if you're hitting this in production you can always use this censor strategy directly in your code until the new version is out.

I do want to note in passing that this isn't going to catch cases like @$$$; I'm not sure how big of an issue this is (and to be clear I'm not asking you to address it in this PR—it is fine as is, I am just thinking of possible followups I might want to consider.) That said, the probability of generating such combinations decreases exponentially with length and are less obviously problematic.

@eltoder
Copy link
Contributor Author

eltoder commented Nov 28, 2024

I do want to note in passing that this isn't going to catch cases like @$$$

Good point. I can pass a list of patterns, compile them into one regexp, and match against the generated string. Not sure if I should anchor the regexp or not. If not, everything will match as a substring, so @\$\$ will match @$$$ and anything else that contains that, including say %*@$$&*. If I anchor the regexp, I can use @+\${2,}. Not anchoring regexp is easier, but the probability of generating something that matches it actually increases with the length instead of decreasing. It's probably still very low for realistic swearing.

Sorry for any delay, and if you're hitting this in production you can always use this censor strategy directly in your code until the new version is out.

No problem at all. I don't know if this was even a big deal, but my users reported it. I already changed my app to use keepStartCensorStrategy(asteriskCensorStrategy()). Now I think that even that is overengineering and I should just replace everything with one "🤬". Another option along the same lines is to use unicode grawlix, then hope that no one complains about "🌀⚡⚡".

@jo3-l
Copy link
Owner

jo3-l commented Nov 28, 2024

Good point. I can pass a list of patterns, compile them into one regexp, and match against the generated string.

Hm. I don't know how comfortable I am with this idea: it feels like we're starting to duplicate the work of the main library (though obviously at a much smaller scale—it's possible this is a slippery slope argument on my part), and I wonder if there is a more general solution. On that note, I discussed this issue briefly with some friends elsewhere and one of them proposed having grawlix never generate the same character twice in a row, which incidentally fixes this issue at the cost of even more complexity. I will have to think about that. In any case, though, please don't let me push this work onto you if you don't have the time, especially if the right 'complete' solution is still up in the air. Like I said, this PR is a fine stopgap as is.

Unicode grawlix is a very fun idea, though 😁

@eltoder
Copy link
Contributor Author

eltoder commented Nov 28, 2024

Yes, it's definitely some duplication of the main library. One other option is then to run the main library on the generated replacement in applyTo.

Not generating the same character twice in a row is actually very easy to do as well. Let me know if you like that idea. It's less code than in this PR.

@jo3-l
Copy link
Owner

jo3-l commented Nov 28, 2024

Not generating the same character twice in a row is actually very easy to do as well. Let me know if you like that idea. It's less code than in this PR.

I'm happy to try this out. It would be good to have an error in the degenerate case of 1 input codepoint (if avoiding consecutive identical characters is enabled.)

@jo3-l
Copy link
Owner

jo3-l commented Dec 3, 2024

I'm going to close this in favor of your new PR #85, which generalizes a little better. Thanks for the fruitful discussion.

@jo3-l jo3-l closed this Dec 3, 2024
@eltoder eltoder deleted the feature/grawlix-blacklist branch December 3, 2024 01:08
@eltoder eltoder restored the feature/grawlix-blacklist branch December 3, 2024 01:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: grawlixCensorStrategy sometimes generates "@$$"
2 participants