-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(batch-exports): replace invalid unicode with ? #21174
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks for looking into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to do this for the JSONLines writer down below too.
return orjson.dumps(d, default=str) | ||
try: | ||
return orjson.dumps(d, default=str) | ||
except (UnicodeEncodeError, TypeError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange that the exception raised doesn't match the docs:
If orjson.dumps() is given a str that does not contain valid UTF-8, orjson.JSONEncodeError is raised. If loads() receives invalid UTF-8, orjson.JSONDecodeError is raised.
However, I verified and it's indeed raising a TypeError
(and obviously the unit test covers it). Wondering if we should be safe and catch orjson.JSONEncodeError
just in case this is a bug that's fixed upstream at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've figured it out: orjson.JSONEncodeError
is an alias of TypeError
(see: https://github.com/ijl/orjson/blob/master/src/typeref.rs#L197, and this issue). So, really, orjson is raising a TypeError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that this:
try:
return orjson.dumps(d, default=str)
except orjson.JSONEncodeError:
...
Would also work (since orjson.JSONEncodeError is TypeError
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One (nitty) thing though: UnicodeEncodeError
is not the exception being raised, but the __cause__
, so I would consider dropping that from the except
and keeping just TypeError
or orjson.JSONEncodeError
.
thought: We really need to stop doing JSON unless user explicitly requests JSON in blob storage exports... |
19a9068
to
53a8418
Compare
Problem
orjson
is very strict about Unicode (but doesn't seem to provide a way to do error replacement on its own). Users have hit this in prod.Changes
Catch rare errors, do Unicode error replacement, then use
orjson.dumps
again.👉 Stay up-to-date with PostHog coding conventions for a smoother review.
Does this work well for both Cloud and self-hosted?
Yes.
How did you test this code?
Unit.