-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assume symbols are either ASCII or UTF-8 #248
Conversation
5d60b21
to
6e172e2
Compare
Now I'm afraid that someone will accidentally be surprised by sudden EncodingError after upgrading msgpack. It may break users' applications that are running now (with hidden encoding mismatch currently). |
I'll look into it. |
6e172e2
to
48b119f
Compare
rescue EncodingError | ||
# If somehow the string wasn't valid UTF-8 not valid ASCII, we fallback | ||
# to what has been the historical behavior of creating a binary symbol | ||
data.force_encoding(Encoding::BINARY).to_sym |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the solution. I first try what should be the happy path, and deal with the error in a rescue
as this should basically never happen ever, so no reason the happy path should pay an extra cost for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix and awesome code comment!
Oh, CI is failing with JRuby. |
Hum, seems like a JRuby bug IMO. Somehow it seems to picks up the existing symbol even though the encoding doesn't match. I'll likely submit a PR to https://github.com/ruby/spec for this. I'll check how it currently behaves on master and try to keep that same behavior. |
Right now only US-ASCII are properly preserved. Any other encoding comes back with `ASCII-8BIT` (binary) encoding. After this change UTF-8 symbols are properly preserved as well. Other encoding cause an EncodingError. Since UTF-8 is the only encoding handled by msgpack strings, I think this change makes sense.
48b119f
to
37949d2
Compare
if IS_JRUBY | ||
# JRuby doesn't quite behave like MRI here. | ||
# "fàe".force_encoding(Encoding::BINARY).to_sym is able to lookup the existing ISO-8859-1 symbol | ||
# It likely is a JRuby bug. | ||
expect(roundtrip(symbol).encoding).to be Encoding::ISO_8859_1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RUby Spec for this ruby/spec#919, so it's likely that future versions of JRuby will have this fixed.
I just noticed there is a small bug with symbols used as hash keys: it 'works with hash keys' do
expect(roundtrip(symbol: 1)).to be == { symbol: 1 }
end
This is because Ruby freeze strings used as hash keys. I'll submit a fix as soon as possible. |
Fix: msgpack#248 (comment) Extensions might not expect it, and it's unlikely that they'll be kept as is as binary strings anyway.
Fix: msgpack#248 (comment) Extensions might not expect it, and it's unlikely that they'll be kept as is as binary strings anyway.
Fix: msgpack#248 (comment) Extensions might not expect it, and it's unlikely that they'll be kept as is as binary strings anyway.
Fix: msgpack#248 (comment) Extensions might not expect it, and it's unlikely that they'll be kept as is as binary strings anyway.
@casperisfine I see this was already merged prior the last fixes. For me this fixed an issue with special characters in YAML files, mentioned here: ruby-i18n/i18n#606 @tagomoris They are waiting for new version release for mgspack. Do you think new version could be released soon? |
@Laykou My original plan is to release a new version including many changes (not merged yet), but yes, I can make a release with enough fixes for symbols. |
Shipped |
Fix: #248 (comment) Extensions might not expect it, and it's unlikely that they'll be kept as is as binary strings anyway.
This is a fresh take on #211
Right now only US-ASCII are properly preserved. Any other encoding comes back with
ASCII-8BIT
(binary) encoding.After this change UTF-8 symbols are properly preserved as well. Other encoding cause an EncodingError. Since UTF-8 is the only encoding handled by msgpack strings, I think this change makes sense.