Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assume symbols are either ASCII or UTF-8 #248

Merged
merged 1 commit into from
Feb 10, 2022

Conversation

casperisfine
Copy link

This is a fresh take on #211

Right now only US-ASCII are properly preserved. Any other encoding comes back with ASCII-8BIT (binary) encoding.

After this change UTF-8 symbols are properly preserved as well. Other encoding cause an EncodingError. Since UTF-8 is the only encoding handled by msgpack strings, I think this change makes sense.

spec/spec_helper.rb Outdated Show resolved Hide resolved
@casperisfine casperisfine force-pushed the symbols-assume-utf8 branch 2 times, most recently from 5d60b21 to 6e172e2 Compare February 8, 2022 13:17
@tagomoris
Copy link
Member

Now I'm afraid that someone will accidentally be surprised by sudden EncodingError after upgrading msgpack. It may break users' applications that are running now (with hidden encoding mismatch currently).
Can't we avoid raising errors at deserialization?

@casperisfine
Copy link
Author

I'll look into it.

Comment on lines +21 to +24
rescue EncodingError
# If somehow the string wasn't valid UTF-8 not valid ASCII, we fallback
# to what has been the historical behavior of creating a binary symbol
data.force_encoding(Encoding::BINARY).to_sym
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the solution. I first try what should be the happy path, and deal with the error in a rescue as this should basically never happen ever, so no reason the happy path should pay an extra cost for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix and awesome code comment!

@tagomoris
Copy link
Member

Oh, CI is failing with JRuby.

@casperisfine
Copy link
Author

Hum, seems like a JRuby bug IMO. Somehow it seems to picks up the existing symbol even though the encoding doesn't match.

I'll likely submit a PR to https://github.com/ruby/spec for this.

I'll check how it currently behaves on master and try to keep that same behavior.

Right now only US-ASCII are properly preserved. Any other encoding
comes back with `ASCII-8BIT` (binary) encoding.

After this change UTF-8 symbols are properly preserved as well.
Other encoding cause an EncodingError. Since UTF-8 is the only encoding
handled by msgpack strings, I think this change makes sense.
Comment on lines +337 to +341
if IS_JRUBY
# JRuby doesn't quite behave like MRI here.
# "fàe".force_encoding(Encoding::BINARY).to_sym is able to lookup the existing ISO-8859-1 symbol
# It likely is a JRuby bug.
expect(roundtrip(symbol).encoding).to be Encoding::ISO_8859_1
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RUby Spec for this ruby/spec#919, so it's likely that future versions of JRuby will have this fixed.

@tagomoris tagomoris merged commit a6148c7 into msgpack:master Feb 10, 2022
@casperisfine
Copy link
Author

I just noticed there is a small bug with symbols used as hash keys:

        it 'works with hash keys' do
          expect(roundtrip(symbol: 1)).to be == { symbol: 1 }
        end

  1) MessagePack::Factory the special treatment of symbols with ext type if an ext type is registered for symbols if using the default serializer works with hash keys
     Failure/Error: data.force_encoding(Encoding::UTF_8).to_sym
     
     FrozenError:
       can't modify frozen String: "symbol"
     # ./lib/msgpack/symbol.rb:20:in `force_encoding'
     # ./lib/msgpack/symbol.rb:20:in `from_msgpack_ext'
     # ./lib/msgpack/factory.rb:70:in `call'
     # ./lib/msgpack/factory.rb:70:in `full_unpack'
     # ./lib/msgpack/factory.rb:70:in `load'
     # ./spec/factory_spec.rb:290:in `roundtrip'
     # ./spec/factory_spec.rb:337:in `block (5 levels) in <top (required)>'

This is because Ruby freeze strings used as hash keys. I'll submit a fix as soon as possible.

casperisfine pushed a commit to Shopify/msgpack-ruby that referenced this pull request Feb 10, 2022
Fix: msgpack#248 (comment)

Extensions might not expect it, and it's unlikely that they'll be kept
as is as binary strings anyway.
casperisfine pushed a commit to Shopify/msgpack-ruby that referenced this pull request Feb 10, 2022
Fix: msgpack#248 (comment)

Extensions might not expect it, and it's unlikely that they'll be kept
as is as binary strings anyway.
casperisfine pushed a commit to Shopify/msgpack-ruby that referenced this pull request Feb 10, 2022
Fix: msgpack#248 (comment)

Extensions might not expect it, and it's unlikely that they'll be kept
as is as binary strings anyway.
casperisfine pushed a commit to Shopify/msgpack-ruby that referenced this pull request Feb 10, 2022
Fix: msgpack#248 (comment)

Extensions might not expect it, and it's unlikely that they'll be kept
as is as binary strings anyway.
@Laykou
Copy link

Laykou commented Feb 15, 2022

@casperisfine I see this was already merged prior the last fixes.

For me this fixed an issue with special characters in YAML files, mentioned here: ruby-i18n/i18n#606

@tagomoris They are waiting for new version release for mgspack. Do you think new version could be released soon?

@tagomoris
Copy link
Member

@Laykou My original plan is to release a new version including many changes (not merged yet), but yes, I can make a release with enough fixes for symbols.
I think I'll be able to find time for it today or in 2-3 days.

@tagomoris
Copy link
Member

Shipped v1.4.5 🎉

tagomoris pushed a commit that referenced this pull request Feb 22, 2022
Fix: #248 (comment)

Extensions might not expect it, and it's unlikely that they'll be kept
as is as binary strings anyway.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants