Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong result when parsing escaped unicode characters #120

Open
m1dnight opened this issue Apr 25, 2023 · 0 comments
Open

Wrong result when parsing escaped unicode characters #120

m1dnight opened this issue Apr 25, 2023 · 0 comments

Comments

@m1dnight
Copy link

m1dnight commented Apr 25, 2023

I tried parsing some Excel files that contain newlines, and encountered some errors while parsing the file.

Input

If in an Excel sheet, a cell contains a newline, that unicode value is not allowed in the standard. Therefore Excell stores it as _x000D_.
An underscore is also not allowed, and that one is encoded as _x005F_.
This means that a carriage return is encoded as _x005F_x000D_.
A document with a newline is properly parsed by the library.

I do have an Excel (that I cannot share) that is wrongly parsed. But this might be because of an older version of Excel that made the file, because as soon as I open and save it with my Excel version it works fine.

When a cell contains the literal string _x000D_ it is parsed as _x005F_x000D_.

Guess

I have yet to find a specific reason why this happens, but I found this which states that some unicode characters are not allowed in XML 1.0 and therefore they are escaped as xHHHH. The underscore in the prefix is also escaped with _x005F, which results in the entire string being represented as _x005F_x000D_.

This link tells us that CR is escaped and that the first underscore of its escaped representation is also escaped, leading a CR to be represented as _x005F_x000D_.

Proof

I have a test case in m1dnight@c335061 this commit that shows the behavior.

I'm not sure though, if this is a bug in SAX or not.

Any ideas on how to proceed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant