You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried parsing some Excel files that contain newlines, and encountered some errors while parsing the file.
Input
If in an Excel sheet, a cell contains a newline, that unicode value is not allowed in the standard. Therefore Excell stores it as _x000D_.
An underscore is also not allowed, and that one is encoded as _x005F_.
This means that a carriage return is encoded as _x005F_x000D_.
A document with a newline is properly parsed by the library.
I do have an Excel (that I cannot share) that is wrongly parsed. But this might be because of an older version of Excel that made the file, because as soon as I open and save it with my Excel version it works fine.
When a cell contains the literal string _x000D_ it is parsed as _x005F_x000D_.
Guess
I have yet to find a specific reason why this happens, but I found this which states that some unicode characters are not allowed in XML 1.0 and therefore they are escaped as xHHHH. The underscore in the prefix is also escaped with _x005F, which results in the entire string being represented as _x005F_x000D_.
This link tells us that CR is escaped and that the first underscore of its escaped representation is also escaped, leading a CR to be represented as _x005F_x000D_.
Proof
I have a test case in m1dnight@c335061 this commit that shows the behavior.
I'm not sure though, if this is a bug in SAX or not.
Any ideas on how to proceed?
The text was updated successfully, but these errors were encountered:
I tried parsing some Excel files that contain newlines, and encountered some errors while parsing the file.
Input
If in an Excel sheet, a cell contains a newline, that unicode value is not allowed in the standard. Therefore Excell stores it as
_x000D_
.An underscore is also not allowed, and that one is encoded as
_x005F_
.This means that a carriage return is encoded as
_x005F_x000D_
.A document with a newline is properly parsed by the library.
I do have an Excel (that I cannot share) that is wrongly parsed. But this might be because of an older version of Excel that made the file, because as soon as I open and save it with my Excel version it works fine.
When a cell contains the literal string
_x000D_
it is parsed as_x005F_x000D_
.Guess
I have yet to find a specific reason why this happens, but I found this which states that some unicode characters are not allowed in XML 1.0 and therefore they are escaped as xHHHH. The underscore in the prefix is also escaped with
_x005F
, which results in the entire string being represented as_x005F_x000D_
.This link tells us that CR is escaped and that the first underscore of its escaped representation is also escaped, leading a CR to be represented as
_x005F_x000D_
.Proof
I have a test case in m1dnight@c335061 this commit that shows the behavior.
I'm not sure though, if this is a bug in SAX or not.
Any ideas on how to proceed?
The text was updated successfully, but these errors were encountered: