-
Notifications
You must be signed in to change notification settings - Fork 76
"csv" files with UTF-8 with BOM encoding are returning the first column with quotes #47
Comments
Just to make sure: you mean that CSV content that has UTF-8 BOM (3 bytes) will cause first header name to be reported incorrectly? Could you share bit of code to show how you are accessing field name? |
I'm actually using direct mapping to a POJO object. CsvMapper mapper = new CsvMapper();
mapper.setSerializationInclusion(Include.NON_NULL)
.enable(MapperFeature.AUTO_DETECT_GETTERS)
.enable(MapperFeature.AUTO_DETECT_IS_GETTERS)
.enable(MapperFeature.AUTO_DETECT_SETTERS)
.disable(MapperFeature.AUTO_DETECT_FIELDS)
.disable(SerializationFeature.WRITE_DATE_KEYS_AS_TIMESTAMPS)
.disable(SerializationFeature.FAIL_ON_EMPTY_BEANS)
.setPropertyNamingStrategy(PropertyNamingStrategy.PASCAL_CASE_TO_CAMEL_CASE)
.disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES);
CsvSchema schema = CsvSchema.emptySchema().withHeader();
File file = new File("{CSV file UTF-8 with BOM}");
if(!file.exists())
System.out.println("File doesn't exist.");
MappingIterator<MyObject> objects = mapper.reader(MyObject.class)
.with(schema)
.readValues(file); I know this, because I debugged my application and have gone through most of your source code and ended up finding the BOM char on the CsvReader.java file. Around the line 559. (I was also wondering if there is a option to specify the file encoding, I think could avoid this problem) |
I don't offer a way to specify encoding because it's safer to just require use of So: I just want to know exact BOM bytes in use -- there are kinds of broken content where what looks like a BOM is not valid one. Is it possible to share the File, or at least first couple of bytes? It should be easy enough to figure out the problem with that. Thanks again for reporting the problem. |
I can only share an example file, but I tested it and I found the same I found that you can replicate this by editing a .csv file with Notepad++ I hope it was helpful. 2014-07-08 21:02 GMT+01:00 Tatu Saloranta [email protected]:
Best regards, |
If you could just list first couple of bytes of the file -- BOM, and couple of bytes of JSON itself. I just want to make 100% sure I use exact same setup, and it is quite easy to get different files as different tools have different capabilities wrt detection and handling of BOMs. |
Very well, I understand. 2014-07-09 20:52 GMT+01:00 Tatu Saloranta [email protected]:
Com os melhores cumprimentos, |
Looks like I can reproduce this easily, and that one char is prepended. Can be anywhere from 1 to 3 bytes, as resulting char, 0xFEFF is "illegal character" marker. |
Interesting. So, |
I know the issue is already closed, but I found quite a neat solution for this problem, through the Apache Commons IO library is there a decorator class named BOMInputStream for the InputStream to skip the BOM byte. I tested it, and works fine with the CsvMapper. |
Thanks, that should be useful for general problem, and good to know of. |
Thanx helped me out, after first columns was always null. |
Take the csv header example:
"ID", "Name", "Age
(...)
When the file is encoded UTF-8 with BOM, the CsvReader will return the first column came with the following string: ""ID"".
The text was updated successfully, but these errors were encountered: