"csv" files with UTF-8 with BOM encoding are returning the first column with quotes #47

andrealexandre · 2014-07-08T18:29:17Z

Take the csv header example:

"ID", "Name", "Age
(...)

When the file is encoded UTF-8 with BOM, the CsvReader will return the first column came with the following string: ""ID"".

cowtowncoder · 2014-07-08T18:45:14Z

Just to make sure: you mean that CSV content that has UTF-8 BOM (3 bytes) will cause first header name to be reported incorrectly? Could you share bit of code to show how you are accessing field name?

andrealexandre · 2014-07-08T19:00:29Z

I'm actually using direct mapping to a POJO object.

CsvMapper mapper = new CsvMapper();

mapper.setSerializationInclusion(Include.NON_NULL)
            .enable(MapperFeature.AUTO_DETECT_GETTERS)
            .enable(MapperFeature.AUTO_DETECT_IS_GETTERS)
            .enable(MapperFeature.AUTO_DETECT_SETTERS)
            .disable(MapperFeature.AUTO_DETECT_FIELDS)
            .disable(SerializationFeature.WRITE_DATE_KEYS_AS_TIMESTAMPS)
            .disable(SerializationFeature.FAIL_ON_EMPTY_BEANS)
            .setPropertyNamingStrategy(PropertyNamingStrategy.PASCAL_CASE_TO_CAMEL_CASE)
            .disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES);

CsvSchema schema = CsvSchema.emptySchema().withHeader();

File file = new File("{CSV file UTF-8 with BOM}");

if(!file.exists())
    System.out.println("File doesn't exist.");

MappingIterator<MyObject> objects = mapper.reader(MyObject.class)
                                      .with(schema)
                                      .readValues(file);

I know this, because I debugged my application and have gone through most of your source code and ended up finding the BOM char on the CsvReader.java file. Around the line 559.
You also can find the char inside the "_inputBuffer", right at the position 0.

(I was also wondering if there is a option to specify the file encoding, I think could avoid this problem)

cowtowncoder · 2014-07-08T20:02:18Z

I don't offer a way to specify encoding because it's safer to just require use of java.io.Reader, if encoding is already known. With CSV things are more difficult wrt auto-detection (since there's no well-known start sequence), but it should be relatively easy to fix BOM handling. It's just not properly tested I think.

So: I just want to know exact BOM bytes in use -- there are kinds of broken content where what looks like a BOM is not valid one.

Is it possible to share the File, or at least first couple of bytes? It should be easy enough to figure out the problem with that.

Thanks again for reporting the problem.

andrealexandre · 2014-07-08T20:15:21Z

I can only share an example file, but I tested it and I found the same
problem with this file.

I found that you can replicate this by editing a .csv file with Notepad++
and convert to encoding to UTF-8.

I hope it was helpful.

2014-07-08 21:02 GMT+01:00 Tatu Saloranta [email protected]:

I don't offer a way to specify encoding because it's safer to just require
use of java.io.Reader, if encoding is already known. With CSV things are
more difficult wrt auto-detection (since there's no well-known start
sequence), but it should be relatively easy to fix BOM handling. It's just
not properly tested I think.

So: I just want to know exact BOM bytes in use -- there are kinds of
broken content where what looks like a BOM is not valid one.

Is it possible to share the File, or at least first couple of bytes? It
should be easy enough to figure out the problem with that.

Thanks again for reporting the problem.

—
Reply to this email directly or view it on GitHub
#47 (comment)
.

Best regards,
André Alexandre

cowtowncoder · 2014-07-09T19:52:08Z

If you could just list first couple of bytes of the file -- BOM, and couple of bytes of JSON itself. I just want to make 100% sure I use exact same setup, and it is quite easy to get different files as different tools have different capabilities wrt detection and handling of BOMs.

andrealexandre · 2014-07-09T21:13:07Z

Very well, I understand.
I sent the first couple of bytes from the file I used.

2014-07-09 20:52 GMT+01:00 Tatu Saloranta [email protected]:

If you could just list first couple of bytes of the file -- BOM, and
couple of bytes of JSON itself. I just want to make 100% sure I use exact
same setup, and it is quite easy to get different files as different tools
have different capabilities wrt detection and handling of BOMs.

—
Reply to this email directly or view it on GitHub
#47 (comment)
.

Com os melhores cumprimentos,
André Alexandre

cowtowncoder · 2014-07-17T05:01:34Z

Looks like I can reproduce this easily, and that one char is prepended. Can be anywhere from 1 to 3 bytes, as resulting char, 0xFEFF is "illegal character" marker.

cowtowncoder · 2014-07-17T05:11:54Z

Interesting. So, CsvParserBootstrapper seems like it should work. But I hadn't connected that to CsvFactory... which is why BOM is simply ignored, it seems.

andrealexandre · 2014-07-29T16:29:45Z

I know the issue is already closed, but I found quite a neat solution for this problem, through the Apache Commons IO library is there a decorator class named BOMInputStream for the InputStream to skip the BOM byte. I tested it, and works fine with the CsvMapper.

cowtowncoder · 2014-07-29T16:37:10Z

Thanks, that should be useful for general problem, and good to know of.

tom999 · 2018-01-16T13:22:16Z

Thanx helped me out, after first columns was always null.

cowtowncoder added a commit that referenced this issue Jul 17, 2014

Add a test for #47

44cac10

cowtowncoder closed this as completed in 4e2b89f Jul 17, 2014

cowtowncoder added a commit that referenced this issue Jul 17, 2014

Backport #47 fix for 2.3.4

bdd56bc

cowtowncoder added this to the 2.3.4 milestone Jul 17, 2014

cowtowncoder mentioned this issue Jul 18, 2014

Use CsvParserBootstrapper for automatic encoding detection #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"csv" files with UTF-8 with BOM encoding are returning the first column with quotes #47

"csv" files with UTF-8 with BOM encoding are returning the first column with quotes #47

andrealexandre commented Jul 8, 2014

cowtowncoder commented Jul 8, 2014

andrealexandre commented Jul 8, 2014

cowtowncoder commented Jul 8, 2014

andrealexandre commented Jul 8, 2014 •

edited

Loading

cowtowncoder commented Jul 9, 2014

andrealexandre commented Jul 9, 2014

cowtowncoder commented Jul 17, 2014

cowtowncoder commented Jul 17, 2014

andrealexandre commented Jul 29, 2014

cowtowncoder commented Jul 29, 2014

tom999 commented Jan 16, 2018

"csv" files with UTF-8 with BOM encoding are returning the first column with quotes #47

"csv" files with UTF-8 with BOM encoding are returning the first column with quotes #47

Comments

andrealexandre commented Jul 8, 2014

cowtowncoder commented Jul 8, 2014

andrealexandre commented Jul 8, 2014

cowtowncoder commented Jul 8, 2014

andrealexandre commented Jul 8, 2014 • edited Loading

cowtowncoder commented Jul 9, 2014

andrealexandre commented Jul 9, 2014

cowtowncoder commented Jul 17, 2014

cowtowncoder commented Jul 17, 2014

andrealexandre commented Jul 29, 2014

cowtowncoder commented Jul 29, 2014

tom999 commented Jan 16, 2018

andrealexandre commented Jul 8, 2014 •

edited

Loading