-
Notifications
You must be signed in to change notification settings - Fork 76
MappingIterator does not clear current token after exception #52
Comments
ah sorry the defect line must also be consumed... please see the corrected code: MappingIterator<LineDTO> values = csvReader.readValues(importStream);
List<LineDTO> lines = new ArrayList<>();
while (values.hasNextValue()) {
try {
LineDTO line = values.nextValue();
lines.add(line);
} catch (Exception e) {
e.printStackTrace();
// bug in jackson, jackson does not clear the token in error-case.
// we also need to consume the defect line.
values.getParser().clearCurrentToken();
LineDTO defectLine = values.nextValue();
System.out.println(defectLine);
}
} |
This area (error handling) is something I think does need improving, and I have tried to think of proper ways to deal with that. So far I am not yet sure what to do, partly because there are problems that can occur both at low streaming API level ( One possibility could be to have a new method that would try to "Skip tokens until END_OBJECT" (or possibly CSV-specific one to skip until end of input line). That might be something that could be linked to |
thanks for reply ... i hope i can use my workaround for my use-case. It seems like it works. So i'm more or less satisfied with the code above ;-) But please notice that the "try/catch" block could cause an endless loop if defined like in following code-example. This should be fixed i think, because the developer would not expect an endless loop situation for this valid try/catch block. Maybe you can let "hasNextValue" return false after an exception occures in "nextValue" or something like that ;-) MappingIterator<LineDTO> values = csvReader.readValues(importStream);
List<LineDTO> lines = new ArrayList<>();
while (values.hasNextValue()) {
try {
LineDTO line = values.nextValue();
lines.add(line);
} catch (Exception e) {
e.printStackTrace();
}
} |
@flappingeagle Good point -- although it is not guaranteed that the state is valid after exception, it would be good if it was guaranteed, and I think we should try to work towards that as much as possible. Do you happen to have small snippet of code (POJO declaration maybe?) and matching JSON, to reproduce specific problem you have? That would be helpful in verifying that at least that case is handled in safer manner. |
well yes, here i have a junit-test-case which tests if all correct lines in a csv-file have been mapped. Incorrect lines should not lead to abortion of the read, but should simply be ignored: note: at the moment this test leads to the explained "endless-loop" problem which should be fixed ;-) ProductPojo.java @JsonPropertyOrder({
"id",
"name",
"amount",
"price"
})
public class ProductPojo {
private Integer id;
private String name;
private Integer amount;
private Double price;
public Integer getId() {
return id;
}
public void setId(Integer id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public Integer getAmount() {
return amount;
}
public void setAmount(Integer amount) {
this.amount = amount;
}
public Double getPrice() {
return price;
}
public void setPrice(Double price) {
this.price = price;
}
} ParseTest.java private final String testPath = new File("src/test/resources").getAbsolutePath();
@Before
public void setUp() throws Exception {
}
@After
public void tearDown() throws Exception {
}
@Test
public void testParse() throws FileNotFoundException {
CsvMapper mapper = new CsvMapper();
CsvSchema schema = mapper.schemaFor(ProductPojo.class);
schema = schema.withColumnSeparator(';');
schema = schema.withQuoteChar('"');
schema = schema.withUseHeader(true);
ObjectReader csvReader = mapper.reader(ProductPojo.class).with(schema);
FileInputStream fis = new FileInputStream(testPath
+ File.separator + "test.csv");
List<ProductPojo> lines = new ArrayList<>();
MappingIterator<ProductPojo> values = null;
try {
values = csvReader.readValues(fis);
while (values.hasNextValue()) {
try {
ProductPojo line = values.nextValue();
lines.add(line);
} catch (Exception e) {
int line = values.getParser().getCurrentLocation().getLineNr();
System.out.println("error at line: " + line + " --> ignoring this line, "
+ "exception: " + e.getMessage());
}
}
} catch (Exception e) {
// hard exception while going to next value.
String line = null;
if (values != null) {
line = String.valueOf(values.getParser().getCurrentLocation().getLineNr());
}
System.out.println("unrecoverable exception at line: " + line + ", exception: "
+ e.getMessage());
}
// there are 3 correct lines in the file, which should be read.
assertTrue(lines.size() == 3);
} test.csv (CSV-File)
|
Come to think of this, it probably requires CSV-specific |
my personal opinion would be that a format-parser (xml,csv,json) should only abort the parsing of the given data if the data does not correspond to the format (xml,csv,json). In XML its easy to find data that does not correspond to xml. The following example is xml but as soon as the parser reaches the " <list>
<item>
</id>
</list> But all mapping-problems that occur while mapping incoming data to given java-pojos should (in my personal opinion) never abort the reading of the given data but should simply ignore fields that could not be mapped and maybe even protocol this as "warnings". The java-bean-validation framework has such a way to protocol warnings for example. Just some infos to think about ;-) |
Conceptually I agree with you: ideally format-specific part should focus on decoding low-level format; and many/most of issues should be solved at a higher-level, where discrepancy is note. But the problem is that I do not think this can be solved in general manner: there is not enough information to resolve problems that can occur at any state during processing. So, for example, problem in a JSON value during data binding is something that I don't think I could solve at databinding level at all (except by having options that define some sort of explicit resolution of inconvertible scalar type into legal one or something). But CSV format itself is, in some way, more robust because it is very simple: it only has 2 dimensions, and no more nesting. This means that skipping a whole line is very likely to resolve any issue there might be, with respect to state. Still -- perhaps the general idea of resetting state could be generalized, so that any |
We encounter similar problems in Wikidata Toolkit. We are parsing some 17 million JSON documents from an input stream, some (very few) of which contain serialization errors that are hard to tolerate in an object model (such as confusion of empty list and empty map syntax). Every JSON document is in exactly one line of the input. We would like to tolerate (skip and ignore) the odd malformed document in this stream, but it is not clear how to do this. The code I am talking about is here (this version has the catch block outside of the loop, since we cannot recover anyway). We are using a MappingIterator constructed from an ObjectReader for a specific Java class structure. Since the errors in the JSON are not foreseeable, it is not feasible to tolerate all of them in the object model (we tried, but each new dump we process has new unexpected issues). I can catch the JsonMappingException, but I failed to advance the MappingIterator behind the offending line in any way. One option would be to read lines from the file and to map the strings directly (no MappingIterator) but this would probably affect performance. Ideas are welcome. P.S. I note now that this bug is specifically about the CSV parser, but I think the underlying questions are the same for all uses of MappingIterator. |
In case of line-per-JSON-document, the most reliable method really is to split lines first, bind them separately. This does incur some performance overhead, and is bit less convenient as Handling of Although it is not optimal, the most practical incremental improvement is to add recoverability on case-by-case basis: specifically, for low-level invalidity, try to make parser keep as much valid state as possible. To that end, it really would make sense to start enumerating specific most commonly encountered problems -- like, say, "String value with embedded unquoted linefeed" or "truncated document that ends in String value". |
Thanks for the quick reply. It turns out that the file we process is a JSON list of entities, but with one per line nonetheless, so that a slightly more complex pedestrian processing works. I now use the MappingIterator until I hit the first exception, at which time I recover by switching to a line-by-line extraction of JSON snippets (with some manual intervention to remove the separating commas and closing "]". The code is here. I spent some time digging into the Jackson code, but I came to the conclusion that there is no easy way for me to recover the parser for continuing with the MappingIterator. If the file would not contain a list (and thus no separator commas), one could manually advance the underlying stream to skip one line, and then close and recreate the MappingIterator (without closing the stream, by setting the Parser.Feature accordingly). Alas, this does not work since a fresh parser cannot start in the middle of a list without seeing the initial "[" to go into list parsing mode. |
Ah yes. Lists are bit more complex than "raw" sequences. One thing that could help is use of |
For what it is worth, I think this is superceded by: FasterXML/jackson-databind#734 and 2.5.2 specifically has a fix to ensure that current token, at least, is cleared even if exception is thrown (not doing that was causing potential infinite loop, where iterator was not advancing). Going forward, if 734 for databind gets implemented, it should be possible to have safer recover available. |
I want to read a whole CSV-File and want it to be mapped to a list of pojos.
But if there are any unmappable fields (example: csv-file contains column-value "1.280" but the field in the pojo is of type integer) the reading should not abort but should simply ignore this line.
The MappingIterator can be used, but does not handle the exception very well (in my opinion) which even leads to an endless loop. I can handle this problem if i clear the current token outside:
But in my opinion it would be better if a call to "nextValue" always clears the current token?
regards and greetings,
bernd
The text was updated successfully, but these errors were encountered: