Skip to content
This repository has been archived by the owner on Jan 22, 2019. It is now read-only.

MappingIterator does not clear current token after exception #52

Closed
flappingeagle opened this issue Oct 17, 2014 · 13 comments
Closed

MappingIterator does not clear current token after exception #52

flappingeagle opened this issue Oct 17, 2014 · 13 comments

Comments

@flappingeagle
Copy link

I want to read a whole CSV-File and want it to be mapped to a list of pojos.
But if there are any unmappable fields (example: csv-file contains column-value "1.280" but the field in the pojo is of type integer) the reading should not abort but should simply ignore this line.
The MappingIterator can be used, but does not handle the exception very well (in my opinion) which even leads to an endless loop. I can handle this problem if i clear the current token outside:

        MappingIterator<LineDTO> values = csvReader.readValues(importStream);
        List<LineDTO> lines = new ArrayList<>();
        while (values.hasNextValue()) {
            try {            
                LineDTO line = values.nextValue();
                lines.add(line);                
            } catch (Exception e) {
                // bug in jackson, jackson does not clear the token in error-case.
                values.getParser().clearCurrentToken();
            } 
        }

But in my opinion it would be better if a call to "nextValue" always clears the current token?

regards and greetings,
bernd

@flappingeagle
Copy link
Author

ah sorry the defect line must also be consumed... please see the corrected code:

        MappingIterator<LineDTO> values = csvReader.readValues(importStream);
        List<LineDTO> lines = new ArrayList<>();
        while (values.hasNextValue()) {
            try {            
                LineDTO line = values.nextValue();
                lines.add(line);

            } catch (Exception e) {
                e.printStackTrace();
                // bug in jackson, jackson does not clear the token in error-case.
                // we also need to consume the defect line.
                values.getParser().clearCurrentToken();
                LineDTO defectLine = values.nextValue();
                System.out.println(defectLine);
            } 
        }

@cowtowncoder
Copy link
Member

This area (error handling) is something I think does need improving, and I have tried to think of proper ways to deal with that. So far I am not yet sure what to do, partly because there are problems that can occur both at low streaming API level (JsonParser) and above. Lower level ones could be easier to resolve, as parser has more context and could possibly skip the line (there are errors that it wouldn't work with but some). Higher-level ones like ones you mention are trickier, as databinder has bit less context, and error can occur at various points.

One possibility could be to have a new method that would try to "Skip tokens until END_OBJECT" (or possibly CSV-specific one to skip until end of input line). That might be something that could be linked to MappingIterator as well (to call in case of certain types of errors).

@flappingeagle
Copy link
Author

thanks for reply ... i hope i can use my workaround for my use-case. It seems like it works. So i'm more or less satisfied with the code above ;-)

But please notice that the "try/catch" block could cause an endless loop if defined like in following code-example. This should be fixed i think, because the developer would not expect an endless loop situation for this valid try/catch block. Maybe you can let "hasNextValue" return false after an exception occures in "nextValue" or something like that ;-)

        MappingIterator<LineDTO> values = csvReader.readValues(importStream);
        List<LineDTO> lines = new ArrayList<>();
        while (values.hasNextValue()) {
            try {            
                LineDTO line = values.nextValue();
                lines.add(line);
            } catch (Exception e) {
                e.printStackTrace();
            } 
        }

@cowtowncoder
Copy link
Member

@flappingeagle Good point -- although it is not guaranteed that the state is valid after exception, it would be good if it was guaranteed, and I think we should try to work towards that as much as possible.
But I'll have to see where and how this could be incorporated.

Do you happen to have small snippet of code (POJO declaration maybe?) and matching JSON, to reproduce specific problem you have? That would be helpful in verifying that at least that case is handled in safer manner.

@flappingeagle
Copy link
Author

well yes, here i have a junit-test-case which tests if all correct lines in a csv-file have been mapped. Incorrect lines should not lead to abortion of the read, but should simply be ignored:

note: at the moment this test leads to the explained "endless-loop" problem which should be fixed ;-)

ProductPojo.java

@JsonPropertyOrder({
        "id",
        "name",
        "amount",
        "price"
})
public class ProductPojo {

    private Integer id;

    private String name;

    private Integer amount;

    private Double price;

    public Integer getId() {
        return id;
    }

    public void setId(Integer id) {
        this.id = id;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public Integer getAmount() {
        return amount;
    }

    public void setAmount(Integer amount) {
        this.amount = amount;
    }

    public Double getPrice() {
        return price;
    }

    public void setPrice(Double price) {
        this.price = price;
    }
}

ParseTest.java

    private final String testPath = new File("src/test/resources").getAbsolutePath();

    @Before
    public void setUp() throws Exception {
    }

    @After
    public void tearDown() throws Exception {
    }

    @Test
    public void testParse() throws FileNotFoundException {
        CsvMapper mapper = new CsvMapper();
        CsvSchema schema = mapper.schemaFor(ProductPojo.class);
        schema = schema.withColumnSeparator(';');
        schema = schema.withQuoteChar('"');
        schema = schema.withUseHeader(true);
        ObjectReader csvReader = mapper.reader(ProductPojo.class).with(schema);

        FileInputStream fis = new FileInputStream(testPath
                + File.separator + "test.csv");

        List<ProductPojo> lines = new ArrayList<>();
        MappingIterator<ProductPojo> values = null;
        try {
            values = csvReader.readValues(fis);
            while (values.hasNextValue()) {
                try {
                    ProductPojo line = values.nextValue();
                    lines.add(line);

                } catch (Exception e) {
                    int line = values.getParser().getCurrentLocation().getLineNr();
                    System.out.println("error at line: " + line + " --> ignoring this line, "
                            + "exception: " + e.getMessage());
                }
            }

        } catch (Exception e) {
            // hard exception while going to next value.
            String line = null;
            if (values != null) {
                line = String.valueOf(values.getParser().getCurrentLocation().getLineNr());
            }
            System.out.println("unrecoverable exception at line: " + line + ", exception: "
                    + e.getMessage());
        }

        // there are 3 correct lines in the file, which should be read.
        assertTrue(lines.size() == 3);
    }

test.csv (CSV-File)

"id"  ;"name"                        ;"amount"    ;"price"  ;
"1"   ;"Penny-Blossom Dream (Large)" ;"5"         ;"17.50"  ;
"2"   ;"Penny-Blossom Dream (Small)" ;"3.1"       ;"12.50"  ;
"3"   ;"Penny-Blossom Dream (XL)"    ;"8"         ;"22.50"  ;
"4"   ;"Penny-Blossom Dream (Large)" ;"10"        ;"error"  ;
"5"   ;"Penny-Blossom Dream (Large)" ;"12"        ;"17.50"  ;

@cowtowncoder
Copy link
Member

Come to think of this, it probably requires CSV-specific MappingIterator; there's no reliable way to make this work for all types. On plus side, CSV reader is probably able to skip the "rest of the line" (assuming input is recoverable), which might allow relatively nice recovery.

@flappingeagle
Copy link
Author

my personal opinion would be that a format-parser (xml,csv,json) should only abort the parsing of the given data if the data does not correspond to the format (xml,csv,json).

In XML its easy to find data that does not correspond to xml. The following example is xml but as soon as the parser reaches the "</id>" the parser can decide that its not valid xml and completely abort the parsing. Another example would be unallowed chars in XML (&, Ampersand is not allowed in XML).

<list>
  <item>
  </id>
</list>

But all mapping-problems that occur while mapping incoming data to given java-pojos should (in my personal opinion) never abort the reading of the given data but should simply ignore fields that could not be mapped and maybe even protocol this as "warnings". The java-bean-validation framework has such a way to protocol warnings for example. Just some infos to think about ;-)

@cowtowncoder
Copy link
Member

Conceptually I agree with you: ideally format-specific part should focus on decoding low-level format; and many/most of issues should be solved at a higher-level, where discrepancy is note.

But the problem is that I do not think this can be solved in general manner: there is not enough information to resolve problems that can occur at any state during processing. So, for example, problem in a JSON value during data binding is something that I don't think I could solve at databinding level at all (except by having options that define some sort of explicit resolution of inconvertible scalar type into legal one or something).
There is no good way to "untangle" the mess at point where problem is known; Jackson has non mechanisms for unwinding, such as skipping remaining properties of an object or so. Ability to do that could be useful, but would also be a big undertaking. And it would also not necessarily be the right thing for all processing -- skipping through following content is extra effort (esp. when content is read over the network) -- and in some cases it'd still be better to simply give up.

But CSV format itself is, in some way, more robust because it is very simple: it only has 2 dimensions, and no more nesting. This means that skipping a whole line is very likely to resolve any issue there might be, with respect to state.

Still -- perhaps the general idea of resetting state could be generalized, so that any JsonParser would have method ("skipUntilRootValue()" or whatever), that could be called to try to salvage situation, at least wrt readValues(). And/or have an indicator method (parser.mayResetToValidState() or such) to indicate formats for which this could work.

@mkroetzsch
Copy link

We encounter similar problems in Wikidata Toolkit. We are parsing some 17 million JSON documents from an input stream, some (very few) of which contain serialization errors that are hard to tolerate in an object model (such as confusion of empty list and empty map syntax). Every JSON document is in exactly one line of the input.

We would like to tolerate (skip and ignore) the odd malformed document in this stream, but it is not clear how to do this. The code I am talking about is here (this version has the catch block outside of the loop, since we cannot recover anyway). We are using a MappingIterator constructed from an ObjectReader for a specific Java class structure. Since the errors in the JSON are not foreseeable, it is not feasible to tolerate all of them in the object model (we tried, but each new dump we process has new unexpected issues).

I can catch the JsonMappingException, but I failed to advance the MappingIterator behind the offending line in any way. One option would be to read lines from the file and to map the strings directly (no MappingIterator) but this would probably affect performance. Ideas are welcome.

P.S. I note now that this bug is specifically about the CSV parser, but I think the underlying questions are the same for all uses of MappingIterator.

@cowtowncoder
Copy link
Member

In case of line-per-JSON-document, the most reliable method really is to split lines first, bind them separately. This does incur some performance overhead, and is bit less convenient as MappingIterator can not be used as is.

Handling of JsonMappingException may be easier to handle in recoverable fashion, whereas lower level invalid JSON (as opposed to "unmappable") is much more difficult, since in latter case parser will have to be part of recovery.
So in a way, there are two levels to recovery: at low level recovering tokenization so that after invalid token, try to re-sync with input to find likely next valid token. And then at higher level similarly try to skip over tokens until a likely matching level is found. At mapping level, knowing expected next token and/or nesting would help.

Although it is not optimal, the most practical incremental improvement is to add recoverability on case-by-case basis: specifically, for low-level invalidity, try to make parser keep as much valid state as possible.

To that end, it really would make sense to start enumerating specific most commonly encountered problems -- like, say, "String value with embedded unquoted linefeed" or "truncated document that ends in String value".

@mkroetzsch
Copy link

Thanks for the quick reply. It turns out that the file we process is a JSON list of entities, but with one per line nonetheless, so that a slightly more complex pedestrian processing works. I now use the MappingIterator until I hit the first exception, at which time I recover by switching to a line-by-line extraction of JSON snippets (with some manual intervention to remove the separating commas and closing "]". The code is here.

I spent some time digging into the Jackson code, but I came to the conclusion that there is no easy way for me to recover the parser for continuing with the MappingIterator. If the file would not contain a list (and thus no separator commas), one could manually advance the underlying stream to skip one line, and then close and recreate the MappingIterator (without closing the stream, by setting the Parser.Feature accordingly). Alas, this does not work since a fresh parser cannot start in the middle of a list without seeing the initial "[" to go into list parsing mode.

@cowtowncoder
Copy link
Member

Ah yes. Lists are bit more complex than "raw" sequences.

One thing that could help is use of JsonParser.releaseBuffered(), which may be used for accessing content parser has buffered but not yet tokenized -- with that, and original InputStream or Reader, it is at least possible to recover unused content. And also inject a separator if needed. Not simple but technically doable.

@cowtowncoder
Copy link
Member

For what it is worth, I think this is superceded by:

FasterXML/jackson-databind#734

and 2.5.2 specifically has a fix to ensure that current token, at least, is cleared even if exception is thrown (not doing that was causing potential infinite loop, where iterator was not advancing).
Clearing the token will prevent stuck processing, although without other changes may expose partial/duplicate lines so that caller may see one line "split" in one more than one pieces.

Going forward, if 734 for databind gets implemented, it should be possible to have safer recover available.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants