Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[add]: Added try/catches to primitive parsing methods #127

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

sztanko
Copy link

@sztanko sztanko commented Jan 26, 2016

Handle invalid primitives in json:

Problem we experience quite often:

dfs -rm -f -r /tmp/err_test;
dfs -mkdir /tmp/err_test;
drop table if exists err_test;
drop table if exists err_string_test;
--add  jar /tmp/json-serde-1.3.7-SNAPSHOT-jar-with-dependencies.jar;
create external table err_test(a int) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' location '/tmp/err_test/';
create external table err_string_test(a string) location '/tmp/err_test/';
insert into err_string_test VALUES('{"a" :"not_an_int"}'),('{"a" :"1"}');
select * from err_string_test order by a LIMIT 10;  --this will go through
select * from err_test order by a LIMIT 10;  -- this will fail

Looking at the code and exception stack trace,

Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row [Error getting row data with exception java.lang.NumberFormatException: For input string: "not_an_int"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:580)
    at java.lang.Integer.parseInt(Integer.java:615)
    at org.openx.data.jsonserde.objectinspector.primitive.ParsePrimitiveUtils.parseInt(ParsePrimitiveUtils.java:33)
    at org.openx.data.jsonserde.objectinspector.primitive.JavaStringIntObjectInspector.get(JavaStringIntObjectInspector.java:45)
    at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:226)
    at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:354)
    at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:198)
    at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:184)
    at org.apache.hadoop.hive.ql.exec.MapOperator.toErrorMessage(MapOperator.java:544)
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:513)
    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
 ]
    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:172)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row [Error getting row data with exception java.lang.NumberFormatException: For input string: "not_an_int"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:580)
    at java.lang.Integer.parseInt(Integer.java:615)
    at org.openx.data.jsonserde.objectinspector.primitive.ParsePrimitiveUtils.parseInt(ParsePrimitiveUtils.java:33)
    at org.openx.data.jsonserde.objectinspector.primitive.JavaStringIntObjectInspector.get(JavaStringIntObjectInspector.java:45)
    at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:226)
    at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:354)
    at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:198)
    at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:184)
    at org.apache.hadoop.hive.ql.exec.MapOperator.toErrorMessage(MapOperator.java:544)
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:513)
    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
 ]
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:518)
    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
    ... 8 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NumberFormatException: For input string: "not_an_int"
    at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:403)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
    at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
    at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:97)
    at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:162)
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:508)
    ... 9 more
Caused by: java.lang.NumberFormatException: For input string: "not_an_int"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:580)
    at java.lang.Integer.parseInt(Integer.java:615)
    at org.openx.data.jsonserde.objectinspector.primitive.ParsePrimitiveUtils.parseInt(ParsePrimitiveUtils.java:33)
    at org.openx.data.jsonserde.objectinspector.primitive.JavaStringIntObjectInspector.get(JavaStringIntObjectInspector.java:45)
    at org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.serialize(BinarySortableSerDe.java:682)
    at org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe.serialize(BinarySortableSerDe.java:631)
    at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.toHiveKey(ReduceSinkOperator.java:507)
    at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:355)
    ... 15 more

and relevant codes

MapOperator.java:process

for (MapOpCtx current : currentCtxs) {
      Object row = null;
      try {
        row = current.readRow(value, context);
        if (!current.forward(row)) {
          childrenDone++;
        }
      } catch (Exception e) {
        // TODO: policy on deserialization errors
        String message = toErrorMessage(value, row, current.rowObjectInspector);
        if (row == null) {
          deserialize_error_count.set(deserialize_error_count.get() + 1);
          throw new HiveException("Hive Runtime Error while processing writable " + message, e);
        }
        throw new HiveException("Hive Runtime Error while processing row " + message, e);
      }
    }

my understanding is that currently there is no way to catch these exceptions via configuration options (e.g. mapreduce.job.skiprecords).
The only thing that seems viable is to catch these kind of problems using try/catch within methods of ParsePrimitiveUtils.
I understand this is probably the ugliest solution and a better pattern might be to pre-validate json, but unfortunately this doesn't seem to be a feasible situation in our case.

I will be pleased if you merge this PR, however I understand that this is not the ideal situation and you might not accept it.
In any case, I am open to your suggestions regarding this.

@sztanko sztanko changed the title [add]: Added try/catches [add]: Added try/catches to primitive parsing methods Jan 26, 2016
@rcongiu
Copy link
Owner

rcongiu commented Jan 26, 2016

I am looking at the patch and I think it's a good idea, but the main concern I have is that in case of an error it returns a default value of 0, I think it should be NULL.
Let's say you have a field like salary, and the parsing of "$100,000" fails because of a dollar sign... you'd have a record that it's been silently converted from 100000 to 0, so anybody taking AVG, MIN on that field will be affected.
Let me think about it....

@sztanko
Copy link
Author

sztanko commented Jan 26, 2016

I absolutely agree with you and that would be my initial intention, however these are primitives, not objects and returning null is not possible...

@rcongiu
Copy link
Owner

rcongiu commented Jan 26, 2016

Yeah, I am looking at the code and it's a cascade of changes... to make it an option I have to change the objectinspector caching, plus all the objectinspectors...
in Scala it would be easy just to pass a function but ... not that easy in Java

@sztanko
Copy link
Author

sztanko commented Jan 28, 2016

I will try to do it myself, time permitting.

@rcongiu
Copy link
Owner

rcongiu commented Jan 28, 2016

I am actually working on it too :)

@sztanko
Copy link
Author

sztanko commented Jan 28, 2016

Oh cool, thank you :)

@rcongiu
Copy link
Owner

rcongiu commented Jan 31, 2016

So I spent a good amount of time looking at this and I think it can't be easily done.
This is why: right now the JSON parsers will not convert string to numbers because it doesn't know ahead of time which numeric type to parse it to - int, long, etc. - so it will carry around a string, and the JavaStringxxxObjectInspector will do the conversion. The problems arise when there's a NULL. Now, because of parsing, a NULL could be an actuall NULL, or it could be an unparseable string. Unfortunately hive will check if the object passed is NULL, and call the get() method on the objectinspector because it thinks it's safe, but the objectinspector will fail.
The right way to do it is to have the table schema passed down to the JSON parser and have that do the conversion so we don't have to worry about it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants