Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema mismatch when running INSERT OVERWRITE from one Avro-backed table to another #14

Open
elibingham opened this issue Feb 21, 2012 · 27 comments

Comments

@elibingham
Copy link

Hi there,

I have two tables defined roughly as:

CREATE EXTERNAL TABLE a
ROW FORMAT SERDE
'com.linkedin.haivvreo.AvroSerDe'
WITH SERDEPROPERTIES (
'schema.url'='hdfs://schema/a.avsc')
STORED AS INPUTFORMAT
'com.linkedin.haivvreo.AvroContainerInputFormat'
OUTPUTFORMAT
'com.linkedin.haivvreo.AvroContainerOutputFormat'
LOCATION
'/foo/a';

CREATE EXTERNAL TABLE b
ROW FORMAT SERDE
'com.linkedin.haivvreo.AvroSerDe'
WITH SERDEPROPERTIES (
'schema.literal'='{ ... }')
STORED AS INPUTFORMAT
'com.linkedin.haivvreo.AvroContainerInputFormat'
OUTPUTFORMAT
'com.linkedin.haivvreo.AvroContainerOutputFormat'
LOCATION
'/foo/b';

When I do:

INSERT OVERWRITE TABLE b
SELECT a, b, c FROM a;

(where b's schema is equivalent to a, b, c)

This fails with an AvroTypeException like this:

2012-02-21 11:31:58,587 INFO org.apache.hadoop.mapred.TaskStatus: task-diagnostic-info for task attempt_201202091933_0059_m_000000_1 : java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable com.linkedin.haivvreo.AvroGenericRecordWritable@92696c2
at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:161)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable com.linkedin.haivvreo.AvroGenericRecordWritable@92696c2
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:520)
at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143)
... 8 more
Caused by: org.apache.avro.AvroTypeException: Found {
"type" : "record",
"name" : "b",
"fields" : [ { … /* schema of b / … } ]
}, expecting {
"type" : "record",
"name" : "a",
"fields" : [ { … /
schema of a */ … } ],
...
}
at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:231)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:127)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:162)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129)
at com.linkedin.haivvreo.AvroDeserializer$SchemaReEncoder.reencode(AvroDeserializer.java:79)
at com.linkedin.haivvreo.AvroDeserializer.deserialize(AvroDeserializer.java:121)
at com.linkedin.haivvreo.AvroSerDe.deserialize(AvroSerDe.java:81)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:504)
... 9 more

Any help would be appreciated.

@jghoman
Copy link
Owner

jghoman commented Feb 22, 2012

Which version of haivvreo are you running? Is it against the Avro 1.4 or 1.6 branch?

@elibingham
Copy link
Author

Jakob,

Hive 0.7.1, Haivvreo 1.0.7, Avro 1.5.4 (CDH3)

Thanks for any assistance.

-Eli

On Feb 22, 2012, at 2:41 PM, Jakob Homan wrote:

Which version of haivvreo are you running? Is it against the Avro 1.4 or 1.6 branch?


Reply to this email directly or view it on GitHub:
#14 (comment)

@jghoman
Copy link
Owner

jghoman commented Feb 24, 2012

You're most likely running into an issue that was fixed on the Avro 1.4 branch: eca88ef I'm planning on merging these changes to trunk this weekend.

@elibingham
Copy link
Author

Jakob,

Ah, I see that branch is 7 changes ahead of master. Thanks for the heads up - can you let me know when the changes are merged so I can try out the new code?

Also, will master work with Avro 1.5.4 or will I need to move to 1.6.x?

-Eli

On Feb 24, 2012, at 4:40 PM, Jakob Homan wrote:

You're most likely running into an issue that was fixed on the Avro 1.4 branch: eca88ef I'm planning on merging these changes to trunk this weekend.


Reply to this email directly or view it on GitHub:
#14 (comment)

@jghoman
Copy link
Owner

jghoman commented Feb 24, 2012

It should work with 1.6. That's one of the things I'm checking. The other commits are minor. The plan is to merge this code to the Hive source, since I don't have enough time to really care for this code right now.

@elibingham
Copy link
Author

Any idea when this fix will be pushed? I'm getting bitten by this bug in another context (can't join together two tables).

@elibingham
Copy link
Author

FYI, I cloned the repo, merged the 1.4 branch into master and built it myself to test out this patch. The problem still seems to exist on seemingly simple joins from a to be with non-matching schemas.

@thirstycrow
Copy link

I got the same problem and solved it by commenting out the following line in AvroGenericRecordReader.java, which set the schema of the target table as the reader's schema of GenericDatumReader.

if(latest != null) gdr.setExpected(latest);

@elibingham
Copy link
Author

@thirstycrow Nice one - just tried this fix in my own fork at elibingham@134c689 - worked like a charm.

@cwsteinbach
Copy link

@jghoman: is this patch ready to go in or are there some remaining issues?

@elibingham
Copy link
Author

@cwsteinbach If you'd like you can clone my fork and build from there - it includes both @jghoman and @thirstycrow 's patches in master, and I've been using in my development environment for awhile without problems.

@jghoman
Copy link
Owner

jghoman commented Apr 6, 2012

The remaining issue is that I've not seen this in our clusters, so I want to look further, but haven't had time. I've been promised time next week to catch up Haivvreo issues (including the port to ASF), so will likely take a look then.

@jghoman
Copy link
Owner

jghoman commented Apr 11, 2012

ok. Finally have some time to focus on Haivvreo. The issue at hand: that line was added as part of 'be smart about reading in the latest version of the datum so we don't have to re-encode later.' Removing means that if there are any schema evolution changes, we'll have to (try to) re-encode. But you're right that it's getting picked up during writing as well, which is a regression. Now the trick is to figure out how to avoid re-encode and not screw up writing. Just commenting out this line isn't enough because the benefits of avoiding re-encoding are significant. Will take a look.

@jghoman
Copy link
Owner

jghoman commented Apr 17, 2012

@elibingham I'm not sure why you were seeing this on Hive 0.7. We just went to Hive 0.8 and saw it. In 0.8 at least, Hive has changed what it's passing around in the properties, the code was thinking it was in non-mr jobs all the time and loading the schema from that, which was what was causing the failure. I've committed a change (5407836) that should fix this. I've verified it against our 0.8.1 and 0.7.1 clusters. I'll try it against the insert into scenario you originally describe today as well and report back.

I've also turned up the logging info when we re-encode to warning because at this point there really is no reason we should be doing so. If we do, I'd like to know about it so we can fix it.

Take it for a spin and see if it works for you. And again, sorry for the delay in being able to look at this.

@elibingham
Copy link
Author

Thanks for looking into this @jghoman. I'll give it a whirl, but I'm fairly certain it's not going to work on our cluster, which is hive 0.7.1-cdh3u2 based. Possibly the CDH3 version of 0.7.1 has some unexpected patches in it? Anyway, the commit only appears to be in the hive8 branch - do you expect that this jar will work with an 0.7.1 cluster? Thanks!

@elibingham
Copy link
Author

Oh, also, when building the HEAD of avro8 branch, I got a unit test failure:

Results :

Tests in error:
determineSchemaCanReadSchemaFromHDFS(com.linkedin.haivvreo.TestHaivvreoUtils): java.io.IOException: java.lang.NoClassDefFoundError: javax/ws/rs/core/StreamingOutput

Tests run: 56, Failures: 0, Errors: 1, Skipped: 0

@jghoman
Copy link
Owner

jghoman commented Apr 18, 2012

It's also in the avro14 branch: https://github.com/jghoman/haivvreo/commits/avro14 I merged it to the hive8 branch as well, but tested it to our hive 0.7.1 cluster today too. I'll check on that unit test failure.

@jghoman
Copy link
Owner

jghoman commented Apr 18, 2012

I fixed the test failure on the hive8 branch. For some reason hadoop 1.0.2 has a dependency on jersey that's not included on its published pom.

@jghoman
Copy link
Owner

jghoman commented Apr 18, 2012

@elibingham With the latest commits (bringing the avro14 branch up to 1.0.10), I've got both our 0.7.1 and 0.8.1 clusters correctly insert overwriting an equivalent query to yours. Definitely check it out and see if it works for you. Tomorrow I'll be merging the master and 1.4 branches (or starting to...) so we don't have this extra layer of annoyance going on.

@elibingham
Copy link
Author

@jghoman Great. Our cluster is on Avro 1.6, so I think I need to wait until you merge to master, which is Avro 1.5/1.6 based?

@jghoman
Copy link
Owner

jghoman commented Apr 24, 2012

@elibingham I've pushed the changes to Master, but don't have a cluster with Avro 1.5 to test it on, so you'll have to be the guinea pig...

@elibingham
Copy link
Author

@jghoman I built and tested with HEAD of master and I'm still seeing the problem (vs. with my patched jar). I've sanitized my CLI input and the Hadoop logs and hopefully this will give you a better idea of what's going on. The interesting part to me in the hadoop logs is that it looks like it's deserializing data from BAR but expecting the schema of FOO in the actual job. I've obscured some of the underlying data details and left out some irrelevant parts of the logs, let me know if you need anything else.

Sanitized hive CLI input:

hive> select a.i, a.j, a.k, a.l, a.m, b.n
> from foo a
> join bar b on a.k = b.k
> limit 100;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201202091933_0242, Tracking URL = http://namenode:50030/jobdetails.jsp?jobid=job_201202091933_0242
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=namenode:8021 -kill job_201202091933_0242
2012-04-24 15:23:29,258 Stage-1 map = 0%, reduce = 0%
2012-04-24 15:23:39,306 Stage-1 map = 1%, reduce = 0%
2012-04-24 15:23:45,336 Stage-1 map = 4%, reduce = 0%
2012-04-24 15:23:48,353 Stage-1 map = 6%, reduce = 0%
2012-04-24 15:23:50,364 Stage-1 map = 7%, reduce = 0%
2012-04-24 15:23:51,369 Stage-1 map = 8%, reduce = 0%
2012-04-24 15:23:53,380 Stage-1 map = 9%, reduce = 0%
2012-04-24 15:23:54,385 Stage-1 map = 12%, reduce = 0%
2012-04-24 15:23:57,398 Stage-1 map = 14%, reduce = 0%
2012-04-24 15:23:59,409 Stage-1 map = 15%, reduce = 0%
2012-04-24 15:24:00,418 Stage-1 map = 17%, reduce = 0%
2012-04-24 15:24:03,431 Stage-1 map = 19%, reduce = 0%
2012-04-24 15:24:05,441 Stage-1 map = 20%, reduce = 0%
2012-04-24 15:24:06,447 Stage-1 map = 21%, reduce = 0%
2012-04-24 15:24:09,463 Stage-1 map = 24%, reduce = 0%
2012-04-24 15:24:11,473 Stage-1 map = 25%, reduce = 0%
2012-04-24 15:24:12,478 Stage-1 map = 27%, reduce = 0%
2012-04-24 15:24:14,489 Stage-1 map = 28%, reduce = 0%
2012-04-24 15:24:15,494 Stage-1 map = 29%, reduce = 0%
2012-04-24 15:24:18,507 Stage-1 map = 30%, reduce = 0%
2012-04-24 15:24:20,516 Stage-1 map = 31%, reduce = 0%
2012-04-24 15:24:21,521 Stage-1 map = 32%, reduce = 2%
2012-04-24 15:24:23,530 Stage-1 map = 33%, reduce = 2%
2012-04-24 15:24:24,535 Stage-1 map = 34%, reduce = 4%
2012-04-24 15:24:27,550 Stage-1 map = 35%, reduce = 4%
2012-04-24 15:24:29,560 Stage-1 map = 36%, reduce = 4%
2012-04-24 15:24:32,575 Stage-1 map = 37%, reduce = 4%
2012-04-24 15:24:33,581 Stage-1 map = 38%, reduce = 4%
2012-04-24 15:24:35,590 Stage-1 map = 39%, reduce = 4%
2012-04-24 15:24:36,595 Stage-1 map = 40%, reduce = 4%
2012-04-24 15:24:38,603 Stage-1 map = 41%, reduce = 4%
2012-04-24 15:24:41,617 Stage-1 map = 43%, reduce = 4%
2012-04-24 15:24:44,631 Stage-1 map = 44%, reduce = 4%
2012-04-24 15:24:45,635 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201202091933_0242 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
hive>

Sanitized Hadoop log from job:

2012-04-24 16:08:09,962 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[..]
2012-04-24 16:08:10,085 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
2012-04-24 16:08:10,477 INFO com.linkedin.haivvreo.AvroGenericRecordReader: Unable to match filesplit /data/bar/20120221.avro.snappy:67108864+67108864 with a partition.
2012-04-24 16:08:10,706 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 2
2012-04-24 16:08:10,712 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2012-04-24 16:08:10,760 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720
2012-04-24 16:08:10,760 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680
2012-04-24 16:08:10,766 INFO ExecMapper: maximum memory = 200998912
[..]
2012-04-24 16:08:10,881 INFO ExecMapper:
Id =8

Id =2

Id =3
Id = 2 null<\Parent>
<\RS>
<\Children>
Id = 8 null<\Parent>
<\TS>
<\Children>
<\MAP>
2012-04-24 16:08:10,881 INFO org.apache.hadoop.hive.ql.exec.MapOperator: Initializing Self 8 MAP
2012-04-24 16:08:10,881 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Initializing Self 0 TS
2012-04-24 16:08:10,881 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Operator 0 TS initialized
2012-04-24 16:08:10,881 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Initializing children of 0 TS
2012-04-24 16:08:10,881 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Initializing child 1 RS
2012-04-24 16:08:10,881 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Initializing Self 1 RS
2012-04-24 16:08:10,884 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Using tag = 0
2012-04-24 16:08:10,897 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Operator 1 RS initialized
2012-04-24 16:08:10,897 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Initialization Done 1 RS
2012-04-24 16:08:10,897 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Initialization Done 0 TS
2012-04-24 16:08:10,897 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Initializing Self 2 TS
2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Operator 2 TS initialized
2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Initializing children of 2 TS
2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Initializing child 3 RS
2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Initializing Self 3 RS
2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Using tag = 1
2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Operator 3 RS initialized
2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Initialization Done 3 RS
2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Initialization Done 2 TS
2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.MapOperator: Initialization Done 8 MAP
2012-04-24 16:08:10,936 INFO org.apache.hadoop.hive.ql.exec.MapOperator: Processing path /data/bar/bar_20120221.avro.snappy
2012-04-24 16:08:10,936 INFO org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias b for file hdfs://namenode/data/bar
2012-04-24 16:08:10,940 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarding 1 rows
2012-04-24 16:08:10,940 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarding 1 rows
2012-04-24 16:08:10,942 INFO ExecMapper: ExecMapper: processing 1 rows: used memory = 118554320
2012-04-24 16:08:10,958 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarding 10 rows
2012-04-24 16:08:10,958 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarding 10 rows
2012-04-24 16:08:10,958 INFO ExecMapper: ExecMapper: processing 10 rows: used memory = 119482744
2012-04-24 16:08:11,103 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarding 100 rows
2012-04-24 16:08:11,103 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarding 100 rows
2012-04-24 16:08:11,103 INFO ExecMapper: ExecMapper: processing 100 rows: used memory = 122267936
2012-04-24 16:08:11,490 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarding 1000 rows
2012-04-24 16:08:11,490 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarding 1000 rows
2012-04-24 16:08:11,491 INFO ExecMapper: ExecMapper: processing 1000 rows: used memory = 153848656
2012-04-24 16:08:12,369 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarding 10000 rows
2012-04-24 16:08:12,369 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarding 10000 rows
2012-04-24 16:08:12,369 INFO ExecMapper: ExecMapper: processing 10000 rows: used memory = 155906600
2012-04-24 16:08:17,641 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarding 100000 rows
2012-04-24 16:08:17,641 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarding 100000 rows
2012-04-24 16:08:17,641 INFO ExecMapper: ExecMapper: processing 100000 rows: used memory = 174357384
2012-04-24 16:08:25,784 INFO com.linkedin.haivvreo.AvroGenericRecordReader: Unable to match filesplit /data/bar/bar_20120221.avro.snappy:201326592+67108864 with a partition.
2012-04-24 16:08:25,784 INFO com.linkedin.haivvreo.AvroGenericRecordReader: Found the avro schema in the job: { SCHEMA OF FOO }
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 finished. closing...
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarded 237917 rows
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 finished. closing...
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarded 0 rows
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 1 finished. closing...
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 1 forwarded 0 rows
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 Close done
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.MapOperator: DESERIALIZE_ERRORS:0
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 finished. closing...
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarded 237917 rows
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 3 finished. closing...
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 3 forwarded 0 rows
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 Close done
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 Close done
2012-04-24 16:08:25,802 INFO ExecMapper: ExecMapper: processed 237917 rows: used memory = 139110360
2012-04-24 16:08:25,807 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2012-04-24 16:08:25,811 WARN org.apache.hadoop.mapred.Child: Error running child
org.apache.avro.AvroTypeException: Found { SCHEMA OF FOO }, expecting { SCHEMA OF BAR }
at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:231)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:127)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:162)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129)
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
at com.linkedin.haivvreo.AvroGenericRecordReader.next(AvroGenericRecordReader.java:119)
at com.linkedin.haivvreo.AvroGenericRecordReader.next(AvroGenericRecordReader.java:41)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:98)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:42)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:67)
at org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileRecordReader.next(Hadoop20SShims.java:208)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:208)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:193)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
2012-04-24 16:08:25,814 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

@jghoman
Copy link
Owner

jghoman commented Apr 25, 2012

you're getting hit with the avro re-encoding stuff that's being discussed in issue #8. Try with the latest pushes. Beyond that, it's difficult to debug this. The code can't find a partition to match the schema to (which happens in issue #8) and then proceeds to find the stashed schema in the jobconf, as if it were being run from a select * statement. The correct thing would be for it to match the partition. You're running CDH Hive, right? I had no idea what changes they made and this bit of code (how to figure out if we're in an MR job or not) changes between versions.

@IllyaYalovyy
Copy link

Hello,

It looks like this issues is not resolved yet. Our environment is like follows:
CDH3U3, avro-1.5.4 and avro-mapred-1.5.4, haivvreo-1.0.7

We are trying to execute following query:
insert overwrite table LZ.TABLE1 select * from ETLDATA.TABLE2;

It constantly fails with following exception:
Caused by: org.apache.avro.AvroRuntimeException: Not a record: ["null",{"type":"record","name":"_struct_d1e13","namespace":"com.expedia.edw.gco.navigator","fields":[{"name":"FirstName","type":["null","string"]},{"name":"FullName","type":["null","string"]},{"name":"TPID","type":["null","int"]},{"name":"TUID","type":["null","int"]}]}]
at org.apache.avro.Schema.getFields(Schema.java:227)
at com.linkedin.haivvreo.AvroSerializer.serializeStruct(AvroSerializer.java:104)
at com.linkedin.haivvreo.AvroSerializer.serialize(AvroSerializer.java:97)
at com.linkedin.haivvreo.AvroSerializer.serialize(AvroSerializer.java:65)
at com.linkedin.haivvreo.AvroSerDe.serialize(AvroSerDe.java:75)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:553)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:78)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:527)
... 9 more

Both tables point on the same avro schema.

Is there any way we can fix it? It is a show stopper for our project.

Thanks,
Illya

@jghoman
Copy link
Owner

jghoman commented May 19, 2012

sorry to hear that. I'm out of ideas on my end, since I can't replicate it. The way to avoid it is to turn off the re-encoding optimization that was mentioned before. It means on evolved records individual reading will be slower though. I'm finishing up the Haivvreo->Hive patch as I type, so hopefully someone in the Hive community may be able to reproduce and we can fix there (although that may be unnecessary because the newer version of Hive has tableproperties, which obviates the need for the all the dancing we're doing).

@jghoman
Copy link
Owner

jghoman commented May 19, 2012

@IllyaYalovyy Actually, yours is different. Looks like what it's encountering isn't even an Avro record. I'd recommend opening a new issue...

@IllyaYalovyy
Copy link

Created a new ticket with a detailed description and steps how to reproduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants