Schema mismatch when running INSERT OVERWRITE from one Avro-backed table to another #14

elibingham · 2012-02-21T18:14:30Z

Hi there,

I have two tables defined roughly as:

CREATE EXTERNAL TABLE a
ROW FORMAT SERDE
'com.linkedin.haivvreo.AvroSerDe'
WITH SERDEPROPERTIES (
'schema.url'='hdfs://schema/a.avsc')
STORED AS INPUTFORMAT
'com.linkedin.haivvreo.AvroContainerInputFormat'
OUTPUTFORMAT
'com.linkedin.haivvreo.AvroContainerOutputFormat'
LOCATION
'/foo/a';

CREATE EXTERNAL TABLE b
ROW FORMAT SERDE
'com.linkedin.haivvreo.AvroSerDe'
WITH SERDEPROPERTIES (
'schema.literal'='{ ... }')
STORED AS INPUTFORMAT
'com.linkedin.haivvreo.AvroContainerInputFormat'
OUTPUTFORMAT
'com.linkedin.haivvreo.AvroContainerOutputFormat'
LOCATION
'/foo/b';

When I do:

INSERT OVERWRITE TABLE b
SELECT a, b, c FROM a;

(where b's schema is equivalent to a, b, c)

This fails with an AvroTypeException like this:

2012-02-21 11:31:58,587 INFO org.apache.hadoop.mapred.TaskStatus: task-diagnostic-info for task attempt_201202091933_0059_m_000000_1 : java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable com.linkedin.haivvreo.AvroGenericRecordWritable@92696c2
at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:161)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable com.linkedin.haivvreo.AvroGenericRecordWritable@92696c2
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:520)
at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143)
... 8 more
Caused by: org.apache.avro.AvroTypeException: Found {
"type" : "record",
"name" : "b",
"fields" : [ { … /* schema of b / … } ]
}, expecting {
"type" : "record",
"name" : "a",
"fields" : [ { … / schema of a */ … } ],
...
}
at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:231)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:127)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:162)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129)
at com.linkedin.haivvreo.AvroDeserializer$SchemaReEncoder.reencode(AvroDeserializer.java:79)
at com.linkedin.haivvreo.AvroDeserializer.deserialize(AvroDeserializer.java:121)
at com.linkedin.haivvreo.AvroSerDe.deserialize(AvroSerDe.java:81)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:504)
... 9 more

Any help would be appreciated.

jghoman · 2012-02-22T19:41:39Z

Which version of haivvreo are you running? Is it against the Avro 1.4 or 1.6 branch?

elibingham · 2012-02-22T19:46:08Z

Jakob,

Hive 0.7.1, Haivvreo 1.0.7, Avro 1.5.4 (CDH3)

Thanks for any assistance.

-Eli

On Feb 22, 2012, at 2:41 PM, Jakob Homan wrote:

Which version of haivvreo are you running? Is it against the Avro 1.4 or 1.6 branch?

Reply to this email directly or view it on GitHub:
#14 (comment)

jghoman · 2012-02-24T21:40:41Z

You're most likely running into an issue that was fixed on the Avro 1.4 branch: eca88ef I'm planning on merging these changes to trunk this weekend.

elibingham · 2012-02-24T21:44:19Z

Jakob,

Ah, I see that branch is 7 changes ahead of master. Thanks for the heads up - can you let me know when the changes are merged so I can try out the new code?

Also, will master work with Avro 1.5.4 or will I need to move to 1.6.x?

-Eli

On Feb 24, 2012, at 4:40 PM, Jakob Homan wrote:

You're most likely running into an issue that was fixed on the Avro 1.4 branch: eca88ef I'm planning on merging these changes to trunk this weekend.

Reply to this email directly or view it on GitHub:
#14 (comment)

jghoman · 2012-02-24T21:47:29Z

It should work with 1.6. That's one of the things I'm checking. The other commits are minor. The plan is to merge this code to the Hive source, since I don't have enough time to really care for this code right now.

elibingham · 2012-02-29T20:54:06Z

Any idea when this fix will be pushed? I'm getting bitten by this bug in another context (can't join together two tables).

elibingham · 2012-03-05T18:30:13Z

FYI, I cloned the repo, merged the 1.4 branch into master and built it myself to test out this patch. The problem still seems to exist on seemingly simple joins from a to be with non-matching schemas.

thirstycrow · 2012-03-13T13:16:31Z

I got the same problem and solved it by commenting out the following line in AvroGenericRecordReader.java, which set the schema of the target table as the reader's schema of GenericDatumReader.

if(latest != null) gdr.setExpected(latest);

elibingham · 2012-03-13T15:20:27Z

@thirstycrow Nice one - just tried this fix in my own fork at elibingham@134c689 - worked like a charm.

cwsteinbach · 2012-04-06T00:30:17Z

@jghoman: is this patch ready to go in or are there some remaining issues?

elibingham · 2012-04-06T17:09:47Z

@cwsteinbach If you'd like you can clone my fork and build from there - it includes both @jghoman and @thirstycrow 's patches in master, and I've been using in my development environment for awhile without problems.

jghoman · 2012-04-06T17:11:44Z

The remaining issue is that I've not seen this in our clusters, so I want to look further, but haven't had time. I've been promised time next week to catch up Haivvreo issues (including the port to ASF), so will likely take a look then.

jghoman · 2012-04-11T22:36:00Z

ok. Finally have some time to focus on Haivvreo. The issue at hand: that line was added as part of 'be smart about reading in the latest version of the datum so we don't have to re-encode later.' Removing means that if there are any schema evolution changes, we'll have to (try to) re-encode. But you're right that it's getting picked up during writing as well, which is a regression. Now the trick is to figure out how to avoid re-encode and not screw up writing. Just commenting out this line isn't enough because the benefits of avoiding re-encoding are significant. Will take a look.

jghoman · 2012-04-17T22:16:59Z

@elibingham I'm not sure why you were seeing this on Hive 0.7. We just went to Hive 0.8 and saw it. In 0.8 at least, Hive has changed what it's passing around in the properties, the code was thinking it was in non-mr jobs all the time and loading the schema from that, which was what was causing the failure. I've committed a change (5407836) that should fix this. I've verified it against our 0.8.1 and 0.7.1 clusters. I'll try it against the insert into scenario you originally describe today as well and report back.

I've also turned up the logging info when we re-encode to warning because at this point there really is no reason we should be doing so. If we do, I'd like to know about it so we can fix it.

Take it for a spin and see if it works for you. And again, sorry for the delay in being able to look at this.

elibingham · 2012-04-18T02:49:54Z

Thanks for looking into this @jghoman. I'll give it a whirl, but I'm fairly certain it's not going to work on our cluster, which is hive 0.7.1-cdh3u2 based. Possibly the CDH3 version of 0.7.1 has some unexpected patches in it? Anyway, the commit only appears to be in the hive8 branch - do you expect that this jar will work with an 0.7.1 cluster? Thanks!

elibingham · 2012-04-18T02:55:37Z

Oh, also, when building the HEAD of avro8 branch, I got a unit test failure:

Results :

Tests in error:
determineSchemaCanReadSchemaFromHDFS(com.linkedin.haivvreo.TestHaivvreoUtils): java.io.IOException: java.lang.NoClassDefFoundError: javax/ws/rs/core/StreamingOutput

Tests run: 56, Failures: 0, Errors: 1, Skipped: 0

jghoman · 2012-04-18T04:11:56Z

It's also in the avro14 branch: https://github.com/jghoman/haivvreo/commits/avro14 I merged it to the hive8 branch as well, but tested it to our hive 0.7.1 cluster today too. I'll check on that unit test failure.

jghoman · 2012-04-18T20:54:03Z

I fixed the test failure on the hive8 branch. For some reason hadoop 1.0.2 has a dependency on jersey that's not included on its published pom.

jghoman · 2012-04-18T23:46:18Z

@elibingham With the latest commits (bringing the avro14 branch up to 1.0.10), I've got both our 0.7.1 and 0.8.1 clusters correctly insert overwriting an equivalent query to yours. Definitely check it out and see if it works for you. Tomorrow I'll be merging the master and 1.4 branches (or starting to...) so we don't have this extra layer of annoyance going on.

elibingham · 2012-04-19T00:40:47Z

@jghoman Great. Our cluster is on Avro 1.6, so I think I need to wait until you merge to master, which is Avro 1.5/1.6 based?

jghoman · 2012-04-24T01:11:25Z

@elibingham I've pushed the changes to Master, but don't have a cluster with Avro 1.5 to test it on, so you'll have to be the guinea pig...

elibingham · 2012-04-24T19:43:13Z

@jghoman I built and tested with HEAD of master and I'm still seeing the problem (vs. with my patched jar). I've sanitized my CLI input and the Hadoop logs and hopefully this will give you a better idea of what's going on. The interesting part to me in the hadoop logs is that it looks like it's deserializing data from BAR but expecting the schema of FOO in the actual job. I've obscured some of the underlying data details and left out some irrelevant parts of the logs, let me know if you need anything else.

Sanitized hive CLI input:

hive> select a.i, a.j, a.k, a.l, a.m, b.n
> from foo a
> join bar b on a.k = b.k
> limit 100;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201202091933_0242, Tracking URL = http://namenode:50030/jobdetails.jsp?jobid=job_201202091933_0242
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=namenode:8021 -kill job_201202091933_0242
2012-04-24 15:23:29,258 Stage-1 map = 0%, reduce = 0%
2012-04-24 15:23:39,306 Stage-1 map = 1%, reduce = 0%
2012-04-24 15:23:45,336 Stage-1 map = 4%, reduce = 0%
2012-04-24 15:23:48,353 Stage-1 map = 6%, reduce = 0%
2012-04-24 15:23:50,364 Stage-1 map = 7%, reduce = 0%
2012-04-24 15:23:51,369 Stage-1 map = 8%, reduce = 0%
2012-04-24 15:23:53,380 Stage-1 map = 9%, reduce = 0%
2012-04-24 15:23:54,385 Stage-1 map = 12%, reduce = 0%
2012-04-24 15:23:57,398 Stage-1 map = 14%, reduce = 0%
2012-04-24 15:23:59,409 Stage-1 map = 15%, reduce = 0%
2012-04-24 15:24:00,418 Stage-1 map = 17%, reduce = 0%
2012-04-24 15:24:03,431 Stage-1 map = 19%, reduce = 0%
2012-04-24 15:24:05,441 Stage-1 map = 20%, reduce = 0%
2012-04-24 15:24:06,447 Stage-1 map = 21%, reduce = 0%
2012-04-24 15:24:09,463 Stage-1 map = 24%, reduce = 0%
2012-04-24 15:24:11,473 Stage-1 map = 25%, reduce = 0%
2012-04-24 15:24:12,478 Stage-1 map = 27%, reduce = 0%
2012-04-24 15:24:14,489 Stage-1 map = 28%, reduce = 0%
2012-04-24 15:24:15,494 Stage-1 map = 29%, reduce = 0%
2012-04-24 15:24:18,507 Stage-1 map = 30%, reduce = 0%
2012-04-24 15:24:20,516 Stage-1 map = 31%, reduce = 0%
2012-04-24 15:24:21,521 Stage-1 map = 32%, reduce = 2%
2012-04-24 15:24:23,530 Stage-1 map = 33%, reduce = 2%
2012-04-24 15:24:24,535 Stage-1 map = 34%, reduce = 4%
2012-04-24 15:24:27,550 Stage-1 map = 35%, reduce = 4%
2012-04-24 15:24:29,560 Stage-1 map = 36%, reduce = 4%
2012-04-24 15:24:32,575 Stage-1 map = 37%, reduce = 4%
2012-04-24 15:24:33,581 Stage-1 map = 38%, reduce = 4%
2012-04-24 15:24:35,590 Stage-1 map = 39%, reduce = 4%
2012-04-24 15:24:36,595 Stage-1 map = 40%, reduce = 4%
2012-04-24 15:24:38,603 Stage-1 map = 41%, reduce = 4%
2012-04-24 15:24:41,617 Stage-1 map = 43%, reduce = 4%
2012-04-24 15:24:44,631 Stage-1 map = 44%, reduce = 4%
2012-04-24 15:24:45,635 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201202091933_0242 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
hive>

Sanitized Hadoop log from job:

2012-04-24 16:08:09,962 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[..]
2012-04-24 16:08:10,085 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
2012-04-24 16:08:10,477 INFO com.linkedin.haivvreo.AvroGenericRecordReader: Unable to match filesplit /data/bar/20120221.avro.snappy:67108864+67108864 with a partition.
2012-04-24 16:08:10,706 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 2
2012-04-24 16:08:10,712 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2012-04-24 16:08:10,760 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720
2012-04-24 16:08:10,760 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680
2012-04-24 16:08:10,766 INFO ExecMapper: maximum memory = 200998912
[..]
2012-04-24 16:08:10,881 INFO ExecMapper:
Id =8

Id =2

Id =3
Id = 2 null<\Parent>
<\RS>
<\Children>
Id = 8 null<\Parent>
<\TS>
<\Children>
<\MAP>
2012-04-24 16:08:10,881 INFO org.apache.hadoop.hive.ql.exec.MapOperator: Initializing Self 8 MAP
2012-04-24 16:08:10,881 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Initializing Self 0 TS
2012-04-24 16:08:10,881 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Operator 0 TS initialized
2012-04-24 16:08:10,881 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Initializing children of 0 TS
2012-04-24 16:08:10,881 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Initializing child 1 RS
2012-04-24 16:08:10,881 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Initializing Self 1 RS
2012-04-24 16:08:10,884 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Using tag = 0
2012-04-24 16:08:10,897 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Operator 1 RS initialized
2012-04-24 16:08:10,897 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Initialization Done 1 RS
2012-04-24 16:08:10,897 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Initialization Done 0 TS
2012-04-24 16:08:10,897 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Initializing Self 2 TS
2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Operator 2 TS initialized
2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Initializing children of 2 TS
2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Initializing child 3 RS
2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Initializing Self 3 RS
2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Using tag = 1
2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Operator 3 RS initialized
2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: Initialization Done 3 RS
2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Initialization Done 2 TS
2012-04-24 16:08:10,898 INFO org.apache.hadoop.hive.ql.exec.MapOperator: Initialization Done 8 MAP
2012-04-24 16:08:10,936 INFO org.apache.hadoop.hive.ql.exec.MapOperator: Processing path /data/bar/bar_20120221.avro.snappy
2012-04-24 16:08:10,936 INFO org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias b for file hdfs://namenode/data/bar
2012-04-24 16:08:10,940 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarding 1 rows
2012-04-24 16:08:10,940 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarding 1 rows
2012-04-24 16:08:10,942 INFO ExecMapper: ExecMapper: processing 1 rows: used memory = 118554320
2012-04-24 16:08:10,958 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarding 10 rows
2012-04-24 16:08:10,958 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarding 10 rows
2012-04-24 16:08:10,958 INFO ExecMapper: ExecMapper: processing 10 rows: used memory = 119482744
2012-04-24 16:08:11,103 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarding 100 rows
2012-04-24 16:08:11,103 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarding 100 rows
2012-04-24 16:08:11,103 INFO ExecMapper: ExecMapper: processing 100 rows: used memory = 122267936
2012-04-24 16:08:11,490 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarding 1000 rows
2012-04-24 16:08:11,490 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarding 1000 rows
2012-04-24 16:08:11,491 INFO ExecMapper: ExecMapper: processing 1000 rows: used memory = 153848656
2012-04-24 16:08:12,369 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarding 10000 rows
2012-04-24 16:08:12,369 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarding 10000 rows
2012-04-24 16:08:12,369 INFO ExecMapper: ExecMapper: processing 10000 rows: used memory = 155906600
2012-04-24 16:08:17,641 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarding 100000 rows
2012-04-24 16:08:17,641 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarding 100000 rows
2012-04-24 16:08:17,641 INFO ExecMapper: ExecMapper: processing 100000 rows: used memory = 174357384
2012-04-24 16:08:25,784 INFO com.linkedin.haivvreo.AvroGenericRecordReader: Unable to match filesplit /data/bar/bar_20120221.avro.snappy:201326592+67108864 with a partition.
2012-04-24 16:08:25,784 INFO com.linkedin.haivvreo.AvroGenericRecordReader: Found the avro schema in the job: { SCHEMA OF FOO }
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 finished. closing...
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 forwarded 237917 rows
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 finished. closing...
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarded 0 rows
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 1 finished. closing...
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 1 forwarded 0 rows
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 Close done
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.MapOperator: DESERIALIZE_ERRORS:0
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 finished. closing...
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarded 237917 rows
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 3 finished. closing...
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 3 forwarded 0 rows
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 Close done
2012-04-24 16:08:25,802 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 8 Close done
2012-04-24 16:08:25,802 INFO ExecMapper: ExecMapper: processed 237917 rows: used memory = 139110360
2012-04-24 16:08:25,807 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2012-04-24 16:08:25,811 WARN org.apache.hadoop.mapred.Child: Error running child
org.apache.avro.AvroTypeException: Found { SCHEMA OF FOO }, expecting { SCHEMA OF BAR }
at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:231)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:127)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:162)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129)
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
at com.linkedin.haivvreo.AvroGenericRecordReader.next(AvroGenericRecordReader.java:119)
at com.linkedin.haivvreo.AvroGenericRecordReader.next(AvroGenericRecordReader.java:41)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:98)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:42)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:67)
at org.apache.hadoop.hive.shims.Hadoop20SShims$CombineFileRecordReader.next(Hadoop20SShims.java:208)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:208)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:193)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
2012-04-24 16:08:25,814 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

jghoman · 2012-04-25T19:25:37Z

you're getting hit with the avro re-encoding stuff that's being discussed in issue #8. Try with the latest pushes. Beyond that, it's difficult to debug this. The code can't find a partition to match the schema to (which happens in issue #8) and then proceeds to find the stashed schema in the jobconf, as if it were being run from a select * statement. The correct thing would be for it to match the partition. You're running CDH Hive, right? I had no idea what changes they made and this bit of code (how to figure out if we're in an MR job or not) changes between versions.

IllyaYalovyy · 2012-05-18T23:05:13Z

Hello,

It looks like this issues is not resolved yet. Our environment is like follows:
CDH3U3, avro-1.5.4 and avro-mapred-1.5.4, haivvreo-1.0.7

We are trying to execute following query:
insert overwrite table LZ.TABLE1 select * from ETLDATA.TABLE2;

It constantly fails with following exception:
Caused by: org.apache.avro.AvroRuntimeException: Not a record: ["null",{"type":"record","name":"_struct_d1e13","namespace":"com.expedia.edw.gco.navigator","fields":[{"name":"FirstName","type":["null","string"]},{"name":"FullName","type":["null","string"]},{"name":"TPID","type":["null","int"]},{"name":"TUID","type":["null","int"]}]}]
at org.apache.avro.Schema.getFields(Schema.java:227)
at com.linkedin.haivvreo.AvroSerializer.serializeStruct(AvroSerializer.java:104)
at com.linkedin.haivvreo.AvroSerializer.serialize(AvroSerializer.java:97)
at com.linkedin.haivvreo.AvroSerializer.serialize(AvroSerializer.java:65)
at com.linkedin.haivvreo.AvroSerDe.serialize(AvroSerDe.java:75)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:553)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:78)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:744)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:527)
... 9 more

Both tables point on the same avro schema.

Is there any way we can fix it? It is a show stopper for our project.

Thanks,
Illya

jghoman · 2012-05-19T00:28:55Z

sorry to hear that. I'm out of ideas on my end, since I can't replicate it. The way to avoid it is to turn off the re-encoding optimization that was mentioned before. It means on evolved records individual reading will be slower though. I'm finishing up the Haivvreo->Hive patch as I type, so hopefully someone in the Hive community may be able to reproduce and we can fix there (although that may be unnecessary because the newer version of Hive has tableproperties, which obviates the need for the all the dancing we're doing).

jghoman · 2012-05-19T01:04:45Z

@IllyaYalovyy Actually, yours is different. Looks like what it's encountering isn't even an Avro record. I'd recommend opening a new issue...

IllyaYalovyy · 2012-05-21T04:56:55Z

Created a new ticket with a detailed description and steps how to reproduce.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema mismatch when running INSERT OVERWRITE from one Avro-backed table to another #14

Schema mismatch when running INSERT OVERWRITE from one Avro-backed table to another #14

elibingham commented Feb 21, 2012

jghoman commented Feb 22, 2012

elibingham commented Feb 22, 2012

jghoman commented Feb 24, 2012

elibingham commented Feb 24, 2012

jghoman commented Feb 24, 2012

elibingham commented Feb 29, 2012

elibingham commented Mar 5, 2012

thirstycrow commented Mar 13, 2012

elibingham commented Mar 13, 2012

cwsteinbach commented Apr 6, 2012

elibingham commented Apr 6, 2012

jghoman commented Apr 6, 2012

jghoman commented Apr 11, 2012

jghoman commented Apr 17, 2012

elibingham commented Apr 18, 2012

elibingham commented Apr 18, 2012

jghoman commented Apr 18, 2012

jghoman commented Apr 18, 2012

jghoman commented Apr 18, 2012

elibingham commented Apr 19, 2012

jghoman commented Apr 24, 2012

elibingham commented Apr 24, 2012

jghoman commented Apr 25, 2012

IllyaYalovyy commented May 18, 2012

jghoman commented May 19, 2012

jghoman commented May 19, 2012

IllyaYalovyy commented May 21, 2012

Schema mismatch when running INSERT OVERWRITE from one Avro-backed table to another #14

Schema mismatch when running INSERT OVERWRITE from one Avro-backed table to another #14

Comments

elibingham commented Feb 21, 2012

jghoman commented Feb 22, 2012

elibingham commented Feb 22, 2012

jghoman commented Feb 24, 2012

elibingham commented Feb 24, 2012

jghoman commented Feb 24, 2012

elibingham commented Feb 29, 2012

elibingham commented Mar 5, 2012

thirstycrow commented Mar 13, 2012

elibingham commented Mar 13, 2012

cwsteinbach commented Apr 6, 2012

elibingham commented Apr 6, 2012

jghoman commented Apr 6, 2012

jghoman commented Apr 11, 2012

jghoman commented Apr 17, 2012

elibingham commented Apr 18, 2012

elibingham commented Apr 18, 2012

jghoman commented Apr 18, 2012

jghoman commented Apr 18, 2012

jghoman commented Apr 18, 2012

elibingham commented Apr 19, 2012

jghoman commented Apr 24, 2012

elibingham commented Apr 24, 2012

jghoman commented Apr 25, 2012

IllyaYalovyy commented May 18, 2012

jghoman commented May 19, 2012

jghoman commented May 19, 2012

IllyaYalovyy commented May 21, 2012