-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schema mismatch when running INSERT OVERWRITE from one Avro-backed table to another #14
Comments
Which version of haivvreo are you running? Is it against the Avro 1.4 or 1.6 branch? |
Jakob, Hive 0.7.1, Haivvreo 1.0.7, Avro 1.5.4 (CDH3) Thanks for any assistance. -Eli On Feb 22, 2012, at 2:41 PM, Jakob Homan wrote:
|
You're most likely running into an issue that was fixed on the Avro 1.4 branch: eca88ef I'm planning on merging these changes to trunk this weekend. |
Jakob, Ah, I see that branch is 7 changes ahead of master. Thanks for the heads up - can you let me know when the changes are merged so I can try out the new code? Also, will master work with Avro 1.5.4 or will I need to move to 1.6.x? -Eli On Feb 24, 2012, at 4:40 PM, Jakob Homan wrote:
|
It should work with 1.6. That's one of the things I'm checking. The other commits are minor. The plan is to merge this code to the Hive source, since I don't have enough time to really care for this code right now. |
Any idea when this fix will be pushed? I'm getting bitten by this bug in another context (can't join together two tables). |
FYI, I cloned the repo, merged the 1.4 branch into master and built it myself to test out this patch. The problem still seems to exist on seemingly simple joins from a to be with non-matching schemas. |
I got the same problem and solved it by commenting out the following line in AvroGenericRecordReader.java, which set the schema of the target table as the reader's schema of GenericDatumReader.
|
@thirstycrow Nice one - just tried this fix in my own fork at elibingham@134c689 - worked like a charm. |
@jghoman: is this patch ready to go in or are there some remaining issues? |
@cwsteinbach If you'd like you can clone my fork and build from there - it includes both @jghoman and @thirstycrow 's patches in master, and I've been using in my development environment for awhile without problems. |
The remaining issue is that I've not seen this in our clusters, so I want to look further, but haven't had time. I've been promised time next week to catch up Haivvreo issues (including the port to ASF), so will likely take a look then. |
ok. Finally have some time to focus on Haivvreo. The issue at hand: that line was added as part of 'be smart about reading in the latest version of the datum so we don't have to re-encode later.' Removing means that if there are any schema evolution changes, we'll have to (try to) re-encode. But you're right that it's getting picked up during writing as well, which is a regression. Now the trick is to figure out how to avoid re-encode and not screw up writing. Just commenting out this line isn't enough because the benefits of avoiding re-encoding are significant. Will take a look. |
@elibingham I'm not sure why you were seeing this on Hive 0.7. We just went to Hive 0.8 and saw it. In 0.8 at least, Hive has changed what it's passing around in the properties, the code was thinking it was in non-mr jobs all the time and loading the schema from that, which was what was causing the failure. I've committed a change (5407836) that should fix this. I've verified it against our 0.8.1 and 0.7.1 clusters. I'll try it against the insert into scenario you originally describe today as well and report back. I've also turned up the logging info when we re-encode to warning because at this point there really is no reason we should be doing so. If we do, I'd like to know about it so we can fix it. Take it for a spin and see if it works for you. And again, sorry for the delay in being able to look at this. |
Thanks for looking into this @jghoman. I'll give it a whirl, but I'm fairly certain it's not going to work on our cluster, which is hive 0.7.1-cdh3u2 based. Possibly the CDH3 version of 0.7.1 has some unexpected patches in it? Anyway, the commit only appears to be in the hive8 branch - do you expect that this jar will work with an 0.7.1 cluster? Thanks! |
Oh, also, when building the HEAD of avro8 branch, I got a unit test failure: Results : Tests in error: Tests run: 56, Failures: 0, Errors: 1, Skipped: 0 |
It's also in the avro14 branch: https://github.com/jghoman/haivvreo/commits/avro14 I merged it to the hive8 branch as well, but tested it to our hive 0.7.1 cluster today too. I'll check on that unit test failure. |
I fixed the test failure on the hive8 branch. For some reason hadoop 1.0.2 has a dependency on jersey that's not included on its published pom. |
@elibingham With the latest commits (bringing the avro14 branch up to 1.0.10), I've got both our 0.7.1 and 0.8.1 clusters correctly insert overwriting an equivalent query to yours. Definitely check it out and see if it works for you. Tomorrow I'll be merging the master and 1.4 branches (or starting to...) so we don't have this extra layer of annoyance going on. |
@jghoman Great. Our cluster is on Avro 1.6, so I think I need to wait until you merge to master, which is Avro 1.5/1.6 based? |
@elibingham I've pushed the changes to Master, but don't have a cluster with Avro 1.5 to test it on, so you'll have to be the guinea pig... |
@jghoman I built and tested with HEAD of master and I'm still seeing the problem (vs. with my patched jar). I've sanitized my CLI input and the Hadoop logs and hopefully this will give you a better idea of what's going on. The interesting part to me in the hadoop logs is that it looks like it's deserializing data from BAR but expecting the schema of FOO in the actual job. I've obscured some of the underlying data details and left out some irrelevant parts of the logs, let me know if you need anything else. Sanitized hive CLI input: hive> select a.i, a.j, a.k, a.l, a.m, b.n Sanitized Hadoop log from job: 2012-04-24 16:08:09,962 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable |
you're getting hit with the avro re-encoding stuff that's being discussed in issue #8. Try with the latest pushes. Beyond that, it's difficult to debug this. The code can't find a partition to match the schema to (which happens in issue #8) and then proceeds to find the stashed schema in the jobconf, as if it were being run from a select * statement. The correct thing would be for it to match the partition. You're running CDH Hive, right? I had no idea what changes they made and this bit of code (how to figure out if we're in an MR job or not) changes between versions. |
Hello, It looks like this issues is not resolved yet. Our environment is like follows: We are trying to execute following query: It constantly fails with following exception: Both tables point on the same avro schema. Is there any way we can fix it? It is a show stopper for our project. Thanks, |
sorry to hear that. I'm out of ideas on my end, since I can't replicate it. The way to avoid it is to turn off the re-encoding optimization that was mentioned before. It means on evolved records individual reading will be slower though. I'm finishing up the Haivvreo->Hive patch as I type, so hopefully someone in the Hive community may be able to reproduce and we can fix there (although that may be unnecessary because the newer version of Hive has tableproperties, which obviates the need for the all the dancing we're doing). |
@IllyaYalovyy Actually, yours is different. Looks like what it's encountering isn't even an Avro record. I'd recommend opening a new issue... |
Created a new ticket with a detailed description and steps how to reproduce. |
Hi there,
I have two tables defined roughly as:
CREATE EXTERNAL TABLE a
ROW FORMAT SERDE
'com.linkedin.haivvreo.AvroSerDe'
WITH SERDEPROPERTIES (
'schema.url'='hdfs://schema/a.avsc')
STORED AS INPUTFORMAT
'com.linkedin.haivvreo.AvroContainerInputFormat'
OUTPUTFORMAT
'com.linkedin.haivvreo.AvroContainerOutputFormat'
LOCATION
'/foo/a';
CREATE EXTERNAL TABLE b
ROW FORMAT SERDE
'com.linkedin.haivvreo.AvroSerDe'
WITH SERDEPROPERTIES (
'schema.literal'='{ ... }')
STORED AS INPUTFORMAT
'com.linkedin.haivvreo.AvroContainerInputFormat'
OUTPUTFORMAT
'com.linkedin.haivvreo.AvroContainerOutputFormat'
LOCATION
'/foo/b';
When I do:
INSERT OVERWRITE TABLE b
SELECT a, b, c FROM a;
(where b's schema is equivalent to a, b, c)
This fails with an AvroTypeException like this:
2012-02-21 11:31:58,587 INFO org.apache.hadoop.mapred.TaskStatus: task-diagnostic-info for task attempt_201202091933_0059_m_000000_1 : java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable com.linkedin.haivvreo.AvroGenericRecordWritable@92696c2
at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:161)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable com.linkedin.haivvreo.AvroGenericRecordWritable@92696c2
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:520)
at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143)
... 8 more
Caused by: org.apache.avro.AvroTypeException: Found {
"type" : "record",
"name" : "b",
"fields" : [ { … /* schema of b / … } ]
}, expecting {
"type" : "record",
"name" : "a",
"fields" : [ { … / schema of a */ … } ],
...
}
at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:231)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:127)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:162)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129)
at com.linkedin.haivvreo.AvroDeserializer$SchemaReEncoder.reencode(AvroDeserializer.java:79)
at com.linkedin.haivvreo.AvroDeserializer.deserialize(AvroDeserializer.java:121)
at com.linkedin.haivvreo.AvroSerDe.deserialize(AvroSerDe.java:81)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:504)
... 9 more
Any help would be appreciated.
The text was updated successfully, but these errors were encountered: