Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial not working #25

Open
tdrottner opened this issue May 3, 2015 · 20 comments
Open

Tutorial not working #25

tdrottner opened this issue May 3, 2015 · 20 comments

Comments

@tdrottner
Copy link

Hi guys,

I am trying to run your tutorials but I don´t know whats wrong. At first, I am new, really beginner in these things like Hadoop etc.

I uses this, exactly same what is written there: https://github.com/Esri/gis-tools-for-hadoop/wiki/GIS-Tools-for-Hadoop-for-Beginners

Now samples ... I tried https://github.com/Esri/gis-tools-for-hadoop/tree/master/samples/point-in-polygon-aggregation-hive but when I type last query (select, join, group, order ...) I have this :

Warning: Map Join MAPJOIN[18][bigTable=earthquakes] in task 'Stage-2:MAPRED' is a cross product
15/05/03 00:48:43 WARN conf.Configuration: file:/tmp/root/hive_2015-05-03_00-48-28_169_4942355857717156539-1/-local-10008/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
15/05/03 00:48:43 WARN conf.Configuration: file:/tmp/root/hive_2015-05-03_00-48-28_169_4942355857717156539-1/-local-10008/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.

so I dont have any results ...

Another sample example is Feature Class to HDFS ( https://github.com/Esri/gis-tools-for-hadoop/wiki/Getting-a-Feature-Class-into-HDFS) ... I am in step 6 - 7 ... I should write that DROP TABLE in Cygwin right? I did it, and have result OK, then I write describe formatted input_ex and nothing happens ... whats wrong with that? I am going step by step like a child ... It could be problem between keyboard and chair (me) but I do everything based on your tutorial ...

EDIT: Feature Class to HDFS is working now ... I just forgot to write ; and the end of describe formatted input_ex ... I didn´t know that it should be there and in tutorial this ; is missing
Thanks for advice :)

@smambrose
Copy link
Contributor

Hi @TikoS,

You will always get the warning:

Warning: Map Join MAPJOIN[18][bigTable=earthquakes] in task 'Stage-2:MAPRED' is a cross product
15/05/03 00:48:43 WARN conf.Configuration: file:/tmp/root/hive_2015-05-03_00-48-28_169_4942355857717156539-1/-local-10008/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
15/05/03 00:48:43 WARN conf.Configuration: file:/tmp/root/hive_2015-05-03_00-48-28_169_4942355857717156539-1/-local-10008/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.

when you run:

SELECT counties.name, count(*) cnt FROM counties
JOIN earthquakes
WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude))
GROUP BY counties.name
ORDER BY cnt desc;

It's ok that the warning occurs, you just need to wait a bit longer to see the results. You'll know the process is done when you can see the >hive cursor. Like the following:

image

It may be a bit easier to tell the the process is continuing to run if on the first command from the sample, you use hive instead of hive -S

Thank you for the correction to the Feature Class to HDFS - the tutorial has been updated.

@tdrottner
Copy link
Author

Hi @smambrose,

I didn´t know that, and I didn´t see hive> ... I will try it again withou using hive -S command.

Anyway I was trying Aggregation sample (taxi demo) and when I tried aggregation (step 9) it was for looooooooooooooong time and the process bar (map and %) was like:

0%, after 30 minutes 0%, after another 30 minutes 89% and 3 times the same (only Map), in Reduce was 0% ... This is okey? because I hade to turn off my lapton I was using for it so I don´t know ..

but then I tried it again, skip step 9 and use 10 and 11 and get this:

Could not find job application_1430678691120_0002. The job might not be running yet.

Job job_1430678691120_0002 could not be found: {"RemoteException":{"exception":"NotFoundException","message":"java.lang.Exception: job, job_1430678691120_0002, is not found","javaClassName":"org.apache.hadoop.yarn.webapp.NotFoundException"}} (error 404)

thanks (sorry if this is so easy for you guys, I am new ... BTW: can I use netCDF, HDF, GRIB formats in these tools? or it is just for csv, json ... ?)

@smambrose
Copy link
Contributor

@TikoS,

I haven't tried running the taxi aggregation sample on a Sandbox yet - but will try to in the next day or so. On a 16 node cluster I was able to run step 9 in ~17 seconds.
image

You might want to try changing the values of 0.001 -> .1, which might be able to run a bit faster, such that: step 9 would be:

FROM (SELECT ST_Bin(.1, ST_Point(dropoff_longitude,dropoff_latitude)) bin_id, *FROM taxi_demo) bins
SELECT ST_BinEnvelope(.1, bin_id) shape,
COUNT(*) count
GROUP BY bin_id
limit 1;

I would make sure the sample runs before doing the aggregation though. I wasn't able to reproduce the error you received after step 10/11. What was returned after step 10?

Currently, gis-tools for-hadoop is for vector data (points, lines and polygons), not raster.

@tdrottner
Copy link
Author

Hi,

16 node cluster 17 second, nice ...

I changed that values and get error, nothing was returned ...

That vector/raster thing ...
Do you consider to make it availiable also for rasters? If I am not wrong,
hadoop (or some enhancement) could run netCDF, HDF or just raster data?

2015-05-04 20:44 GMT+02:00 Sarah Ambrose [email protected]:

@TikoS https://github.com/tikos,

I haven't tried running the taxi aggregation sample on a Sandbox yet - but
will try to in the next day or so. On a 16 node cluster I was able to run
step 9 in ~17 seconds.
[image: image]
https://cloud.githubusercontent.com/assets/8440775/7459694/a27d0f32-f252-11e4-9907-a2481fa6783f.png

You might want to try changing the values of 0.001 -> .1, which might be
able to run a bit faster, such that: step 9 would be:

FROM (SELECT ST_Bin(.1, ST_Point(dropoff_longitude,dropoff_latitude)) bin_id, FROM taxi_demo) bins
SELECT ST_BinEnvelope(.1, bin_id) shape,
COUNT(
) count
GROUP BY bin_id
limit 1;

I would make sure the sample runs before doing the aggregation though. I
wasn't able to reproduce the error you received after step 10/11. What was
returned after step 10?

Currently, gis-tools for-hadoop is for vector data (points, lines and
polygons), not raster.


Reply to this email directly or view it on GitHub
#25 (comment)
.

bc. Tomáš DROTTNER

Katedra geoinformatiky * | *Department of Geoinformatics
Univerzita Palackého v Olomouci | Palacký University in Olomouc
17. listopadu 50 | Olomouc, 771 46 | Czech Republic

@smambrose
Copy link
Contributor

@TikoS

Were/Are you able to confirm the previous steps 1-8 worked (jars were added without errors, you were able to describe the taxi_demo table, the taxi data loaded? What error did you recieve when running Step 9?

You are correct that those formats can be read by hadoop (and many are splitable). We aren't currently working on a raster solution for gis-tools-for-hadoop, although we have not ruled it out.
By searching something like 'HDF format hadoop input format' you should be able to find what you are looking for.

@jmirmelstein
Copy link
Member

@smambrose I'm seeing the same issue as @TikoS . When i run the final 'Select' I see the same warnings that you mentioned earlier, but then I see the hive> prompt and no results.

@tdrottner
Copy link
Author

Steps 1-8 worked well ;) I don´t know if step 9 had some error message yet because I had to turn it off ...
but I will try it again ... (its really long process - its was running more than 1 hour... and still 89% Map then again 30% ... I will try it and write results ...

Hadoop HDF and stuff ... That would be great to implement that to these tools ...

@smambrose
Copy link
Contributor

@jmirmelstein and @TikoS are you both using the Hortonworks sandbox when you see the original error without results for the sample? What version are you on?

Warning: Map Join MAPJOIN[18][bigTable=earthquakes] in task 'Stage-2:MAPRED' is a cross product
15/05/03 00:48:43 WARN conf.Configuration: file:/tmp/root/hive_2015-05-03_00-48-28_169_4942355857717156539-1/-local-10008/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
15/05/03 00:48:43 WARN conf.Configuration: file:/tmp/root/hive_2015-05-03_00-48-28_169_4942355857717156539-1/-local-10008/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.

I'm not currently able to reproduce it. Are there any other details if you start hive hive without including the -S?

@tdrottner
Copy link
Author

AI am using version 2.1 since in tutorial is written that you had some issues with 2.2. version, I am using everything same with tutorials just for case...
I will try it again ASAP my VM start Hortonworks Sandbox ... Its really slow (I have 5 -years old laptop) and I am using it right now so ...

@tdrottner
Copy link
Author

So...I tried it (that earthquake aggregation sample not taxi) and it passed, I get hive at the end, but not that results (name and count of earthquake) and I am attaching here whole console from beginning to the end:

[root@sandbox ~]# hive

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.p                                roperties
hive> add jar
    >   ${env:HOME}/esri-git/gis-tools-for-hadoop/samples/lib/esri-geometry-api.                                jar
    >   ${env:HOME}/esri-git/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop                                .jar;
Added /root/esri-git/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar to c                                lass path
Added resource: /root/esri-git/gis-tools-for-hadoop/samples/lib/esri-geometry-ap                                i.jar
Added /root/esri-git/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar to                                 class path
Added resource: /root/esri-git/gis-tools-for-hadoop/samples/lib/spatial-sdk-hado                                op.jar
hive> create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
OK
Time taken: 4.993 seconds
hive> create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains                                ';
OK
Time taken: 0.892 seconds
hive> CREATE EXTERNAL TABLE IF NOT EXISTS earthquakes (earthquake_date STRING, l                                atitude DOUBLE, longitude DOUBLE, depth DOUBLE, magnitude DOUBLE,
    >     magtype string, mbstations string, gap string, distance string, rms st                                ring, source string, eventid string)
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    > LOCATION '${env:HOME}/esri-git/gis-tools-for-hadoop/samples/data/earthquak                                e-data';
OK
Time taken: 2.926 seconds
hive> CREATE EXTERNAL TABLE IF NOT EXISTS counties (Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary)
    > ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
    > STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
    > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    > LOCATION '${env:HOME}/esri-git/gis-tools-for-hadoop/samples/data/counties-data';
OK
Time taken: 0.865 seconds
hive> SELECT counties.name, count(*) cnt FROM counties
    > JOIN earthquakes
    > WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude))
    > GROUP BY counties.name
    > ORDER BY cnt desc;
Warning: Map Join MAPJOIN[18][bigTable=earthquakes] in task 'Stage-2:MAPRED' is a cross product
Query ID = root_20150504141212_c9bdd531-5033-4b02-8447-69e26ed76b3a
Total jobs = 2
15/05/04 14:12:56 WARN conf.Configuration: file:/tmp/root/hive_2015-05-04_14-12-17_001_4646085809202048677-1/-local-10008/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
15/05/04 14:12:56 WARN conf.Configuration: file:/tmp/root/hive_2015-05-04_14-12-17_001_4646085809202048677-1/-local-10008/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
Execution log at: /tmp/root/root_20150504141212_c9bdd531-5033-4b02-8447-69e26ed76b3a.log
2015-05-04 02:13:01     Starting to launch local task to process map join;      maximum memory = 260177920
2015-05-04 02:13:05     Dump the side-table into file: file:/tmp/root/hive_2015-05-04_14-12-17_001_4646085809202048677-1/-local-10005/HashTable-Stage-2/MapJoin-mapfile00--.hashtable
2015-05-04 02:13:05     Uploaded 1 File to: file:/tmp/root/hive_2015-05-04_14-12-17_001_4646085809202048677-1/-local-10005/HashTable-Stage-2/MapJoin-mapfile00--.hashtable (260 bytes)
2015-05-04 02:13:05     End of local task; Time Taken: 4.91 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1430772586818_0001, Tracking URL = http://sandbox.hortonworks.com:8088/proxy/application_1430772586818_0001/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1430772586818_0001
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2015-05-04 14:14:06,510 Stage-2 map = 0%,  reduce = 0%
2015-05-04 14:15:06,904 Stage-2 map = 0%,  reduce = 0%, Cumulative CPU 2.69 sec
2015-05-04 14:15:10,564 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 3.64 sec
2015-05-04 14:16:10,793 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 3.64 sec
2015-05-04 14:16:52,976 Stage-2 map = 100%,  reduce = 67%, Cumulative CPU 4.9 sec
2015-05-04 14:16:58,387 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 7.74 sec
MapReduce Total cumulative CPU time: 7 seconds 740 msec
Ended Job = job_1430772586818_0001
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1430772586818_0002, Tracking URL = http://sandbox.hortonworks.com:8088/proxy/application_1430772586818_0002/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1430772586818_0002
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2015-05-04 14:18:22,943 Stage-3 map = 0%,  reduce = 0%
2015-05-04 14:19:20,051 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.37 sec
2015-05-04 14:20:20,055 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.37 sec
2015-05-04 14:21:13,236 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 4.62 sec
MapReduce Total cumulative CPU time: 4 seconds 620 msec
Ended Job = job_1430772586818_0002
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 7.74 sec   HDFS Read: 276 HDFS Write: 96 SUCCESS
Job 1: Map: 1  Reduce: 1   Cumulative CPU: 4.62 sec   HDFS Read: 473 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 360 msec
OK
Time taken: 538.985 seconds
hive>

@smambrose
Copy link
Contributor

thanks @TikoS

what do you get when you type
select * from counties limit 1;
and
select * from earthquakes limit1;

@tdrottner
Copy link
Author

I get this

hive> select * from counties limit 1;
OK
Time taken: 0.826 seconds
hive> select * from earthquakes limit1;
OK
Time taken: 0.818 seconds
hive>

its nice to know that it is OK ... but iI want to see some results :D

@smambrose
Copy link
Contributor

hi @TikoS,

Looks like the tables are empty. You can try doing
drop table earthquakes;
drop table counties;
exit;

when you exit hive, make sure you are in the esri-git directory.

You can then re-run the sample (without using the -S on hive). We'll continue to troubleshoot -will probably not have anything for today though. Please keep us updated of your progress, and if you have any informative error messages.

@tdrottner
Copy link
Author

Hi @smambrose ,

I tried it, and I have same error ... I have only OK result

@jmirmelstein
Copy link
Member

@smambrose - this is what i see:

hive> drop table counties;
FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class com.esri.hadoop.hive.serde.JsonSerde not found)
hive> [root@sandbox esri-git]#

@smambrose
Copy link
Contributor

@TikoS @jmirmelstein Thanks for all your help, we think we have it figured out. We will update the sample - but for now this should get it working.

In cygwin, you'll want to be in you esri-git directory ([root@sandbox esri-git]#) and complete the following:

#make a earthquake demo directory in hadoop
hadoop fs -mkdir earthquake-demo

#hadoop fs -put /path/on/localsystem /path/to/hdf
hadoop fs -put gis-tools-for-hadoop/samples/data/counties-data earthquake-demo
hadoop fs -put gis-tools-for-hadoop/samples/data/earthquake-data earthquake-demo

#check that it worked:
hadoop fs -ls earthquake-demo

Start up Hive and add the jars and functions:

hive

add jar
  ${env:HOME}/esri-git/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar
  ${env:HOME}/esri-git/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar;

create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains';

Drop existing tables and create new empty ones:

drop table earthquakes;

CREATE TABLE earthquakes (earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, depth DOUBLE, magnitude DOUBLE,
    magtype string, mbstations string, gap string, distance string, rms string, source string, eventid string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

drop table counties;

CREATE TABLE counties (Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary)                                         
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'              
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

Load data into the tables:

LOAD DATA INPATH 'earthquake-demo/earthquake-data/earthquakes.csv' OVERWRITE INTO TABLE earthquakes;

LOAD DATA INPATH 'earthquake-demo/counties-data/california-counties.json' OVERWRITE INTO TABLE counties;

You should now be able to complete the analysis in the sample

@tdrottner
Copy link
Author

@smambrose

Hi ... I have again problem ... in last step DATA INPATH I got this:

hive> DATA INPATH 'earthquake-demo/earthquake-data/earthquakes.csv' OVERWRITE INTO TABLE earthquakes;
NoViableAltException(69@[])
        at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:999)
        at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)
        at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:408)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:322)
        at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:976)
        at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1041)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:912)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:902)
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
FAILED: ParseException line 1:0 cannot recognize input near 'DATA' 'INPATH' ''earthquake-demo/earthquake-data/earthquakes.csv''
hive>
    > DATA INPATH 'earthquake-demo/california-counties.json' OVERWRITE INTO TABLE counties;
NoViableAltException(69@[])
        at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:999)
        at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)
        at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:408)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:322)
        at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:976)
        at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1041)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:912)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:902)
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
FAILED: ParseException line 1:0 cannot recognize input near 'DATA' 'INPATH' ''earthquake-demo/california-counties.json''
hive>

I had to change path to .csv and .json because it was different
steps before this are OK

@smambrose
Copy link
Contributor

can you type this in? (after exiting hive)

hadoop fs -ls earthquake-demo - what is the output?

Did you mean to get rid of /counties-data/ in the command LOAD DATA INPATH 'earthquake-demo/counties-data/california-counties.json' OVERWRITE INTO TABLE counties;? Is that what you meant by changing the path?

@tdrottner
Copy link
Author

Yes I can ... I did it before so here is result of that:

[root@sandbox esri-git]# hadoop fs -ls earthquake-demo
Found 2 items
-rw-r--r--   1 root root    1028330 2015-05-07 10:19 earthquake-demo/california-counties.json
drwxr-xr-x   - root root          0 2015-05-07 10:20 earthquake-demo/earthquake-data

And yes, this change I made

BTW: Taxi demo sample ... I am again in step 9 .. .change the value from 0.01 to 1 to make it faster BUT ... it is still slow ... or its okey? I am just asking because I don´t know ... Isn ´t it weird ?

hive> FROM (SELECT ST_Bin(1, ST_Point(dropoff_longitude,dropoff_latitude)) bin_id, *FROM taxi_demo) bins
    > SELECT ST_BinEnvelope(1, bin_id) shape,
    > COUNT(*) count
    > GROUP BY bin_id;
Query ID = root_20150507120909_e000001c-4259-48dd-8e98-3684d0e94566
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1431013082714_0001, Tracking URL = http://sandbox.hortonworks.com:8088/proxy/application_1431013082714_0001/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1431013082714_0001
Hadoop job information for Stage-1: number of mappers: 9; number of reducers: 3
2015-05-07 12:11:05,397 Stage-1 map = 0%,  reduce = 0%
2015-05-07 12:12:09,782 Stage-1 map = 0%,  reduce = 0%
2015-05-07 12:13:25,554 Stage-1 map = 0%,  reduce = 0%
2015-05-07 12:14:37,769 Stage-1 map = 0%,  reduce = 0%
2015-05-07 12:15:38,096 Stage-1 map = 0%,  reduce = 0%
2015-05-07 12:16:42,504 Stage-1 map = 0%,  reduce = 0%
2015-05-07 12:17:25,371 Stage-1 map = 89%,  reduce = 0%
2015-05-07 12:18:11,890 Stage-1 map = 0%,  reduce = 0%
2015-05-07 12:19:12,073 Stage-1 map = 0%,  reduce = 0%
2015-05-07 12:20:12,697 Stage-1 map = 0%,  reduce = 0%
2015-05-07 12:21:26,323 Stage-1 map = 0%,  reduce = 0%
2015-05-07 12:22:26,650 Stage-1 map = 0%,  reduce = 0%
2015-05-07 12:23:33,421 Stage-1 map = 11%,  reduce = 0%
2015-05-07 12:23:35,272 Stage-1 map = 56%,  reduce = 0%
2015-05-07 12:24:09,535 Stage-1 map = 89%,  reduce = 0%
2015-05-07 12:24:41,054 Stage-1 map = 67%,  reduce = 0%
2015-05-07 12:25:28,244 Stage-1 map = 44%,  reduce = 0%
2015-05-07 12:26:22,278 Stage-1 map = 0%,  reduce = 0%
2015-05-07 12:28:36,400 Stage-1 map = 0%,  reduce = 0%
2015-05-07 12:29:46,988 Stage-1 map = 0%,  reduce = 0%
2015-05-07 12:30:47,851 Stage-1 map = 0%,  reduce = 0%
2015-05-07 12:31:48,892 Stage-1 map = 0%,  reduce = 0%
2015-05-07 12:33:13,617 Stage-1 map = 0%,  reduce = 0%
2015-05-07 12:34:14,299 Stage-1 map = 0%,  reduce = 0%

EDIT:
I guess that my last code isn´t good for me because it end now with this:

2015-05-07 12:56:44,267 Stage-1 map = 0%,  reduce = 0%
2015-05-07 13:07:46,126 Stage-1 map = 89%,  reduce = 0%
java.io.IOException: Job status not available
        at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:322)
        at org.apache.hadoop.mapreduce.Job.getStatus(Job.java:329)
        at org.apache.hadoop.mapred.JobClient.getJob(JobClient.java:598)
        at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:288)
        at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:547)
        at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:426)
        at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
        at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
        at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
        at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1504)
        at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1271)
        at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1089)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:912)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:902)
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Ended Job = job_1431013082714_0001 with exception 'java.io.IOException(Job status not available )'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
hive>

@smambrose Please, could you help me little bit with this?

@smambrose
Copy link
Contributor

@TikoS

I'm going to create a new issue for the aggregation problems.

As for the data inpath error - I have been able to reproduce it, and am still looking into the cause.

I was able to run the above code again, instead of using "earthquake-demo" I tried something else, and it worked. So you might want to try it again if you haven't. Here is what I had:

Loading them into "try1"

[root@sandbox esri-git]# hadoop fs -mkdir try1
[root@sandbox esri-git]# hadoop fs -put gis-tools-for-hadoop/samples/data/counties-data try1
[root@sandbox esri-git]# hadoop fs -put gis-tools-for-hadoop/samples/data/earthquake-data try1
[root@sandbox esri-git]# hadoop fs -ls try1
Found 2 items
drwxr-xr-x   - root root          0 2015-05-12 08:47 try1/counties-data
drwxr-xr-x   - root root          0 2015-05-12 08:47 try1/earthquake-data
[root@sandbox esri-git]# hadoop fs -ls try1/counties-data
Found 1 items
-rw-r--r--   1 root root    1028330 2015-05-12 08:47 try1/counties-data/california-counties.json
[root@sandbox esri-git]# hadoop fs -ls try1/earthquake-data
Found 1 items
-rw-r--r--   1 root root    5742716 2015-05-12 08:47 try1/earthquake-data/earthquakes.csv

Use Hive and load jars/functions

[root@sandbox esri-git]# hive
add jar
  ${env:HOME}/esri-git/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar
  ${env:HOME}/esri-git/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar;

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
hive>
    > add jar
    >   ${env:HOME}/esri-git/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar
    >   ${env:HOME}/esri-git/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar;
Added /root/esri-git/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar to class path
Added resource: /root/esri-git/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar
Added /root/esri-git/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar to class path
Added resource: /root/esri-git/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar
hive> create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
OK
Time taken: 1.814 seconds
hive> create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains';
OK
Time taken: 0.545 seconds

View tables/drop old ones/create new ones

hive> show tables;
OK
agg_samp
counties
earthquakes
input_ex
sample_07
sample_08
Time taken: 0.687 seconds, Fetched: 6 row(s)
hive> drop table counties;
OK
Time taken: 0.999 seconds
hive> drop table earthquakes;
OK
Time taken: 0.6 seconds
hive> CREATE TABLE earthquakes (earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, depth DOUBLE,                                                                                        magnitude DOUBLE,
    >     magtype string, mbstations string, gap string, distance string, rms string, source string, even                                                                                       tid string)
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    > STORED AS TEXTFILE;
OK
Time taken: 0.53 seconds
hive> show tables
    > ;
OK
agg_samp
earthquakes
input_ex
sample_07
sample_08
Time taken: 0.337 seconds, Fetched: 5 row(s)

Tried loading from "earthquakes-demo"

hive> DATA INPATH 'earthquake-demo/earthquake-data/earthquakes.csv' OVERWRITE INTO TABLE earthquakes;
NoViableAltException(69@[])
        at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:999)
        at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)
        at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:408)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:322)
        at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:976)
        at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1041)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:912)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:902)
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
FAILED: ParseException line 1:0 cannot recognize input near 'DATA' 'INPATH' ''earthquake-demo/earthquake-                                                                                       data/earthquakes.csv''

Loaded from new location (try1)

hive> LOAD DATA INPATH 'try1/earthquake-data/earthquakes.csv' OVERWRITE INTO TABLE earthquakes;
Loading data to table default.earthquakes
rmr: DEPRECATED: Please use 'rm -r' instead.
Moved: 'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/earthquakes' to trash at: hdfs://sandbox.                                                                                       hortonworks.com:8020/user/root/.Trash/Current
Table default.earthquakes stats: [numFiles=1, numRows=0, totalSize=5742716, rawDataSize=0]
OK
Time taken: 1.327 seconds

Created counties table and loaded data

hive> CREATE TABLE counties (Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary)
    > ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
    > STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
    > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
OK
Time taken: 1.875 seconds
hive> LOAD DATA INPATH 'earthquake-demo/counties-data/california-counties.json' OVERWRITE INTO TABLE coun                                                                                       ties;
FAILED: SemanticException Line 1:17 Invalid path ''earthquake-demo/counties-data/california-counties.json                                                                                       '': No files matching path hdfs://sandbox.hortonworks.com:8020/user/root/earthquake-demo/counties-data/ca                                                                                       lifornia-counties.json
hive> LOAD DATA INPATH 'try1/counties-data/california-counties.json' OVERWRITE INTO TABLE counties;
Loading data to table default.counties
rmr: DEPRECATED: Please use 'rm -r' instead.
Moved: 'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/counties' to trash at: hdfs://sandbox.hor                                                                                       tonworks.com:8020/user/root/.Trash/Current
Table default.counties stats: [numFiles=1, numRows=0, totalSize=1028330, rawDataSize=0]
OK
Time taken: 0.609 seconds

An important thing to note is that once you load data, you have to use the put command again (like hadoop fs -put gis-tools-for-hadoop/samples/data/counties-data try1) because it is deleted in the process of loading (Moved: 'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/counties' to trash at: hdfs://sandbox.hor tonworks.com:8020/user/root/.Trash/Current)

Please let me know if a second run through works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants