Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy from HDFS not working #22

Open
mayur7789 opened this issue Jan 10, 2015 · 54 comments
Open

Copy from HDFS not working #22

mayur7789 opened this issue Jan 10, 2015 · 54 comments

Comments

@mayur7789
Copy link

Here is the error details
model

I have Windows 7 as Host machine and Cloudera VM (Linux) as guest machine. Arcmap is installed on Host machine please find the screen shot below while accessing.
clouderavmnamenode
http://quickstart.cloudera:50070/dfshealth.html#tab-overview (192.168.126.128:50070/) through browser on the same machine as ArcMap.

Also tried solving the problem with the help of "Copy to HDFS not working #16" But still I am facing the same problem.
clouderavm ss

Please point me to right direction.

@climbage
Copy link
Member

Can you make the tool work by using the IP address of the namenode rather than the hostname?

@mayur7789
Copy link
Author

@climbage : No I tired using the tool with IP address but couldn't make it work, any other solution.

@randallwhitman
Copy link
Contributor

The third image in the issue description of 2015/01/10 does not appear to be the "same problem" as the first image - it is very different. The third image shows a web browser [Firefox] dialog asking which application should be used to open the CSV file. In fact, such a web browser dialog indicates a successful download of the requested file. This may not provide entirely relevant information, as the screenshot appears to be a web browser running in the guest OS in the VM, and no screen shot is provided of a browser on the host OS trying to access the file through that URL.

@GISDev01
Copy link

@mayur7789 If you did not figure this out yet, it looks like you are using "quickstart" as an input to the GP tool from your first screenshot. It is likely that your host machine has no idea where this location of "quickstart" is located unless you actually put quick in your Windows HOSTS file. As Randall mentions, you should try accessing the same URL in your 3rd screenshot from within a web browser on your host machine instead of from within the Cloudera VM. This test it to make sure webhdfs is accessible from the host machine (the same one running the GP tool from within ArcMap) to the Cloudera VM.

If the IP URL that is shown in your third screenshot works in your host machine web browser, then trying switching your input param. within your GP tool in ArcMap from "quickstart" to just the IPv4 address which you used in the URL webhdfs test.

@JamesLMilner
Copy link

I am having the same issue with the out the box VM provided via https://github.com/Esri/gis-tools-for-hadoop/wiki/GIS-Tools-for-Hadoop-for-Beginners . I have followed all the instructions however I hit the same error whilst trying to run the Copy to HDFS script. I have a feeling that as per @GISDev01 the host machine has no way of desiphering what sandbox or sandbox.hortonworks.com actually is, but i'm not sure how to make that happen. If you have either @mayur7789 , @randallwhitman or @climbage have any ideas that would be greatly appreciated as I'm supposed to be demoing it soon.

image

image

@climbage
Copy link
Member

climbage commented Apr 8, 2015

@JamesMilnerUK Can you change sandbox to localhost and try again?

@JamesMBallard
Copy link

Wrong James...

@JamesLMilner
Copy link

@climbage thanks! That worked (woops, what a simple mistake). However now I'm faced with a new error... "Unexpected error : No JSON object could be decoded". Perhaps something funky going on with the actual created file?

image

@climbage
Copy link
Member

climbage commented Apr 8, 2015

@JamesMilnerUK Could be. Can you browse to the file in HDFS? http://localhost:50070

@JamesLMilner
Copy link

@climbage for sure; this is what I see.
image

@climbage
Copy link
Member

climbage commented Apr 8, 2015

@JamesMilnerUK is there a link on that page to browse the file system?

@JamesLMilner
Copy link

@climbage yeah, so I can browse to the file, however when I try and open the agg_samp I get "file not found" in my web browser.

image
image

I managed to download the file using HDP 2.2 which is strange:

image

No obvious malformation in the JSON but theres so much of it it's kind of hard to tell.

@climbage
Copy link
Member

climbage commented Apr 8, 2015

Hmm, and you created dataset through the hive sample?

@JamesLMilner
Copy link

@climbage Yeah, I'm following the workflow as described on https://github.com/Esri/gis-tools-for-hadoop/wiki/Getting-the-results-of-a-Hive-query-into-ArcGIS . Everything appears to run correctly:
image

But I get the JSON error now whilst running the Copy from HDFS tool

Edit: Hang on, I do get this error "FAILED: SemanticException [Error 10004]: Line 2:7 Invalid table alias or column reference 'event_date': (possible column names are: earthquake_date, latitude, longitude, depth, mag "

which I have a feeling relates to issue #24 ? If i change it to earthquake_date it runs it runs error free in hive but i feel the problem may run deeper.

@climbage
Copy link
Member

climbage commented Apr 8, 2015

How about select count(*) from agg_samp?

@JamesLMilner
Copy link

@climbage if I replace event_date with earthquake_date it comes back with "OK 12948". It also seems it's not giving the finished file as JSON.

@climbage
Copy link
Member

climbage commented Apr 8, 2015

When are you replacing it?

@JamesLMilner
Copy link

Replacing event_date (as described in tutorial: https://github.com/Esri/gis-tools-for-hadoop/wiki/Getting-the-results-of-a-Hive-query-into-ArcGIS) with earthsquake_date because otherwise it fails.

CREATE TABLE IF NOT EXISTS earthquakes (earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, magnitude DOUBLE)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA LOCAL INPATH './earthquakes.csv' OVERWRITE INTO TABLE earthquakes;

CREATE TABLE earthquakes_new(earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, magnitude DOUBLE);
INSERT OVERWRITE TABLE earthquakes_new
SELECT earthquake_date, latitude, longitude, magnitude
FROM earthquakes
WHERE latitude is not null;

Thank you for your help by the way.

@climbage
Copy link
Member

climbage commented Apr 8, 2015

Interesting. So what if you re-run the GP tool now?

@JamesLMilner
Copy link

The hive commands pass per screenshot provided above. The GP tool gives: Unexpected error : No JSON object could be decoded

Interestingly if I copy the JSON to my desktop and run the second GP task (JSON to Feature Layer) I get this:

image

I'm not sure if this is how it's supposed to look as no example is provided, but I'm assuming that it is correct (feature table looks OK; visualises OK).

@climbage
Copy link
Member

climbage commented Apr 8, 2015

Well, that is how it's supposed to look so you're at least getting the right results.

Actually, I think I might know why the copy from HDFS too is not working. Instead of,

apps/hive/warehouse/...

use

/apps/hive/warehouse/...

If you don't root the path it will assume it's relative to your home directory /users/[username]

@JamesLMilner
Copy link

Thanks for the suggestion. I have attempted it and unforunately still getting the same error: (Unexpected error : [Errno 11004] getaddrinfo failed). Changing it back to no slash gives me the JSON error which is more promising (so to speak!).

Also, just out of interest what is this query doing? Aggregating the number of earthquakes in a year into square polygons?

@climbage
Copy link
Member

climbage commented Apr 8, 2015

Can you post the traceback for the error this time?

@JamesLMilner
Copy link

Executing: CopyFromHDFS localhost 50070 root     apps/hive/warehouse/agg_samp     C:\Users\jmilner\Desktop\earthquakes3.json
Start Time: Wed Apr 08 20:35:45 2015
Running script CopyFromHDFS...
Unexpected error : No JSON object could be decoded
Traceback (most recent call last):
File "<string>", line 265, in execute
 File "C:\Users\jmilner\Documents\Developer Evangelist\GeoDev Meetups\GeoDev2\geoprocessing-tools-for-hadoop\webhdfs\webhdfs.py", line 181, in getFileStatus
data_dict = json.loads(response.read())
File "C:\Python27\Lib\json\__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "C:\Python27\Lib\json\decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python27\Lib\json\decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

I also tried changing

DROP TABLE agg_samp;
CREATE TABLE agg_samp(area binary, count double)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'              
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

to

DROP TABLE agg_samp;
CREATE TABLE agg_samp(area binary, count double)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'              
STORED AS INPUTFORMAT 'com.esri.json.hadoop.UnenclosedJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

With still the same error (Hive didn't fail however).

@climbage
Copy link
Member

climbage commented Apr 8, 2015

It still looks like you don't have a / at the beginning of your path

@JamesLMilner
Copy link

@climbage if I put a / I get the old error:

Executing: CopyFromHDFS localhost 50070 root /apps/hive/warehouse/agg_samp  
Unexpected error : [Errno 11004] getaddrinfo failed
Traceback (most recent call last):
File "<string>", line 277, in execute
File "C:\Users\jmilner\Documents\Developer Evangelist\GeoDev    Meetups\GeoDev2\geoprocessing-tools-for-hadoop\webhdfs\webhdfs.py", line 152, in    copyFromHDFS
fileDownloadClient.request('GET', redirect_path, headers={})
File "C:\Python27\Lib\httplib.py", line 995, in request
self._send_request(method, url, body, headers)
File "C:\Python27\Lib\httplib.py", line 1029, in _send_request
self.endheaders(body)
File "C:\Python27\Lib\httplib.py", line 991, in endheaders
self._send_output(message_body)
File "C:\Python27\Lib\httplib.py", line 844, in _send_output
self.send(msg)
File "C:\Python27\Lib\httplib.py", line 806, in send
self.connect()
File "C:\Python27\Lib\httplib.py", line 787, in connect
self.timeout, self.source_address)
File "C:\Python27\Lib\socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno 11004] getaddrinfo failed

@climbage
Copy link
Member

climbage commented Apr 8, 2015

That's what I'm looking for. This is slightly different in that now you're getting an error during redirection to a datanode.

Try this in the browser and see what you get.

http://localhost:50070/webhdfs/v1/apps/hive/warehouse/agg_samp/000000_0?op=OPEN

@JamesLMilner
Copy link

Thanks again. Unforunately I get "This web page is not available" . I'm redirected to:

 http://sandbox.hortonworks.com:50075/webhdfs/v1/apps/hive/warehouse/agg_samp/000000_0?op=OPEN&namenoderpcaddress=sandbox.hortonworks.com:8020&offset=0

@GISDev01
Copy link

GISDev01 commented Apr 8, 2015

@JamesMilnerUK Is sandbox.hortonworks.com in your Windows hosts file?

@climbage
Copy link
Member

climbage commented Apr 8, 2015

Now we're getting somewhere. @GISDev01 has the solution. @smambrose can we put this in the tutorials?

@JamesLMilner
Copy link

I added
sandbox.hortonworks.com localhost
to my host file but that didn't work. Am I doing something wrong?

Interestingly changing sandbox.hortonworks.com to localhost in the URL:

 localhost:50075/webhdfs/v1/apps/hive/warehouse/agg_samp/000000_0?op=OPEN&namenoderpcaddress=sandbox.hortonworks.com:8020&offset=0

downloads the JSON

@climbage
Copy link
Member

climbage commented Apr 8, 2015

I believe it should be...

127.0.0.1     sandbox.hortonworks.com

@GISDev01
Copy link

GISDev01 commented Apr 8, 2015

The IP in your Windows OS hosts file is going to be the IP of your Hortonworks VM. The IP of the Hortonworks VM is highly dependent on your VM settings. In most cases with both VMWare Player and also VirtualBox on a host Windows OS with default options for the virtual NIC, Mike's answer above is correct since the VM shares the IP of the host (by default).

@JamesLMilner
Copy link

Awesome! Got it working :) Thank you guys.

The setup is:
Working From HDFS Setup:

image

Working Hostfile

127.0.0.1     sandbox.hortonworks.com

And the changes in the hive SQL from event_date to earthquake_date

@GISDev01
Copy link

GISDev01 commented Apr 8, 2015

Excellent!

@JamesLMilner
Copy link

Also feel free to use any of the screenshots in the documentation if they are useful. Might be good to have what the finished data should look like mapped :)

@climbage
Copy link
Member

climbage commented Apr 8, 2015

Awesome! We'll definitely want to work some of this into the documentation.

@GISDev01
Copy link

GISDev01 commented Apr 8, 2015

@JamesMilnerUK Just wondering, are you using VMWare Player or VirtualBox?

@JamesLMilner
Copy link

@GISDev01 I was using VirtualBox. I was following the GIS Tools for Hadoop for Beginners

@smambrose
Copy link
Contributor

The above has been added to the documentation (the wikis). Thanks everyone!

@ymzzx
Copy link

ymzzx commented Aug 29, 2015

I met the same problem as you@JamesMilnerUK, and I follow your step, but the same error,
Executing: CopyFromHDFS 192.168.152.113 50070 root /user/hive/warehouse/agg_samp d:\134.json
Start Time: Sat Aug 29 21:37:05 2015
Running script CopyFromHDFS...
Unexpected error : [Errno 11004] getaddrinfo failed
Traceback (most recent call last):
File "", line 277, in execute
File "E:\hadoop_learning\geoprocessing-tools-for-hadoop-master\geoprocessing-tools-for-hadoop-master\webhdfs\webhdfs.py", line 152, in copyFromHDFS
fileDownloadClient.request('GET', redirect_path, headers={})
File "C:\Python27\ArcGIS10.2\Lib\httplib.py", line 958, in request
self._send_request(method, url, body, headers)
File "C:\Python27\ArcGIS10.2\Lib\httplib.py", line 992, in _send_request
self.endheaders(body)
File "C:\Python27\ArcGIS10.2\Lib\httplib.py", line 954, in endheaders
self._send_output(message_body)
File "C:\Python27\ArcGIS10.2\Lib\httplib.py", line 814, in _send_output
self.send(msg)
File "C:\Python27\ArcGIS10.2\Lib\httplib.py", line 776, in send
self.connect()
File "C:\Python27\ArcGIS10.2\Lib\httplib.py", line 757, in connect
self.timeout, self.source_address)
File "C:\Python27\ArcGIS10.2\Lib\socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno 11004] getaddrinfo failed

could you help me ? Thanks a lot!

@ymzzx
Copy link

ymzzx commented Aug 29, 2015

Please help me , @JamesMilnerUK ,thanks a lot !

@ymzzx
Copy link

ymzzx commented Aug 29, 2015

qq 20150829220231
please help me , thanks a lot ! @mayur7789 @climbage @randallwhitman @GISDev01 @JamesMilnerUK

@climbage
Copy link
Member

@ymzzx Can you get to the name node web UI through your browser?

@ymzzx
Copy link

ymzzx commented Aug 30, 2015

@climbage Here is the details. I use vmware workstation ,there are three node in my vmware . of course ,one namenode ,two datanodes. I add the namenode and its IP in my host machine's host file (192.168.152.113 hadoop1), my host machine is win7. I can get to the namenode through http://192.168.152.113:50070 ,but I can not browse the filesystem in my hostmachine's browser.As the image shows,when I point the Browse the filesystem , there is error.
22
1111

@ymzzx
Copy link

ymzzx commented Aug 30, 2015

and another question, why the numRows is 0,as the image shows
33 @climbage

@ymzzx
Copy link

ymzzx commented Aug 30, 2015

maybe there are so many problems , thanks a lot ! @climbage

@amitabh74
Copy link

I am too facing the same problem. I am trying to copy from HDFS where HDFS server is on remote system. All the steps in turorial https://github.com/Esri/gis-tools-for-hadoop/wiki/Getting-the-results-of-a-Hive-query-into-ArcGIS has been followed as mentioned.
However running the "Copy from HDFS tool" shows following error:
Executing: CopyFromHDFS RHDCN05.rmsi.com 50070 hdfs /apps/hive/warehouse/taxi_agg E:\test.json
Start Time: Tue Jan 05 10:24:34 2016
Running script CopyFromHDFS...
Unexpected error : [Errno 10061] No connection could be made because the target machine actively refused it
Traceback (most recent call last):
File "", line 277, in execute
File "E:\ESRI Hadoop Libraries\geoprocessing-tools-for-hadoop-master\webhdfs\webhdfs.py", line 152, in copyFromHDFS
fileDownloadClient.request('GET', redirect_path, headers={})
File "C:\Python27\ArcGIS10.3\Lib\httplib.py", line 995, in request
self._send_request(method, url, body, headers)
File "C:\Python27\ArcGIS10.3\Lib\httplib.py", line 1029, in _send_request
self.endheaders(body)
File "C:\Python27\ArcGIS10.3\Lib\httplib.py", line 991, in endheaders
self._send_output(message_body)
File "C:\Python27\ArcGIS10.3\Lib\httplib.py", line 844, in _send_output
self.send(msg)
File "C:\Python27\ArcGIS10.3\Lib\httplib.py", line 806, in send
self.connect()
File "C:\Python27\ArcGIS10.3\Lib\httplib.py", line 787, in connect
self.timeout, self.source_address)
File "C:\Python27\ArcGIS10.3\Lib\socket.py", line 571, in create_connection
raise err
error: [Errno 10061] No connection could be made because the target machine actively refused it

Completed script CopyFromHDFS...
Failed to execute (CopyFromHDFS).
Failed at Tue Jan 05 10:24:37 2016 (Elapsed Time: 2.74 seconds)

@ymzzx
Copy link

ymzzx commented Jan 5, 2016

OK,just copy your result as an usual way, without using the tools,you will find a nice end.------------------ Original ------------------
From: "amitabh74"[email protected]
Date: Tue, Jan 5, 2016 05:49 PM
To: "Esri/gis-tools-for-hadoop"[email protected];
Cc: "ymzzx"[email protected];
Subject: Re: [gis-tools-for-hadoop] Copy from HDFS not working (#22)

I am too facing the same problem. I am trying to copy from HDFS where HDFS server is on remote system. All the steps in turorial https://github.com/Esri/gis-tools-for-hadoop/wiki/Getting-the-results-of-a-Hive-query-into-ArcGIS has been followed as mentioned.
However running the "Copy from HDFS tool" shows following error:
Executing: CopyFromHDFS RHDCN05.rmsi.com 50070 hdfs /apps/hive/warehouse/taxi_agg E:\test.json
Start Time: Tue Jan 05 10:24:34 2016
Running script CopyFromHDFS...
Unexpected error : [Errno 10061] No connection could be made because the target machine actively refused it
Traceback (most recent call last):
File "", line 277, in execute
File "E:\ESRI Hadoop Libraries\geoprocessing-tools-for-hadoop-master\webhdfs\webhdfs.py", line 152, in copyFromHDFS
fileDownloadClient.request('GET', redirect_path, headers={})
File "C:\Python27\ArcGIS10.3\Lib\httplib.py", line 995, in request
self._send_request(method, url, body, headers)
File "C:\Python27\ArcGIS10.3\Lib\httplib.py", line 1029, in _send_request
self.endheaders(body)
File "C:\Python27\ArcGIS10.3\Lib\httplib.py", line 991, in endheaders
self._send_output(message_body)
File "C:\Python27\ArcGIS10.3\Lib\httplib.py", line 844, in _send_output
self.send(msg)
File "C:\Python27\ArcGIS10.3\Lib\httplib.py", line 806, in send
self.connect()
File "C:\Python27\ArcGIS10.3\Lib\httplib.py", line 787, in connect
self.timeout, self.source_address)
File "C:\Python27\ArcGIS10.3\Lib\socket.py", line 571, in create_connection
raise err
error: [Errno 10061] No connection could be made because the target machine actively refused it

Completed script CopyFromHDFS...
Failed to execute (CopyFromHDFS).
Failed at Tue Jan 05 10:24:37 2016 (Elapsed Time: 2.74 seconds)


Reply to this email directly or view it on GitHub.

@amitabh74
Copy link

I did try what you are suggesting using following url:
rhdcn05.rmsi.com:50075/webhdfs/v1/apps/hive/warehouse/taxi_agg/000000_0?op=OPEN&namenoderpcaddress=RHDCN05.rmsi.com:8020&offset=0
and saving the output as json file and then opening it by converting it to feature class successfully.
However, the basic problem is not able to copy the table from HDFS.

@GISDev01
Copy link

GISDev01 commented Mar 8, 2016

@amitabh74 that is actually a different error than the original issue here. Please open a new issue, as this one can be closed. We want to avoid the common problem of github issues turning into forum threads. I can help you troubleshoot in a new issue.

@Cess-Zhu
Copy link

@ymzzx @amitabh74 I met the same problem just as you did, but finally I figure out a solution. Hope that may help you.

First of all, the IP of my VM is 192.168.233.130:
hdp_2 3 2_vmare-2016-03-11-16-33-52

Change my OS host file to:
192.168.233.130 sandbox.hortonworks.com
1

Then, fill the "HDFS host name" box with "sandbox.hortonworks.com" in "Copy From HDFS" tool:
2

and success:
3

This solution is pretty different, however it works. And I was inspired by @GISDev01 @JamesMilnerUK
and @climbage . When I change
http://sandbox.hortonworks.com:50075/webhdfs/v1/apps/hive/warehouse/taxi_agg/000000_0?op=OPEN&namenoderpcaddress=sandbox.hortonworks.com:8020&offset=0
to
http://192.168.233.130:50075/webhdfs/v1/apps/hive/warehouse/taxi_agg/000000_0?op=OPEN&namenoderpcaddress=sandbox.hortonworks.com:8020&offset=0
I can get the json file.

But I am still wondering the reason why the error appears, and why I cannot use "localhost" as "HDFS server name" while my host file has already edited as 127.0.0.1 sandbox.hortonworks.com. .

BTW, I'm using VMware and my physical machine's OS is Windows 10. And the VM shares the IP of the host.

@MjHow912
Copy link

MjHow912 commented Dec 1, 2017

Has anyone tried connecting to a Kerberized cluster from ArcMap to make this work? What was your workaround if so? I would like to just copy the table to a my local desktop and then import it from there to turn into a FEATURE but my trouble is exporting it.

@wangbingzhang123
Copy link

@ymzzx @amitabh74 @Cess-Zhu I met one of the same questions Error 10061 Have you solved it?Please help me.Thank you very much.

1512380452 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests