-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Getting data into and out of HDFS via the hadoop fs
command line tool is annoying, plain and simple. Wouldn't it be great if you could just mount an HDFS store as a local drive? That way, you could use the same set of tools you always use when working with files (GUI file browsers, your favorite text editor, grep).
For most UNIX-based systems (Linux, *BSD, OSX), there's a general solution to this problem called FUSE. FUSE is a library and an API that allows you to write userland drivers for filesystems. That makes developing, debugging, and distributing your driver much, much easier. And if you've got a bug or two, it protects you from crashing your machine.
That's the filesystem side of things. Getting the data into and out of HDFS is the next trick. There's a Java API for it that comes with Hadoop, but using that will mean you'll need Hadoop installed and configured to talk to your Hadoop cluster. Ideally, this should be a driver that can be easily pointed at any cluster, and can be run from a machine that doesn't have Hadoop installed at all.
Thankfully, HDFS also provides a restful http API called WebHDFS. The files in an HDFS store are assigned uris, and operating on those files becomes a matter of issuing http calls (eg. GET to retrieve the contents of a file, DELETE to delete it). Since it's just HTTP, there's no limitation on where you can run it, and there's no Hadoop installation requirement.
That's the HDFS side of things settled. However, FUSE is natively a C API, and talking to a restful web API in C is...probably the nicest word I can find for it is “awkward”. Since I'm most comfortable programming in python these days, and there's a good python FUSE library out there called fusepy, this project is written in python.
Naturally, there wasn't a clean mapping from the FUSE API to the WebHDFS API. Some things, like symlinks, are only partially supported at best on HDFS. Other things, like file locking, aren't supported at all. The trick there was to support as much as I possibly could, and do it in a way that was consistent with the way you'd expect a filesystem to behave.
This was done as a final project for the CSCI E-185: Big Data Analytics (Spring 2013).
Here's how to install it.