Skip to content

How GraphChi For Pig Works

oleksii iepishkin edited this page Oct 30, 2013 · 2 revisions

GraphChi for Pig (GraphChiForPig) integration is implemented as a special loader function for Pig (org.apache.pig.LoadFunc).

The basic idea is this: one mapper will run a GraphChi Java program, and pull all the data from HDFS to this mapper's local disk. Thus, it is exactly as running GraphChi as your local machine, but with automatic data import from HDFS.

Pig stores the input files partitioned in HDFS and instantiates the loader object for each of the partitions separately (and may, I am not sure, even split files to smaller parts). However, GraphChi has been designed to run on one single machine only, and its performance is based on the ability to utilize disk well and thus avoid networking bottlenecks slowing distribtued graph computation.

Thus, when the GraphChi application classes (extending PigGraphChiBase class) are instantiated by Pig, only one of them will actually do any job. Thus - there will be only one active mapper, and others finish immediatelly (Note: sometimes Hadoop or Pig instantiates two attempts to do the same work - this can be disabled by setting mapred.map.tasks.speculative.execution to false). Only the instance with index = 0 will continue running. It will load all the partitions of the input to the local disk, by directly accessing HDFS.

Since GraphChi programs work very differently than normal Pig loaders, there are several hacks needed to make this working:

  • all the work is done in the prepareToRead method of the LoadFunc class. This will first starts HDFSGraphLoader object to load the file partitions, and then uses the FastSharder class to create GraphChi shards on the local disk.
  • however, Hadoop runtime will kill a process if it does not make "progress". To fix this, we span a separate thread that reports "progress" to the Hadoop runtime. This will ensure the mapper is not killed.
  • Pig will also immediatelly call getNext(), but we cannot return any results until the GraphChi computation is finished. Thus, the program will loop in getNext() until the program is ready.

Notes

  • Hadoop runner should have sufficient memory to run the GraphChi program. If the GraphChi program does not allocate any additional memory itself, a basic Hadoop mapper setting will be fine.
  • Each map-reduce node should have sufficient disk to store the whole graph input and the shards. On very large graphs, you are going to need tens of gigabytes of disk space. This is hardly a problem on large scale Hadoop/HDFS installations.
Clone this wiki locally