Version 0.2
- Performance has been improved by parallelizing shard loading better (Oct 22, 2013)
- GraphChi's Java version has a new cool random walk simulation engine: https://github.com/GraphChi/graphchi-java/wiki/Personalized-Pagerank-with-DrunkardMob
If you use GraphChi, please fill the form to tell about your experience (it is short!): https://docs.google.com/forms/d/1E4jjAqQiW76hKAjXpjL1SbkBauC3_V5BfCXQU6yz-I4/edit
Project for developing the Java version of GraphChi ( http://www.graphchi.org ), the disk-based graph computation engine. To learn more about GraphChi, visit the C++ version's project page: https://github.com/GraphChi/graphchi-cpp
NEW: GraphChi can be used in Hadoop/Pig scripts: GraphChi for Pig.
Read the README.txt for information on how to build and run the example applications. You are going to need Maven for building.
It is a very good idea to study the example applications carefully. There are currently three example applications in the package edu.cmu.graphchi.apps:
- PageRank for computing the famous PageRank ranking
- Connected Components for computing the weakly connected components
- Alternative Least Squares Matrix Factorization
GraphChi-java supports edge-list and: adjacency list formats.
To preprocess the graph, you need to call "sharder" in the beginning of your program. For example, if you graph has floating-point input values for each edge, you can call it as follows:
protected static FastSharder createSharder(String graphName, int numShards) throws IOException {
return new FastSharder<Float, Float>(graphName, numShards, new VertexProcessor<Float>() {
public Float receiveVertexValue(int vertexId, String token) {
return (token == null ? 0.0f : Float.parseFloat(token));
}
}, new EdgeProcessor<Float>() {
public Float receiveEdge(int from, int to, String token) {
return (token == null ? 0.0f : Float.parseFloat(token));
}
}, new FloatConverter(), new FloatConverter());
}
public static void main(String[] args) throws Exception {
String baseFilename = args[0];
int nShards = Integer.parseInt(args[1]);
String filetype= args[2]; // "edgelist" or "adjacency"
/* Create shards */
FastSharder sharder = createSharder(baseFilename, nShards);
if (new File(ChiFilenames.getFilenameIntervals(baseFilename, nShards)).exists()) {
sharder.shard(new FileInputStream(new File(baseFilename)), filetype);
} else {
logger.info("Found shards -- no need to preprocess");
}
....
Note: (For edge-lists only) initial values for vertices are also now supported. If you define a self-edge (i.e edge with same source and destination), the value of the 'edge' is interpreted as value for the vertex. Thus you need to provide both a vertex-value and edge-value parser to the FastSharder.
Since version 0.2 (January 2013), GraphChi Java programs can be directly invoked from Pig. More information will on page GraphChiForPig.
Example PIG script:
REGISTER graphchi-java-0.2-jar-with-dependencies.jar;
pagerank = LOAD '/user/akyrola/graphs/soc-LiveJournal1.txt' USING edu.cmu.graphchi.apps.pig.PigPagerank as (vertex:int, rank:float);
STORE pagerank INTO '/user/akyrola/pagerank';
An early Scala-wrapper is also provided. See PagerankScala for example.
Following features are not implemented in the Java-version:
- dynamic graphs
- dynamic edge data
GraphChi implements a preprocessing step called "sharding", which reads an input graph and stores it in efficient binary format on the disk (see Introduction to GraphChi for more information). Java-version now includes its own sharding code (called "FastSharder"), which differs from the C++ version: FastSharder shuffles the order of vertices to guarantee (with high probability) an even distribution of edges over shards. Thus, internally GraphChi's Java-version uses different vertex IDs than in the original graph.
Translating between the internal ids and original ids is easy using the VertexIdTranslate class:
VertexIdTranslate trans = engine.getVertexIdTranslate();
for(int i=0; i < engine.numVertices(); i++) {
System.out.println("Internal id " + i + " = original id " + trans.backward(i));
}
On a new MacBook Pro (2012), I can run Pagerank on a Twitter-graph with 1.5 B edges (available from http://an.kaist.ac.kr/traces/WWW2010.html/) in about 10 minutes / iteration. It is 2-3x slower than the C++ version on an SSD disk. I expect this performance to be sufficient for many users, especially researchers and analysts who do need real-time performance but favor a convenient programming environment.
I want to thank Twitter and particularly the Personalization and Recommendations Team for the my internship during Fall 2012. M any of the improvements to mature the GraphChi's Java-version were done during the course of the internship.
YourKit is kindly supporting GraphChi project with its full-featured Java Profiler. It is incredibly useful for debugging performance bottlenecks and analyzing memory usage. YourKit, LLC is the creator of innovative and intelligent tools for profiling Java and .NET applications. Take a look at YourKit's leading software products: YourKit Java Profiler and YourKit .NET Profiler.
-- Aapo Kyrola, [email protected]