GraphChi-java

Version 0.2

News

Performance has been improved by parallelizing shard loading better (Oct 22, 2013)
GraphChi's Java version has a new cool random walk simulation engine: https://github.com/GraphChi/graphchi-java/wiki/Personalized-Pagerank-with-DrunkardMob

Survey

If you use GraphChi, please fill the form to tell about your experience (it is short!): https://docs.google.com/forms/d/1E4jjAqQiW76hKAjXpjL1SbkBauC3_V5BfCXQU6yz-I4/edit

Introduction

Project for developing the Java version of GraphChi ( http://www.graphchi.org ), the disk-based graph computation engine. To learn more about GraphChi, visit the C++ version's project page: https://github.com/GraphChi/graphchi-cpp

NEW: GraphChi can be used in Hadoop/Pig scripts: GraphChi for Pig.

How to use

Read the README.txt for information on how to build and run the example applications. You are going to need Maven for building.

It is a very good idea to study the example applications carefully. There are currently three example applications in the package edu.cmu.graphchi.apps:

PageRank for computing the famous PageRank ranking
Connected Components for computing the weakly connected components
Alternative Least Squares Matrix Factorization

Input data

GraphChi-java supports edge-list and: adjacency list formats.

To preprocess the graph, you need to call "sharder" in the beginning of your program. For example, if you graph has floating-point input values for each edge, you can call it as follows:

    protected static FastSharder createSharder(String graphName, int numShards) throws IOException {
            return new FastSharder<Float, Float>(graphName, numShards, new VertexProcessor<Float>() {
                public Float receiveVertexValue(int vertexId, String token) {
                    return (token == null ? 0.0f : Float.parseFloat(token));
                }
            }, new EdgeProcessor<Float>() {
                public Float receiveEdge(int from, int to, String token) {
                    return (token == null ? 0.0f : Float.parseFloat(token));
                }
            }, new FloatConverter(), new FloatConverter());
        }
    
        public static void main(String[] args) throws  Exception {
            String baseFilename = args[0];
            int nShards = Integer.parseInt(args[1]);
            String filetype= args[2]; // "edgelist" or "adjacency"
    
            /* Create shards */
            FastSharder sharder = createSharder(baseFilename, nShards);
            if (new File(ChiFilenames.getFilenameIntervals(baseFilename, nShards)).exists()) {
                    sharder.shard(new FileInputStream(new File(baseFilename)), filetype);
                } else {
                    logger.info("Found shards -- no need to preprocess");
                }
        ....

Note: (For edge-lists only) initial values for vertices are also now supported. If you define a self-edge (i.e edge with same source and destination), the value of the 'edge' is interpreted as value for the vertex. Thus you need to provide both a vertex-value and edge-value parser to the FastSharder.

Hadoop / PIG

Since version 0.2 (January 2013), GraphChi Java programs can be directly invoked from Pig. More information will on page GraphChiForPig.

Example PIG script:

    REGISTER graphchi-java-0.2-jar-with-dependencies.jar;
    
    pagerank = LOAD '/user/akyrola/graphs/soc-LiveJournal1.txt' USING edu.cmu.graphchi.apps.pig.PigPagerank as (vertex:int, rank:float);
    
    STORE pagerank INTO '/user/akyrola/pagerank';

Scala

An early Scala-wrapper is also provided. See PagerankScala for example.

Differences to the C++ version

Following features are not implemented in the Java-version:

dynamic graphs
dynamic edge data

GraphChi implements a preprocessing step called "sharding", which reads an input graph and stores it in efficient binary format on the disk (see Introduction to GraphChi for more information). Java-version now includes its own sharding code (called "FastSharder"), which differs from the C++ version: FastSharder shuffles the order of vertices to guarantee (with high probability) an even distribution of edges over shards. Thus, internally GraphChi's Java-version uses different vertex IDs than in the original graph.

Translating between the internal ids and original ids is easy using the VertexIdTranslate class:

         VertexIdTranslate trans = engine.getVertexIdTranslate();
         for(int i=0; i < engine.numVertices(); i++) {
              System.out.println("Internal id " + i + " = original id " + trans.backward(i));
          }

Performance

On a new MacBook Pro (2012), I can run Pagerank on a Twitter-graph with 1.5 B edges (available from http://an.kaist.ac.kr/traces/WWW2010.html/) in about 10 minutes / iteration. It is 2-3x slower than the C++ version on an SSD disk. I expect this performance to be sufficient for many users, especially researchers and analysts who do need real-time performance but favor a convenient programming environment.

Acknowledgements

I want to thank Twitter and particularly the Personalization and Recommendations Team for the my internship during Fall 2012. M any of the improvements to mature the GraphChi's Java-version were done during the course of the internship.

YourKit sponsorship

YourKit is kindly supporting GraphChi project with its full-featured Java Profiler. It is incredibly useful for debugging performance bottlenecks and analyzing memory usage. YourKit, LLC is the creator of innovative and intelligent tools for profiling Java and .NET applications. Take a look at YourKit's leading software products: YourKit Java Profiler and YourKit .NET Profiler.

-- Aapo Kyrola, [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 381 Commits
conf		conf
example_pig		example_pig
src		src
test/edu/cmu/graphchi		test/edu/cmu/graphchi
.gitignore		.gitignore
NOTES.txt		NOTES.txt
README.md		README.md
README.txt		README.txt
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GraphChi-java

News

Survey

Introduction

How to use

Input data

Hadoop / PIG

Scala

Differences to the C++ version

Performance

Acknowledgements

YourKit sponsorship

About

Releases

Packages

bevanking/graphchi-java

Folders and files

Latest commit

History

Repository files navigation

GraphChi-java

News

Survey

Introduction

How to use

Input data

Hadoop / PIG

Scala

Differences to the C++ version

Performance

Acknowledgements

YourKit sponsorship

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages