Skip to content

Tutorial Statistical Graph Analysis

Aapo Kyrola edited this page Apr 17, 2014 · 6 revisions

GraphChi-DB can be used for efficiently querying induced subgraphs from very large networks. Thus, you can for example, easily sample a vertex, and retrive induced neighborhood graph of the vertex. Or you can choose a random set of vertices and compute their induced subgraph.

Assuming you have the data loaded in database, and the GraphChiDatabase object in variable DB, here is how you request edges for the induced subgraph of a set of vertices:

   implicit val DB = new GraphChiDB(...)
   val mySet = Set(...)   // Set of vertices. Can be obtained by sampling, for instance.
   val inducedEdges = Queries.inducedSubgraph(mySet, 0)

This will return an array of ResultEdge objects. Note, that all the vertex IDs are in the internal ID space of GraphChi-DB. See the README.md for explanation of ID mapping.

  case class ResultEdge(src: Long, dst: Long, dataPtr: Long)

Fields src and dst are the source and destination vertex IDs for an edge. Field dataPtr can be used to retrieve data for an edge. Let's ignore it for now.

What if you want to find the induced subgraph of all neighbors of a given vertex. This is easy: first query the neighbors of a vertex and then pass this set to the inducedSubgraph query. Here is an example that finds the induced subgraph of a user's neighbors, excluding the user (ego) itself.

    val friends = DB.queryOut(queryVertexInternalId, 0)
    val inducedGraph = Queries.inducedSubgraph(friends.getInternalIds.toSet - queryVertexInternalId, 0)

Note, that the edges returned by inducedSubgraph() are directed. What if you want to have only distinct edges, ignoring the direction. Following code snippet shows how to do that:

  val undirected =  inducedEdges(e => ResultEdge(math.min(e.src, e.dst), math.max(e.src, e.dst), 0) ).filterNot(e => e.src == e.dst).toSet

The snippet above first maps all edges so that src field has the smaller of src and dst ID, and dst the larger ID. It also sets the dataPtr field to zero, because we do not use it. This guarantees that edge (src=4, dst=9) and (src=9, dst=4) are counted as equals. Then, it converts the list of edges to a set so that all duplicate edges are removed. In addition, self-edges are filtered away.

Given this subgraph, you can analyze it as you wish. In the example application SubGraphFrequencies.scala we sample 3-node subgraphs from the induced neighborhood graph. GraphChi-DB does not provide any internal tools for analyzing these small graphs, but you can use any Java/Scala library for in-memory graph analysis.

Clone this wiki locally