-
Notifications
You must be signed in to change notification settings - Fork 13
Frameworks Review
#Frameworks Review
We see several systems developed to cater the need of graphs related problems. Apache Giraph is one such graph processing system that is used at Facebook to analyse the social graph formed by users and their connections. It is an open-source implementation of Google's Pregel graph processing architecture. Apache Spark and its graph abstraction package GraphX has gained lot of attention in the developer community for their graph based computational needs recently. GraphX also uses Pregel, but its own optimized variant of Pregel, in its underneath implementation. And these systems uses different graph partitioning techniques. These techniques are base on various assumptions and have some limitations in some situations. There have been several researches done very recently and several papers published on the research problem on how to achieve better performance and how to avoid these limitations.
##Pregel Pregel[1] graph framework was introduced for graph-parallel computations, where the computation is expressed as a sequence of steps, called supersteps. In each step, a vertex receives a message from the previous step, changes its state and pass the message to the other vertices. Since Pregel is a synchronous system, it updates all parameters of vertices in parallel using values from the previous step as input. Partitioning of the graph is done by simply using a hash function on vertex ID.
Name | Pregel |
---|---|
Year | 2010 |
Bounded/Unbounded | Bounded |
Dynamic state changes? | Yes, Edges and topology changes |
vertex-cut/edge-cut | edge-cut |
Considering power-law distribution? | No |
Considering other natural graph models? | No |
Stream partitioning? | No |
##Distributed GraphLab
Then GraphLab[2][3] was introduced for dynamic and asynchronous graph-parallel computation. GraphLab consists of a Data Graph which holds the program state, an Update Function, which is a stateless procedure, that updates the Data Graph in small chunks called Scopes and a Sync Function which concurrently maintains global aggregates. Different hash functions can be used to partition the graph.
Name | GraphLab |
---|---|
Year | 2011 |
Bounded/Unbounded | Bounded |
Dynamic state changes? | Yes |
vertex-cut/edge-cut | edge-cut |
Considering power-law distribution? | Tests with natural graphs |
Considering other natural graph models? | No |
Stream partitioning? | No |
##PowerGraph PowerGraph[4] accommodates a sequential greedy heuristic which places the next edge on the machine that minimizes the conditional expected replication factor. The steps are simple. If A(u) and A(v) are adjacency of u and v respectively,
Case 1: If A(u) and A(v) intersect, then the edge should be assigned to a machine in the intersection.
Case 2: If A(u) and A(v) are not empty and do not intersect, then the edge should be assigned to one of the machines from the vertex with the most unassigned edges. Case 3: If only one of the two vertices has been assigned, then choose a machine from the assigned vertex.
Case 4: If neither vertex has been assigned, then assign the edge to the least loaded machine.
Name | PowerGraph |
---|---|
Year | 2012 |
Bounded/Unbounded | Bounded |
Dynamic state changes? | Yes |
vertex-cut/edge-cut | vertex-cut |
Considering power-law distribution? | Yes |
Considering other natural graph models? | No |
Stream partitioning? | No |
##S-PowerGraph
S-PowerGraph[5] discuss several methods as ’Degree’, a greedy procedure
which makes use of the in-degree distribution and ’DegreeIO’ which tries to
meet the challenge that both indegree and outdegree of natural graphs are
skewed.
S-PowerGraph firstly suggests the Grid-based Constrained Random Vertex-cut algorithm of GraphBuilder could be altered for streaming graph par-
titioning. In this method, a vertex is mapped into a shard which accommodates a constrained set of partitions. Then, the partition a vertex to be
assigned will be chosen from the constrained set randomly. The edge e is assigned to P idx where idx is decided by idx = GridHash(e)
, the hash function implemented in GraphBuilder[6].
Then the S-PowerGraph discuss about an enhanced version of the greedy
algorithm proposed in PowerGraph[3]. As the greedy algorithm in PowerGraph could lead to some unbalanced partitions in some cases, they introduce Balance(PowerGraph) where Balance() is a constraint to avoid the imbalance.
Name | S-PowerGraph |
---|---|
Year | 2015 |
Bounded/Unbounded | Unbounded |
Dynamic state changes? | Yes |
vertex-cut/edge-cut | vertex-cut |
Considering power-law distribution? | Yes |
Considering other natural graph models? | No |
Stream partitioning? | Yes |
##X-stream X-stream[7] is proposing streaming partitioning for bounded graphs, that is the vertices and edges remains fixed during the entire computation. The number of streaming partitions stays fixed throughout the computation. During initialization, the vertex set of the entire graph is partitioned into vertex sets for the different partitions, and the edge list of each partition is computed.
Name | X-stream |
---|---|
Year | 2013 |
Bounded/Unbounded | Bounded |
Dynamic state changes? | Yes |
vertex-cut/edge-cut | vertex-cut |
Considering power-law distribution? | Yes |
Considering other natural graph models? | No |
Stream partitioning? | Yes |
##Fennel Fennel[8] is a one pass streaming graph partitioning technique which outperforms the de-facto standard offline software METIS on numerous real-world graphs, like Twitter graph with more than 1.4 billion of edges, which outputs balanced partitions.
Name | Fennel |
---|---|
Year | 2014 |
Bounded/Unbounded | Unbounded |
Dynamic state changes? | Yes |
vertex-cut/edge-cut | edge-cut |
Considering power-law distribution? | Tests with natural graphs |
Considering other natural graph models? | No |
Stream partitioning? | Yes |
##Restreaming Partitioning Restreaming Partitioning[9] consider the special case of same or approximately same graph being stream partitioned again and again. This use Linear Deterministic Greedy (LDG)[9] and FENNEL partitioning algorithm, but slightly change them for restreaming. LDG uses multiplicative weights to guarantee balance while FENNEL needs to be regularized towards balance.
Name | Restreaming Partitioning |
---|---|
Year | 2013 |
Bounded/Unbounded | Bounded |
Dynamic state changes? | Yes |
vertex-cut/edge-cut | edge-cut |
Considering power-law distribution? | Yes( Social graphs ) |
Considering other natural graph models? | Yes( web graphs ) |
Stream partitioning? | Yes |
##Linear Embedding ”Distributed Balanced Partitioning via Linear Embedding”[10] discusses a method researchers at Google have tried and tested with Google services. ”Linear Embedding” is a balanced partitioning problem where the goal is to partition the vertices of a given graph into k parts so as to minimize the total cut size. Linear Embedding method first embeds nodes of the graph onto a line, then attempt to improve the ordering mainly by swapping vertices in a semi local manner and finally accommodate a post-processing method to improve the cut-size. They discuss several different methods for each different step and compare the outcome performances for each combination. For embedding nodes onto a line, they accommodate random mapping, Hilbert curve mapping when geographic/geometric information is available, and Affinity-based mapping which takes into account the affinity of vertices by grouping vertices that are closely connected, hence building a tree of these connections. Once the linear embedding is done, the next step is to improve ordering using semi local moves. One method is using Minimum Linear Arrangement (MinLA)[11] and the other is Rank Swap which depends on the pre chosen cut boundaries, the number of final partitions k. Finally they accommodate a post-processing method like dynamic programming to adjust the cut sizes.
Name | Linear Embedding |
---|---|
Year | 2016 |
Bounded/Unbounded | Unbounded |
Dynamic state changes? | Yes |
vertex-cut/edge-cut | edge-cut |
Considering power-law distribution? | Yes |
Considering other natural graph models? | No |
Stream partitioning? | Yes |
#References
[1] Malewicz, G., Austern, M. H., Bik, A. J. ., Dehnert, J. C., Horn, I., Leiser, N., & Czajkowski, G. (2010). Pregel: a system for large-scale graph processing. Proceedings of the 2010 International Conference on Management of Data - SIGMOD ’10, 135–146. http://doi.org/10.1145/1807167.1807184
[2] Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., & Guestrin, C. (2011). Distributed GraphLab: A Distributed Framework for Machine Learning in the Cloud, 716–727. http://doi.org/10.14778/2212351.2212354
[3] Low, Yucheng et al. ”Graphlab: A new framework for parallel machine learning.” arXiv preprint arXiv:1408.2041 (2014).
[4] Gonzalez, J., Low, Y., & Gu, H. (2012). Powergraph: Distributed graph-parallel computation on natural graphs. OSDI’12 Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, 17–30. Retrieved from https://www.usenix.org/system/files/conference/osdi12/osdi12-final-167.pdf
[5] Xie, C., Li, W.-J., & Zhang, Z. (2015). S-PowerGraph: Streaming Graph Partitioning for Natural Graphs by Vertex-Cut. Retrieved from http://arxiv.org/abs/1511.02586
[6] Jain, Nilesh, Guangdeng Liao, and Theodore L Willke. "Graphbuilder: scalable graph etl framework." First International Workshop on Graph Data Management Experiences and Systems 23 Jun. 2013: 4
[7] Roy, A., Mihailovic, I., & Zwaenepoel, W. (2013). X-stream: edge-centric graph processing using streaming partitions. The Twenty-Fourth ACM Symposium on Operating Systems Principles, 472 – 488. http://doi.org/10.1145/2517349.2522740
[8] Tsourakakis, C., Gkantsidis, C., Radunovic, B., Vojnovic, M. (2014). Fennel: Streaming graph partitioning for massive scale graphs. Proceedings of the 7th ACM International Conference on Web Search and Data Mining, 333–342. http://doi.org/10.1145/2556195.2556213
[9] Nishimura, J., & Ugander, J. (2013). Restreaming Graph Partitioning : Simple Versatile Algorithms for Advanced Balancing. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1106–1114. http://doi.org/10.1145/2487575.2487696
[10] Aydin, K., Bateni, M., & Mirrokni, V. (2016). Distributed Balanced Partitioning via Linear Embedding, 387–396. http://doi.org/10.1145/2835776.2835829
[11] Goldschmidt, Olivier, and Dorit S Hochbaum. "Polynomial algorithm for the k-cut problem." (1988): 444-451.