Skip to content

MapReduce Pattern

okram edited this page Feb 17, 2012 · 3 revisions

MapReduce is a data processing pattern that groups objects under a key (the map and shuffle) and then applies some function to those groupings (the reduce). Gremlin supports this pattern with the groupBy step. We will demonstrate its use over the provided Grateful Dead song/artist graph.

gremlin> g = new TinkerGraph()
==>tinkergraph[vertices:0 edges:0]
gremlin> g.loadGraphML('data/graph-example-2.xml')
==>null

The groupBy step can take 2 or 3 function parameters. These functions are:

  1. key function: for the incoming object, generate a key to group it by.
  2. value function: before storing the object under the key, process it to yield the value to be stored.
  3. reduce function: when the stream is empty, apply this function to all the values stored at each key.

Map and Shuffle in Gremlin

The traversal below will group all the song vertices by whether they are an original or cover song. NOTE: the hasNot('song_type',null) is a filter that ensures only song vertices are analyzed. A more efficient expression would be to use an index as opposed to a linear scan and filter.

gremlin> g.V.hasNot('song_type',null).groupBy{it.song_type}{it}.cap.next()
==>cover=[v[332], v[330], v[331], v[336], v[337], v[334], v[335], v[531], v[747], v[746], v[535], v[740], v[538], v[744], v[742], v[3], ...]
==>original=[v[349], v[341], v[343], v[346], v[348], v[4], v[9], v[327], v[390], v[393], v[392], v[395], v[396], v[397], v[398], v[399], ...]

It is possible to provide a “richer” value function so that the grouped songs are represented by their name as opposed to their vertex.

gremlin> g.V.hasNot('song_type',null).groupBy{it.song_type}{it.name}.cap.next()
==>cover=[HIDEAWAY, TANGLED UP IN BLUE, SHELTER FROM THE STORM, GOODNIGHT IRENE, CHINESE BONES, NEIGHBORHOOD GIRLS, FOREVER YOUNG, ...]
==>original=[BLACK MUDDY RIVER, ALICE D MILLIONAIRE, ALLIGATOR, AT A SIDING, BARBED WIRE WHIPPING PARTY, BERTHA, HERE COMES SUNSHINE, ...]

Map, Shuffle, and Reduce in Gremlin

The reduce function is applied to all the values generated for a key. A simple example is provided which simply counts the number of songs per song type. Before showing a full reduce, see the intermediate results of the map and shuffle.

gremlin> g.V.hasNot('song_type',null).groupBy{it.song_type}{1}.cap.next()      
==>cover=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
==>original=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Now, a reduce function is added where the 1s are counted.

gremlin> g.V.hasNot('song_type',null).groupBy{it.song_type}{1}{it.size()}.cap.next()
==>cover=313
==>original=184

The previous example was for demonstration purposes as the more efficient way to get such count distributions is via groupCount.

gremlin> g.V.hasNot('song_type',null).groupCount{it.song_type}.cap.next()           
==>cover=313
==>original=184

However, here is a reduce-based groupBy traversal that is difficult to achieve any other way. The traversal below provides a nested distribution. It answers the question: “How many songs was a particular artist the primary singer for, for the different song types?”

gremlin> m = g.V.hasNot('song_type',null).groupBy{it.song_type}{it.out('sung_by').next()}{it._().name.groupCount.cap.next()}.cap.next()
==>cover={Unknown=6, Weir=66, John_Fogerty=5, Boz_Scaggs=1, Pigpen_Garcia=1, All=9, Grateful_Dead=1, Garcia=74, Pigpen_Mydland=3, Etta_James=1, None=6, Lesh_Mydland=1, Neal_Cassady=1, Pigpen_Weir_Mydland=2, Weir_Garcia=1, Donna=4, Elvin_Bishop=4, Joey_Covington=2, Hall_and_Oates=2, Bob_Dylan=22, Pigpen?=1, Pigpen=29, Welnick=9, Garcia_Weir=3, Jorma_Kaukonen=4, Allman_Brothers=1, Neville_Brothers=2, Mydland=6, Stephen_Stills=2, Hornsby=2, Bo_Diddley=6, Spencer_Davis=2, Joan_Baez=10, Jon_Hendricks=2, Beach_Boys=3, Suzanne_Vega=2, Weir_Mydland=2, Pigpen_Weir=8, Neil_Young=1, Lesh=6}
==>original={Weir_Wasserman=1, Mydland_Lesh=1, Lesh_Hart_Kreutzmann=1, Weir=33, Weir_Hart_Welnick=1, Welnick_Bralove=1, Grateful_Dead=15, Garcia_Dawson=1, Garcia=71, Weir_Hart=3, Garcia_Lesh=3, Garcia_Lesh_Weir=1, Hunter=3, Weir_Bralove=1, Pigpen=7, Welnick=1, Donna_Godchaux=2, Hart=2, Mydland=12, Lesh_Pigpen=1, Hornsby=1, Keith_Godchaux=1, instrumental=1, Garcia_Kreutzmann=2, Hart_Kreutzmann=2, Weir_Mydland=1, Garcia_Weir_Lesh=1, Lesh=13, Weir_Kreutzmann=1}
``` 

The resultant "nested distribution" map can then be analyzed by walking the nested map. In the example below, we see that Jerry Garcia sung 71 of the original Grateful Dead songs.

```text
gremlin> m.original
==>Weir_Wasserman=1
==>Mydland_Lesh=1
==>Lesh_Hart_Kreutzmann=1
==>Weir=33
==>Weir_Hart_Welnick=1
==>Welnick_Bralove=1
==>Grateful_Dead=15
==>Garcia_Dawson=1
==>Garcia=71
...
gremlin> m.original.Garcia
==>71

Moreover, to demonstrate the flexibility of this pattern, these nested maps can be sorted.

gremlin> m.original.sort{a,b -> b.value <=> a.value}[0..4]
==>Garcia=71
==>Weir=33
==>Grateful_Dead=15
==>Lesh=13
==>Mydland=12