Representing time dependent graphs in Neo4j

Background

Large-scale data collection efforts using wearable sensors to mine for proximity of individuals (for example, the SocioPatterns project) produce time-varying social graphs, where nodes are individuals, edges represent proximity/contact relations of individuals, and the proximity graph changes over time. Both nodes and edges can have rich attributes.

Data formats for exchanging the time-dependent graphs are available, see for instance the GEXF format. Efficiently mining large time-dependent graphs, however, requires a database and a representation that can support complex topological queries, temporal queries, multi-scale temporal indexing and aggregation, and more. A growing research community working on temporal networks may benefit from sound and efficient techniques to represent, store and query dynamic graphs.

Here we want to start a discussion on best practices for modeling time-varying graphs in Neo4j.

Data model

In the following we will assume that the social network is measured over adjacent discrete intervals of time that we will call frames. A frame is the finest unit of temporal aggregation that we use, and is associated with a time interval defined by a starting time and an ending time. A frame allows to retrieve the status of the social network during the corresponding time interval, and it is thus associated with a social graph at a given point in time. The nodes of such social graphs represent individuals. Edges represent, for instance, the proximity relations of individuals during the time interval of the frame. Edges are undirected and weighted, but the model described here generalizes to directed weighted edges and to multi-relational networks.

A Neo4j pattern for dynamic graphs

The Neo4j property graph representation is appropriate for time-varying graphs, however it requires a more complex model than the obvious choice of representing nodes as Neo4j nodes and edges as Neo4j relations. The time-varying nature of the graph we want to represent requires that we associate nodes and edges with time intervals (frames) in arbitrary ways. In Neo4j, the natural way to represent this association of nodes and edges with a given interval of time (frame) is to draw relations from nodes and edges to the frame(s) they refer to. This suggests to represent both nodes and edges as Neo4j nodes. To avoid ambiguity, in the following we will refer to the nodes of the social graph as actors and to the edges of the social graph as interactions.

In summary:

nodes of the social graph are nodes in Neo4j
edges of the social graph are nodes in Neo4j
intervals of time are nodes in Neo4j

We use the following representation for a dynamic social graph in Neo4j (see figure below):

A time-dependent graph as a whole is accessed as a RUN node, corresponding to a time-resolved record of social network evolution. A RUN node is connected to the reference node by means of a HAS_RUN relation.
RUN nodes have RUN_FRAME relations to all the FRAME nodes. They also have a RUN_FRAME_FIRST relation to the first frame of the graph history, and a RUN_FRAME_LAST to the last one.
Each FRAME node points to the successive frame by means of a FRAME_NEXT relation.
Each FRAME node points to the status of the social graph during the corresponding time interval: it has FRAME_ACTOR relations to ACTOR nodes (representing tagged individuals) and FRAME_INTERACTION relations to INTERACTION nodes (representing a proximity/contact/social relation). Timestamps and interval metadata are represented as attributes of the FRAME node.
An INTERACTION node has two INTERACTION_ACTOR relations to the ACTOR nodes that the edge connects to one another.
Time-dependent attributes for an edge of the social graph (for example, time-dependent edge weights) are represented as attributes of the FRAME_INTERACTION relations that associate that edge with the corresponding frame(s).
Similarly, time-varying attributed of a nodes (tag) of the social graph are represented as attributes of the FRAME_ACTOR relations that associate that node with the corresponding frame(s).
All ACTOR and INTERACTION nodes are connected to the main RUN node with RUN_ACTOR and RUN_INTERACTION relations.
[ QUESTION 1: are the (many) RUN_FRAME and RUN_INTERACTION relations really useful? The RUN_INTERACTION relations can be used to hold the weight of edges in the cumulative network, but the same information can be placed on INTERACTION nodes.]

Temporal indexing

The evolution of the graph can be tracked by walking from one frame to the next. Efficient random access to the frame timeline can be supported better by suitably indexing the time-stamped sequence of FRAME nodes. This can be achieved in different ways, such as using binary trees (like in the Neo4j Timeline class), or mirroring the natural temporal hierarchy of the data (year/month/day/hour/...) with hierarchical temporal relations between nodes.

Here we choose the latter technique, because it preserves the natural units of temporal aggregation of the data (days, hours, etc.). The temporal indexing structure we use is shown in the figure below, and is similar to the multilevel indexing structure of the Cypher cookbook. Of course, it can co-exist with other temporal indexes of the FRAME nodes, such as the native one.

We build a tree that explicitly represents the temporal hierarchy of the dataset. The nodes of the tree are TIMELINE nodes. The top-level node, which is the entry point for the temporal index, is reachable from the RUN node through a HAS_TIMELINE relation.
Nodes at each level of the tree have NEXT_LEVEL relations to nodes at the level below.
The nodes at each level of the tree represent different scales of temporal aggregation, according to the time units that are most appropriate for the dataset at hand. For the same of simplicity, the tree in figure has a "day" and a "hour" levels only, but in other domains it may be useful to have a deeper hierarchy such as "year" / "month" / "day" / "hour" / "minute".
At each level, time attributes are associated with the NEXT_LEVEL relations [ QUESTION 2: is it more efficient to have them on nodes? Should we have both?]
The nodes at the bottom level of the tree correspond to the finest scale of temporal aggregation (hours, in figure) and are connected to the indexed FRAME nodes via TIMELINE_INSTANCE relations. The timestamp of each FRAME node is replicated as a relation attribute of the corresponding TIMELINE_INSTANCE relation.
[ QUESTION 3: would it be convenient to have relations connecting horizontally nodes at each level of the tree, in order to walk sequentially, for example, through all the "hour" nodes? Breadth-first traversal and suitable naming of the TIMELINE nodes probably takes care of this in an efficient way.]
[ QUESTION 4: is it best to record the time attributes at different scales (day, hour, etc.) as types of the relations connecting nodes in the tree, as described in the Cypher cookbook's path tree ? ] Fine-grained, 'strongly named' relationships (e.g. YEAR, MONTH, DAY) can be twice as fast as more generally named relationships qualified by an attribute (such as NEXT_LEVEL {type:'year'}). Strongly named relationships avoid property lookups with each traversal, and therefore reduce IO.

Loading a dynamic GEXF graph into Neo4j

To simplify testing with the proposed data structures, we provide a simple Python script that uses the Neo4j Python REST bindings to load a dynamic GEXF document into a Neo4j store.

For testing, we will use an empirical dynamic social network measured at the ACM Hypertext 2009 conference by the SocioPatterns collaboration. The network of proximity relations among the conference attendees was recorded every 20 seconds for about 3 days and is available for download in GEXF format. In this dataset, nodes are individuals and edges represent face-to-face proximity relations of individuals.

Example: The following command loads the GEXF document into a local Neo4j REST server. The RUN node is named "HT2009" and data are loaded starting at POSIX time 1246191120 (2.12pm on Jun 28th 2009) with time frames of 20 seconds:

./load_gexf_to_neo4j.py ht2009_20sec.gexf HT2009 1246191120 20 http://localhost:7474/db/data/

Test dataset stats

The dataset we use for testing is a relatively small one: 113 persons ( RUN nodes), 2163 proximity relations ( INTERACTION nodes), 13956 frames of 20 seconds each ( FRAME nodes), for a total duration of slightly more than 3 days. Production datasets are typically much larger than this, involving several hundreds persons, ~50-100,000 proximity relations, 70-100,000 frames.

Querying a dynamic graph with Cypher

In the following we provide a few sample Cypher queries performed on a real-world dynamic social graph from the SocioPatterns project, loaded into Neo4j as described above.

We will work with Neo4j 1.9. We will assume that the top-level RUN node of the dynamic graph is node 1234. The reported timings for each query are medians over 10 runs of the query.

QUERY 1: get all time frames of run "HT2009" recorded between 9am and 1pm of July 1st 2009, ordered by timestamp [130 ms]

START root = node(0)
MATCH root-[:HAS_RUN]->run-[:HAS_TIMELINE]->()-[y:NEXT_LEVEL]->()-[m:NEXT_LEVEL]->()-[d:NEXT_LEVEL]->()-[h:NEXT_LEVEL]->()-[:TIMELINE_INSTANCE]->frame
WHERE run.name="HT2009" and y.year=2009 and m.month=7 and d.day=1 and h.hour>=9 and h.hour<13
RETURN frame ORDER BY frame.timestamp

QUERY 2: get the names of all persons present in frame 1234 [4 ms for 80-100 returned rows]

START frame = node(1234)
MATCH frame-[:FRAME_ACTOR]-actor
RETURN actor.name

QUERY 3: get the weighted proximity graph during frame 1234, filtering out the weak contacts [3 ms for 40-50 returned rows]

START frame=node(1234)
MATCH frame-[r:FRAME_INTERACTION]-interaction
WHERE r.weight > 20
RETURN interaction.actor1, interaction.actor2, r.weight;

QUERY 4 (slow): get a list of all the persons, and for each person get the number of frames it was present in [3900 ms , ~110 returned rows] [ QUESTION 5: should we model the number of frames a tag is connected to as a property of ACTOR nodes?]

START run = node(2345)
MATCH run-[:RUN_ACTOR]->actor<-[r:FRAME_ACTOR]-()
RETURN actor.name, count(r)

QUERY 5 (slow): get the names of all persons that were present in more than 1000 frames, ranked by time of presence [3900 ms, ~110 returned rows] [see QUESTION 5 above].

START run = node(2345)
MATCH run-[:RUN_ACTOR]->actor<-[r:FRAME_ACTOR]-()
WITH actor.name as name, COUNT(r) as freq
WHERE freq > 1000
RETURN name, freq ORDER BY freq DESC

Different way of writing it, similar performance:

START run = node(2345)
MATCH run-[:RUN_ACTOR]-actor
WITH actor
MATCH ()-[r:FRAME_ACTOR]-actor
WITH actor.name as name, COUNT(r) as freq
WHERE freq > 1000
RETURN name, freq ORDER BY freq DESC;

QUERY 6 (slow): list all (distinct) days on which tag 1111 was present [140 ms on the sample dataset, but it scales badly with the density of the graphs associated to frames]

START actor = node(1111)
MATCH ()-[d:NEXT_LEVEL]->()-[:NEXT_LEVEL]->()-[:TIMELINE_INSTANCE]-()-[:FRAME_ACTOR]-actor
RETURN DISTINCT(d.day)

This runs surprisingly slow on larger datasets, considering that from each FRAME node there is a single path to the corresponding "day" node of the timeline index. Does this happen because of the high number of relations of the FRAME nodes (the inappropriately named "super-node problem") ? Adding more temporal metadata on the FRAME nodes improves performance: the following query runs in about 60 ms.

START actor = node(1111)
MATCH frame-[:FRAME_ACTOR]-actor
RETURN DISTINCT(frame.day)

QUERY 7: return the names of all persons that were in proximity of user 8888, sorted alphabetically [5 ms]

START actor1 = node(8888)
MATCH actor1<-[:INTERACTION_ACTOR]-()-[:INTERACTION_ACTOR]->actor2
RETURN actor2.name ORDER BY actor2.name

QUERY 8: return the names of all persons that were in proximity of user 8888 on the 7th [90 ms]

START actor1 = node(8888)
MATCH actor1<-[:INTERACTION_ACTOR]-interaction-[:INTERACTION_ACTOR]->actor2
WITH interaction, actor2
MATCH ()-[d:NEXT_LEVEL]->()-[:NEXT_LEVEL]->()-[:TIMELINE_INSTANCE]-()-[:FRAME_INTERACTION]-interaction
WHERE d.day = 7
RETURN DISTINCT(actor2.name)

QUERY 9: find all the common neighbors of user 1111 and 2222 [15 ms]

START actor1 = node(1111), actor2 = node(2222)
MATCH actor1<-[:INTERACTION_ACTOR]-()-[:INTERACTION_ACTOR]->actor
WITH COLLECT(actor) as neighs1, actor2
MATCH actor2<-[:INTERACTION_ACTOR]-()-[:INTERACTION_ACTOR]->actor
WHERE actor IN neighs1
RETURN actor

NOTE: need to use this form, as the straightforward query to find common neighbors, reported below, runs very slowly for the larger datasets mentioned above for QUERY 6.

START actor1 = node(1111), actor2 = node(2222)
MATCH actor1<-[:INTERACTION_ACTOR]-()-[:INTERACTION_ACTOR]->actor<-[:INTERACTION_ACTOR]-()-[:INTERACTION_ACTOR]->actor2
RETURN actor

QUERY 10: compute the degree of all persons in the contact graph [62 ms]

START run = node(2345)
MATCH run-[:RUN_ACTOR]-actor-[r:INTERACTION_ACTOR]-()
RETURN actor.name, COUNT(r) ORDER BY COUNT(r) DESC

To facilitate reproducible testing, we provide a script that runs all of the above queries on the test data described above, and computes median execution times over several runs. The script is launched as follows (the Neo4j URL and the RUN node's name must match those used above):

./cypher_query_timing.py http://localhost:7474/db/data/ HT2009

Remarks and call for comments

Intrinsically heterogeneous topology of real-world datasets

Most graph datasets from real-world systems such as social networks, on-line and off-line infrastructural networks, records of human-driven activity and a host of natural phenomena exhibit fat-tailed statistics for the node connectivity (see for example the review paper on Power laws, Pareto distributions and Zipf's law). This means that graph databases have to deal with the inevitable high heterogeneity of node connectivity, i.e., with the fact that a small fraction of nodes have connections to a large fraction of other nodes.

In our data there are two main sources of heterogeneity: the first is the topological heterogeneity of the social networks we measure, the second is the temporal heterogeneity in activity due to the bursty nature of human dynamics. These heterogeneities are reflected in the properties of the dynamic social graphs we want to model in Neo4j.

Performance observations

A word of warning: we do not claim that the simple experiments reported here reflect the general performance of Neo4j/Cypher in any specific application domain except our own.

Overall, Neo4j and Cypher achieve a very satisfactory performance in querying the sample dataset we used here, suitable for exploratory data analysis, for interactive Web applications build on top of the Neo4j REST server, and for research-oriented data mining.
The bad performance of commonly used queries such as QUERY 4 and QUERY 5 is a reason for concern, especially considering that the dataset we used for testing is very small. However, in our case it is possible to solve this problem by pre-computing additional node attributes (number of frames a tag is connected to, see QUESTION 5 above).
The bad performance of QUERY 6 on scaling up the size/density of the dataset is definitely a problem, especially considering the simplicity of the query.

Request for feedback

As stated at the beginning of this report, we want to start a discussion on best practices for modeling time-dependent graphs in Neo4j. We will be grateful for any comments, further testing by using our data, and in general any insights that the Neo4j team and community are willing to share.

We welcome comments and modifications to this wiki page.

Contributors

Ciro Cattuto, ISI Foundation
André Panisson, ISI Foundation
Marco Quaggiotto, Politecnico di Milano and ISI Foundation

Acknowledgements

We thank Alex Averbuch of Neo Technology for insightful discussions and comments about the data model and the Cypher queries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly