forked from rich-iannone/DiagrammeR-docs
-
Notifications
You must be signed in to change notification settings - Fork 0
/
01-intro.Rmd
executable file
·29 lines (15 loc) · 5.25 KB
/
01-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Introduction {#intro}
## Why Model Data as a Graph?
Network graphs allow for relationship-based data models. Positioning the relationships between entities as first class lets us more easily model complex, real-world information. A distinguishing feature of network graph models is that the relationships are treated with as much value as the node entity data records themselves. The connections within data are often not considered but we really need to work closely with relationship data to extract valuable insights.
Using a network graph usually means you have fast navigation (i.e., traversals) between nodes (performed in constant time). Modeling data with a relational, tabular approach will require working with data as a set of tables and columns, and this necessitates carrying out complex joins and self-joins when the dataset becomes more interrelated. Such queries are technically complicated to construct and can be expensive to run. Furthermore, making them work synchronously is not easy, with performance dropping precipitously as the total dataset size increases. The tabular data model is also typically hard to reason about since it doesn't readily correspond to mental models for a given application.
**DiagrammeR** provides a framework and collection of functions to model network graphs as property graphs. A property graph is one that has labeled nodes (for informational entities) that are connected via directed, typed relationships. Both nodes and relationships hold arbitrary properties (attributes, as key-value pairs). There is no rigid schema, but with node and edge labeling we can have as much metadata as we’d like. Taking the example of a typical software repository with multiple users, entities can be users, comments, issues, repositories, organizations, etc., and they are related to each other in specific ways. A user can create or delete a repository, and, the same user could also push a commit (a separate entity, having its own metadata) to a repository. The following schematic provides a simplified graph model of how entities can be related to each other in the context of an multiuser software repository.
<img src="diagrams/property_graph_chp_01.png">
## The DiagrammeR Graph Object
A graph object in **DiagrammeR** maintains a series of interrelated elements. Most importantly, the set of nodes (vertices) and edges (links) resides here, along with elements for graph metadata such as styling attributes, active node/edge selections, data frames linked to nodes or edges, and other user-defined caches of graph features.
<img src="diagrams/diagrammer_graph_chp_01.png">
The user does not directly modify the contents of the graph object. Rather, the package provides access to a wide range of graph functions for modifying the internal components of the graph object. This both provides ease-of-use and assurances that all graph elements are synchronized, avoiding unintentional corruption of the graph.
Nodes and edges contain properties (called *attributes* throughout) as key-value pairs. Formally, the graph object contains an **R** data frame for all graph nodes and their attributes (`nodes_df`) and an analogous data frame for edges and their attributes (`edges_df`). In the context of **DiagrammeR** these are typically called the node data frame (*ndf*) and the edge data frame (*edf*).
There are several additional data frames used to support the modification and display of graphs. The Global Graph Attributes (`global_attrs`) data frame stores data for the default appearance of a graph when displayed. One entry may include using a `fontsize` of `10` for each `node`. Another may affect the whole graph, where the `layout` is the `circo` type. These defaults, set for nodes and edges, can be subject to overrides on a per-node or per-edge basis by including values for the same attributes in the node or edge data frames.
Two data frames are reserved for node and edge selection values (`node_selection` and `edge_selection`). If there is no active selection, then these data frames are empty (i.e., have 0 rows). If there is a selection (and there can only be one type of selection at any given time), then rows containing node IDs or edge IDs will be populated. This will generally persist (even as other graph functions are used) until the selection is formally cleared. These types of selections are useful because we can target specific parts of the graph and apply specific functions that recognize selections to do useful graph work.
The graph stores useful metadata about itself such as identifiers and control options (in `graph_info`) and a log of which graph modification functions were used (in `graph_log`). This is valuable for the system to understand the graph history and state and this allows for some graph functions to work with less user-provided information.
Finally, we have the option to store some other **R** objects in the graph. One is a cache space for **R** vectors. This cached vector could consist of a series of labels extracted from one part of the graph that may be useful in a function that does something to another part of the graph. The other reserved space for data is for entire data frames that could belong to any node or edge in the graph. This is great if you have extended metadata that doesn't easily correspond to a key-value attribute pair.