Skip to content

arjuntherajeev/neo4j_issuu_data_analysis

Repository files navigation

Getting Started with Data Analysis using Neo4j

  • Author: Arjun Rajeev Nedungadi (@arjuntherajeev)
  • Field: Data Analysis/Graph Databases
  • Topic: Importing & Performing Queries on a JSON Data Set from issuu.com using Neo4j

Why are we doing this?

Data Analysis is the phenomenon of dissecting, structuring and understanding data. In a nutshell - we want to find meaning from our data. In this tutorial, we aim to analyze a data set from issuu.com. The goal is to find answers to a variety of simple and complex questions.

There are a plethora of tools, techniques and methods available to pursue Data Analysis. We will use Neo4j - a Graph Database, to represent and visualize the data. It uses a query language called Cypher which allows us to build queries and find all the answers that we seek. By the end of this tutorial, we will be able to import a JSON data set to Neo4j and comfortably perform queries on our data.

Recommended Reading

Readers are encouraged to read and possess a basic understanding about Neo4j and Graph Databases prior to starting this tutorial. I previously published a beginner's guide to Neo4j - Getting Started with Neo4j which introduces the concepts of Graph Databases and elaborates on how to perform basic CRUD (Create-Read-Update-Delete) operations using Neo4j.

Getting Set Up

The first thing we need to do is to download and install Neo4j. We will use Neo4j Desktop which provides a user-friendly UI to visualise the Graph and run queries.

The next step is to acquire and understand the data set!

1. The Data Set

The data set that we will analyze comes from issuu.com - an online repository for magazines, catalogs, newspapers, and other publications. They published the Issuu Research Dataset with a treasure of data about documents and visitors. The data set is completely anonymised and provides an insight into the usage of the website.

The data is available in the JSON format. It can be downloaded/accessed from this GitHub repository. There are two flavors of this file:

  1. A small version - issuu_sample.json (This version of the data set has 4 entries).
  2. A large version - issuu_cw2.json (This version of the data set has 10,000 entries).

Both these data sets have been slightly modified for the purpose of this tutorial. To summarize the modification - All JSON entries are now stored in an array and referenced by a key called items.

The data set is vast and the detailed specification is available here. However, we are interested in the following attributes:

Attribute Purpose
env_doc_id Uniquely identify each Document
visitor_uuid Uniquely identify each Visitor
visitor_country Two-letter code to identify Visitor's country
event_type Type of action accomplished by Visitor on the Document

2. Understanding the Graph

Now that we have selected our data set and cherry-picked the necessary attributes, the next step is to formulate the data as a graph. To create the graph in Neo4j, we need to identify the following elements:

  • Nodes
  • Relationships
  • Properties

From our data set, we can identify 2 Nodes with the following properties:

Node Properties
Document doc_uuid
Visitor visitor_uuid, country

uuid stands for Universally Unique IDentifier.

Tip: A Node can also be thought as a Class in Object-Oriented Programming!

What about the Relationships? Yes! We can create 1 Relationship between the Document and Visitor Nodes.

Relationship Properties
Visitor viewed Document type

The relationship viewed is generic in nature. The type property indicates the specific type of event that was accomplished.

As an example, if we consider a Visitor Thomas and a Document A Diary of Jane, then the Relationship can be illustrated as: Thomas viewed A Diary of Jane. However, the type of viewership could be any one the following:

  • impression
  • click
  • read
  • download
  • share
  • pageread
  • pagereadtime
  • continuation_load

Hence, for the purpose of simplicity of this exercise, we will use a single Property on the Relationship called as type. The Relationship can now be illustrated as: Visitor Thomas viewed (and specifically downloaded) Document A Diary of Jane.

3. Creating Constraints & Indexes

A Constraint is a mechanism to control and ensure data integrity in Neo4j. Constraints can be created on either Nodes or Relationships. There are basically three types of Constraints in Neo4j:

  1. Unique Node Property Constraints - to ensure that the Graph contains only a single Node with a specific Label and Property value.
  2. Node Property Existence Constraints - to ensure that a certain Property of a specific Label exists within all Nodes in the Graph.
  3. Relationship Property Existence Constraints - to ensure that a certain Property exists within all Relationships of a specific structure.
  4. Node Keys - to ensure that, for a given Label and set of Properties, there exists all those Properties for that Label and that the combination of Property values is unique. Essentially, a combination of existence and uniqueness.

We will create Unique Node Property Constraints for our Graph as follows:

On the Document Label for the Property doc_uuid:

CREATE CONSTRAINT ON (d:Document) ASSERT d.doc_uuid IS UNIQUE

On the Visitor Label for the Property visitor_uuid:

CREATE CONSTRAINT ON (v:Visitor) ASSERT v.visitor_uuid IS UNIQUE

The main goal of this exercise is to query the Graph to derive insights about the data set. One way to improve the efficiency of retrieval of data is by using the concept of Indexes. The idea behind an Index here is the same as in Relational and NoSQL databases.

In Neo4j, an Index can be created on a single property of a Label, these are known as Single-Property Indexes. An Index can also be created on multiple properties of a Label, these are known as Composite Indexes.

It is important to understand that, by creating a Unique Node Property Constraint on a Property, Neo4j will alo create a Single-Property Index on that Property. Hence, in our situation, the Indexes will be created on the doc_uuid Property for the Document Label and on the visitor_uuid Property for the Visitor Label.

Now that we created the Constraints, we can view existing Indexes in Neo4j using the query:

CALL db.indexes

It should return:

╒═══════════════════════════════╤══════╤════════════════════╕
│description                    │state │type                │
╞═══════════════════════════════╪══════╪════════════════════╡
│INDEX ON :Document(doc_uuid)   │online│node_unique_property│
├───────────────────────────────┼──────┼────────────────────┤
│INDEX ON :Visitor(visitor_uuid)│online│node_unique_property│
└───────────────────────────────┴──────┴────────────────────┘

Tip: More information about Constraints and their effect on Indexes is elaborated here.

4. Importing the JSON Data Set

Let's now get our hands dirty!

Assuming that Neo4j is started (with an appropriate Database Location selected), we should first see an empty Graph. This means that there are no Nodes (and Relationships). Our goal is to populate the Graph with the data from the JSON data set by defining its skeleton (Nodes and Relationships).

To accomplish this, we will use the concept of user defined procedures in Neo4j. These procedures are specific functionalities which can be re-used to manipulate the Graph. We will specifically use a procedure from the APOC library which is a collection of 200+ commonly used procedures.

On Neo4j Desktop, APOC can be installed with the click of a single button!

To do this, open your Project and click on the Manage button. Next, click on the Plugins tab. Under this tab, you will see the heading APOC accompanied with its version and a description. Click on the Install and Restart button. Once APOC is successully installed, you should see a label with the message ✓ Installed.

Here is a screenshot of a successful APOC installation:

Tip: More information about user defined procedures and the APOC library (installation, usage, examples) is elaborated on a Neo4j blog article titled - APOC: An Introduction to User-Defined Procedures and APOC.

We are interested in a specific procedure - apoc.load.json which will allow us to load data from a JSON document. This procedure will return a singular map if the result is a JSON object or a stream of maps if the result is an array.

The following code snippet illustrates the use of the apoc.load.json procedure with our large data set - issuu_cw2.json:

WITH "/path/to/issuu_cw2.json" AS url
CALL apoc.load.json(url) 
YIELD value

Now, value contains our data which we need to utilize to create the Nodes and Relationships. However, we first need to parse this bulk JSON data.

To do this, we can use the UNWIND keyword in Neo4j. In a nutshell, UNWIND will expand a list into a sequence of rows (More information about UNWIND is available here). At this stage, we are required to possess an understanding of the structure of the JSON data. As mentioned in Chapter 1, we can access the entries using the key items.

This is illustrated as:

UNWIND value.items AS item

At this stage, we have access to a single row via item. Now, we can use the API to access the values!

In Neo4j, the WITH clause allows using the output of a sub-query with following sub-query parts (More information about WITH is available here).

From our data, we are concerned about Documents from the issuu.com "Reader" software. We uniquely identify these Documents using the env_doc_id attribute. However, it is to be noted that not all Documents have the env_doc_id attribute. Thus, we are required to explicity select those entries which possess the attribute. The following code snippet illustrates this:

WITH item
WHERE NOT item.env_doc_id IS NULL

Notice how we access the value using item.env_doc_id. This style of retrieving the value makes working with JSON data on Neo4j a smooth experience!

Now that we have access to the values of each entry, it is time to create the Nodes and Relationships. This is accomplished using the MERGE keyword in Neo4j. It is crucial to know the difference between CREATE and MERGE. As an example:

If we execute the following statements sequentially (provided that the constraints were NOT created):

CREATE (d:Document {doc_uuid:1}) RETURN (d) 
CREATE (d:Document {doc_uuid:1}) RETURN (d)

This will result in 2 Nodes being created! A way to control this is by using MERGE which will create the Node only if it does not exist. This can illustrated as follows:

MERGE (d:Document {doc_uuid:1}) RETURN (d)

This same principle can be applied for Relationships as well!

Hence, the final Cypher query consisting of all the above components will look like:

WITH "https://raw.githubusercontent.com/arjuntherajeev/neo4j_issuu_data_analysis/master/issuu_cw2.json" AS url
CALL apoc.load.json(url) 
YIELD value
UNWIND value.items AS item
WITH item
WHERE NOT item.env_doc_id IS NULL
MERGE (document:Document {doc_uuid:item.env_doc_id})
MERGE (visitor:Visitor {visitor_uuid:item.visitor_uuid}) ON CREATE SET visitor.visitor_country = item.visitor_country
MERGE (visitor)-[:VIEWED{type:item.event_type}]->(document)

If we run this query verbatim on Neo4j, the output should be (similar to):

Added 2293 labels, created 2293 nodes, set 5749 properties, created 2170 relationships, statement executed in 15523 ms.

To check whether the Graph was populated successfully, we can run the Cypher query: MATCH (n) RETURN (n) LIMIT 200 which will only display the top 200 results.

The output can be visualized as follows:

5. Let's Query

Finally, we can derive insights from our data!

We need to ask our Graph questions. These questions need to be translated to Cypher queries which will return the appropriate results. Let us answer some basic and advanced questions about the data set:

Note: For the purpose of this tutorial, we will only display the top 10 results for queries with a large number of rows. This is achieved by using the LIMIT 10 constraint.

Query 1. Find the number of visitors from each country and display them in the descending order of count.

MATCH (v:Visitor) 
RETURN v.visitor_country AS Country, count(v) AS Count 
ORDER BY count(v) DESC 
LIMIT 10

Result:

╒═══════╤═════╕
│Country│Count│
╞═══════╪═════╡
│US     │312  │
├───────┼─────┤
│BR     │143  │
├───────┼─────┤
│MX     │135  │
├───────┼─────┤
│PE     │47   │
├───────┼─────┤
│CA     │46   │
├───────┼─────┤
│ES     │43   │
├───────┼─────┤
│GB     │36   │
├───────┼─────┤
│AR     │35   │
├───────┼─────┤
│FR     │34   │
├───────┼─────┤
│CO     │32   │
└───────┴─────┘

Discussion:

This query simply performs an internal group by operation where Visitor Nodes are grouped based on the visitor_country property. The count is computed using the count() aggregate function. We sort the results in the descending order using the ORDER BY <column> DESC clause in Neo4j.

Query 2. For a given document, find the number of visitors from each country. (Example Document UUID = 140228101942-d4c9bd33cc299cc53d584ca1a4bf15d9)

MATCH (d:Document)<-[:VIEWED]-(v:Visitor)
WHERE d.doc_uuid='140228101942-d4c9bd33cc299cc53d584ca1a4bf15d9'
RETURN v.visitor_country AS Country, count(v.visitor_country) AS Count 
ORDER BY count(v.visitor_country) DESC

Result:

╒═══════╤═════╕
│Country│Count│
╞═══════╪═════╡
│GY     │15   │
├───────┼─────┤
│CA     │12   │
├───────┼─────┤
│US     │11   │
├───────┼─────┤
│CW     │1    │
├───────┼─────┤
│BB     │1    │
└───────┴─────┘

Discussion:

This query is very similar to the one above. Here, we perform an internal group by operation to group the Visitor Nodes based on the visitor_country property. However, this query differs from the previous one in the sense that we want to filter the counts for a particular Document UUID.

In order to achieve this filteration, we need to utilise the Relationship within the Graph. Hence, we first MATCH, filter using the WHERE clause and then return the desired values.

Tip: The relationship given here: MATCH (d:Document)<-[:VIEWED]-(v:Visitor) can also be written as MATCH (v:Visitor)-[:VIEWED]->(d:Document).

Query 3. Find the number of occurrences for each type of viewership activity.

MATCH (d:Document)<-[r:VIEWED]-(v:Visitor)
RETURN r.type AS Type, count(d.doc_uuid) AS Count
ORDER BY Count ASC

Result:

╒════════════╤═════╕
│Type        │Count│
╞════════════╪═════╡
│click       │1    │
├────────────┼─────┤
│read        │62   │
├────────────┼─────┤
│pageread    │369  │
├────────────┼─────┤
│pagereadtime│779  │
├────────────┼─────┤
│impression  │959  │
└────────────┴─────┘

Discussion:

This query also performs an internal group by operation on the Relationship property type. An interesting aspect of this query is the ORDER BY Count ASC. Previously, we followed the style of using ORDER BY count(d.doc_uuid) ASC. However, once we add a column name such as Count, we can use that in subsequent parts of the query.

Hence, ORDER BY count(d.doc_uuid) ASC can also be written as ORDER BY Count ASC.

Query 4. Find the visitors for each document and display the top 3 in the descending order of number of visitors.

MATCH (d:Document)<-[r:VIEWED]-(v:Visitor)
RETURN DISTINCT d.doc_uuid AS DocUUID, collect(DISTINCT v.visitor_uuid) AS Visitors, count(DISTINCT v.visitor_uuid) AS Count
ORDER BY Count DESC
LIMIT 3

Result:

╒══════════════════════════════╤══════════════════════════════╤═════╕
│DocUUID                       │Visitors                      │Count│
╞══════════════════════════════╪══════════════════════════════╪═════╡
│140224101516-e5c074c3404177518│[4f4bd7a35b20bd1f, 78e1a8af51d│26   │
│bab9d7a65fb578e               │44194, 4d49271019c7ed96, 6b2d3│     │
│                              │cca6c1f8595, 19f5285fef7c1f00,│     │
│                              │ 3819fc022d225057, f102d9d4fc4│     │
│                              │bacdc, e5d957682bc8273b, abefb│     │
│                              │3fe7784f8d3, 6170372b90397fb3,│     │
│                              │ 797846998c5624ca, 43dd7a8b2fa│     │
│                              │fe059, 3ec465aa8f36302b, d6b90│     │
│                              │f07f29781e0, 7bd813cddec2f1b7,│     │
│                              │ 3db0cb8f357dcc71, e1bfcb29e0f│     │
│                              │3664a, 6d87bcdc5fa5865a, b0ba1│     │
│                              │42cdbf01b11, 0930437b533a0031,│     │
│                              │ e3392e4a18d3370e, ee14da6b126│     │
│                              │3a51e, 502ddaaa898e57c4, 6fd04│     │
│                              │0328d2ad46f, 23f8a503291a948d,│     │
│                              │ 923f25aa749f67f6]            │     │
├──────────────────────────────┼──────────────────────────────┼─────┤
│140228202800-6ef39a241f35301a9│[2f21ee71e0c6a2ce, 55ac6c3ce63│25   │
│a42cd0ed21e5fb0               │25228, e8fa4a9e63248deb, b2a24│     │
│                              │f14bb5c9ea3, d2d6e7d1a25ee0b0,│     │
│                              │ 6229cca3564cb1d1, 13ca53a93b1│     │
│                              │594bf, 47d2608ec1f9127b, 4e2a7│     │
│                              │5f30e6b4ce7, 43b59d36985d8223,│     │
│                              │ 355361a351094143, 51fd872df55│     │
│                              │686a5, 2f63e0cca690da91, febc7│     │
│                              │86c33113a8e, 52873ed85700e41f,│     │
│                              │ ca8079a4aaff28cb, 17db86d2605│     │
│                              │43ddd, b3ded380cc8fdd24, b6169│     │
│                              │f1bebbbe3ad, 458999cbf4307f34,│     │
│                              │ 280bd96790ade2d4, 32563acf872│     │
│                              │f5449, fabc9339a406616d, 36a12│     │
│                              │501ee94d15c, 6d3b99b2041af286]│     │
├──────────────────────────────┼──────────────────────────────┼─────┤
│140228101942-d4c9bd33cc299cc53│[b1cdbeca3a556b72, a8cf3c4f144│24   │
│d584ca1a4bf15d9               │9cc5d, 1435542d699350d9, 06d46│     │
│                              │5bfb51b0736, 2d41536695cc4814,│     │
│                              │ a96854d21780c1f9, d1c98b02398│     │
│                              │e9677, 78deb8ffdb03d406, 6c661│     │
│                              │964d1d13c61, 0d47795fb1ddba9d,│     │
│                              │ 667283570b5cedfe, 5b2baf03296│     │
│                              │63564, 08c069dc405cad2e, 6823c│     │
│                              │573efad29f6, 9b2cb60327cb7736,│     │
│                              │ 0e8ddc2d2a60e14f, f5986e1cb02│     │
│                              │378e4, fa3810e505f4f792, d5ed3│     │
│                              │cfc4a454fe9, ba76461cdd66d337,│     │
│                              │ ee42ba15ed8618eb, 688eb0dcd6a│     │
│                              │d8c86, 67c698e88b4fbdcc, c97c3│     │
│                              │83d774deae0]                  │     │
└──────────────────────────────┴──────────────────────────────┴─────┘

Discussion:

This query utilizes the collect() aggregate function which groups multiple records into a list. An important consideration made here is the use of the DISTINCT operator to ensure that duplicate values are omitted from the output. Finally, we display the top 3 using the LIMIT 3 constraint.

Query 5. For a given document, find recommendations of other documents like it.

Example 1: Document UUID = 130902223509-8fed6b88ae0937c1c43fb30cb9f87ad8

MATCH (d:Document)<-[r:VIEWED]-(v:Visitor)-[r1:VIEWED]->(d1:Document) 
WHERE d1<>d AND d.doc_uuid='130902223509-8fed6b88ae0937c1c43fb30cb9f87ad8'
RETURN d1 AS Recommendations, count(*) AS Views, 
sum(
CASE r1.type 
  WHEN "impression" THEN 1 
  WHEN "pageread" THEN 1.5 
  WHEN "pagereadtime" THEN 1.5 
  WHEN "read" THEN 2 
  WHEN "click" THEN 0.5 
  ELSE 0 
END
) as Score
ORDER BY Score DESC

Result:

╒══════════════════════════════╤═════╤═════╕
│Recommendations               │Views│Score│
╞══════════════════════════════╪═════╪═════╡
│{doc_uuid: 130810070956-4f21f4│12   │16   │
│22b9c8a4ffd5f62fdadf1dbee8}   │     │     │
└──────────────────────────────┴─────┴─────┘

Example 2: Document UUID = 120831070849-697c56ab376445eaadd13dbb8b6d34d0

MATCH (d:Document)<-[r:VIEWED]-(v:Visitor)-[r1:VIEWED]->(d1:Document) 
WHERE d1<>d AND d.doc_uuid='120831070849-697c56ab376445eaadd13dbb8b6d34d0'
RETURN d1 AS Recommendations, count(*) AS Views, 
sum(
CASE r1.type 
  WHEN "impression" THEN 1 
  WHEN "pageread" THEN 1.5 
  WHEN "pagereadtime" THEN 1.5 
  WHEN "read" THEN 2 
  WHEN "click" THEN 0.5 
  ELSE 0 
END
) as Score
ORDER BY Score DESC

Result:

╒══════════════════════════════╤═════╤═════╕
│Recommendations               │Views│Score│
╞══════════════════════════════╪═════╪═════╡
│{doc_uuid: 130701025930-558b15│6    │6    │
│0c485fc8928ff65b88a6f4503d}   │     │     │
├──────────────────────────────┼─────┼─────┤
│{doc_uuid: 120507012613-7006da│6    │6    │
│2bc335425b93d347d2063dc373}   │     │     │
└──────────────────────────────┴─────┴─────┘

Discussion:

This query aims to find documents similar to a given document by assigning a score based on the type of viewership activity.

Activity Score
impression 1
pageread 1.5
pagereadtime 1.5
read 2
click 0.5

First, we perform a MATCH operation to capture the 1st degree and 2nd degree viewership of a Visitor Node along with the Document Nodes and Relationships. We ensure that 2 Documents Nodes are not the same by using the <> operator and also specify the initial Document UUID for which we would like to find related Documents.

Next, we simply return the recommended Document UUIDs, their overall viewership counts and their score. To calculate the score, we utilize the CASE expression which is then tallied using the sum() aggregate function.

The CASE expression has the following syntax:

CASE <expression>
 WHEN <value> THEN <result> //if value matches, then return result
 [WHEN ...] //repeat until all values are handled
 [ELSE <default>] //else return a default result
END

Finally, we sort the results in the descending order of score using th ORDER BY Score DESC clause!

Summary

There you go, ladies & gentlemen!

In this tutorial, we saw an example of performing Data Analysis using Neo4j. We examined the Issuu Research Dataset and elaborated on its structure, format and fields. Next, we formulated the model/schema of our desired Graph by choosing appropriate Nodes, Relationships and Properties. Further, we created Constraints and Indexes in Neo4j to ensure uniqueness and improve the performance of querying. After this, we discussed how to import the raw JSON data set, parse it and populate our Graph by following the previously determined schema. Lastly, we saw some sample Cypher queries which helped us derive insight from our vast data set!

References

What's Next?

The possibilities are endless! If you enjoyed this tutorial, then you can try to derive insights using another data set of your choice! While we chose to construct a rather simple Graph, you can make it much more complex and detailed. Further to this, you can also explore and experiment with the various APOC user defined procedures on the Graph!

Note

If you have any other interesting Cypher queries that might be useful to readers, please feel free to create a new Issue and include a short description and the query itself.

Releases

No releases published

Packages

No packages published