KnowledgeBase Neo4J schema

Cypher statements to create index and constraint

CREATE CONSTRAINT ON (arc:archive) ASSERT arc.uid  IS UNIQUE
CREATE CONSTRAINT ON (news:newspaper) ASSERT news.uid  IS UNIQUE
CREATE CONSTRAINT ON (iss:issue) ASSERT iss.uid  IS UNIQUE
CREATE CONSTRAINT ON (pag:page) ASSERT pag.uid  IS UNIQUE
CREATE CONSTRAINT ON (art:article) ASSERT art.uid  IS UNIQUE
CREATE INDEX ON :entity(Project)
CREATE INDEX ON :article(Project)
CREATE INDEX ON :entity(df)

Which should result after :schema cypher query

Indexes
   ON :article(Project) ONLINE 
   ON :entity(df) ONLINE 
   ON :archive(uid) ONLINE  (for uniqueness constraint)
   ON :article(uid) ONLINE  (for uniqueness constraint)
   ON :issue(uid) ONLINE  (for uniqueness constraint)
   ON :newspaper(uid) ONLINE  (for uniqueness constraint)
   ON :page(uid) ONLINE  (for uniqueness constraint)

Constraints
   ON ( archive:archive ) ASSERT archive.uid IS UNIQUE
   ON ( article:article ) ASSERT article.uid IS UNIQUE
   ON ( issue:issue ) ASSERT issue.uid IS UNIQUE
   ON ( newspaper:newspaper ) ASSERT newspaper.uid IS UNIQUE
   ON ( page:page ) ASSERT page.uid IS UNIQUE

TF-IDF implementation

The entity term frequency tf of our model is the raw count of an entity (ent:entity) in a document (art:article), i.e. the number of times that entity occurs in (art). It is stored while performing NER in the (ent)-[r:appears_in]->(art) relationship.

Once NER is done, we count a) the total number of different entities collected for (art) to obtain the (art) property dl, or document length:

CALL apoc.periodic.iterate(
  "MATCH (art:article) WHERE (art)<-[:appears_in]-() RETURN art",
  "MATCH (art:article)<-[r:appears_in]-() WITH art, count(r) as dl SET art.dl = dl", 
  {batchSize:5000, iterateList:true, parallel:true})

and b) the number of times that (ent) occurs in the overall corpus, known as df

CALL apoc.periodic.iterate(
  "MATCH (ent:entity) WHERE (ent)-[:appears_in]->() RETURN ent",
  "MATCH (ent:entity)-[r:appears_in]->() WITH ent, count(r) as df SET ent.df = df", 
  {batchSize:5000, iterateList:true, parallel:true})

The tf term frequency of relationship -[r:appears_in]-> is then normalized by dl resulting in ntf, a float number between 0.0 - 1.0.

CALL apoc.periodic.iterate(
  "MATCH (art:article) WHERE (art)<-[:appears_in]-() RETURN art",
  "MATCH (art:article)<-[r:appears_in]-() SET r.ntf = r.tf/art.dl", 
  {batchSize:100, iterateList:true, parallel:false})

Note that both tf and ntf are stored in the (ent)-[r:appears_in]->(art) relationship, while dl is stored in (art).

The entity document frequency, ie. , is represented by the integer property df of the (ent) node.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KnowledgeBase Neo4J schema

TF-IDF implementation

Clone this wiki locally