-
Notifications
You must be signed in to change notification settings - Fork 1
KnowledgeBase Neo4J schema
Cypher statements to create index and constraint
CREATE CONSTRAINT ON (arc:archive) ASSERT arc.uid IS UNIQUE
CREATE CONSTRAINT ON (news:newspaper) ASSERT news.uid IS UNIQUE
CREATE CONSTRAINT ON (iss:issue) ASSERT iss.uid IS UNIQUE
CREATE CONSTRAINT ON (pag:page) ASSERT pag.uid IS UNIQUE
CREATE CONSTRAINT ON (art:article) ASSERT art.uid IS UNIQUE
CREATE INDEX ON :entity(Project)
CREATE INDEX ON :article(Project)
CREATE INDEX ON :entity(df)
Which should result after :schema
cypher query
Indexes
ON :article(Project) ONLINE
ON :entity(df) ONLINE
ON :archive(uid) ONLINE (for uniqueness constraint)
ON :article(uid) ONLINE (for uniqueness constraint)
ON :issue(uid) ONLINE (for uniqueness constraint)
ON :newspaper(uid) ONLINE (for uniqueness constraint)
ON :page(uid) ONLINE (for uniqueness constraint)
Constraints
ON ( archive:archive ) ASSERT archive.uid IS UNIQUE
ON ( article:article ) ASSERT article.uid IS UNIQUE
ON ( issue:issue ) ASSERT issue.uid IS UNIQUE
ON ( newspaper:newspaper ) ASSERT newspaper.uid IS UNIQUE
ON ( page:page ) ASSERT page.uid IS UNIQUE
The entity term frequency tf
of our model is the raw count of an entity (ent:entity) in a document (art:article)
, i.e. the number of times that entity occurs in (art)
. It is stored while performing NER in the (ent)-[r:appears_in]->(art)
relationship.
Once NER is done, we count a) the total number of different entities collected for (art)
to obtain the (art) property dl
, or document length:
CALL apoc.periodic.iterate(
"MATCH (art:article) WHERE (art)<-[:appears_in]-() RETURN art",
"MATCH (art:article)<-[r:appears_in]-() WITH art, count(r) as dl SET art.dl = dl",
{batchSize:5000, iterateList:true, parallel:true})
and b) the number of times that (ent)
occurs in the overall corpus, known as df
CALL apoc.periodic.iterate(
"MATCH (ent:entity) WHERE (ent)-[:appears_in]->() RETURN ent",
"MATCH (ent:entity)-[r:appears_in]->() WITH ent, count(r) as df SET ent.df = df",
{batchSize:5000, iterateList:true, parallel:true})
The tf
term frequency of relationship -[r:appears_in]->
is then normalized by dl
resulting in ntf
, a float number between 0.0 - 1.0
.
CALL apoc.periodic.iterate(
"MATCH (art:article) WHERE (art)<-[:appears_in]-() RETURN art",
"MATCH (art:article)<-[r:appears_in]-() SET r.ntf = r.tf/art.dl",
{batchSize:100, iterateList:true, parallel:false})
Note that both tf
and ntf
are stored in the (ent)-[r:appears_in]->(art)
relationship, while dl
is stored in (art)
.
The entity document frequency, ie. , is represented by the integer property df
of the (ent)
node.