-
Notifications
You must be signed in to change notification settings - Fork 7
alignment
Alignment is the process of merging a new subgraph generated by rt with the full knowledge graph.
There are two broad categories of alignment:
- Merging new nodes that have canonical names / unique IDs (e.g. CVE #):
- If a matching canonical name is not found in the knowledge graph, add the node.
- If a matching canonical name is found in the knowledge graph, merge properties and merge edges.
- Merging nodes without canonical names / unique IDs (e.g. malware). Some of these nodes may not have a canonical name, others may have a canoncial name but it is not available:
- Identify equivalent nodes and score the confidence that the two nodes refer to the same domain concept.
- if a suitable match is found, merge properties and merge edges as above.
- if a suitable match is not found, add the new node, and merge edges as above if needed.
This section documents the state the alignment code and what future activities are planned.
(5/15/2014) The alignment code receives all content from a single source as a single graph. For example, when STUCCO loads content from a CVE source file all the content is transformed into a graphson structure and passed to alignment. The alignment code "assumes" string it receives is a subgraph in the form of Graphson. Alignment handles the merging of new content as individual vertices and edges without taking into account any topology/connectivity.
Of the two broad categories for alignment our first instantiation is only of the first category. Here are the steps:
- Using only the vertices first:
- Each vertex’s unique ID is searched for within the knowledge graph (i.e., Titan).
- If no vertex is found, then this vertex is created within the Knowledge graph.
- If a vertex is found then the properties are “merged” with the vertex in the knowledge graph. Properties that were not present are added and existing properties are appended to, overridden, or retained if they are newer than the vertex being merged. The decisions for aligning vertices’ properties are associated with the property type which is encoded into the graphson schema which is associated with the ontology definition. There are currently four types of merge methods: keepNew, appendList, keepUpdates, and keepConfidence.
- Once all the vertices have been added the edges can be added.
- Note, vertices must be added first or the new edges won’t find the vertices within the knowledge graph.
- Using the edge definition (i.e., which vertices ID’s define an edge) we look for incoming and outgoing vertices as defined in the knowledge graph.
- If an edge’s definition can’t find all the vertices, an error is logged and moves to the next edge.
- When the respective vertices are found the process then creates a property map for the edge and adds the edge properties to that map, finally committing that edge to the knowledge graph. NOTE: If an edge does exist we are not performing alignment with it, which may/will create duplicate edges.
KNOWNS:
- To perform alignment with Titan we go through the Rexster web interface and use Gremlin as a means to acquire and submit content to Titan.
The section attempts to highlight issues that the current alignment process either needs to resolve.
- The alignment rule set will need to be based on the ontology properties which are reflected in the JSON schema. As such, any new rules developed should be verified against this schema (similar to XSD in XML).
- Rule construction may want to leverage a DSL to make construction and verification of the rules easier to manage.
- As rules are constructed are these rules maintained in a DB or loaded via file
- Manual Correction Tool
- Ability to revert/override modifications to the GraphDB if there are incorrect insertions
- Ability to add content without having to go through the pipeline
- See the provenance on a node/edge and know what entries made that contribution or had the same contribution
- Consider provide a holding queue (Parking Lot) for entries that have enough conflicting evidence that manual intervention is needed.
- Log provenance information on changes/updates on edges and nodes.
- When updates occur on either a node or edge the result is:
- Overwrite content
- Append content (simple merge)
- Merge Content (identify what portions should be combined)
- Need to determine for different nodes/edges what comparison measure should be used. What kinds of comparison measures are needed. How much of deviation results in creation of a new node/edge instance, updates or parking lot.
- For canonical names or IDs the comparison the function should be an equality measure
- For dates we need to consider time stamps that vary with only year down to the second (i.e., general to precise). How will we deal with this broad range (unless we provide range values)?
- For unstructured text there are several approaches but this will depend upon the property in question.
- Meta-Rules will need to be used to make sure that that updates will be smart. For example, new sources of information may provide old content and shouldn't overwrite current content. Checking time stamps to know what content is most recent.
There are several venues to deal with the alignment problem in other domains. In the Database domain this is called the merge/purge problem of combining different databases. The theory is similar however the underlying structures are different because we are using a graph database whereas your standard relational DB is row-column oriented. Part of this task will be exploring what functional pieces can be leverage from the DB community and what pieces can be leveraged from the graph community. Things to Research: * Approximate subgraph matching with graph edit distance. This will help identify which sub-graphs are most likely a match. However, it won't be conclusive as additional functions need to be applied at the individuals levels to determine the update/insertion action.
When merging two nodes or edges where the new and existing values of a property differ, the updated value will be determined by some function that is specified for that property. The updated value may be (a) one of the two conflicting values, (b) a new value derived from both input values, or (c) an array-like object with both values. These functions may make use of any node properties, such as the new or existing node's confidence score, source(s), or published date(s).
General process when merging nodes (properties that had null
for either the existing or new value can be handled in the same way):
- resolve value: for each conflicting property, identify the updated value to insert into the knowledge graph. e.g.:
existing_node["conflicting_property"] = resolve_property_with_strategy(conflicting_property, existing_node, new_node)
-
update graph: update
existing_node.conflicting_property
in the knowledge graph.new_node
will not be added to graph.
Note: Edges to/from new_node
in the subgraph will be created in the knowledge graph to existing_node
. (This assumes all nodes from the subgraph are added to the knowledge graph before edges.)
Example resolution functions:
//publishedDate is an integer unix timestamp
resolve_property_with_newest(property_name, existing_node, new_node) {
if (existing_node["publishedDate"] < new_node["publishedDate"])
return new_node["property_name"]
else
return existing_node["property_name"]
}
//confidence ("score") is a float between 0 and 1
resolve_property_by_confidence(property_name, existing_node, new_node) {
if (existing_node["score"] < new_node["score"])
return new_node["property_name"]
else
return existing_node["property_name"]
}
Other examples could include a weighted average by confidence scores, or functions that may be unique to a specific property, e.g. an account's lastLogin
property might always take the newest value, or a vulnerability's patchAvailable
property might never change to false
once a true
value has been seen.
Merging node confidence score properties will always use the same function across all node types. Other properties may share the same functions.
Nodes can be added or merged into the knowledge graph. The edges associated with those nodes need to be added or merged as well. If both nodes were merged or added to the knowledge graph, the strategy for merging is based on whether or not there is an existing edge.
- No existing edge exists, add the edge.
- An existing edge exists, merge the properties of both edges, as described above.
This process starts with a new node, with no matching ID found in the database. The database is searched for existing nodes which may match, and if a match is found, the node properties and edges are merged as above. If a matching node is not found, it is added, and its edges are merged or added as needed.
Some node types, such as IP addresses, should always have matching IDs, and should not search for approximate matches. However other node types, such as malware, will very rarely have matching IDs even when there is a matching node present. (The nodes match if they represent the same real-world entity, even if they do not have the same ID.)
When searching for a matching node, the first step is to build a restricted set of potential matches.
The purpose of this step is to reduce the number expensive of in-depth comparisons that are needed, by replacing most of them with a much cheaper comparison that eliminates most nodes. To start, only nodes of the same node type should be considered (e.g. malware can only possibly match malware, etc.) Next, a "canopy" is found, which contains all potentially-matching nodes. The specifics of this depend on the comparison techniques for the field and node pairs chosen below, but as an example, assume that nodes are matched based on distance, and that the node distance depends on the weighted sum of property distances. If one pair of properties have a large enough distance, that alone could make a match impossible, so comparing the remaining fields is not needed.
For each potentially matching node that remains, calculate the distance for each of their properties. There are many approaches to finding these distances, and the distance metric used may vary based on the data types and the field's meaning.
- Token distance - Token distance compares two multi-word strings, breaks them into individual words, and compares the counts of words in each string. (This is sometimes described as a "bag of words.") This can be expanded to consider word frequency and misspellings in the final distance.
This is best suited to reasonably long sections of text, such as a description field. - Character distance - Character distance, in the simplest case, is the "edit distance" or "Levenshtein distance" between two strings - the total number of insert, delete, or replace operations needed to transform one string into another. This can be expensive, but some optimizations are possible. There are numerous variations on this basic approach, such as giving different weights to the different operations, or reducing the cost of adjacent insertions, or varying the cost based on position within the string. This can also include varying the cost based on the specific substitution performed, to account for misspellings and phonetic similarity. One interesting approach is to break the strings down into "q-grams" (overlapping substrings of some fixed length) and then finding the token distance using one of the techniques from item 1.
- Numeric Distance - The techniques for finding distance between numeric fields are generally much simpler than the above categories. In most cases, this is simply the difference between the values. However often in the literature, numeric fields are simply treated as strings, and one of the above methods are used.
- Domain-specific distance - This involves finding a distance based on some domain specific rules. For example, if a field contained a log level (Emergency, Alert, Critical, Error, Warning, Notice, Info, Debug) then "Debug" may have a distance of 1 from "Info", and a distance of 4 from "Error".
Choosing a suitable distance metric depends on the data type and the meaning of the field, but it also depends on the types of errors anticipated. Most of the literature focuses on human error, such as typos, misspellings, and inconsistent representation (eg. "Avenue" vs. "Ave.") In our case, we anticipate most of the errors will originate in the text extraction process, and handling these types of errors has been little studied.
After all property distances have been found, they should be combined to find nodes which match overall. Again, there are many techniques available to achieve this. Most or all of these techniques can be extended to add a "reject region" for nodes that are too uncertain to be automatically assigned as a match or a non-match, but are instead added to a queue for further (generally manual) review.
- Probabilistic approaches - There are many approaches that find the probability of a node matching based on the probability of the pairs of fields matching. This requires either learning or estimating these probabilities for each field. Some approaches add an adjustable cost factor, which is useful in cases where false positives and false negatives have different impacts on the use of the data.
- Supervised and semi-supervised approaches - if labeled training data is available, a variety of supervised and semi-supervised machine learning techniques are available, using the list of distances and/or the node properties as the input vector.
Examples include using Support Vector Machines (SVM), clustering approaches, and graph partition approaches. Note that some of these are intended to find groups of matching entries, instead of matching pairs as in our case. - Unsupervised approaches - These generally rely on clustering to find groups of similar nodes. In some cases, there is an additional step to review and label these clusters. In some cases, after labeling these clusters, this data is then used to "bootstrap" a different approach.
- Active-learning approaches - These are similar to the approaches above, but they make use of the fact that most cases are either obvious matches or obvious non-matches. They find the relatively few ambiguous cases, prompt for human labeling, and then adjust their parameters as needed. These approaches seem promising, but somewhat less studied than the previous two categories.
- Distance-based approaches - These approaches also make use of the fact that most non-matching nodes are very distant ("sparse neighborhood",) and matching nodes tend to be few and close ("compact set".)
In the simplest case, this involves finding a distance from a weighted sum of the field distances, and then comparing that node distance with some threshold. However, the problem becomes finding suitable weights for each field, and finding an appropriate threshold for a match, which tends to lead back to the above approaches. - Rule-based approaches - These approaches are based on constructing domain-specific rules that must be satisfied for a match. These rules are often expressed in some domain-specific language. These approaches tend to be highly accurate, but they require a large amount of manual effort from a domain expert to create and troubleshoot these rules. One interesting approach uses labeled training data to create lists of potential rules, which are then reviewed and adjusted by a domain expert.
All of these approaches are adopted from record matching in conventional databases, which is a well studied problem. Unfortunately, there is still no overall best approach for that problem, instead, it is highly dependent on the domain, on the data, and on what (if any) training data or domain expertise is available.
Another consideration is that these approaches vary greatly in speed, so a suitable choice will depend on the fraction of nodes that must be matched with this process, the number of potential matches in the canopy for each node, and the overall rate of incoming data vs. available resources.
Trivial cases: adding a new node, adding a new group of nodes/edges, adding the same new node multiple times.
Old State:
{
"vertices": [
{
"_id": "CVE-1999-0002",
"_type": "vertex",
"source": "CVE",
"description": "Buffer overflow in NFS mountd gives root access to remote attackers, mostly in Linux systems.",
"references": [
"CERT:CA-98.12.mountd",
"http://www.ciac.org/ciac/bulletins/j-006.shtml",
"http://www.securityfocus.com/bid/121",
"XF:linux-mountd-bo"
],
"status": "Entry",
"score": 1.0
}
]
}
Update:
{
"vertices": [
{
"_id": "CVE-1999-0002",
"_type": "vertex",
"source": "CVE",
"description": "",
"references": [
"ftp://patches.sgi.com/support/free/security/advisories/19981006-01-I"
],
"status": "Entry",
"score": 1.0
}
]
}
New State: combine list of references
{
"vertices": [
{
"_id": "CVE-1999-0002",
"_type": "vertex",
"source": "CVE",
"description": "Buffer overflow in NFS mountd gives root access to remote attackers, mostly in Linux systems.",
"references": [
"ftp://patches.sgi.com/support/free/security/advisories/19981006-01-I",
"CERT:CA-98.12.mountd",
"http://www.ciac.org/ciac/bulletins/j-006.shtml",
"http://www.securityfocus.com/bid/121",
"XF:linux-mountd-bo"
],
"status": "Entry",
"score": 1.0
}
]
}
Old State:
{
"vertices": [
{
"_id": "CVE-2013-4878",
"_type": "vertex",
"vertexType": "vulnerability",
"source": "NVD",
"description": "The default configuration of Parallels Plesk Panel 9.0.x and 9.2.x on UNIX, and Small Business Panel 10.x on UNIX, has an improper ScriptAlias directive for phppath, which makes it easier for remote attackers to execute arbitrary code via a crafted request.",
"publishedDate": "2013-07-18T12:51:56.227-04:00",
"modifiedDate": "2013-07-18T12:51:56.227-04:00",
"score": 1.0,
"cweNumber": "CWE-264",
"cvssScore": 6.8,
"accessVector": "NETWORK",
"accessComplexity": "MEDIUM",
"accessAuthentication": "NONE",
"confidentialityImpact": "PARTIAL",
"integrityImpact": "PARTIAL",
"availabilityImpact": "PARTIAL",
"cvssDate": "2013-07-19T16:37:00.000-04:00"
}
]
}
Update:
{
"vertices": [
{
"_id": "CVE-2013-4878",
"_type": "vertex",
"vertexType": "vulnerability",
"source": "NVD",
"description": "The default configuration of Parallels Plesk Panel 9.0.x and 9.2.x on UNIX, and Small Business Panel 10.x on UNIX, has an improper ScriptAlias directive for phppath, which makes it easier for remote attackers to execute arbitrary code via a crafted request, a different vulnerability than CVE-2012-1823.",
"publishedDate": "2013-07-18T12:51:56.227-04:00",
"modifiedDate": "2013-07-19T16:51:21.577-04:00",
"score": 1.0,
"cweNumber": "CWE-264",
"cvssScore": 6.8,
"accessVector": "NETWORK",
"accessComplexity": "MEDIUM",
"accessAuthentication": "NONE",
"confidentialityImpact": "PARTIAL",
"integrityImpact": "PARTIAL",
"availabilityImpact": "PARTIAL",
"cvssDate": "2013-07-19T16:37:00.000-04:00"
}
]
}
New State: overwrite old description with new, like resolve_property_with_newest
example above.
{
"vertices": [
{
"_id": "CVE-2013-4878",
"_type": "vertex",
"vertexType": "vulnerability",
"source": "NVD",
"description": "The default configuration of Parallels Plesk Panel 9.0.x and 9.2.x on UNIX, and Small Business Panel 10.x on UNIX, has an improper ScriptAlias directive for phppath, which makes it easier for remote attackers to execute arbitrary code via a crafted request, a different vulnerability than CVE-2012-1823.",
"publishedDate": "2013-07-18T12:51:56.227-04:00",
"modifiedDate": "2013-07-19T16:51:21.577-04:00",
"score": 1.0,
"cweNumber": "CWE-264",
"cvssScore": 6.8,
"accessVector": "NETWORK",
"accessComplexity": "MEDIUM",
"accessAuthentication": "NONE",
"confidentialityImpact": "PARTIAL",
"integrityImpact": "PARTIAL",
"availabilityImpact": "PARTIAL",
"cvssDate": "2013-07-19T16:37:00.000-04:00"
}
]
}
Matching IDs, one node, data from two sources. update a property based on confidence scores. Maintain timestamps as appropriate.
Old State:
{
"vertices": [
{
"_id": "CVE-2013-5217",
"_type": "vertex",
"vertexType": "vulnerability",
"source": "Something that isn't NVD",
"description": "HP System Management Homepage (SMH) before 7.2.1 allows remote attackers to bypass intended access restrictions and obtain sensitive information via unspecified vectors, a different vulnerability than CVE-2013-2355.",
"publishedDate": "2013-07-22T07:19:33.783-04:00",
"modifiedDate": "2013-07-26T00:00:00.000-04:00",
"score": 0.6
}
]
}
Update:
{
"vertices": [
{
"_id": "CVE-2013-5217",
"_type": "vertex",
"vertexType": "vulnerability",
"source": "NVD",
"description": "** REJECT ** DO NOT USE THIS CANDIDATE NUMBER. ConsultIDs: CVE-2012-5217. Reason: This candidate is a duplicate of CVE-2012-5217. A typo caused the wrong ID to be used. Notes: All CVE users should reference CVE-2012-5217 instead of this candidate. All references and descriptions in this candidate have been removed to prevent accidental usage.",
"publishedDate": "2013-07-22T07:20:46.637-04:00",
"modifiedDate": "2013-07-22T07:20:47.053-04:00",
"score": 1.0
}
]
}
New State: overwrite less reliable description with the more reliable one, like resolve_property_by_confidence
example above.
{
"vertices": [
{
"_id": "CVE-2013-5217",
"_type": "vertex",
"vertexType": "vulnerability",
"source": "NVD",
"description": "** REJECT ** DO NOT USE THIS CANDIDATE NUMBER. ConsultIDs: CVE-2012-5217. Reason: This candidate is a duplicate of CVE-2012-5217. A typo caused the wrong ID to be used. Notes: All CVE users should reference CVE-2012-5217 instead of this candidate. All references and descriptions in this candidate have been removed to prevent accidental usage.",
"publishedDate": "2013-07-22T07:19:33.783-04:00",
"modifiedDate": "2013-07-26T00:00:00.000-04:00",
"score": 1.0
}
]
}