The String Similarity on Flink Project from the Big Data Praktikum @ UNI Leipzig, SS2016
Required parameters:
--process default
--inputCsv path/to/concept_attribute.csv
Description: imports the concept_attribute.csv, filters only label attributes, maps id and value and prints it out
Required parameters:
--process createCompareCsv
--inputCsv path/to/concept_attribute.csv
--outputCsv path/to/output.csv
--removeBrackets [true|false]
Description:
- imports the concept_attribute.csv
- filters only attributes of name 'label'
- maps id and value, e.g. {1,'label',Leipzig,string} is mapped to {1,Leipzig}
- Clean up the String: if removeBrackets is set to true, all brackets and their content will be deleted
- builds the cartesian product with itself
- builds a strict upper triangular matrix by filtering all tuples where id1 < id2
- output as csv
Required parameters:
--process calculateSimilarity
--inputCsv path/to/crossMerged.csv
--outputDir path/to/output/directory
--algorithms stringCompare,stringCompareNgram,flinkSortMerge,sortMerge,simmetrics
To calculate the string similarity there are 4 different algorithms/techniques. This parameter controls which algorithm(s) will be used. By default, all will be executed.
Optional parameters:
-
--threshold X.XX
Only tuples with a dice similarity >= X.XX will be collected in the result dataset -
--tokenizeDigits Y
Size of an n-gram. Y = 3 by default.
Description:
- imports the crossMerged.csv
- calculates dice similarity by the given algorithms, threshold and digits
- outputs one *.csv file per algorithm with tuple format {(int) id_a, (string)label_a, (int)id_b, (string)label_b, (float)simmilarity_value}
concept.csv columns: entity id, uri, source
concept_attributes.csv columns: entity id, property name, property value, property type
linksWithIDs.csv columns: source entity id, target entity id