-
Notifications
You must be signed in to change notification settings - Fork 15
Algorithm of dividing text separators into groups
Main goal of this algorithm is to divide all text separators into 2 groups. One of the groups is Group1 and other is Group2. But the specific of this algorithm is that it's not known: which of groups is Group1 and which is Group2.
The algorithm contains 2 steps:
For each text separator graph of connections with other characters should be build. Weight of connections with other characters equals to the times when other character was observer as a first character that start the token after token that ends with the text splitter character.
This could be illustrated with example.
Input text: "Hello! My name is: Max. I live in Kiev. My age is 18 years. This summer i'll go to Paris! And what about you?" Input text separators: !:.?
Connections that would be extracted from this text:
! connected with:
- M - 1
- A - 1
: connected with:
- M - 1
. connected with:
- I - 1
- M - 1
- T - 1
This graphs should be converted from absolute value to probability:
! connected with:
- M - 1/2 = 0.5
- A - 1/2 = 0.5
: connected with:
- M - 1/1 = 1.0
. connected with:
- I - 1/3 = 0.(3)
- M - 1/3 = 0.(3)
- T - 1/3 = 0.(3)
After creating this graphs for all of the text separators they can be merged into 2 graphs
For merging graphs algorithm of two graphs comparison should be defined.
Let's illustrate algorithm of two graphs comparison on the next example. In input next 2 graph:
! connected with:
- M - 0.5
- A - 0.5
. connected with:
- I - 0.3
- M - 0.3
- T - 0.3
! && . connected with:
- M
- A
- I
- T
connection value should be equal to the module of the difference between the two values, divided to the maximum value
! && . connected with:
- M - |0.3 - 0.5| / 0.5 = 0.2 / 0.5 = 0.4
- A - |0.0 - 0.5| / 0.5 = 0.5 / 0.5 = 1.0
- I - |0.0 - 0.3| / 0.3 = 0.3 / 0.3 = 1.0
- T - |0.0 - 0.3| / 0.3 = 0.3 / 0.3 = 1.0
For calculating graphs distances average value of merged graph connections should be calculated: (0.4 + 1.0 + 1.0 + 1.0) / 4.0 = 3.4 / 4.0 = 0.85
Graphs distances would be one minus previous value:
1.0 - 0.85 = 0.15
With distances calculation algorithm all graphs can be decided onto 2 clusters. For this 2 graphs that have lowest distance between each other should be obtained. This two graphs would create 2 base cluster. Each next graph should be compared with each of this two clusters and merged two the most closest cluster.
Let's illustrate this ob previous example:
dist('.', '!') = 0.15 dist('.', ':') = 1.0 - 0.43 = 0.57 dist('!', ':') = 1.0 - 0.75 = 0.25
two base cluster would be: . and ! because this two graphs have lowest distance between each other: 0.15
dist('.', ':') = 0.57 dist('!', ':') = 0.25
: should be merged with .
As a result we have 2 clusters:
- cluster1: .:
- cluster2: !
As can be seen, algorithm need big text in the output or the quality of result would be low. Also this algorithm works much better with languages that are case sensitive. Algorithm shows some quality degradation on case-unsensative texts.