-
Notifications
You must be signed in to change notification settings - Fork 15
Algorithm of dividing text separators into groups
Main goal of this algorithm is to divide all text separators into 2 groups. One of the groups is Group1 and other is Group2. But the specific of this algorithm is that it's not known: which of groups is Group1 and which is Group2.
The algorithm contains 2 steps:
- Create graph of text separators connections;
- Merge all graphs into 2 graphs;
For each text separator graph of connections with other characters should be build. Weight of connections with other characters equals to the times when other character was observer as a first character that start the token after token that ends with the text splitter character.
This could be illustrated with example.
Input text: "Hello! My name is: Max. I live in Kiev. My age is 18 years. This summer i'll go to Paris! And what about you?" Input text separators: !:.?
Connections that would be extracted from this text:
! connected with:
- M - 1
- A - 1
: connected with:
- M - 1
. connected with:
- I - 1
- M - 1
- T - 1
This graphs should be converted from absolute value to probability:
! connected with:
- M - 1/2 = 0.5
- A - 1/2 = 0.5
: connected with:
- M - 1/1 = 1.0
. connected with:
- I - 1/3 = 0.(3)
- M - 1/3 = 0.(3)
- T - 1/3 = 0.(3)
After creating this graphs for all of the text separators they can be merged into 2 graphs