Skip to content

Algorithm of dividing text separators into groups

b0noI edited this page Oct 19, 2014 · 12 revisions

Definitions

Preamble

Main goal of this algorithm is to divide all text separators into 2 groups. One of the groups is Group1 and other is Group2. But the specific of this algorithm is that it's not known: which of groups is Group1 and which is Group2.

Algorithm

The algorithm contains 2 steps:

  1. Create graph of text separators connections;
  2. Merge all graphs into 2 graphs;

Create graph of text separators connections

For each text separator graph of connections with other characters should be build. Weight of connections with other characters equals to the times when other character was observer as a first character that start the token after token that ends with the text splitter character.

This could be illustrated with example.

Input text: "Hello! My name is: Max. I live in Kiev. My age is 18 years. This summer i'll go to Paris! And what about you?" Input text separators: !:.?

Connections that would be extracted from this text:

! connected with:

  • M - 1
  • A - 1

: connected with:

  • M - 1

. connected with:

  • I - 1
  • M - 1
  • T - 1

This graphs should be converted from absolute value to probability:

! connected with:

  • M - 1/2 = 0.5
  • A - 1/2 = 0.5

: connected with:

  • M - 1/1 = 1.0

. connected with:

  • I - 1/3 = 0.(3)
  • M - 1/3 = 0.(3)
  • T - 1/3 = 0.(3)

After creating this graphs for all of the text separators they can be merged into 2 graphs

Merging all of graphs into 2

For merging graphs algorithm of two graphs comparison should be defined.

Let's illustrate algorithm of two graphs comparison on the next example. In input next 2 graph:

! connected with:

  • M - 0.5
  • A - 0.5

. connected with:

  • I - 0.3
  • M - 0.3
  • T - 0.3

Step 1 creating merged graph

! && . connected with:

  • M
  • A
  • I
  • T

Step 2 calculating connections values for merged graph

connection value should be equal to the module of the difference between the two values, divided to the maximum value

! && . connected with:

  • M - |0.3 - 0.5| / 0.5 = 0.2 / 0.5 = 0.4
  • A - |0.0 - 0.5| / 0.5 = 0.5 / 0.5 = 1.0
  • I - |0.0 - 0.3| / 0.3 = 0.3 / 0.3 = 1.0
  • T - |0.0 - 0.3| / 0.3 = 0.3 / 0.3 = 1.0

Step 3 calculating graphs distances with merged graphs values

For calculating graphs distances average value of merged graph connections should be calculated: (0.4 + 1.0 + 1.0 + 1.0) / 4.0 = 3.4 / 4.0 = 0.85

Graphs distances would be one minus previous value:

1.0 - 0.85 = 0.15


With distances calculation algorithm all graphs can be decided onto 2 clusters. For this 2 graphs that have lowest distance between each other should be obtained. This two graphs would create 2 base cluster. Each next graph should be compared with each of this two clusters and merged two the most closest cluster.

Let's illustrate this ob previous example:

calculating all distances

dist('.', '!') = 0.15 dist('.', ':') = 1.0 - 0.43 = 0.57 dist('!', ':') = 1.0 - 0.75 = 0.25

creating two base cluster

two base cluster would be: . and ! because this two graphs have lowest distance between each other: 0.15

choosing cluster for :

dist('.', ':') = 0.57 dist('!', ':') = 0.25

: should be merged with .

result

As a result we have 2 clusters:

  • cluster1: .:
  • cluster2: !

Algorithm limitation

As can be seen, algorithm need big text in the output or the quality of result would be low. Also this algorithm works much better with languages that are case sensitive. Algorithm shows some quality degradation on case-unsensative texts.