Skip to content

Algorithm of dividing text separators into groups

b0noI edited this page Oct 19, 2014 · 12 revisions

Definitions

Preamble

Main goal of this algorithm is to divide all text separators into 2 groups. One of the groups is Group1 and other is Group2. But the specific of this algorithm is that it's not known: which of groups is Group1 and which is Group2.

Algorithm

The algorithm contains 2 steps:

  1. Create graph of text separators connections;
  2. Merge all graphs into 2 graphs;

Create graph of text separators connections

For each text separator graph of connections with other characters should be build. Weight of connections with other characters equals to the times when other character was observer as a first character that start the token after token that ends with the text splitter character.

This could be illustrated with example.

Input text: "Hello! My name is: Max. I live in Kiev. My age is 18 years. This summer i'll go to Paris! And what about you?" Input text separators: !:.?

Connections that would be extracted from this text:

! connected with:

  • M - 1
  • A - 1

: connected with:

  • M - 1

. connected with:

  • I - 1
  • M - 1
  • T - 1

This graphs should be converted from absolute value to probability:

! connected with:

  • M - 1/2 = 0.5
  • A - 1/2 = 0.5

: connected with:

  • M - 1/1 = 1.0

. connected with:

  • I - 1/3 = 0.(3)
  • M - 1/3 = 0.(3)
  • T - 1/3 = 0.(3)

After creating this graphs for all of the text separators they can be merged into 2 graphs

Merging all of graphs into 2