-
Notifications
You must be signed in to change notification settings - Fork 15
Algorithm of tokens separator extraction
b0noI edited this page Dec 20, 2014
·
5 revisions
WARNING this wiki is deprecated, new wiki is here and on our [new site](http://aif.io/
- [tokens separator](https://github.com/b0noI/AIF2/wiki/Main-definitions#tokens separator)
The process of tokens separator extraction counts each unique character in the text. One rule is to be applied during calculation: if a unique character occurs more than once in a row, this occurrence is counted as one.
This approach does not work under the following conditions:
- input text is small;
- language of input text doesn't have a token separator (some languages don't use token separators at all or in some cases).
Example of input: Token1 token2 eeee4 .
Counting result:
'T' - 1
't' - 1
'o' - 2
'k' - 2
'e' - 3
'n' - 2
'1' - 1
'2' - 1
'4' - 1
' ' - 3
'.' - 1