-
Notifications
You must be signed in to change notification settings - Fork 15
v.1.0.0 Work with sentences
WARNING this wiki is deprecated, new wiki is here and on our [new site](http://aif.io/
This page describes the functions that AIF2 can perform with sentences.
On the sentence level AIF provides the following functions:
- extract splitter characters from tokens list (like: ;./?'"!()[]);
- divide splitter characters into groups (like: GROUP1 [()!?.], GROUP2 [,:;]);
- define splitter characters group that contains sentence splitter characters (like: GROUP1 [()!?.]);
- split tokens into sentences;
This function gives you a possibility to extract the splitter characters list from the input tokens list. This function should be used by using interface: ISeparatorExtractor (from package: com.aif.language.sentence.separators.extractors). To create an instance of this interface you need to select the type of ISeparatorExtractor and get an instance like this:
ISeparatorExtractor.Type.PROBABILITY.getInstance()
The currently supported types are:
- PROBABILITY
- PREDEFINED
For the difference between separator types see the section below
This extractor has predefined characters. This extractor type ignores tokens and input and returns a predefined array of characters.
List of predefined characters: '.', '!', '?', '(', ')', '[', ']', '{', '}', ';', ''', '"'
This is a default extractor type. This extractor will parse input tokens and will extract splitters from tokens.
Example of extracting separators:
final List<String> inputTokens = ...;
final ISeparatorExtractor separatorsExtractor = ISeparatorExtractor.Type.PROBABILITY.getInstance();
Optional<List<Character>> optionalSeparators = testInstance.extract(inputTokens);
This function gives you a possibility to divide all separators into groups. This function should be used by using interface: ISeparatorsGrouper (from the package: com.aif.language.sentence.separators.groupers). To create an instance of this interface you need to select the type of ISeparatorsGrouper and get an instance like this:
ISeparatorsGrouper.Type.PROBABILITY.getInstance()
The currently supported types are:
- PROBABILITY
- PREDEFINED
For the difference between types see the section below
This grouper has predefined characters. This grouper type will put all separators that contain the predefined list into one group and all other separators into the other group. The input tokens list is just ignored.
The list of predefined characters: '.', '!', '?'
For example, if input separators: '.', '!', ',', ';', '(', ')'
Then the groups on output will be the following:
- '.', '!'
- ',', ';', '(', ')'
This is a default grouper type. This grouper will parse input tokens and will group separators according to the statistical analysis.
The example of grouping separators:
final List<String> inputTokens = ...;
final List<Character> separators = ...;
final ISeparatorsGrouper separatorsGrouper = ISeparatorsGrouper.Type.PROBABILITY.getInstance();
final List<Set<Character>> separatorsGroups = separatorsGrouper.group(inputTokens, separators)
The group names that are used in the classifier:
- GROUP1 - group of separators that are used for splitting tokens on sentences (like: .?!);
- GROUP2 - group of separators that are used for splitting sentences on parts (like: ,;:);
This function gives you a possibility to classify the separators group. This function should be used by using the interface: ISeparatorGroupsClassificatory (from package: com.aif.language.sentence.separators.classificators). To create an instance of this interface you need to select the type of ISeparatorGroupsClassificatory end get an instance like this:
ISeparatorGroupsClassificatory.Type.PROBABILITY.getInstance()
The currently supported types are:
- PROBABILITY
- PREDEFINED
For the difference between types see the section below
This classificatory has the predefined character. This classificatory type will set a group that contains a predefined character as GROUP1, and the other group as GROUP2. The input tokens list is just ignored.
The predefined character: '.'
For example, if input groups are: ['.', '!'], [',', ';', '(', ')']
groups in output will be the following:
- GROUP1: ['.', '!']
- GROUP2: [',', ';', '(', ')']
This is a default classificatory type. This classificatory will parse input tokens and will group separators according to statistical analysis.
An example of classifying separators groups:
final List<String> inputTokens = ...;
final List<Set<Character>> separatorsGroups = ...;
final ISeparatorGroupsClassificatory sentenceSeparatorGroupsClassificatory = ISeparatorGroupsClassificatory.Type.PROBABILITY.getInstance();
final Map<ISeparatorGroupsClassificatory.Group, Set<Character>> result = sentenceSeparatorGroupsClassificatory.classify(inputTokens, separatorGroups);
To split sentences you need to:
- create AbstractSentenceSplitter instance (from package: com.aif.language.sentence.splitters);
- call a split method.
final AbstractSentenceSplitter sentenceSplitter = AbstractSentenceSplitter.Type.HEURISTIC.getInstance();
Supported types:
- HEURISTIC (default)
- SIMPLE
This separator will extract GROUP1 separators from the text and execute a split text by GROUP1 separators. This splitter is not smart, so any cases like: "Mr." would be parsed as sentence end.
This separator will extract GROUP1 separators from the text and execute a smart split text by GROUP1 separators. This splitter will analyze all the occurrences of GROUP1 characters to understand where there are cases like this: "Mr." and where there are cases of end sentences.
After you have got AbstractSentenceSplitter instance, you can split tokens by calling the split method like this:
ISplitter<List<String>, List<String>> sentenceSplitter = ...
List<String> tokens = ...
List<List<String>> sentences = sentenceSplitter.split(tokens);
An example of real usage can be found here