A research project implimentation for automatic text summarization. AllSummarizer is considered as an extractive method; Each sentence is scored based on some creteria, reorder the most scored ones then extract the first relevant ones.
You can find more about the method in the paper:
| Using clustering and a modified classification algorithm for automatic text summarization
------------ | -------------
Authors: | Abdelkrime Aries (A), Houda Oufaida (A) and Omar Nouali (B)
Affiliation: | A: Ecole Nationale Supérieue d'Informatique (ESI), Algeria;
B: Ctr. de recherche sur l'Information Scientifique et Technique (CERIST), Algeria
Booktitle: | Document Recognition and Retrieval XX. Proceedings of the SPIE, Volume 8658.
Date: | February 4, 2013
Address: | Burlingame, California, USA
Publisher: | SPIE
Link: | http://dx.doi.org/10.1117/12.2004001
Also, the participation of the system at MultiLing 2015 workshop:
| AllSummarizer system at MultiLing 2015: Multilingual single and multi-document summarization ------------ | ------------- Authors: | Abdelkrime Aries, Djamel Eddine Zegour and Khaled Walid Hidouci Affiliation: | Ecole Nationale Supérieue d'Informatique (ESI), Algeria Booktitle: | Proceedings of the SIGDIAL 2015 Conference, pages 237–244, , . 2015 Date: | September 2-4, 2015 Address: | Prague, Czech Republic Publisher: | Association for Computational Linguistics Link: | http://www.aclweb.org/anthology/W15-4634
TODO: add a brief description since there is a link to the paper
This project is dependent to other projects:
- KToolJa: for file management and plugins
- LangPi: for text preprocessing; which depends on other libraries
Preprocessing plugins are in the folder: "preProcess". For Hebrew and Tai preprocessing tools, check LangPi releases. Those two plugins are not Apache2 licensed.
To execute from command line:
- Jar file: java -jar <jar_name> options
- Class: java kariminf.as.ui.MonoDoc options
- -i <input_file>: it must be a file or a folder if it is multidocument or variant inputs
- -o <output_file>: it must be a file or a folder if it is multidocument or there is multiple output lengths, feature combinations or thresholds
- -v: variant inputs; a folder that contains files or folders to be summarized.
sumary unit:
- -b: we use Bytes to specify the summary size.
- -c: we use characters to specify the summary size.
- -w: we use words to specify the summary size.
- -s: we use sentences to specify the summary size.
sumary length:
- -n : defines the number of units to be extracted.
- -r : ratio from 1 to 100% defines the percentage of units to be extracted. you can specify more than one length, by separating the lengths with semicolons
- -f : the features used to score the sentences. the features are separated by commas; for example: tfu,pos for multiple combinations, we use semicolons; for example: tfu,pos;tfb,len
- -t : a number from 0 to 100 to specify the threshold of clustering. for multiple thresholds, we use semicolons; for example: 5;50
To get help, use -h
Suppose we have a folder for inputs called "exp":
exp
├── multi
│ ├── M001
│ │ ├── M0010.english
│ │ ├── M0011.english
│ │ └── M0012.english
│ └── M002
│ ├── M0020.english
│ ├── M0021.english
│ └── M0022.english
└── single
├── doc1.txt
└── doc2.txt
the command:
-i "exp/single" -o "exp/output" -l en -t "5-15:5" -n "100;200" -c -f "tfu,pos;tfb,rleng" -v
gives these files:
doc1.txt_0.05_Pos-TFU_100c.txt doc1.txt_0.1_Pos-TFU_100c.txt doc2.txt_0.15_Pos-TFU_100c.txt
doc1.txt_0.05_Pos-TFU_200c.txt doc1.txt_0.1_Pos-TFU_200c.txt doc2.txt_0.15_Pos-TFU_200c.txt
doc1.txt_0.05_RLeng-TFB_100c.txt doc1.txt_0.1_RLeng-TFB_100c.txt doc2.txt_0.15_RLeng-TFB_100c.txt
doc1.txt_0.05_RLeng-TFB_200c.txt doc1.txt_0.1_RLeng-TFB_200c.txt doc2.txt_0.15_RLeng-TFB_200c.txt
doc1.txt_0.15_Pos-TFU_100c.txt doc2.txt_0.05_Pos-TFU_100c.txt doc2.txt_0.1_Pos-TFU_100c.txt
doc1.txt_0.15_Pos-TFU_200c.txt doc2.txt_0.05_Pos-TFU_200c.txt doc2.txt_0.1_Pos-TFU_200c.txt
doc1.txt_0.15_RLeng-TFB_100c.txt doc2.txt_0.05_RLeng-TFB_100c.txt doc2.txt_0.1_RLeng-TFB_100c.txt
doc1.txt_0.15_RLeng-TFB_200c.txt doc2.txt_0.05_RLeng-TFB_200c.txt doc2.txt_0.1_RLeng-TFB_200c.txt
the command:
-i "exp/single/doc1.txt" -o "exp/output" -l en -t 5 -r "5;10" -c -f "tfu,pos"
gives these files:
doc1.txt_0.05_Pos-TFU_10%c.txt doc1.txt_0.05_Pos-TFU_5%c.txt
the command:
-i "exp/multi" -o "exp/output" -l en -t 5 -r "5;10" -c -f "tfu,pos" -v -m
gives these files:
M001_0.05_Pos-TFU_10%c.txt M001_0.05_Pos-TFU_5%c.txt
M002_0.05_Pos-TFU_10%c.txt M002_0.05_Pos-TFU_5%c.txt
The code is released under Apache 2.0 license. For more details about this license, check LICENSE file