- you know basic python 2.7
- you know basic sql syntax
- you know chinese
- python 2.7
- python 2.7 library:
- jieba==0.38
- MySQL-python==1.2.5
- xmldict==0.4.1
- mysql Ver 14.14 or higher and lower, test on version 14.14
- test on mac os, it may be support linux and windows.
-
wikipedia data: 维基百科为对维基百科中的内容有使用兴趣的人提供了完整内容的电子档案。 这里选择的是,20141009 版本的镜像。我们只需要,维基百科内容中的文字信息,以及最新一版本的数据情况 ,而不需要维基百科内容的历史 修改记录,因此我们选择下载页面中的, zhwiki-20141009-pages-articles-multistream.xml 其容量大小为 4.9G ,其压缩版下载网址在[这里](http://download.wikipedia.com/zhwiki/ /20141009/zhwiki-20141009-pages-articles-multistream.xml.bz2), 如果下载不到这个版本的数据,你也可以选择其它时间的比如20160501 的zhwiki-20160501-pages-articles-multistream.xml.bz2 。
-
question set data: 问题集的获取:这里使用的问题集来自万小军语义计算与知识挖掘课程[5] 中的中文智能问答系统的问题集, 其中包括100个有答案的问题,位于在data/question_with_answer_100.xml,和10000个没有答案的问题,位于data/question_without_answer_10000.xml。
-
get raw data from wikipedia, see more from previous data section, by the way you can check the raw data format
-
unzip the raw data,get a big xml file, then convert the xml to sql scripts using by the method describe here recommend to use mwdum.py, it will cost you almost 30 minutes.create the wikipedia database using the sql script ExportMysqlDataToFile/sql/wikipedia_create_table.sql, know more about the wikipedia database table structure from here , then run the sql scripts which generated from xml. it will cost you almost 30 minutes
-
clean data and get useful data into file system using ExportMysqlDataToFile module
-
build retrieve system using DocsRetrieveSystem module
-
this system is ready to answer question, run the script in QuestionAnalysis, change dir to QuestionAnalysis, then run the main.py
python main.py
, you can see the question answer system performance on 100 sample question indata/question_with_answer_100.xml
, but the performance is really bad. but you can learn a lot from the project code.
- update ExportMysqlDataToFile/config.py file,
-
update DB connect info
class DB: host = "127.0.0.1" username = "root" password = "" database = "wikipedia"
-
update
project_base_path
where the ExportMysqlDataToFile module located anddata_base_path
where you want the files export to from mysql database -
delete log.txt files
-
change directory to ExportMysqlDataToFile, then extract to file system by running main.py
python main.py extract_to_file_system
by command, after almost 5 minutes, you can check thedata_base_path
directory, there is atemp
directory contain the wiki page content.
then you need to split the page to fragments by running python main.py split_to_fragment
, you can stop the script anytime, the progress will stored in the log.txt file, next
time when you run python main.py split_to_fragment
again, it will continue from previous progress.
it will cost you nearly 1 hour, then you will see fragment
directory in your data_base_path
-
then you can delete the
temp
folder in yourdata_base_path
path, you also can remove thewikipedia
database -
done
-
run sql script at DocsRetrieveSystem/db/create_table.sql to create tables for docs retrieve system, make sure database
wiki_search
do not exists before, after running, you will see a empty database namedwiki_search
-
make sure the database is empty, by running sql:
truncate table wiki_doc_term; truncate table wiki_term; truncate table wiki_doc;
-
change directory to DocsRetrieveSystem, to do index by running
python main.py do_index_to_database
, it may cost you more than 10 hours. it pretty slow. -
you will see tons of data in your database
wiki_search
, you need to keep thefragment
directory in yourdata_base_path
, in the database, do not remove them. if you want to search fragment about 天安门, you can runpython main.py search 天安门
to check the DocsRetrieveSystem works
data/question_with_answer_100.xml中的100 个问题中,有 33 个问题,在文档检索系统的工作下成功的把含有答案的文档检索出来,候选句选择程序,从 33 个包含正确答案的文档列表中, 成功抽取出 11 个包含正确答案的候选句子,成功率为 33.3%(11/33 = 33.3%)。 最终的答案抽取程序,从候选句中抽取出最终的正确答案,成功抽取出的结 果,有 5 个与标准答案相符合,抽取的成功率为,45.5%(5/11 = 45.5%)。
final answer accuracy: 5%
almost 200k pages, 720k fragments
- Sproat, R. and Emerson, T. The First International Chinese Word Segmentation Bakeoff[A]. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing[C]. Sapporo , Japan: July 11-12 , 2003 ,133-143.
- 黄昌宁,赵海. 中文分词十年回顾[J]. 中文信息学报,2007,03:8-19.
- 结巴中文分词[EB/OL]https://github.com/fxsjy/jieba,2015.05.09/2015.05.09
- The Integration of Lexical Knowledge and External Resources for Question Answering
- Web Data Mining 2014 Fall – PKU » 互联网数据挖掘[EB/OL]http://www.icst.pku.edu.cn/lcwm/course/WebDataMining2014/
- 黄翼彪. 开源中文分词器的比较研究[D].郑州大学,2013.
- Help:Formatting – MediaWiki[EB/OL]http://www.mediawiki.org/wiki/Help:Formatting,2015.04.12/2015.05.03
- Manual:Database layout[EB/OL]http://www.mediawiki.org/wiki/Manual:Database_layout,2015.03.12/2015.05.09
- Data dumps/xml2sql[EB/OL]http://meta.wikimedia.org/wiki/Data_dumps/xml2sql, 2013.3.5/2015.05.09
- Robertson, Stephen, and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond. [M] Now Publishers Inc, 2009
- Singhal, Amit, Chris Buckley, and Mandar Mitra. "Pivoted document length normalization." [C] Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1996.
- 维基百科:数据库下载[EB/OL]http://zh.wikipedia.org/wiki/Wikipedia:数据库下载, 2014.10.15/2015.05.09
- 下载页面[EB/OL] http://download.wikipedia.com/zhwiki/20141009/zhwiki-20141009-pages-articles-multistream.xml.bz2,2015.05.09/2015.05.09
- 张宇,刘挺,文勖. 基于改进贝叶斯模型的问题分类[J]. 中文信息学报,2005,02:100-105.
- https://github.com/nutztherookie/mwdum.py.git/