-
Notifications
You must be signed in to change notification settings - Fork 14
Home
Welcome to the InformationExtraction wiki!
- Overview
- Architecture
- System and Process Flow
- DataSet
- Model Training
- Evaluation
- Getting Started
- More Info
The project is to extract org hierarchy, names, titles, Business units from unstructured documents such as Web crawling with specific URLs, and can be supported via Information Extraction techniques. For example:
- Ternary relation: T= (people, job, company) _
- Input: Web page saved in HDFS, e.g. John Smith is the CEO at Inc. Corp” _
- Output: Structured data in data store and accessible via UI, e.g. (John Smith, CEO, Inc. Corp)_
More specifically, it’s a Relation Extraction problem.(https://en.wikipedia.org/wiki/Relationship_extraction). We formulate the problem as a classification problem (in a discriminative framework).
System Flow:
Process Flow:
This dataset is built for purpose of model training and evaluation.
The leadership board pages of around 300+ companies are crawled with the help of Jsoup.
The crawler script reads in a list of company urls and returns the clean page content with people profile in a single line. For example,
Tim Cook, CEO
Angela Ahrendts, Senior Vice President Retail and Online Stores
Eddy Cue, Senior Vice President Internet Software and Services
The crawled raw pages are stored in folder data/evaluation/web, each subfolder contains one company.
For the purpose of evaluating the NER performance, we need a set of ground truth. Since there are no available labelled data, some manual work is required. To speed up the label process, you can use LabelHelper.scala to get automatic labeled result and manually review it.
Entity Person, Title, Department are labelled separately.
The Person is labeled as \t1[PersonName]\t
The Title is labeled as \t2[TitleName]\t
The Department is labeled as \t3[DepartmentName]\t
For example, the above data is labeled as follows:
\t1Tim Cook\t, \t2CEO\t
\t1Angela Ahrendts\t, \t2Senior Vice President\t \t3Retail and Online Stores\t
\t1Eddy Cue\t, \t2Senior Vice President\t \t3Internet Software and Services\t
Manual labeled pages are put in the folder data/evaluation/maunal, each subfolder contains one company.
Having labeled the data, TabConverter.scala parses the labeled files and extracts a list of relations for each page, and the results are stored in data/evaluation/extraction
The training data is put in the folder data/NERDepartment. It includes the data files, the meaning of the columns, and what features to generate via a properties file. The data files are in tab-separated columns, with minimally the word tokens in one column and the class labels in another column. [TrainModel.java] (../blob/master/ie/src/main/java/com/intel/ie/TrainModel.java) parses the files and creates a new classifier. The process needs several minutes and the new classifier is stored in model
The training data is generated from our labeled data. RelationCorpExtractor.scala extracts from labeled data and generate the training corp in conll format.
You can train a new model by using IntelKBPStatisticalExtractor.java.
The evaluation system consists of two parts: NER evaluation and relation extraction evaluation.
The evaluation metric is precision/recall.
NerEvaluation.scala evaluates name entity recognition result. Before evaluating NER model, you need to label your test data with the format similar to this.
You can follow the prompt of Label.scala to label your test file, type 1
for PERSON, 2
for TITLE, 3
for EMPLOYEE_OF and c
if you want to want to cancel last label and re-type.
RelationEvaluation.scala evaluates relation extraction results. The dataset has been described in Section. Dataset.
Required Env :
- Scala 2.10.4 + Spark 1.6;
- Scala 2.11.8 + Spark 2.0 (If using intellij, add the lib folder, Spark assembly jar file and scale SDK to the project library)
Relation Extraction problems have been investigated for over 2 decades. Many available toolsets:
- Stanford Relation Extraction:http://nlp.stanford.edu/software/relationExtractor.html
- mit-nlp/MITIE: https://github.com/mit-nlp/MITIE
-
Alchemyapi: http://www.alchemyapi.com/products/alchemylanguage/relation-extraction
-
GATE: https://gate.ac.uk/ie/