arxiv_miner/scraping_engine.py consists of the classes to tap into the feed from ArXiv and creates an ArxivRecord
in Elasticsearch. This is done so that records are only re-mined if necessary. Instructions to scrape data into Elasticsearch are provided here.
arxiv_miner/mining_engine.py consists of a process that mines papers which get scraped. paper.py consists of the ArxivPaper
class. This class extracts LaTeX source repository from remote source. Each LaTeX source repository is parsed to create a "Structure Tree" of the research document. The Structure tree is created using tex2py. The Structure tree helps correlate the structure of latex document.
The structure tree is then used to create a Section
object. More information about Section
object can be found in Core Structures The text
within each Section
is populated by using the opendetex library. The opendex library helps filter text information from individual tex files. A hacky algorithm based on number of tex files correlates the text with Structure Tree to create a single Section
.
Instructions to mine papers after scraping and index into Elasticsearch are provided here.
from arxiv_miner import ArxivPaper,ResearchPaperFactory
ROOT_DICTORY_TO_STORE_LATEX = './papers_root'
# This will extract LaTeX source from ArXiv parse the data to a `Section` Object
paper = ArxivPaper.from_arxiv_id('1706.03762',ROOT_DICTORY_TO_STORE_LATEX,detex_path='<PATH_TO_DETEX_BINARY>')
# The will create a `ResearchPaper`
paperdoc = ResearchPaperFactory.from_arxiv_record(paper)
arxiv_miner/database/elasticsearch.py consists of the core methods over Elasticsearch to search and aggregate data. Search and aggregation requires two classes :
- A wrapper class over Elasticsearch to execute the search and aggregate queries :
KeywordsTextSearch
orArxivElasticTextSearch
- These classes contains methods that help retrieve information from the index containing the mined documents.
from arxiv_miner import KeywordsTextSearch ELASTICARGS= dict( index_name=None, host='localhost', port=9200, auth=None ) database = KeywordsTextSearch(**ELASTICARGS)
- A wrapper class to create the search and aggregation queries from input :
TextSearchFilter
,DateAggregation
andTermsAggregation
- Search and aggregation wrappers are created using elasticsearch_dsl.
TextSearchFilter
is the core wrapper on what to search i.e. core filters over the indexed data after mining
from arxiv_miner import TextSearchFilter, DATE_FIELD_NAME, CATEGORY_FIELD_NAME, TEXT_HIGHLIGHT,CategoryFilterItem text_filter = TextSearchFilter( id_vals=[], # Explicitly filtering arxiv_id's in search/aggregation no_date = False, # no_date : bool for not using date filter string_match_query="", # Text filter opt text_filter_fields = [], # Specific fields in search index to filter. start_date_key=None, # start date filter opt end_date_key = None, # end date filter opt date_filter_field = DATE_FIELD_NAME, # date field upon which date filter will be applied category_filter = [],# [CategoryFilterItem] Use `category_filter` or `multi_category_filter` category_filter_values =[], # if len(category_filter) > 0 the category_filter_values required category_field = CATEGORY_FIELD_NAME, category_match_type= 'AND', multi_category_filter=[], # multi_category_filter : [[CategoryFilterItem]] sort_key=DATE_FIELD_NAME, # Sort Key upon which search results will be ordered sort_order='descending', highlights = TEXT_HIGHLIGHT,# Search keys to highligh results fragments from highlight_fragments=60, source_fields=[],# Particular fields to restrict search on # Page settings page_size=10,\ page_number=1, scan=False# Full Dataset Traversal key # If scan==True, then no Pagination else paginate )
DateAggregation
andTermsAggregation
inheritTextSearchFilter
to create aggregations for date and keywords for a particular search query.
Paginated style results or iterator style retrieval of saved ArXiv Papers From Elasticsearch
from arxiv_miner import KeywordsTextSearch,TextSearchFilter
ELASTICARGS= dict(
index_name=None,
host='localhost',
port=9200,
auth=None
)
database = KeywordsTextSearch(**ELASTICARGS)
# Pagination Style retrieval
text_filter = TextSearchFilter(
string_match_query="out of distribution generalization",
start_date_key='04/04/2015',
end_date_key = '04/04/2021',
page_size=100,
)
paginated_search_results = database.text_search(text_filter)
# Iterator style retrieval.
scan_text_filter = TextSearchFilter(
string_match_query="out of distribution generalization",
start_date_key='04/04/2015',
end_date_key = '04/04/2021',
scan=True
)
for doc in database.text_search_scan(scan_text_filter):
handlestuff(doc)