Core Components

Scraping

arxiv_miner/scraping_engine.py consists of the classes to tap into the feed from ArXiv and creates an ArxivRecord in Elasticsearch. This is done so that records are only re-mined if necessary. Instructions to scrape data into Elasticsearch are provided here.

Mining and Parsing

arxiv_miner/mining_engine.py consists of a process that mines papers which get scraped. paper.py consists of the ArxivPaper class. This class extracts LaTeX source repository from remote source. Each LaTeX source repository is parsed to create a "Structure Tree" of the research document. The Structure tree is created using tex2py. The Structure tree helps correlate the structure of latex document.

The structure tree is then used to create a Section object. More information about Section object can be found in Core Structures The text within each Section is populated by using the opendetex library. The opendex library helps filter text information from individual tex files. A hacky algorithm based on number of tex files correlates the text with Structure Tree to create a single Section.

Instructions to mine papers after scraping and index into Elasticsearch are provided here.

Standalone Paper Parsing

from arxiv_miner import ArxivPaper,ResearchPaperFactory
ROOT_DICTORY_TO_STORE_LATEX = './papers_root'
# This will extract LaTeX source from ArXiv parse the data to a `Section` Object
paper = ArxivPaper.from_arxiv_id('1706.03762',ROOT_DICTORY_TO_STORE_LATEX,detex_path='<PATH_TO_DETEX_BINARY>')
# The will create a `ResearchPaper`
paperdoc = ResearchPaperFactory.from_arxiv_record(paper)

Storage And Search

arxiv_miner/database/elasticsearch.py consists of the core methods over Elasticsearch to search and aggregate data. Search and aggregation requires two classes :

A wrapper class over Elasticsearch to execute the search and aggregate queries : KeywordsTextSearch or ArxivElasticTextSearch
- These classes contains methods that help retrieve information from the index containing the mined documents.
```
from arxiv_miner import KeywordsTextSearch 
ELASTICARGS= dict(
    index_name=None,
    host='localhost',
    port=9200,
    auth=None
)
database = KeywordsTextSearch(**ELASTICARGS)
```

A wrapper class to create the search and aggregation queries from input : TextSearchFilter, DateAggregation and TermsAggregation

Search and aggregation wrappers are created using elasticsearch_dsl.
TextSearchFilter is the core wrapper on what to search i.e. core filters over the indexed data after mining

from arxiv_miner import TextSearchFilter, DATE_FIELD_NAME, CATEGORY_FIELD_NAME, TEXT_HIGHLIGHT,CategoryFilterItem
text_filter = TextSearchFilter(
    id_vals=[], # Explicitly filtering arxiv_id's in search/aggregation
    no_date = False, # no_date : bool for not using date filter
    string_match_query="", # Text filter opt
    text_filter_fields = [], # Specific fields in search index to filter. 
    start_date_key=None, # start date filter opt
    end_date_key = None, # end date filter opt
    date_filter_field = DATE_FIELD_NAME, # date field upon which date filter will be applied
    category_filter = [],# [CategoryFilterItem] Use `category_filter` or `multi_category_filter`
    category_filter_values =[], # if len(category_filter) > 0 the category_filter_values required
    category_field = CATEGORY_FIELD_NAME,
    category_match_type= 'AND',
    multi_category_filter=[], # multi_category_filter : [[CategoryFilterItem]]
    sort_key=DATE_FIELD_NAME, # Sort Key upon which search results will be ordered
    sort_order='descending',
    highlights = TEXT_HIGHLIGHT,# Search keys to highligh results fragments from
    highlight_fragments=60,
    source_fields=[],# Particular fields to restrict search on
    # Page settings 
    page_size=10,\
    page_number=1,
    scan=False# Full Dataset Traversal key # If scan==True, then no Pagination else paginate
)

DateAggregation and TermsAggregation inherit TextSearchFilter to create aggregations for date and keywords for a particular search query.

Standalone Usage

Paginated style results or iterator style retrieval of saved ArXiv Papers From Elasticsearch

from arxiv_miner import KeywordsTextSearch,TextSearchFilter
ELASTICARGS= dict(
    index_name=None,
    host='localhost',
    port=9200,
    auth=None
)
database = KeywordsTextSearch(**ELASTICARGS)
# Pagination Style retrieval
text_filter = TextSearchFilter(
    string_match_query="out of distribution generalization",
    start_date_key='04/04/2015',
    end_date_key = '04/04/2021',
    page_size=100,
)
paginated_search_results = database.text_search(text_filter)
# Iterator style retrieval.
scan_text_filter = TextSearchFilter(
    string_match_query="out of distribution generalization",
    start_date_key='04/04/2015',
    end_date_key = '04/04/2021',
    scan=True
)
for doc in database.text_search_scan(scan_text_filter):
    handlestuff(doc)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core_components.md

core_components.md

Core Components

Scraping

Mining and Parsing

Standalone Paper Parsing

Storage And Search

Standalone Usage

Files

core_components.md

Latest commit

History

core_components.md

File metadata and controls

Core Components

Scraping

Mining and Parsing

Standalone Paper Parsing

Storage And Search

Standalone Usage