diff --git a/LICENSE.txt b/LICENSE.txt new file mode 100644 index 0000000..301f1d6 --- /dev/null +++ b/LICENSE.txt @@ -0,0 +1,17 @@ +MIT License +Copyright (c) 2021 Valay Dave +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. \ No newline at end of file diff --git a/Readme.md b/Readme.md index 849e3b7..cfd2d2d 100644 --- a/Readme.md +++ b/Readme.md @@ -1,110 +1,44 @@ -# Arxiv Miner. +# ArXiv-Miner -Repository Helps Mine Arxiv Papers to quickly Scrape through new Papers and Mine data for Faster Readings. +> ArXiv Miner is a toolkit for mining research papers on CS ArXiv. -# BROADER GOAL -1. The goal of this project is to annotate and build faster search around research papers so that I can be quickly aware of what is happening in the domain. -2. It is also ment to structure research papers in searialisable JSON so that I can start annotating research and fixing things around the same. +## What is ArXiv-Miner -# How can One get there ? +`arxiv-miner` is a quick handy library that helps power [Sci-Genie](https://sci-genie.com). Sci-Genie is a search engine for quickly through full text of papers on CS ArXiv. `arxiv-miner` helps extract and parse LaTeX documents from CS ArXiv. It also supports storage and search of those parsed documents using **Elasticsearch**. The library can be applicable for all other domains like Math, Physics, Biology etc. -## ARXIV PAPER MINING +## Documentation +All documentation on how to install and use `arxiv-miner` is provided in the documentation website or inside the [docs folder](docs). Contribution guidelines are also provided there. -### GOAL OF PAPER MINING -Parse the Arxiv Latex/PDF into A research Paper Object which can be serialised so that It is in readable format for some form of Machine learning/Annoation methods. But it all starts from cleaning the Dirt from Arxiv. +## Why was ArXiv-Miner created ? +ArXiv Miner was created for easily scraping, parsing and searching research content on ArXiv. This library was created after stitching together solutions from the code of various tools like [arxiv-sanity](https://github.com/karpathy/arxiv-sanity-preserver), [arxiv-vanity/engrafo](https://github.com/arxiv-vanity/engrafo), [arxivscraper](https://github.com/Mahdisadjadi/arxivscraper), [tex2py](https://github.com/alvinwan/tex2py), [cso-classifier](https://github.com/angelosalatino/cso-classifier/) and [axcell](https://github.com/paperswithcode/axcell). Parsed structure of the content can be useful in search or any scientific research mining/AI applications as a heuristic baseline. -### WAY TO DO IT -1. Extract Papers from `Arxiv` using `scrape_papers.py` script. The `ArxivDatabase` will hold the `ArxivRecord`s. -2. `mine_papers.py` will download the Latex version of the Papers for Arxiv and create and `ArxivRecord` object. -3. The `ArxivRecord` can is a base class to `ArxivPaper`. -4. The `ArxivPaper` Object helps extract the Latex source from the Arxiv and parses it. - - Three things will help solve the Information mining Problem. - 1. Extraction of Document Structure/hierarchy via Python-Latex Libraries like `tex2py`. - 2. Extraction of Text from Latex Document Using `detex` : https://github.com/pkubowicz/opendetex - 3. Collate with the Tree with the text based on hierachical traversal of tree and text-splittig based search to collate the information. - - These things are Managed using the child classes of `LatexInformationParser`. These child classes will help for the Structured `Section` objects which contains the stored parsed structure of the Research paper. -5. The Scaraped/Mined Papers are stored in a `fs` or `elasticsearch` based search engines. +## Core Components of ArXiv-Miner +- Scraping +- Parsing +- Indexing/Storage +## Family Of Projects With ArXiv-Miner +- `arxiv-table-miner` : Coming Soon. +- `arxiv-table-ml-models` : Coming Soon. +- `semantic-scholar-data-pipeline` : https://github.com/valayDave/semantic-scholar-data-pipeline -## Setup +## Disclaimer +This project was developed like a [Cowboy coder](https://en.wikipedia.org/wiki/Cowboy_coding) over the [COVID-19 pandemic](https://en.wikipedia.org/wiki/COVID-19_pandemic). Hence, this **may have bugs and not the most well optimized code**. The primary reason for development was to aid CS and Machine Learning/AI research, but this tool can be extended to all 3M+ documents on ArXiv. -```sh -sh setup.sh -``` +## Call For Contributors +Any help with contributions to improve the project or fix bugs are completely welcome. Please read the contribution guide in the documentation. -### To Setup Ontology Miner: +## Credits and Appreciation +This project like all others has been built on shoulders of giants. A big thanks to the creators of the following libraries/open source projects that aided the development of `arxiv-miner`, and it's family of projects: +- [arxiv-sanity](https://github.com/karpathy/arxiv-sanity-preserver) +- [arxiv-vanity/engrafo](https://github.com/arxiv-vanity/engrafo) +- [arxivscraper](https://github.com/Mahdisadjadi/arxivscraper) +- [tex2py](https://github.com/alvinwan/tex2py) +- [cso-classifier](https://github.com/angelosalatino/cso-classifier/) +- [axcell](https://github.com/paperswithcode/axcell) +- [elasticsearch](https://github.com/elastic/elasticsearch) +- [Semantic Scholar Open Research corpus](https://github.com/allenai/s2orc) +- [metaflow](https://metaflow.org) -```sh -sh cso_setup.sh -``` - -## What is Done Yet : - -1. Arxiv PDF and LateX Extraction Pipeline -2. Arxiv Paper Parsing to JSON Objects using Latex and Python. --> Latex Based Symantically parsed Data Extraction :: READY -3. Local Database Setup and Data Exploration. - -## What Needs to Be Done ? - -1. Data Extraction And Pasing System Are pretty Well set from Database. - 1. The Database Generation needs to move from Andrej's script to using the `arxivscraper` which uses the mass Metadata extraction. - -2. Final System : - - Scraping Crons - - Parsing Idempotent processes. - - TODO : Further parse - - ArxivRecord Database with `fs` | `elasticsearch` - - Search Interface - - Daily Update of New Research - - Search indexing for - - -# How Does it Work ? - -## Overview -- Parts of Current System : - - `ArxivDatabase` : Core class to expose base methods for interfacing with DB. It is an adapter that can work with an `filesystem` based database or `elasticsearch`. The purpose of the adapter is ment create an interopratable data layer that can switched according to requirement and need. - - Filesystem based DB uses `ArxivDatabaseService(rpyc.Service,ArxivFSDatabase)`. The `database_server.py` file helps create and FS based database server. - - `HarvestingProcess` : This uses a `ScrapingEngine` to extract `ArxivIdentity` from ArXiv API(`http://export.arxiv.org/oai2?verb=ListRecords`). - - The Data extracted is stored to the database as an `ArxivRecord`. - - `DailyHarvestationProcess` helps retrieve data daily papers. - - `MassHarvestationProcess` gets data based on DateRange. - - `MiningProcess`: Helps mine the papers for `LaTeX` information. The mined `ArxivRecord` is stored in the Database - -- The Database provides a Way to Create/Update `ArxivRecord`. The `ArxivRecord` contains an `ArxivIdentity` which is extracted using the `arxiv_miner.scraping_engine.ScrapingEngine`. `ArxivRecord` is the Fundamental Datastructure use to identify a research paper. `ArxivPaper` is a processing Object which can use a `ArxivRecord` to start the mining process. - -## Running the Damn Thing. -- The `config.py` file contains the `Config` Object which is Singleton used for configuration across the project. -- Start FS based Database Server with Below Command . The Database Server is responsible For Managing the data. Elasticsearch is also supported as a backend database. - ```sh - python database_server.py - ``` -- Start the Data Harvester according to your requirements. Can perform a `daily-harvest` or a `date-range` harvest. - ```sh - python scrape_papers.py --help - ``` - - DB adapters can be switched. The `--use_defaults` will load the defaults of `--datastore` from `Config`. - ```sh - python scrape_papers.py --datastore elasticsearch --host localhost --port 18861 daily-harvest - ``` -- Start the Miner To parallely start mining the Extracted data. - ```sh - python mine_papers.py --help - ``` - - The Miner has the same database cli adapter as Scraper. - ```sh - python mine_papers.py --datastore fs --use_defaults start-miner - ``` -- Source Harvest and Store to S3: - ```sh - nohup /home/ubuntu/arxiv-miner/.env/bin/python /home/ubuntu/arxiv-miner/mass_source_harvest.py --max-chunks 200 > /home/ubuntu/arxiv-miner/mass_harvet.log & - ``` - -- Extract EC2 instance List from AWS - ``` - aws ec2 describe-instances --region=us-east-1 --query 'Reservations[*].Instances[*].[InstanceId,Tags[?Key==`Name`].Value|[0],State.Name,PrivateIpAddress,PublicIpAddress]' --output table > instance_list.md - ``` -# TODO / VISION -1. Create a search interface for looking for research. -2. Get daily analytics of the new research coming out -3. Create reports and analytics for the new research +## Licence +MIT \ No newline at end of file diff --git a/arxiv_miner/__init__.py b/arxiv_miner/__init__.py index 28a2536..de3d473 100644 --- a/arxiv_miner/__init__.py +++ b/arxiv_miner/__init__.py @@ -8,11 +8,6 @@ ResearchPaper,\ ResearchPaperSematicParser -from .loader import \ - ArxivLoader,\ - ArxivLoaderFilter,\ - FSArxivLoadingFactory - from .record import \ ArxivIdentity,\ ArxivLatexParsingResult,\ @@ -22,8 +17,6 @@ ArxivSematicParsedResearch from .database import \ - ArxivFSDatabaseService,\ - ArxivDatabaseServiceClient,\ ArxivElasticSeachDatabaseClient,\ KeywordsTextSearch,\ TextSearchFilter,\ diff --git a/arxiv_miner/cli.py b/arxiv_miner/cli.py new file mode 100644 index 0000000..558a384 --- /dev/null +++ b/arxiv_miner/cli.py @@ -0,0 +1,62 @@ +''' +This is the Generalised CLI origin of the Project. +this will be used for the Extracting the Important CLI information such as Database +Selection etc. Can be used as a gateway to integrate all the submodules into one cli invocation +''' + +import click +from functools import wraps +import configparser +from .config import Config +from .database import SUPPORTED_DBS,get_database_client +import json + +DEFAULT_APP_NAME= 'ArXiv-Miner' + +def common_run_options(func): + db_defaults = Config.get_db_defaults() + @click.option('--host', default=db_defaults['host'], help='ArxivDatabase Host') + @click.option('--port', default=db_defaults['port'], help='ArxivDatabase Port') + @wraps(func) + def wrapper(*args, **kwargs): + return func(*args, **kwargs) + return wrapper + + +@click.group(invoke_without_command=True) +@click.option('--use_defaults',is_flag=True,help='Use Default Database Configurations For Chosen Datastore.') +@click.option('--with-config',default=None,help='Path to configuration ini file to use. Uses a configuration file for the instantiation of the database') +@common_run_options +@click.pass_context +def db_cli(ctx,use_defaults,with_config,host,port,app_name=DEFAULT_APP_NAME): + ctx.obj = {} + args , client_class = database_choice(use_defaults,with_config,host,port) + print_str = '\n %s Process Using %s Datastore'%(app_name,'elasticsearch') + args_str = ''.join(['\n\t'+ i + ' : ' + str(args[i]) for i in args]) + click.secho(print_str,fg='green',bold=True) + click.secho(args_str+'\n\n',fg='magenta') + ctx.obj['db_class'] = client_class + ctx.obj['db_args'] = args + + +def database_choice(use_defaults,with_config,host,port): + client_class = get_database_client('elasticsearch') + if with_config is not None: + config = configparser.ConfigParser() + config.read(with_config) + args = dict(index_name=config['elasticsearch']['index'], + host=config['elasticsearch']['host'] + ) + if 'port' in config['elasticsearch']: + args['port'] = config['elasticsearch']['port'] + if 'auth' in config['elasticsearch']: + args['auth'] = config['elasticsearch']['auth'].split(' ') + # get_database_client will raise error if some-one feeds BS DB + elif use_defaults: + args = Config.get_defaults('elasticsearch') + else: + args = dict(index_name=Config.elasticsearch_index,host=host,port=port) + return args, client_class + +if __name__ == '__main__': + db_cli() \ No newline at end of file diff --git a/arxiv_miner/config.py b/arxiv_miner/config.py new file mode 100644 index 0000000..64e4f07 --- /dev/null +++ b/arxiv_miner/config.py @@ -0,0 +1,32 @@ +# TODO : Move this to configuration format where the entire thing comes from a YML file +import os +# global settings +# ----------------------------------------------------------------------------- +class Config(object): + default_database = 'elasticsearch' + elasticsearch_port = 9200 + elasticsearch_host = 'localhost' + elasticsearch_index = 'arxiv_papers' + es_auth = None # should be a tuple + + # Object Store + bucket_name = 'arxiv-papers-source-bucket' + + @classmethod + def get_defaults(cls,db_str): + if db_str == 'elasticsearch': + return_dict = dict(\ + host=cls.elasticsearch_host,\ + port=cls.elasticsearch_port,\ + index_name = cls.elasticsearch_index) + + if cls.es_auth is not None: + return_dict['auth']=cls.es_auth + + return return_dict + else: + return None + + @classmethod + def get_db_defaults(cls): + return cls.get_defaults(cls.default_database) diff --git a/arxiv_miner/database/__init__.py b/arxiv_miner/database/__init__.py index 605b89b..3d3c084 100644 --- a/arxiv_miner/database/__init__.py +++ b/arxiv_miner/database/__init__.py @@ -11,12 +11,7 @@ FIELD_MAPPING,\ DATE_FIELD_NAME -from .filesystem import ArxivFSDatabase -from .proxy_service import \ - ArxivFSDatabaseService,\ - ArxivDatabaseServiceClient - -SUPPORTED_DBS = ['fs','elasticsearch'] +SUPPORTED_DBS = ['elasticsearch'] class DatabaseNotSupported(Exception): headline = 'DB_CLIENT_NOT_FOUND' @@ -29,7 +24,5 @@ def __init__(self,given_client): def get_database_client(client_name): if client_name not in SUPPORTED_DBS: raise DatabaseNotSupported(client_name) - if client_name == 'fs': - return ArxivDatabaseServiceClient elif client_name == 'elasticsearch': return KeywordsTextSearch diff --git a/arxiv_miner/database/elasticsearch.py b/arxiv_miner/database/elasticsearch.py index 5997e6a..596ad9d 100644 --- a/arxiv_miner/database/elasticsearch.py +++ b/arxiv_miner/database/elasticsearch.py @@ -56,7 +56,6 @@ def __init__(self,index_name=None,host='localhost',port=9200,auth=None): src_str = f'{host}' else: src_str = f'{host}:{port}' - if auth is None: self.es = elasticsearch.Elasticsearch(src_str,timeout=30, max_retries=10) else: @@ -811,18 +810,6 @@ def text_aggregation(self,agg_obj:Aggregation): return_buckets = agg_obj.transform_resp(aggregation_buckets) return return_buckets - # @async_wrap - # def async_text_search_scan(self,filter_obj:TextSearchFilter): - # return self.text_search_scan(filter_obj) - - # @async_wrap - # def async_text_aggregation(self,agg_obj:Aggregation): - # return self.text_aggregation(agg_obj) - - # @async_wrap - # def async_text_search(self,filter_obj:TextSearchFilter): - # return self.text_search(filter_obj) - class KeywordsTextSearch(ArxivElasticTextSearch): def __init__(self, **kwargs): super().__init__(**kwargs) diff --git a/arxiv_miner/database/filesystem.py b/arxiv_miner/database/filesystem.py deleted file mode 100644 index 448df3a..0000000 --- a/arxiv_miner/database/filesystem.py +++ /dev/null @@ -1,162 +0,0 @@ -""" -This Module is responsible for Working as a Generalised FS based Database for Scraping/Mining etc. -This uses the `ArxivDatabase` Adapter to create an FS driven DB. -""" -import os -from ..record import ArxivRecord,ArxivIdentity,ArxivPaperStatus -from ..utils import dir_exists,save_json_to_file,load_json_from_file -from ..paper import ArxivPaper -from ..logger import create_logger -from .core import ArxivDatabase - -class PaperMap: - """ - This Datastructure is responsible for Storing the Metadata - About the Scraping/Mining For the FS Database. - - `ArxivDatabase` is an adapter class so that one can Quickly switch from an FS based database to - """ - filename = 'paper_map.json' - paper_map = {} - unmined_set = set() - - def __init__(self,data_root_path,build_new=False): - self.papers_path = os.path.join(data_root_path,'papers') - self.root_path = os.path.join(data_root_path,'map') - self._init_paper_map(build_new=build_new) - - def save_map(self): - if not dir_exists(self.root_path): - os.makedirs(self.root_path) - save_json_to_file(self.to_json(),os.path.join(self.root_path,self.filename)) - - def _init_paper_map(self,build_new=False): - """_init_paper_map - Load Map from a Directory Or Build one from The papers Path - :param build_new [bool] : if True, Will initiate Proceess of Building from FS with paper_path Else it will use the map/paper_map.json - """ - if build_new: - self._load_map_from_fs() - return - map_path = os.path.join(self.root_path,self.filename) - if dir_exists(map_path): - json_map = load_json_from_file(map_path) - self.paper_map,self.unmined_set = self._from_json(json_map) - return - # if PaperMap Json Doesn't Exist on Path and There Are a few Folders on the Papers path then - self._load_map_from_fs() - return - - def _load_map_from_fs(self): - """_load_map_from_fs - Method to Build the paper_map : this Helps with Mining and Scraping Process. - paper_map = { - '0704.3931' : `ArxivPaperStatus` - } - """ - list_of_subfolders_with_paths = [f.path for f in os.scandir(self.papers_path) if f.is_dir()] - for path in list_of_subfolders_with_paths: - paper_id = path.split('/')[-1] - try: - paper = ArxivPaper.from_fs(paper_id,self.papers_path) - # print("Adding Paper : ",paper_id) - mined = True if paper.paper_processing_meta is not None else False - # Create Paper Status in PaperMap. - self.paper_map[paper_id] = ArxivPaperStatus(mined=mined,scraped=True) - # Add to unmined_set if paper is Not Mined. - if not mined: - self.unmined_set.add(paper_id) - except Exception as e: - continue - - def _from_json(self,json_object): - unmined_set = set() - for key in json_object: - json_object[key] = ArxivPaperStatus.from_json(json_object[key]) - if not json_object[key].mined: - unmined_set.add(key) - - return json_object,unmined_set - - def __len__(self): - return len(self.paper_map.keys()) - - def __getitem__(self,paper_id): - if paper_id not in self.paper_map: - return None - return self.paper_map[paper_id] - - def to_json(self): - return dict((paper_id,self.paper_map[paper_id].to_json()) for paper_id in self.paper_map) - - def add(self,paper_id): - if paper_id not in self.paper_map: - self.paper_map[paper_id] = ArxivPaperStatus(scraped=True) # Create When An Identity is Scraped. - self.unmined_set.add(paper_id) - - def get_unmined_paper(self) -> str: - if len(self.unmined_set) == 0: - return None - return self.unmined_set.pop() - - def add_unmined_id(self,paper_id): - self.unmined_set.add(paper_id) - self.update() - - -class ArxivFSDatabase(ArxivDatabase): - # 4 M IDS IN MEMORY FOR 600 MB MEMORY : MAAAAX - """ArxivFSDatabase - - Works Similar to the `ArxivLoader`. - It will Be a centralised Database between Scraping Engine and Mining Engine. - `ArxivFSDatabase` uses `PaperMap` to Help with Querying. - - This Database Can Later GeT replaced with a more formal search Engine. - """ - def __init__(self,data_root_path,build_new_map=False): - paper_path = os.path.join(data_root_path,'papers') - if not dir_exists(paper_path): - os.makedirs(paper_path) - self.papers_path = paper_path - self.paper_map = PaperMap(data_root_path,build_new=build_new_map) - self.logger = create_logger(self.__class__.__name__) - self.logger.info("Database Has Started Currently With Papers : %d"%len(self.paper_map)) - - - def query(self, paper_id) -> ArxivRecord: - # paper_path = os.path.join(self.papers_path,paper_id) - # Ideally if It is not in Paper Map then there is no chance - # That the paper is present in the - if self.paper_map[paper_id] is None: - return None - paper = ArxivPaper.from_fs(paper_id,self.papers_path) - record = paper.to_arxiv_record() - return record - - def save_identity(self,identity:ArxivIdentity): - paper_path = os.path.join(self.papers_path,identity.identity) - paper_meta_path = os.path.join(paper_path,ArxivRecord.identity_file_name) - # Update The Map if there is no Identity in the Map. - if self.paper_map[identity.identity] is None: - self.paper_map.add(identity.identity) # Add paper to the map(It also sets scraped=True) - # Save paper identity. - if not dir_exists(paper_path): - os.makedirs(paper_path) - save_json_to_file(identity.to_json(),paper_meta_path) - - def save_record(self,record:ArxivRecord): - paper = ArxivPaper.from_arxiv_record(self.papers_path,record) - paper.to_fs() - - def set_mined(self,identity:ArxivIdentity,mined_status:bool): - self.paper_map[identity.identity].mined = mined_status - if not mined_status: # If setting it as unmined then Re-add it to set. - self.paper_map.add_unmined_id(identity.identity) - - def get_unmined_paper(self) -> ArxivRecord: - paper_id = self.paper_map.get_unmined_paper() - if paper_id is None: - return None - return self.query(paper_id) - diff --git a/arxiv_miner/database/proxy_service.py b/arxiv_miner/database/proxy_service.py deleted file mode 100644 index d3f5e8a..0000000 --- a/arxiv_miner/database/proxy_service.py +++ /dev/null @@ -1,79 +0,0 @@ -""" -This module helps exposing the FS based database as "Service" based module via `rpyc` -`rpyc` helps create direct remote callable python objects. This helps expose db as a service. -""" -import rpyc -from ..record import ArxivRecord,ArxivIdentity,ArxivPaperStatus -from ..exception import ArxivDatabaseConnectionException -from .filesystem import ArxivFSDatabase -from .core import ArxivDatabase - -class ArxivFSDatabaseService(rpyc.Service,ArxivFSDatabase): - """ArxivFSDatabaseService - This service will help expose an FS based DB as a server for clients to start calling. - This is useful if one doesn't-want/cant use Elasticsearch and still wants to mine data. - """ - def __init__(self,*args,**kwargs): - super().__init__(*args,**kwargs) - self.logger.info('Database Sever Started On %s'%self.papers_path) - - def on_connect(self, conn): - # code that runs when a connection is created - # (to init the service, if needed) - self.logger.info("[CONN OPEN]: DB Currently Has %d Papers "%len(self.paper_map)) - pass - - def shutdown(self): - self.paper_map.save_map() - self.logger.info("Shutting Down The Server. Total Papers Records(Mined/Scraped) %d"%len(self.paper_map)) - - def on_disconnect(self, conn): - # code that runs after the connection has already closed - # (to finalize the service, if needed) - self.logger.info("[CONN CLOSE]: DB Currently Has %d Papers "%len(self.paper_map)) - pass - - def exposed_query(self,paper_id): # this is an exposed method - return self.query(paper_id) - - def exposed_save_identity(self,identity:ArxivIdentity): # while this method is not exposed - return self.save_identity(identity) - - def exposed_get_unmined_paper(self): - return self.get_unmined_paper() - - def exposed_set_mined(self,identity:ArxivIdentity,mined_status:bool): - return self.set_mined(identity,mined_status) - - def exposed_save_record(self,record:ArxivRecord): - return self.save_record(record) - - -class ArxivDatabaseServiceClient(ArxivDatabase): - """ - This is in case any Data-layer Adapter needs to be exposed as a remote service - with `rypc`. This - """ - - def __init__(self,host='localhost',port=18861): - try: - self.conn = rpyc.connect(host, port,config={'allow_public_attrs': True, 'sync_request_timeout': 10}) - self.client = self.conn.root - except Exception as e: - raise ArxivDatabaseConnectionException(host,port,str(e)) - - - def query(self,paper_id): - return self.client.query(paper_id) - - def save_identity(self,identity:ArxivIdentity): - return self.client.save_identity(identity) - - def get_unmined_paper(self): - return self.client.get_unmined_paper() - - def set_mined(self,identity:ArxivIdentity,mined_status:bool): - return self.client.set_mined(identity,mined_status) - - def save_record(self,record:ArxivRecord): - return self.client.save_record(record) diff --git a/arxiv_miner/exception.py b/arxiv_miner/exception.py index 7d4a388..a221dcf 100644 --- a/arxiv_miner/exception.py +++ b/arxiv_miner/exception.py @@ -32,7 +32,7 @@ def __init__(self,paper_id,message): class SectionSerialisationException(LatexParserException): def __init__(self,ms): - msg = "Serialisation of Section Object Requeses %s"%ms + msg = "Serialisation of Section Object Requires %s"%ms super(SectionSerialisationException, self).__init__(msg) diff --git a/arxiv_miner/loader.py b/arxiv_miner/loader.py deleted file mode 100644 index 854cc2c..0000000 --- a/arxiv_miner/loader.py +++ /dev/null @@ -1,241 +0,0 @@ -import pandas -import random -import os -import tarfile -from typing import List -from .paper import ArxivPaper -from collections import Counter - -class ArxivLoaderFilter: - pdf_only : bool - parsing_errors :bool - min_latex_pages: int - max_latex_pages: int - sample_size:int - - def __init__(self, - pdf_only = None,\ - parsing_errors = None,\ - min_latex_pages = None,\ - max_latex_pages = None,\ - sample_size = None,\ - scraped_only = None, - ): - self.pdf_only = pdf_only - self.parsing_errors = parsing_errors - self.min_latex_pages = min_latex_pages - self.max_latex_pages = max_latex_pages - self.sample_size = sample_size - - # For articles that have only been scraped. - self.scraped_only = scraped_only - - - @property - def is_active(self): # Any of Its properties are set. - for k in self.__dict__: - if getattr(self,k) is not None: - return True - return False - - @property - def requires_mined_record(self): - if self.pdf_only is not None: - return True - if self.parsing_errors is not None: - return True - if self.min_latex_pages is not None: - return True - if self.max_latex_pages is not None: - return True - return False - -class FSArxivLoadingFactory: - - @staticmethod - def only_scraped_loader(papers_root_path): - ax_filter = ArxivLoaderFilter(scraped_only=True) - loader_obj = ArxivLoader(papers_root_path,filter_object = ax_filter) - return loader_obj - - @staticmethod - def latex_failed_loader(papers_root_path): - ax_filter = ArxivLoaderFilter(parsing_errors=True,pdf_only=False) - loader_obj = ArxivLoader(papers_root_path,filter_object = ax_filter) - return loader_obj - - @staticmethod - def sampled_loader(papers_root_path,num_samples): - ax_filter = ArxivLoaderFilter(sample_size = num_samples) - loader_obj = ArxivLoader(papers_root_path,filter_object = ax_filter) - return loader_obj - - @staticmethod - def latex_parsed_loader(papers_root_path): - ax_filter = ArxivLoaderFilter(pdf_only=False,parsing_errors=False) - loader_obj = ArxivLoader(papers_root_path,filter_object = ax_filter) - return loader_obj - - @staticmethod - def latex_page_range_loader(papers_root_path,\ - min_latex_pages, - max_latex_pages): - ax_filter = ArxivLoaderFilter(min_latex_pages=min_latex_pages,max_latex_pages=max_latex_pages) - loader_obj = ArxivLoader(papers_root_path,filter_object = ax_filter) - return loader_obj - - -class ArxivLoader(): - """ArxivLoader : - THIS CLASS's PURPOSE IS FOR FAST DATA MOVEMENT IIF NEEDED BETWEEEN SERVERS. - IT HELPS LOAD AND UNLOAD AN ENTIRE DATABASE WITH FILTERS. - - ONE CAN USE THIS AS A BACKUP TOOL FOR DATA EXTRACTION - Loads Arxiv Paper objects from Folder Root. - - Loader Features : - 1. Filter Papers By : --> Via ArxivLoaderFilter - 1. pdf_only - 2. parsing_errors - 3. Latex Pages - 2. Create Sampled Loader. --> Via ArxivLoaderFilter - 3. Act like an Indexable Array - 4. Get Papers using the axiv_id - - :param papers_root_path Folder where all Arxiv Papers are Stored by the `ArxivPaper` object. - """ - papers = [] - loader_archieve_file_name = 'papers_loaded.tar.gz' - - def __init__(self,papers_root_path,filter_object=ArxivLoaderFilter(),detex_path=None): - self.papers_root_path = papers_root_path - self.papers = [] - list_of_subfolders_with_paths = [f.path for f in os.scandir(papers_root_path) if f.is_dir()] - arxiv_ids = list(map(lambda x:x.split('/')[-1],list_of_subfolders_with_paths)) - - self._build_papers_from_fs(arxiv_ids,filter_object,detex_path=detex_path) - - def __getitem__(self,index): - return self.papers[index] - - def __len__(self): - return len(self.papers) - - def get_sample(self): - return self.papers[random.randint(0,len(self.papers)-1)] - - def to_metadata_dataframe(self): - return pandas.DataFrame(self.get_meta_data_array()) - - def _build_papers_from_fs(self,arxiv_ids:List[str],filter_object:ArxivLoaderFilter,detex_path=None): - """_build_papers_from_fs - Build papers according to `ArxivLoaderFilter` filter object - `ArxivLoaderFilter` : - - Filters Via Sampling - - Filters via Latex page counts - - Filters via pdf_only papers - - Filters via latex_errored papers - :param arxiv_ids: List[str] - :param filter_object: ArxivLoaderFilter - """ - use_filter = False - if filter_object.is_active: - use_filter = True - if filter_object.sample_size is not None: - random.shuffle(arxiv_ids) # Ids are already sufflled so samples can be created. - - for paper_id in arxiv_ids: - if len(self.papers) == filter_object.sample_size: - break # post generating samples. - try: - paper = ArxivPaper.from_fs(paper_id,self.papers_root_path) - except Exception as e:# Ingnore Papers which are not Parsable. - print(e) - continue - if use_filter: - if not self.paper_filter(paper,filter_object): - continue - self.papers.append(paper) - - @staticmethod - def paper_filter(paper_obj:ArxivPaper,filter_obj:ArxivLoaderFilter): - compiled_bool = True - - # paper_processing_meta is set only if Latex Information is Parsed. - # if paper_processing_meta none and ArxivLoaderFilter is active then ignore this record because we need parsed records. - if paper_obj.paper_processing_meta is None: - if filter_obj.requires_mined_record: - return False - - if filter_obj.scraped_only: - return True - - # As paper_processing_meta is not None and u only need unprocessed records - if filter_obj.scraped_only: - return False - - if filter_obj.pdf_only is not None: # If Looking for PDFs_only and no paper_processing_meta then ignore - cond_result = paper_obj.paper_processing_meta.pdf_only == filter_obj.pdf_only - compiled_bool = cond_result and compiled_bool - - - if filter_obj.parsing_errors is not None: - cond_result = paper_obj.latex_parsing_result.parsing_error == filter_obj.parsing_errors - compiled_bool = cond_result and compiled_bool - - - if filter_obj.min_latex_pages is not None: - cond_result = paper_obj.paper_processing_meta.latex_files >= filter_obj.min_latex_pages - compiled_bool = cond_result and compiled_bool - - - if filter_obj.max_latex_pages is not None: - cond_result = paper_obj.paper_processing_meta.latex_files < filter_obj.max_latex_pages - compiled_bool = cond_result and compiled_bool - - return compiled_bool - - def get_meta_data_array(self): - object_array = [] - for paper in self.papers:# Use Identiy For any place needing Metadata. - object_array.append(paper.identity.to_json()) - return object_array - - def __getitem__(self, index): - return self.papers[index] - - def parsing_statistics(self): - num_pdfs = sum([ 0 if paper.paper_processing_meta.pdf_only else 1 for paper in self.papers]) - latex_files_counts = dict(Counter([paper.paper_processing_meta.latex_files for paper in self.papers])) - num_errored = sum([1 if paper.latex_processing_meta.parsing_error else 0 for paper in self.papers]) - fully_parsed = sum([ 1 if not paper.paper_processing_meta.pdf_only and not paper.latex_processing_meta.parsing_error else 0 for paper in self.papers]) - return { - 'num_pdfs':num_pdfs, - 'latex_files_counts':latex_files_counts, - 'num_errored':num_errored, - 'fully_parsed':fully_parsed - } - - def from_archive(self): - """from_archive - todo : Create a method that creates a loader from archive - """ - pass - - def make_archive(self,archive_path='./',with_latex=False): - """make_archive - create a Tar file with all the papers within the loader. - :param archive_path: [str], defaults to './' - :param with_latex: [bool], defaults to False : if False then it doesn't archieve the Latex Folder with raw latex source. - """ - archive_path = os.path.join(archive_path,self.loader_archieve_file_name) - with tarfile.open(archive_path, "w:gz") as tar: - for paper in self.papers: - if with_latex: - tar.add(paper.paper_root_path, arcname=os.path.basename(paper.paper_root_path)) - else: - tar.add(paper.arxiv_meta_file_path,arcname=paper.arxiv_meta_file_path) - tar.add(paper.paper_meta_file_path,arcname=paper.paper_meta_file_path) - if paper.latex_parsed_document is not None: - tar.add(paper.tex_processing_file_path,arcname=paper.tex_processing_file_path) - print("Finished Creating Tar File At Path : %s"%archive_path) diff --git a/arxiv_miner/paper.py b/arxiv_miner/paper.py index 7efea70..e4dc32d 100644 --- a/arxiv_miner/paper.py +++ b/arxiv_miner/paper.py @@ -84,7 +84,7 @@ def core_meta(self): final_dict = {**final_dict,**self.paper_processing_meta.to_json()} if self.latex_parsing_result: final_dict['parsing_error'] = self.latex_parsing_result.parsing_error - return core_meta + return final_dict @property def identity_meta(self): @@ -276,6 +276,11 @@ def from_fs(cls,paper_id,root_papers_path,detex_path=None): def to_fs(self): self._save_metadata_to_fs() self._save_parsed_document_to_fs() + + @classmethod + def from_arxiv_id(cls,axid,root_papers_path,detex_path=None): + axobj = cls(axid,root_papers_path,build_paper=True,detex_path=detex_path) + return axobj ############ ############ ######################## ############ ############ ############ Portability Methods To Make `ArxivPaper` a Processing Object that can reside anywhere. ############ diff --git a/arxiv_miner/record.py b/arxiv_miner/record.py index 4281508..bedc5b6 100644 --- a/arxiv_miner/record.py +++ b/arxiv_miner/record.py @@ -9,7 +9,6 @@ from dataclasses import asdict as D2D from typing import List -from .scraper import Record from .semantic_parsing import \ ArxivDocument,\ ResearchPaper diff --git a/arxiv_miner/scraper.py b/arxiv_miner/scraper.py index 49245b6..44252cb 100644 --- a/arxiv_miner/scraper.py +++ b/arxiv_miner/scraper.py @@ -115,7 +115,6 @@ class Scraper(object): Returning all eprints from ``` - import arxivscraper.arxivscraper as ax scraper = ax.Scraper(category='stat',date_from='2017-12-23',date_until='2017-12-25',t=10, filters={'affiliation':['facebook'],'abstract':['learning']}) output = scraper.scrape() diff --git a/arxiv_query.py b/arxiv_query.py deleted file mode 100644 index b46c178..0000000 --- a/arxiv_query.py +++ /dev/null @@ -1,155 +0,0 @@ -import arxiv -import dateparser -import pandas -import pickle -from utils import dir_exists -from constants import * - -SORT_BY = ["relevance", "lastUpdatedDate", "submittedDate"] -SORT_ORDER =['descending','ascending'] - -def get_cs_labels(): - return list(COMPUTER_SCIENCE_TOPICS.values()) - -def get_cs_topics(): - return list(COMPUTER_SCIENCE_TOPICS.keys()) - -def wrap_brackets(string): - RIGHT_BRACKET=")" - LEFT_BRACKET="(" - return LEFT_BRACKET+string+RIGHT_BRACKET - -def wrap_quotes(string): - QUOTE='"' - SPACE=' ' - return QUOTE+string.replace(' ',SPACE)+QUOTE - - -class ArxivRemoteObject: - def __init__(self,arxiv_object): - self.url = arxiv_object['id'] - self.title = arxiv_object['title'] - self.abstract = arxiv_object['summary'] - self.tags = ', '.join(list(map(lambda x : x['term'] if x['term'] not in ALL_TOPICS else ALL_TOPICS[x['term']],arxiv_object['tags']))) - self.primary_category = arxiv_object['arxiv_primary_category'] - self.authors = ', '.join(arxiv_object['authors']) - self.published = arxiv_object['published'] - - # These are for search purposess. - self.unfiltered_tags = list(map(lambda x : x['term'],arxiv_object['tags'])) - - def print_markdown_with_streamlit(self,st,only_meta=False): - st.markdown('# %s'%self.title) - st.markdown('## %s'%'Meta') - human_readable_date = dateparser.parse(self.published).strftime("%d, %b %Y") - st.markdown(''' - URL : {url}\n - TAGS : {tags}\n - Authors : {authors}\n - Published : {human_readable_date}\n - '''.format(**self.__dict__,human_readable_date=human_readable_date) - ) - if not only_meta: - st.markdown('## Abstract') - st.markdown('%s'%self.abstract) - - - def to_json(self): - return dict(self.__dict__) - - - def __str__(self): - return """ - # {title} - - ## Meta - URL : {url} - TAGS : {tags} - Authors : {authors} - Published : {published} - - ## ABSTRACT - {abstract} - --- - """.format(**self.__dict__) - - -class ArxivLocalDatabase: - def __init__(self,db_path): - if not dir_exists(db_path): - raise Exception("No Database File At : %s"%db_path) - db = pickle.load(open(db_path, 'rb')) - object_arr = db.values() - for ob in object_arr: - ob['authors']= list(map(lambda x : x['name'],ob['authors'])) - self.local_objects = [ArxivRemoteObject(i) for i in object_arr] - - def __getitem__(self,index): - return self.local_objects[index] - - def to_dataframe(self): - return pandas.DataFrame([i.to_json() for i in self.local_objects]) - - def lookup_indices(self,indices): - for i in indices: # Assumes indexes will be correct. - yield self.local_objects[i] - - -def query_arxiv(categories,\ - search_text,\ - cat_concat_flag='AND',\ - sort_by=SORT_BY[0],\ - sort_order=SORT_ORDER[0],\ - max_chunk_results=10,\ - max_results=20\ - ): - - if len(categories) == 0 and search_text == '': - return [] - - search_query = build_arxiv_query(categories,search_text,cat_concat_flag=cat_concat_flag) - query_result = arxiv.query( - query=search_query,\ - max_chunk_results=max_chunk_results,\ - max_results=max_results,\ - iterative=True, - sort_by=sort_by, - sort_order=sort_order - - ) - return list(ArxivRemoteObject(paper) for paper in query_result()),search_query - - -def build_arxiv_query(categories,search_text,cat_concat_flag='AND'): - SPACE=' ' - OR = SPACE+'OR'+SPACE - AND = SPACE+'AND'+SPACE - query = [] - if len(categories) > 0: - categories = ['cat:'+cat for cat in categories] - cat_concat_op = OR if cat_concat_flag is 'OR' else AND - query.append(cat_concat_op.join(categories)) - - if search_text is not '': - search_str = ["ti:"+wrap_quotes(search_text),"abs:"+wrap_quotes(search_text)] - query.append(OR.join(search_str)) - - if len(query) > 1: - query = [wrap_brackets(q) for q in query] - return AND.join(query) - - elif len(query) == 1: - return query[0] - - return '' - - -# Todo Build Database From the already tooled methods that Are build here. - -class ArxivQueryInterface: - """ - TODO: Scipt out queries which contain stuff the usually used. - """ - def __init__(self): - pass - # "cat:cs.CV+OR+cat:cs.AI+OR+cat:cs.LG+OR+cat:cs.CL+OR+cat:cs.NE+OR+cat:stat.ML" \ No newline at end of file diff --git a/cli.py b/cli.py deleted file mode 100644 index b786380..0000000 --- a/cli.py +++ /dev/null @@ -1,59 +0,0 @@ -''' -This is the Generalised CLI origin of the Project. -this will be used for the Extracting the Important CLI information such as Database -Selection etc. Can be used as a gateway to integrate all the submodules into one cli invocation -''' - -import click -from functools import wraps -from config import Config -from arxiv_miner import SUPPORTED_DBS,get_database_client -import json - -DB_HELP = 'The Chosen Backend Store. Select from : '+','.join(SUPPORTED_DBS) -DEFAULT_APP_NAME= 'ArXiv-Miner' - -def common_run_options(func): - db_defaults = Config.get_db_defaults() - @click.option('--host', default=db_defaults['host'], help='ArxivDatabase Host') - @click.option('--port', default=db_defaults['port'], help='ArxivDatabase Port') - @wraps(func) - def wrapper(*args, **kwargs): - return func(*args, **kwargs) - return wrapper - - -@click.group(invoke_without_command=True) -@click.option('--datastore', type=click.Choice(SUPPORTED_DBS),default=Config.default_database, help=DB_HELP) -@click.option('--use_defaults',is_flag=True,help='Use Default Database Configurations For Chosen Datastore. Config currently comes from utils.py') -@common_run_options -@click.pass_context -def db_cli(ctx,datastore,use_defaults,host,port,app_name=DEFAULT_APP_NAME): - ctx.obj = {} - args , client_class = database_choice(datastore,use_defaults,host,port) - print_str = '\n %s Process Using %s Datastore'%(app_name,datastore) - args_str = ''.join(['\n\t'+ i + ' : ' + str(args[i]) for i in args]) - click.secho(print_str,fg='green',bold=True) - click.secho(args_str+'\n\n',fg='magenta') - arxiv_database = client_class(**args) - ctx.obj['db_class'] = client_class - ctx.obj['db_args'] = args - - -def database_choice(datastore,use_defaults,host,port): - # get_database_client will raise error if some-one feeds BS DB - client_class = get_database_client(datastore) - if datastore == 'fs': - if use_defaults: - args = Config.get_defaults('fs') - else: - args = dict(host=host,port=port) - elif datastore == 'elasticsearch': - if use_defaults: - args = Config.get_defaults('elasticsearch') - else: - args = dict(index_name=Config.elasticsearch_index,host=host,port=port) - return args, client_class - -if __name__ == '__main__': - db_cli() \ No newline at end of file diff --git a/config.py b/config.py deleted file mode 100644 index 95e9794..0000000 --- a/config.py +++ /dev/null @@ -1,47 +0,0 @@ -import os -# global settings -# ----------------------------------------------------------------------------- -class Config(object): - # Based on `arxiv_miner.database.SUPPORTED_DBS` - default_database = 'elasticsearch' - - #FS Database Related Configuration - data_path = os.path.abspath('./data') - fs_database_port = 18861 - fs_database_host = 'localhost' - fs_database_config = { - 'allow_public_attrs': True,\ - 'sync_request_timeout': 10\ - } - - elasticsearch_port = 9200 - elasticsearch_host = 'localhost' - elasticsearch_index = 'arxiv_papers' - # Mining Related Configuration - detex_path = os.path.abspath('./detex') - mining_data_path = os.path.abspath('./mining_data/papers') - - # Object Store - bucket_name = 'arxiv-papers-source-bucket' - - - @classmethod - def get_defaults(cls,db_str): - if db_str == 'elasticsearch': - return dict(\ - host=cls.elasticsearch_host,\ - port=cls.elasticsearch_port,\ - index_name = cls.elasticsearch_index - ) - elif db_str == 'fs': - return dict(\ - # data_path= cls.data_path, - host=cls.fs_database_host,\ - port=cls.fs_database_port,\ - ) - else: - return None - - @classmethod - def get_db_defaults(cls): - return cls.get_defaults(cls.default_database) diff --git a/constants.py b/constants.py deleted file mode 100644 index 2a419ad..0000000 --- a/constants.py +++ /dev/null @@ -1,198 +0,0 @@ -COMPUTER_SCIENCE_TOPICS = { - "cs.AI": "Artificial Intelligence", - "cs.AR": "Hardware Architecture", - "cs.CC": "Computational Complexity", - "cs.CE": "Computational Engineering, Finance, and Science", - "cs.CG": "Computational Geometry", - "cs.CL": "Computation and Language", - "cs.CR": "Cryptography and Security", - "cs.CV": "Computer Vision and Pattern Recognition", - "cs.CY": "Computers and Society", - "cs.DB": "Databases", - "cs.DC": "Distributed, Parallel, and Cluster Computing", - "cs.DL": "Digital Libraries", - "cs.DM": "Discrete Mathematics", - "cs.DS": "Data Structures and Algorithms", - "cs.ET": "Emerging Technologies", - "cs.FL": "Formal Languages and Automata Theory", - "cs.GL": "General Literature", - "cs.GR": "Graphics", - "cs.GT": "Computer Science and Game Theory", - "cs.HC": "Human-Computer Interaction", - "cs.IR": "Information Retrieval", - "cs.IT": "Information Theory", - "cs.LG": "Learning", - "cs.LO": "Logic in Computer Science", - "cs.MA": "Multiagent Systems", - "cs.MM": "Multimedia", - "cs.MS": "Mathematical Software", - "cs.NA": "Numerical Analysis", - "cs.NE": "Neural and Evolutionary Computing", - "cs.NI": "Networking and Internet Architecture", - "cs.OH": "Other Computer Science", - "cs.OS": "Operating Systems", - "cs.PF": "Performance", - "cs.PL": "Programming Languages", - "cs.RO": "Robotics", - "cs.SC": "Symbolic Computation", - "cs.SD": "Sound", - "cs.SE": "Software Engineering", - "cs.SI": "Social and Information Networks", - "cs.SY": "Systems and Control" -} - -ALL_TOPICS = { - 'astro-ph': 'Astrophysics', - 'astro-ph.CO': 'Cosmology and Nongalactic Astrophysics', - 'astro-ph.EP': 'Earth and Planetary Astrophysics', - 'astro-ph.GA': 'Astrophysics of Galaxies', - 'astro-ph.HE': 'High Energy Astrophysical Phenomena', - 'astro-ph.IM': 'Instrumentation and Methods for Astrophysics', - 'astro-ph.SR': 'Solar and Stellar Astrophysics', - 'cond-mat.dis-nn': 'Disordered Systems and Neural Networks', - 'cond-mat.mes-hall': 'Mesoscale and Nanoscale Physics', - 'cond-mat.mtrl-sci': 'Materials Science', - 'cond-mat.other': 'Other Condensed Matter', - 'cond-mat.quant-gas': 'Quantum Gases', - 'cond-mat.soft': 'Soft Condensed Matter', - 'cond-mat.stat-mech': 'Statistical Mechanics', - 'cond-mat.str-el': 'Strongly Correlated Electrons', - 'cond-mat.supr-con': 'Superconductivity', - 'cs.AI': 'Artificial Intelligence', - 'cs.AR': 'Hardware Architecture', - 'cs.CC': 'Computational Complexity', - 'cs.CE': 'Computational Engineering, Finance, and Science', - 'cs.CG': 'Computational Geometry', - 'cs.CL': 'Computation and Language', - 'cs.CR': 'Cryptography and Security', - 'cs.CV': 'Computer Vision and Pattern Recognition', - 'cs.CY': 'Computers and Society', - 'cs.DB': 'Databases', - 'cs.DC': 'Distributed, Parallel, and Cluster Computing', - 'cs.DL': 'Digital Libraries', - 'cs.DM': 'Discrete Mathematics', - 'cs.DS': 'Data Structures and Algorithms', - 'cs.ET': 'Emerging Technologies', - 'cs.FL': 'Formal Languages and Automata Theory', - 'cs.GL': 'General Literature', - 'cs.GR': 'Graphics', - 'cs.GT': 'Computer Science and Game Theory', - 'cs.HC': 'Human-Computer Interaction', - 'cs.IR': 'Information Retrieval', - 'cs.IT': 'Information Theory', - 'cs.LG': 'Learning', - 'cs.LO': 'Logic in Computer Science', - 'cs.MA': 'Multiagent Systems', - 'cs.MM': 'Multimedia', - 'cs.MS': 'Mathematical Software', - 'cs.NA': 'Numerical Analysis', - 'cs.NE': 'Neural and Evolutionary Computing', - 'cs.NI': 'Networking and Internet Architecture', - 'cs.OH': 'Other Computer Science', - 'cs.OS': 'Operating Systems', - 'cs.PF': 'Performance', - 'cs.PL': 'Programming Languages', - 'cs.RO': 'Robotics', - 'cs.SC': 'Symbolic Computation', - 'cs.SD': 'Sound', - 'cs.SE': 'Software Engineering', - 'cs.SI': 'Social and Information Networks', - 'cs.SY': 'Systems and Control', - 'econ.EM': 'Econometrics', - 'eess.AS': 'Audio and Speech Processing', - 'eess.IV': 'Image and Video Processing', - 'eess.SP': 'Signal Processing', - 'gr-qc': 'General Relativity and Quantum Cosmology', - 'hep-ex': 'High Energy Physics - Experiment', - 'hep-lat': 'High Energy Physics - Lattice', - 'hep-ph': 'High Energy Physics - Phenomenology', - 'hep-th': 'High Energy Physics - Theory', - 'math.AC': 'Commutative Algebra', - 'math.AG': 'Algebraic Geometry', - 'math.AP': 'Analysis of PDEs', - 'math.AT': 'Algebraic Topology', - 'math.CA': 'Classical Analysis and ODEs', - 'math.CO': 'Combinatorics', - 'math.CT': 'Category Theory', - 'math.CV': 'Complex Variables', - 'math.DG': 'Differential Geometry', - 'math.DS': 'Dynamical Systems', - 'math.FA': 'Functional Analysis', - 'math.GM': 'General Mathematics', - 'math.GN': 'General Topology', - 'math.GR': 'Group Theory', - 'math.GT': 'Geometric Topology', - 'math.HO': 'History and Overview', - 'math.IT': 'Information Theory', - 'math.KT': 'K-Theory and Homology', - 'math.LO': 'Logic', - 'math.MG': 'Metric Geometry', - 'math.MP': 'Mathematical Physics', - 'math.NA': 'Numerical Analysis', - 'math.NT': 'Number Theory', - 'math.OA': 'Operator Algebras', - 'math.OC': 'Optimization and Control', - 'math.PR': 'Probability', - 'math.QA': 'Quantum Algebra', - 'math.RA': 'Rings and Algebras', - 'math.RT': 'Representation Theory', - 'math.SG': 'Symplectic Geometry', - 'math.SP': 'Spectral Theory', - 'math.ST': 'Statistics Theory', - 'math-ph': 'Mathematical Physics', - 'nlin.AO': 'Adaptation and Self-Organizing Systems', - 'nlin.CD': 'Chaotic Dynamics', - 'nlin.CG': 'Cellular Automata and Lattice Gases', - 'nlin.PS': 'Pattern Formation and Solitons', - 'nlin.SI': 'Exactly Solvable and Integrable Systems', - 'nucl-ex': 'Nuclear Experiment', - 'nucl-th': 'Nuclear Theory', - 'physics.acc-ph': 'Accelerator Physics', - 'physics.ao-ph': 'Atmospheric and Oceanic Physics', - 'physics.app-ph': 'Applied Physics', - 'physics.atm-clus': 'Atomic and Molecular Clusters', - 'physics.atom-ph': 'Atomic Physics', - 'physics.bio-ph': 'Biological Physics', - 'physics.chem-ph': 'Chemical Physics', - 'physics.class-ph': 'Classical Physics', - 'physics.comp-ph': 'Computational Physics', - 'physics.data-an': 'Data Analysis, Statistics and Probability', - 'physics.ed-ph': 'Physics Education', - 'physics.flu-dyn': 'Fluid Dynamics', - 'physics.gen-ph': 'General Physics', - 'physics.geo-ph': 'Geophysics', - 'physics.hist-ph': 'History and Philosophy of Physics', - 'physics.ins-det': 'Instrumentation and Detectors', - 'physics.med-ph': 'Medical Physics', - 'physics.optics': 'Optics', - 'physics.plasm-ph': 'Plasma Physics', - 'physics.pop-ph': 'Popular Physics', - 'physics.soc-ph': 'Physics and Society', - 'physics.space-ph': 'Space Physics', - 'q-bio.BM': 'Biomolecules', - 'q-bio.CB': 'Cell Behavior', - 'q-bio.GN': 'Genomics', - 'q-bio.MN': 'Molecular Networks', - 'q-bio.NC': 'Neurons and Cognition', - 'q-bio.OT': 'Other Quantitative Biology', - 'q-bio.PE': 'Populations and Evolution', - 'q-bio.QM': 'Quantitative Methods', - 'q-bio.SC': 'Subcellular Processes', - 'q-bio.TO': 'Tissues and Organs', - 'q-fin.CP': 'Computational Finance', - 'q-fin.EC': 'Economics', - 'q-fin.GN': 'General Finance', - 'q-fin.MF': 'Mathematical Finance', - 'q-fin.PM': 'Portfolio Management', - 'q-fin.PR': 'Pricing of Securities', - 'q-fin.RM': 'Risk Management', - 'q-fin.ST': 'Statistical Finance', - 'q-fin.TR': 'Trading and Market Microstructure', - 'quant-ph': 'Quantum Physics', - 'stat.AP': 'Applications', - 'stat.CO': 'Computation', - 'stat.ME': 'Methodology', - 'stat.ML': 'Machine Learning', - 'stat.OT': 'Other Statistics', - 'stat.TH': 'Statistics Theory' -} diff --git a/cso-req.txt b/cso-req.txt new file mode 100644 index 0000000..867dcf8 --- /dev/null +++ b/cso-req.txt @@ -0,0 +1 @@ +cso-classifier \ No newline at end of file diff --git a/data_exploration_dashboard.py b/data_exploration_dashboard.py deleted file mode 100644 index deee720..0000000 --- a/data_exploration_dashboard.py +++ /dev/null @@ -1,154 +0,0 @@ -import streamlit as st -from config import Config -from arxiv_miner import FSArxivLoadingFactory,ArxivLoader -import pickle -import pandas -import pandas -import dateparser -import arxiv_query - -import numpy as np -import os -DETEX_PATH = os.path.abspath('./detex') -MAX_ARTICLES_PER_PAGE = 20 -APP_MODES = [ - "Database Exploration", - # "Instant URL Parsing", - # "Dataset Exploration", - "Source Code Lookup", - "Arxiv Query Builder" -] -LOADER_FACTORY = FSArxivLoadingFactory - -@st.cache(show_spinner=True,hash_funcs={ArxivLoader: id}) -def get_paper_data(latex_papers = False): - storage_path = os.path.join(Config.data_path,'papers') - if not latex_papers : - loader = ArxivLoader(storage_path) - else: - loader = LOADER_FACTORY.latex_parsed_loader(storage_path) - return loader - -@st.cache(show_spinner=True,hash_funcs={arxiv_query.ArxivLocalDatabase: id}) -def get_local_paper_database(): - local_db = arxiv_query.ArxivLocalDatabase('./db.p') - return local_db - - - -def dataset_exploration(): - """source_lookup - This will hold all the needed functions for making the view. - """ - ltx_papers = st.checkbox('Show Latex Parsed Papers') - loader = get_paper_data(latex_papers=ltx_papers) - ids = list(range(len(loader))) - id_value = st.selectbox("Select A Paper", ids, format_func=lambda x: loader[x].identity_meta['title']) - paper = loader[id_value] - - if paper.latex_parsed_document is None: - st.markdown("# No Content Found") - return - - print_paper(paper.latex_parsed_document) - -def print_paper(latex_parsed_document): - - content = latex_parsed_document.to_markdown() - human_readable_date = dateparser.parse(latex_parsed_document.published).strftime("%d, %b %Y") - title_str = ''' - # {title}\n - '''.format(title=latex_parsed_document.title) - st.title('%s'%latex_parsed_document.title) - - url_str = ''' - **URL : <{url}>**\n - '''.format(url =latex_parsed_document.url) - st.markdown('%s'%url_str) - - published_str = ''' - **Published ON : {published}**\n - '''.format(published =human_readable_date) - st.markdown(published_str) - - data_str = ''' - ## Latex Parsing Result - ''' - st.markdown(data_str) - st.markdown(content) - - -def arxiv_query_builder(): - topics = arxiv_query.get_cs_topics() - search_query = st.text_input("Key Words You Are Looking For ? Use | for OR Queries and & for AND Queries" ) - selected_topics = st.multiselect("Select Topics To Look For :",topics,None,lambda x:arxiv_query.COMPUTER_SCIENCE_TOPICS[x]) - topic_query_selection = st.radio("Topic AND OR Query ?",["AND","OR"]) - - sort_by = st.radio("Sort By Options",arxiv_query.SORT_BY) - sort_order= st.radio("Sort Order",arxiv_query.SORT_ORDER) - btn_result = st.button('Run Query') - if btn_result: - parsed_objs,sq = arxiv_query.query_arxiv(selected_topics,search_query,topic_query_selection,sort_by=sort_by,sort_order=sort_order) - # st.write("Found Reports %d"%len(parsed_objs)) - # st.write('%s'%sq) - for obj in parsed_objs: - obj.print_markdown_with_streamlit(st) - - -def set_match(search,query,match_all=False): - search_set =set(search) - query_set = set(query) - - if match_all: - if len(search_set - query_set) == 0: - return True - else: - return False - - else: - if len(search_set - query_set) < len(search_set): - return True # Because the query contains something we are searching for. - - return False - -def database_exploration(): - paper_db = get_local_paper_database() - df = paper_db.to_dataframe() - topics = arxiv_query.get_cs_topics() - df['published'] = pandas.to_datetime(df['published']) - - st.markdown("%s"%"# Local Database Exploration") - selected_topics = st.multiselect("Select Topics To Look For :",topics,None,lambda x:arxiv_query.COMPUTER_SCIENCE_TOPICS[x]) - match_all = st.checkbox("Match All Topics ? ") - search_mask = df['unfiltered_tags'].apply(lambda x : set_match(selected_topics,x,match_all=match_all)) - - # todo : more search filters can come here. - search_df = df[search_mask] - search_result = '## Results Found : %d'%len(search_df) - st.markdown(search_result) - if len(search_df) == 0: - return - - # todo : Create Date histograms of publishing. - search_df = search_df.sample(min(len(search_df),MAX_ARTICLES_PER_PAGE)) - filtered_papers = paper_db.lookup_indices(search_df.index) - for obj in list(filtered_papers): - obj.print_markdown_with_streamlit(st,only_meta=True) - - -def init_app(): - # st.sidebar.title("What to do") - app_mode = st.sidebar.selectbox("Choose the app mode",APP_MODES) - - if app_mode == "Dataset Exploration": - dataset_exploration() - elif app_mode == "Source Code Lookup": - st.code(open('data_exploration_dashboard.py').read()) - elif app_mode == "Arxiv Query Builder": - arxiv_query_builder() - elif app_mode == 'Database Exploration': - database_exploration() - -if __name__=='__main__': - init_app() - diff --git a/database_server.py b/database_server.py deleted file mode 100644 index cf5c5a2..0000000 --- a/database_server.py +++ /dev/null @@ -1,44 +0,0 @@ -from arxiv_miner import ArxivFSDatabaseService as ArxivDatabaseService -from rpyc.utils.server import ThreadedServer -import os -from config import Config -import datetime -import click -from signal import signal,SIGINT - - -DEFAULT_PATH = Config.data_path -DEFAULT_PORT = Config.fs_database_port -DEFAULT_HOST = Config.fs_database_host - -DATABASE_HELP = ''' -ArXiv Database - -This will start an FS oriented Database for the ArxivRecords and uses -`rpyc` based Services to interact. -''' - -@click.command(help=DATABASE_HELP) -@click.option('--host',default=DEFAULT_HOST,help='ArxivDatabase HostName') -@click.option('--port',default=DEFAULT_PORT,help='ArxivDatabase Port') -@click.option('--data_path',default=DEFAULT_PATH,type=click.Path()) -def start_server(data_path,\ - port = DEFAULT_PORT, - host = DEFAULT_HOST): - db_service = ArxivDatabaseService(data_path) - - def stop_server(signal_received, frame): - db_service.shutdown() - t.close() - - signal(SIGINT, stop_server) - t = ThreadedServer(\ - db_service,\ - port=port,\ - hostname=host,\ - protocol_config=Config.fs_database_config) - t.start() - - -if __name__ == "__main__": - start_server() \ No newline at end of file diff --git a/default_config.ini b/default_config.ini new file mode 100644 index 0000000..1ce2612 --- /dev/null +++ b/default_config.ini @@ -0,0 +1,5 @@ +[elasticsearch] +host = localhost +index = arxiv_papers +port = 9200 +# auth = your_user_name your_super_secure_passwor \ No newline at end of file diff --git a/docs/.nojekyll b/docs/.nojekyll new file mode 100644 index 0000000..e69de29 diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..be8f869 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,36 @@ +# ArXiv-Miner + +> ArXiv Miner is a toolkit for mining research papers on CS ArXiv. + +## What is ArXiv-Miner + +`arxiv-miner` is a quick handy library that helps power [Sci-Genie](https://sci-genie.com). Sci-Genie is a search engine for quickly through full text of papers on CS ArXiv. `arxiv-miner` helps extract and parse LaTeX documents from CS ArXiv. It also supports storage and search of those parsed documents using **Elasticsearch**. The library can be applicable for all other domains like Math, Physics, Biology etc. + +## Why was ArXiv-Miner created ? +ArXiv Miner was created for easily scraping, parsing and searching research content on ArXiv. This library was created after stitching together a solutions from the code of various tools like [arxiv-sanity](https://github.com/karpathy/arxiv-sanity-preserver), [arxiv-vanity/engrafo](https://github.com/arxiv-vanity/engrafo), [arxivscraper](https://github.com/Mahdisadjadi/arxivscraper), [tex2py](https://github.com/alvinwan/tex2py), [cso-classifier](https://github.com/angelosalatino/cso-classifier/) and [axcell](https://github.com/paperswithcode/axcell). Parsed structure of the content can be useful in search or any scientific research mining/AI applications as a heuristic baseline. + +## Core Components of ArXiv-Miner +- Scraping +- Parsing +- Indexing/Storage + +## Family Of Projects With ArXiv-Miner +- `arxiv-table-miner` : Coming Soon. +- `arxiv-table-ml-models` : Coming Soon. +- `semantic-scholar-data-pipeline` : https://github.com/valayDave/semantic-scholar-data-pipeline + + +## Credits and Appreciation +This project like all others has been built on shoulders of giants. A big thanks to the creators of the following libraries/open source projects that aided the development of `arxiv-miner`, and it's family of projects: +- [arxiv-sanity](https://github.com/karpathy/arxiv-sanity-preserver) +- [arxiv-vanity/engrafo](https://github.com/arxiv-vanity/engrafo) +- [arxivscraper](https://github.com/Mahdisadjadi/arxivscraper) +- [tex2py](https://github.com/alvinwan/tex2py) +- [cso-classifier](https://github.com/angelosalatino/cso-classifier/) +- [axcell](https://github.com/paperswithcode/axcell) +- [elasticsearch](https://github.com/elastic/elasticsearch) +- [Semantic Scholar Open Research corpus](https://github.com/allenai/s2orc) +- [metaflow](https://metaflow.org) + +## Licence +MIT \ No newline at end of file diff --git a/docs/_sidebar.md b/docs/_sidebar.md new file mode 100644 index 0000000..c6b524f --- /dev/null +++ b/docs/_sidebar.md @@ -0,0 +1,8 @@ +- [ArXiv Miner](README.md) +- [Quick start](quickstart.md) +- [Core Components](core_components.md) +- [Core Data Structures](structures.md) +- [Running the Damn Thing](deployment_scripts.md) +- [Roadmap](roadmap.md) +- [Contributing](contributing.md) +- [Changelog](changelog.md) diff --git a/docs/changelog.md b/docs/changelog.md new file mode 100644 index 0000000..e80a919 --- /dev/null +++ b/docs/changelog.md @@ -0,0 +1,11 @@ +# Change Log +## V2.0 +Cleaned up the repository for OSS release. Created documentation and refactored CLI into the main module. +## V1.40 +- Added the Author Index +- Added the Ontology Keyword Index +- Added the Ontology Miner to extract ontology using `cso-classifier` in Mining Process +- Re-indexed `arxiv_papers_parsed_research` with ontology +- Created Id feed extraction API +- CSO Classifier install decoupled. +- Added Auth to elasticsearch wrapper. \ No newline at end of file diff --git a/docs/contributing.md b/docs/contributing.md new file mode 100644 index 0000000..f57d786 --- /dev/null +++ b/docs/contributing.md @@ -0,0 +1,5 @@ +# Contributing +- Fork this repo +- Create pull requests against the master branch +- Ensure that the PR description clearly describes the behavior of the change or what the new feature does. +- Add test cases for new feature or requested changes. diff --git a/docs/core_components.md b/docs/core_components.md new file mode 100644 index 0000000..fd6655e --- /dev/null +++ b/docs/core_components.md @@ -0,0 +1,101 @@ +# Core Components + +## Scraping +[arxiv_miner/scraping_engine.py](https://github.com/valayDave/arxiv-miner/blob/master/arxiv_miner/scraping_engine.py) consists of the classes to tap into the feed from ArXiv and creates an [`ArxivRecord`](structures.md#ArxivRecord) in Elasticsearch. This is done so that records are only re-mined if necessary. Instructions to scrape data into Elasticsearch are provided [here](deployment_scripts.md#data-extraction). + + +## Mining and Parsing +[arxiv_miner/mining_engine.py](https://github.com/valayDave/arxiv-miner/blob/master/arxiv_miner/mining_engine.py) consists of a process that mines papers which get scraped. [paper.py](https://github.com/valayDave/arxiv-miner/blob/master/arxiv_miner/paper.py) consists of the `ArxivPaper` class. This class extracts LaTeX source repository from remote source. Each LaTeX source repository is parsed to create a "Structure Tree" of the research document. The Structure tree is created using [tex2py](https://github.com/alvinwan/tex2py). The Structure tree helps correlate the structure of latex document. + +The structure tree is then used to create a `Section` object. More information about `Section` object can be found in [Core Structures](structures.md#Section) The `text` within each `Section` is populated by using the [opendetex library](https://github.com/pkubowicz/opendetex). The opendex library helps filter text information from individual tex files. A hacky algorithm based on number of tex files correlates the text with Structure Tree to create a single `Section`. + +Instructions to mine papers after scraping and index into Elasticsearch are provided [here](deployment_scripts.md#data-mining-and-storage). + +### Standalone Paper Parsing + +```python +from arxiv_miner import ArxivPaper,ResearchPaperFactory +ROOT_DICTORY_TO_STORE_LATEX = './papers_root' +# This will extract LaTeX source from ArXiv parse the data to a `Section` Object +paper = ArxivPaper.from_arxiv_id('1706.03762',ROOT_DICTORY_TO_STORE_LATEX,detex_path='') +# The will create a `ResearchPaper` +paperdoc = ResearchPaperFactory.from_arxiv_record(paper) +``` + + +## Storage And Search +[arxiv_miner/database/elasticsearch.py](https://github.com/valayDave/arxiv-miner/blob/master/arxiv_miner/database/elasticsearch.py) consists of the core methods over **Elasticsearch** to search and aggregate data. Search and aggregation requires two classes : +1. *A wrapper class over Elasticsearch to execute the search and aggregate queries* : `KeywordsTextSearch` or `ArxivElasticTextSearch` + - These classes contains methods that help retrieve information from the index containing the mined documents. + ```python + from arxiv_miner import KeywordsTextSearch + ELASTICARGS= dict( + index_name=None, + host='localhost', + port=9200, + auth=None + ) + database = KeywordsTextSearch(**ELASTICARGS) + + ``` +2. *A wrapper class to create the search and aggregation queries from input* : `TextSearchFilter`, `DateAggregation` and `TermsAggregation` + - Search and aggregation wrappers are created using [elasticsearch_dsl](https://elasticsearch-dsl.readthedocs.io/en/latest/). + - `TextSearchFilter` is the core wrapper on what to search i.e. core filters over the indexed data after mining + ```python + from arxiv_miner import TextSearchFilter, DATE_FIELD_NAME, CATEGORY_FIELD_NAME, TEXT_HIGHLIGHT,CategoryFilterItem + text_filter = TextSearchFilter( + id_vals=[], # Explicitly filtering arxiv_id's in search/aggregation + no_date = False, # no_date : bool for not using date filter + string_match_query="", # Text filter opt + text_filter_fields = [], # Specific fields in search index to filter. + start_date_key=None, # start date filter opt + end_date_key = None, # end date filter opt + date_filter_field = DATE_FIELD_NAME, # date field upon which date filter will be applied + category_filter = [],# [CategoryFilterItem] Use `category_filter` or `multi_category_filter` + category_filter_values =[], # if len(category_filter) > 0 the category_filter_values required + category_field = CATEGORY_FIELD_NAME, + category_match_type= 'AND', + multi_category_filter=[], # multi_category_filter : [[CategoryFilterItem]] + sort_key=DATE_FIELD_NAME, # Sort Key upon which search results will be ordered + sort_order='descending', + highlights = TEXT_HIGHLIGHT,# Search keys to highligh results fragments from + highlight_fragments=60, + source_fields=[],# Particular fields to restrict search on + # Page settings + page_size=10,\ + page_number=1, + scan=False# Full Dataset Traversal key # If scan==True, then no Pagination else paginate + ) + ``` + - `DateAggregation` and `TermsAggregation` inherit `TextSearchFilter` to create aggregations for date and keywords for a particular search query. + +### Standalone Usage + +Paginated style results or iterator style retrieval of saved ArXiv Papers From Elasticsearch +```python +from arxiv_miner import KeywordsTextSearch,TextSearchFilter +ELASTICARGS= dict( + index_name=None, + host='localhost', + port=9200, + auth=None +) +database = KeywordsTextSearch(**ELASTICARGS) +# Pagination Style retrieval +text_filter = TextSearchFilter( + string_match_query="out of distribution generalization", + start_date_key='04/04/2015', + end_date_key = '04/04/2021', + page_size=100, +) +paginated_search_results = database.text_search(text_filter) +# Iterator style retrieval. +scan_text_filter = TextSearchFilter( + string_match_query="out of distribution generalization", + start_date_key='04/04/2015', + end_date_key = '04/04/2021', + scan=True +) +for doc in database.text_search_scan(scan_text_filter): + handlestuff(doc) +``` diff --git a/docs/deployment_scripts.md b/docs/deployment_scripts.md new file mode 100644 index 0000000..5e7a7be --- /dev/null +++ b/docs/deployment_scripts.md @@ -0,0 +1,42 @@ +# Running the Damn Thing. + +All scripts in the [scripts folder](https://github.com/valayDave/arxiv-miner/tree/master/scripts) consist of the scripts needed to scrape and parse content for storage in ArXiv. The `default_config.ini` file contains the `elasticsearch` configuration needed to run most scripts. General structure of the ini file is as follows: +```ini +[elasticsearch] +host = localhost +index = arxiv_papers +port = 9200 +# auth = your_user_name your_super_secure_password +``` + +## Scraping / Data Extraction + +`scripts/scrape_papers.py` will tap into feed provided by ArXiv from [this URL](http://export.arxiv.org/oai2?verb=ListRecords) to store records for further mining. It will start the data extraction according to arguments. This step is done to only scrape for new content or content which has changed. + +`scripts/scrape_papers.py` provide two options: +- Extract new records which are published on the feed in the last 24 hours and store in DB. +```sh +python scripts/scrape_papers.py --with-config default_config.ini daily-harvest +``` +- Extract records published in date range and store in DB. +```sh +python scripts/scrape_papers.py --with-config default_config.ini date-range --start_date '2020-05-29' --end_date '2020-06-30' +``` + +## Data Mining and Storage +`scripts/mine_papers.py` extracts the papers stored after scraping and extract LaTeX source and parses the data. +```sh +python scripts/mine_papers.py --with-config default_config.ini start-miner +``` +## Quick Streamlit Search Dashboard Over Stored Data +`scripts/arxiv_search_dash.py` runs a quick streamlit based dashboard to search and visualize search results stored after scraping and mining. +```sh +streamlit run scripts/arxiv_search_dash.py -- --config default_config.ini +``` +## Save LaTeX Source To S3 bucket +> This script needs some tweeks to make it more customizable + +`scripts/mass_source_harvest.py` extracts the LaTeX sources from ArXiv and stores them in S3. +```sh +python scripts/mass_source_harvest.py --max-chunks 200 > /home/ubuntu/arxiv-miner/mass_harvet.log & +``` diff --git a/docs/index.html b/docs/index.html new file mode 100644 index 0000000..92cb6db --- /dev/null +++ b/docs/index.html @@ -0,0 +1,26 @@ + + + + + Document + + + + + + +
+ + + + + + + diff --git a/docs/quickstart.md b/docs/quickstart.md new file mode 100644 index 0000000..f2cdd0b --- /dev/null +++ b/docs/quickstart.md @@ -0,0 +1,22 @@ + +# Setup +This library can be used in may ways. It can be used as a standalone library to quickly mine content on ArXiv or act as a layer of access to information stored in Elasticsearch after mining and scraping. A `pip` install of the core library is needed. All additional dependencies can be managed/installed according usecase match to [core components](core_components.md): +- Scraping : None +- Mining : [Latex mining utilities installation](quickstart.md#latex-mining-utils-installation) , [Ontology classifier installation](quickstart.md#setup-ontology-classifier) +- Storage and Search: [Elasticsearch 7.8.0](https://www.elastic.co/guide/en/elasticsearch/reference/7.8/deb.html). + +## Core Library Install + +```sh +pip install -r git+https://github.com/valayDave/arxiv-miner +``` +## Latex Mining Utils installation +Main dependences : `texlive-full` (Ubuntu) or [`texshop`](https://pages.uoregon.edu/koch/texshop/) (OSX) and [`opendetex`](https://github.com/pkubowicz/opendetex). The [setup_latex_parsing.sh](https://github.com/valayDave/arxiv-miner/blob/master/setup_latex_parsing.sh) script will install `texlive-full` and other dependencies for ubuntu and also create the binary for `opendetex` in the current working directory. +```sh +sh setup_latex_parsing.sh +``` +## Setup Ontology Classifier: +The [`cso-classifier`](https://github.com/angelosalatino/cso-classifier/) needs to be installed to include Ontology mining when starting the mining process. +```sh +sh cso_setup.sh +``` diff --git a/docs/roadmap.md b/docs/roadmap.md new file mode 100644 index 0000000..71935bb --- /dev/null +++ b/docs/roadmap.md @@ -0,0 +1,13 @@ +# Roadmap/Vision + +## Add Test cases for different parts of the system. +Think of diverse and powerful test cases for the current system that can validate functioning of isolated individual components. +## Equation Extraction Parsing And Explicit Storage +Explicitly parse equations from LaTeX source to explicitly store along with a parsed paper. +## Integrate Twitter and Reddit Into Datamodel. +Extract comments from Reddit and Twitter about a paper along with the engagments of upvotes/likes to include in search ranking and also store along with parsed research content. +## Move Family Of Projects Under ArXiv Miner +Move projects such as `arxiv-table-miner`,`arxiv-table-miner-ml-models`, and `semantic-scholar-data-pipeline` into one project under `arxiv-miner` + +## Integrate More Powerful Semantic Search Tools +Explore integration of vector search engines like [vlad](https://vald.vdaas.org/) \ No newline at end of file diff --git a/docs/structures.md b/docs/structures.md new file mode 100644 index 0000000..27409ec --- /dev/null +++ b/docs/structures.md @@ -0,0 +1,51 @@ +# Core Data Structures +[arxiv_miner/record.py](https://github.com/valayDave/arxiv-miner/blob/master/arxiv_miner/record.py) consists of all the core data structures needed to by the scraping/mining and data storage serialization/deserialization. +## `ArxivRecord` +This is the core base class which is used by `ArxivPaper` and holds all relevant data/metadata post parsing of LaTeX sources. +```python + +class ArxivPaperProcessingMeta: + pdf_only:bool = False + latex_files:int = 0 + mined:bool = False + latex_parsed: bool=False + +class ArxivRecord(object): + # Core Identity : Information about arxivid, authors,categories etc. + identity:ArxivIdentity = None + # Processing Metadata + # meta about processing results + paper_processing_meta : ArxivPaperProcessingMeta = None + # Intermediate representation latex parsing results + latex_parsing_result : ArxivLatexParsingResult = None + + # Single `Section` Created after parsing LaTeX which can be converted to `ResearchPaper` + latex_parsed_document : ArxivDocument = None + +``` + +## `ResearchPaper` + +The `ResearchPaper` is the data structure that holds the parsed text from the research document. The purpose of this object is to fit the research document and its hierarchy into predefined sections which are commonly occurring in research documents. The general pre-identified sections are given below: + +- Introduction +- Related Works +- Methodology +- Experiments +- Dataset +- Conclusion +- Limitations +Any section that doesn't fit the predefined sections will be categorized as *Unknown*. + +The `ResearchPaper` consists of key value pairs which relate to the given predefined sections. The key is the name of the section eg. Introduction, Related Works etc. and the value is a `Section`. +## `Section` + +`Section` is a tree-like data structure that holds hierarchical information. `Section` is given by: +```python +class Section: + title:str = "Introduction" + text:str = "Text relating to the introduction of a paper" + children:List[Section] +``` +The *children* in the `Section` hold the information about the child notes of that `Section`. The `Section` object helps capture a research paper’s hierarchy before it gets parsed into a key-value-based `ResearchPaper`. The `Section` object can also be serialized to JSON making it indexable in the Lucene search index. + diff --git a/requirements.txt b/requirements.txt index c71450c..1db09c3 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,6 +1,5 @@ # For Arxiv Scraping arxiv -arxivscraper # For Arxiv Processing tex2py @@ -16,7 +15,9 @@ dateparser # This is For Arxiv Mining expiringdict -rpyc elasticsearch elasticsearch_dsl luqum + +# For dashboard +streamlit diff --git a/arxiv_search_dash.py b/scripts/arxiv_search_dash.py similarity index 90% rename from arxiv_search_dash.py rename to scripts/arxiv_search_dash.py index bec6307..6f113ec 100644 --- a/arxiv_search_dash.py +++ b/scripts/arxiv_search_dash.py @@ -1,13 +1,13 @@ from typing import List import streamlit as st -from config import Config +from arxiv_miner.config import Config import dateparser from streamlit.cli import main import datetime from functools import wraps import arxiv_miner from arxiv_miner import \ - ArxivElasticTextSearch,\ + KeywordsTextSearch,\ TermsAggregation,\ DateAggregation,\ TextSearchFilter,\ @@ -22,8 +22,27 @@ import click import dateparser -from cli import database_choice,common_run_options +from arxiv_miner.cli import database_choice,common_run_options DEFAULT_APP_NAME = 'ArXiv-Search-Dashboard' +import argparse +import streamlit as st +import os + +parser = argparse.ArgumentParser(description=f'{DEFAULT_APP_NAME} : Quick streamlist dashboard to search over mined documents') + +parser.add_argument('--config', default='default_config.ini', + help="Path to configuration. If not provided default will be used.") +try: + args = parser.parse_args() +except SystemExit as e: + # This exception will be raised if --help or invalid command line arguments + # are used. Currently streamlit prevents the program from exiting normally + # so we have to do a hard exit. + os._exit(e.code) + +config = args.config + + APP_HELP_STR = '''

Product Help

@@ -70,9 +89,9 @@ def get_bookmarker(): return BookMarkMap() class DataView(): - def __init__(self,database:ArxivElasticTextSearch): + def __init__(self,database:KeywordsTextSearch): super().__init__() - if type(database) != ArxivElasticTextSearch: + if type(database) != KeywordsTextSearch: raise("Elastic Search Required For Text Based DB Search") sort_keys = [ dict(name='Recency',value=DATE_FIELD_NAME), @@ -336,7 +355,7 @@ def print_block(block:SearchResults,bookmarked=False,allow_bookmarking=True,surv return (bookmark_button_res,block.identity.identity) # --> Returns button value and identity. -@st.cache(hash_funcs={ArxivElasticTextSearch:id,TextSearchFilter:hash},persist=True,allow_output_mutation=True) +@st.cache(hash_funcs={KeywordsTextSearch:id,TextSearchFilter:hash},persist=True,allow_output_mutation=True) def get_db_data(\ db_conn,\ tsf @@ -344,12 +363,11 @@ def get_db_data(\ return db_conn.text_search(tsf) -@st.cache(hash_funcs={ArxivElasticTextSearch:id},allow_output_mutation=True) -def get_db_obj(use_defaults,host,port,app_name=DEFAULT_APP_NAME): +@st.cache(hash_funcs={KeywordsTextSearch:id},allow_output_mutation=True) +def get_db_obj(use_defaults,config_path,host,port,app_name=DEFAULT_APP_NAME): db_arg_obj = {} - datastore = 'elasticsearch' - args , client_class = database_choice(datastore,use_defaults,host,port) - print_str = '\n %s Process Using %s Datastore'%(app_name,datastore) + args , client_class = database_choice(use_defaults,config_path,host,port) + print_str = '\n %s Process Using %s Datastore'%('ArXiv-Search-Dash','elasticsearch') args_str = ''.join(['\n\t'+ i + ' : ' + str(args[i]) for i in args]) click.secho(print_str,fg='green',bold=True) click.secho(args_str+'\n\n',fg='magenta') @@ -360,10 +378,10 @@ def get_db_obj(use_defaults,host,port,app_name=DEFAULT_APP_NAME): return database_client -def text_search_dashboard(use_defaults,host,port,app_name=DEFAULT_APP_NAME): - database_client = get_db_obj(use_defaults,host,port) +def text_search_dashboard(use_defaults,config_path,host,port,app_name=DEFAULT_APP_NAME): + database_client = get_db_obj(use_defaults,config_path,host,port) DataView(database_client) if __name__=="__main__": # text_search_dashboard = wrap_db(text_search_dashboard,main) - text_search_dashboard(True,None,None) \ No newline at end of file + text_search_dashboard(False,config,None,None) \ No newline at end of file diff --git a/author_ontology_index_setup.py b/scripts/author_ontology_index_setup.py similarity index 97% rename from author_ontology_index_setup.py rename to scripts/author_ontology_index_setup.py index 25e0a9f..9bcf1b0 100644 --- a/author_ontology_index_setup.py +++ b/scripts/author_ontology_index_setup.py @@ -5,7 +5,7 @@ from arxiv_miner.utils import load_json_from_file,save_json_to_file,dir_exists import os from arxiv_miner import ArxivElasticSeachDatabaseClient -from config import Config +from arxiv_miner.config import Config from arxiv_miner.logger import create_logger from arxiv_miner.ontology_miner import OntologyMiner,ONTOLOGY_MINABLE from arxiv_miner.record import ArxivSematicParsedResearch, Author,Ontology diff --git a/backup.py b/scripts/backup.py similarity index 98% rename from backup.py rename to scripts/backup.py index e7a8a67..27ed25c 100644 --- a/backup.py +++ b/scripts/backup.py @@ -3,7 +3,7 @@ from arxiv_miner.utils import load_json_from_file,save_json_to_file,dir_exists import os from arxiv_miner import ArxivElasticSeachDatabaseClient -from config import Config +from arxiv_miner.config import Config from arxiv_miner.logger import create_logger from arxiv_miner.record import ArxivSematicParsedResearch import time diff --git a/id_feed_data.py b/scripts/id_feed_data.py similarity index 96% rename from id_feed_data.py rename to scripts/id_feed_data.py index 87c0d60..765cb6f 100644 --- a/id_feed_data.py +++ b/scripts/id_feed_data.py @@ -3,16 +3,15 @@ Mine Records and Send them back to The DB. """ from arxiv_miner import \ - MiningProcess,\ - ArxivDatabaseServiceClient + MiningProcess from arxiv_miner import TextSearchFilter import click from arxiv_miner.logger import create_logger from arxiv_miner.utils import load_json_from_file,save_json_to_file,dir_exists import os -from config import Config -from cli import db_cli +from arxiv_miner.config import Config +from arxiv_miner.cli import db_cli import random import time import json diff --git a/mass_source_harvest.py b/scripts/mass_source_harvest.py similarity index 98% rename from mass_source_harvest.py rename to scripts/mass_source_harvest.py index 4dca5ab..0dd1f99 100644 --- a/mass_source_harvest.py +++ b/scripts/mass_source_harvest.py @@ -9,7 +9,7 @@ import metaflow import os import click -from config import Config +from arxiv_miner.config import Config import json from arxiv_miner.logger import create_logger from typing import List,Tuple diff --git a/mine_papers.py b/scripts/mine_papers.py similarity index 91% rename from mine_papers.py rename to scripts/mine_papers.py index 5d6b6ad..7c13124 100644 --- a/mine_papers.py +++ b/scripts/mine_papers.py @@ -3,18 +3,16 @@ Mine Records and Send them back to The DB. """ from arxiv_miner import \ - MiningProcess,\ - ArxivDatabaseServiceClient -from arxiv_miner import ArxivElasticSeachDatabaseClient + MiningProcess import click import os -from config import Config -from cli import db_cli +from arxiv_miner.config import Config +from arxiv_miner.cli import db_cli import time -DEFAULT_PATH = Config.mining_data_path -DEFAULT_DETEX_PATH = Config.detex_path +DEFAULT_PATH = os.path.abspath('./mining_data/papers') +DEFAULT_DETEX_PATH = os.path.abspath('./detex') DEFAULT_MINING_INTERVAL=5 SLEEP_BETWEEEN_PORCS = 5 diff --git a/ontology_migration.py b/scripts/ontology_migration.py similarity index 98% rename from ontology_migration.py rename to scripts/ontology_migration.py index 1c1f773..a6ce391 100644 --- a/ontology_migration.py +++ b/scripts/ontology_migration.py @@ -3,7 +3,7 @@ from arxiv_miner.utils import load_json_from_file,save_json_to_file,dir_exists import os from arxiv_miner import ArxivElasticSeachDatabaseClient,KeywordsTextSearch -from config import Config +from arxiv_miner.config import Config from arxiv_miner.logger import create_logger from arxiv_miner.ontology_miner import OntologyMiner,ONTOLOGY_MINABLE from arxiv_miner.record import ArxivSematicParsedResearch,Ontology,Author diff --git a/scripts/remine_data.py b/scripts/remine_data.py new file mode 100644 index 0000000..662d560 --- /dev/null +++ b/scripts/remine_data.py @@ -0,0 +1,68 @@ +""" +The Purpose Of this Script is To Connect to ArxivDatabase +Mine Records and Send them back to The DB. +""" +from arxiv_miner import \ + MiningProcess +from arxiv_miner import TextSearchFilter + +import click +from arxiv_miner.logger import create_logger +from arxiv_miner.utils import load_json_from_file,save_json_to_file,dir_exists +import os +from arxiv_miner.config import Config +from arxiv_miner.cli import db_cli +import random +import time +import json +import pandas + + +DEFAULT_PATH = Config.mining_data_path +DEFAULT_DETEX_PATH = Config.detex_path + +DEFAULT_MINING_INTERVAL=5 +SLEEP_BETWEEEN_PORCS = 5 +DEFAULT_MINING_LIMIT=30 +DEFAULT_EMPTY_WAIT_TIME= 600 +DEFAULT_SLEEP_INTERVAL_COUNT = 50 +DEFAULT_FILE = 'id_list.json' +APP_NAME = 'ArXiv-Miner' +MINER_HELP = f''' + +The Purpose Of this Script is To Connect to ArxivDatabase, +and Set mined to be for the records in {DEFAULT_FILE} +''' + +STORE_ROOT_PATH = 'extracted_id_data' + +@db_cli.command(help=MINER_HELP) +@click.option('--id_dict_path',default=DEFAULT_FILE,help='File of Arxiv Id list to extract data Feed. ') +@click.option('--sample',default=None,type=int,help='Sample Of Ids from the selected `id_dict_path`') +@click.pass_context +def remine_data(ctx, # click context object: populated from db_cli + id_dict_path = DEFAULT_FILE, + sample=None + ): + backup_time = str(int(time.time())) + + logger = create_logger('Data Reminer') + + num_stored = 0 + logger.info("Starting Database Stream " ) + + database_client = ctx.obj['db_class'](**ctx.obj['db_args']) # Create Database + with open(id_dict_path,'r') as f : + id_json = json.load(f) + + id_list = id_json['id_list'] + if sample is not None: + id_list = random.sample(id_list,sample) + + for datatup in database_client.id_stream(id_list): + arxiv_id,rec_obj,_ = datatup + database_client.set_mined(rec_obj.identity,False) + logger.info(f"Set {arxiv_id} to Unmined") + +if __name__ == "__main__": + db_cli() diff --git a/scrape_papers.py b/scripts/scrape_papers.py similarity index 98% rename from scrape_papers.py rename to scripts/scrape_papers.py index 6478020..0393563 100644 --- a/scrape_papers.py +++ b/scripts/scrape_papers.py @@ -1,5 +1,4 @@ from arxiv_miner import \ - ArxivDatabaseServiceClient,\ MassDataHarvestingEngine,\ DailyScrapingEngine,\ ScrapingEngine,\ @@ -19,8 +18,9 @@ import click import datetime from functools import wraps,partial -from config import Config -from cli import db_cli +from arxiv_miner.config import Config +from arxiv_miner.cli import db_cli + DEFAULT_SELECTED_CLASS = 'cs' DEFAULT_START_DATE = datetime.datetime.now().strftime(ScrapingEngine.date_format) @@ -64,7 +64,6 @@ def date_range(ctx, # click context start_date=DEFAULT_START_DATE,\ end_date=DEFAULT_END_DATE,\ timeout_per_scrape=DEFAULT_TIMEOUT_PER_DATE_RANGE_SCRAPE): - database_client = ctx.obj['db_class'](**ctx.obj['db_args']) harvester = MassDataHarvestingEngine.from_string_dates( database_client,\ @@ -92,7 +91,6 @@ def daily_harvest(ctx, # click context selected_class=DEFAULT_SELECTED_CLASS,\ thread_mode=DEFAULT_THREADMODE,\ timeout_per_scrape=DEFAUL_TIMEOUT_PER_DAILY_SCRAPE): - database_client = ctx.obj['db_class'](**ctx.obj['db_args']) # Create Database harvester = DailyScrapingEngine(database_client,selected_class=selected_class) hp = None diff --git a/setup.py b/setup.py index ba1e7de..9e49dbb 100644 --- a/setup.py +++ b/setup.py @@ -1,6 +1,6 @@ from setuptools import setup, find_packages -version = '1.4.0' +version = '2.0.0' setup(name='arxiv_miner', version=version, @@ -13,7 +13,6 @@ include_package_data=True, install_requires = [ 'arxiv', - 'arxivscraper', 'tex2py', 'matplotlib', 'pandas', @@ -21,7 +20,6 @@ 'numpy', 'dateparser', 'expiringdict', - 'rpyc', 'elasticsearch', 'elasticsearch_dsl', 'luqum', diff --git a/setup_latex_parsing.sh b/setup_latex_parsing.sh new file mode 100644 index 0000000..8d69698 --- /dev/null +++ b/setup_latex_parsing.sh @@ -0,0 +1,11 @@ +sudo apt-get update + +sudo apt-get install -y make build-essential libssl-dev zlib1g-dev \ +libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev \ +libncursesw5-dev xz-utils tk-dev libffi-dev liblzma-dev python-openssl + +sudo apt-get install -y make flex texlive-full + +git clone https://github.com/pkubowicz/opendetex +(cd opendetex && make) +cp opendetex/detex ./detex \ No newline at end of file diff --git a/utils.py b/utils.py deleted file mode 100644 index 05dd0e5..0000000 --- a/utils.py +++ /dev/null @@ -1,95 +0,0 @@ -from contextlib import contextmanager - -import os -import re -import pickle -import tempfile -import json - -def dir_exists(dir_path): - try: - os.stat(dir_path) - return True - except: - return False - -def load_json_from_file(file_path): - with open(file_path,'r') as f: - json_file = json.load(f) - return json_file - -def save_json_to_file(json_dict,file_path): - with open(file_path,'w') as f: - json.dump(json_dict,f) - -# Context managers for atomic writes courtesy of -# http://stackoverflow.com/questions/2333872/atomic-writing-to-file-with-python -@contextmanager -def _tempfile(*args, **kws): - """ Context for temporary file. - - Will find a free temporary filename upon entering - and will try to delete the file on leaving - - Parameters - ---------- - suffix : string - optional file suffix - """ - - fd, name = tempfile.mkstemp(*args, **kws) - os.close(fd) - try: - yield name - finally: - try: - os.remove(name) - except OSError as e: - if e.errno == 2: - pass - else: - raise e - - -@contextmanager -def open_atomic(filepath, *args, **kwargs): - """ Open temporary file object that atomically moves to destination upon - exiting. - - Allows reading and writing to and from the same filename. - - Parameters - ---------- - filepath : string - the file path to be opened - fsync : bool - whether to force write the file to disk - kwargs : mixed - Any valid keyword arguments for :code:`open` - """ - fsync = kwargs.pop('fsync', False) - - with _tempfile(dir=os.path.dirname(filepath)) as tmppath: - with open(tmppath, *args, **kwargs) as f: - yield f - if fsync: - f.flush() - os.fsync(f.fileno()) - os.rename(tmppath, filepath) - -def safe_pickle_dump(obj, fname): - with open_atomic(fname, 'wb') as f: - pickle.dump(obj, f, -1) - - -# arxiv utils -# ----------------------------------------------------------------------------- - -def strip_version(idstr): - """ identity function if arxiv id has no version, otherwise strips it. """ - parts = idstr.split('v') - return parts[0] - -# "1511.08198v1" is an example of a valid arxiv id that we accept -def isvalidid(pid): - return re.match('^\d+\.\d+(v\d+)?$', pid)