OSS Release (#3)

* OSS Cleanup - Refactored CLI into the main module - removed all outside scripts and put them in one folder - removed FS database. - Removed outside config. - setup ini file based configuration. - Create documentation - Added changelog and contribution guide. - Added shelll script for open detex etc. - fixed the streamlit dashboard. - Added license - Version bump and final cleanup pre merge.
valayDave · May 28, 2021 · c7caf0b · c7caf0b
1 parent c60d143
commit c7caf0b
Show file tree

Hide file tree

Showing 46 changed files with 603 additions and 1,404 deletions.
diff --git a/LICENSE.txt b/LICENSE.txt
@@ -0,0 +1,17 @@
+MIT License
+Copyright (c) 2021 Valay Dave
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/Readme.md b/Readme.md
@@ -1,110 +1,44 @@
-# Arxiv Miner. 
+# ArXiv-Miner
 
-Repository Helps Mine Arxiv Papers to quickly Scrape through new Papers and Mine data for Faster Readings. 
+> ArXiv Miner is a toolkit for mining research papers on CS ArXiv. 
 
-# BROADER GOAL
-1. The goal of this project is to annotate and build faster search around research papers so that I can be quickly aware of what is happening in the domain. 
-2. It is also ment to structure research papers in searialisable JSON so that I can start annotating research and fixing things around the same. 
+## What is ArXiv-Miner
 
-# How can One get there ?
+`arxiv-miner` is a quick handy library that helps power [Sci-Genie](https://sci-genie.com). Sci-Genie is a search engine for quickly through full text of papers on CS ArXiv. `arxiv-miner` helps extract and parse LaTeX documents from CS ArXiv. It also supports storage and search of those parsed documents using **Elasticsearch**. The library can be applicable for all other domains like Math, Physics, Biology etc. 
 
-## ARXIV PAPER MINING
+## Documentation 
+All documentation on how to install and use `arxiv-miner` is provided in the documentation website or inside the [docs folder](docs). Contribution guidelines are also provided there. 
 
-### GOAL OF PAPER MINING 
-Parse the Arxiv Latex/PDF into A research Paper Object which can be serialised so that It is in readable format for some form of Machine learning/Annoation methods. But it all starts from cleaning the Dirt from Arxiv. 
+## Why was ArXiv-Miner created ?
+ArXiv Miner was created for easily scraping, parsing and searching research content on ArXiv. This library was created after stitching together solutions from the code of various tools like [arxiv-sanity](https://github.com/karpathy/arxiv-sanity-preserver), [arxiv-vanity/engrafo](https://github.com/arxiv-vanity/engrafo), [arxivscraper](https://github.com/Mahdisadjadi/arxivscraper), [tex2py](https://github.com/alvinwan/tex2py), [cso-classifier](https://github.com/angelosalatino/cso-classifier/) and [axcell](https://github.com/paperswithcode/axcell). Parsed structure of the content can be useful in search or any scientific research mining/AI applications as a heuristic baseline.
 
-### WAY TO DO IT 
-1. Extract Papers from `Arxiv` using `scrape_papers.py` script. The `ArxivDatabase` will hold the `ArxivRecord`s.
-2. `mine_papers.py` will download the Latex version of the Papers for Arxiv and create and `ArxivRecord` object.  
-3. The `ArxivRecord` can is a base class to `ArxivPaper`. 
-4. The `ArxivPaper` Object helps extract the Latex source from the Arxiv and parses it. 
-    - Three things will help solve the Information mining Problem. 
-        1. Extraction of Document Structure/hierarchy via Python-Latex Libraries like `tex2py`. 
-        2. Extraction of Text from Latex Document Using `detex` : https://github.com/pkubowicz/opendetex
-        3. Collate with the Tree with the text based on hierachical traversal of tree and text-splittig based search to collate the information. 
-    - These things are Managed using the child classes of `LatexInformationParser`. These child classes will help for the Structured `Section` objects which contains the stored parsed structure of the Research paper. 
-5. The Scaraped/Mined Papers are stored in a `fs` or `elasticsearch` based search engines. 
+## Core Components of ArXiv-Miner
+- Scraping 
+- Parsing
+- Indexing/Storage 
 
+## Family Of Projects With ArXiv-Miner
+- `arxiv-table-miner` : Coming Soon.
+- `arxiv-table-ml-models` : Coming Soon.
+- `semantic-scholar-data-pipeline` : https://github.com/valayDave/semantic-scholar-data-pipeline
 
-## Setup
+## Disclaimer 
+This project was developed like a [Cowboy coder](https://en.wikipedia.org/wiki/Cowboy_coding) over the [COVID-19 pandemic](https://en.wikipedia.org/wiki/COVID-19_pandemic). Hence, this **may have bugs and not the most well optimized code**. The primary reason for development was to aid CS and Machine Learning/AI research, but this tool can be extended to all 3M+ documents on ArXiv. 
 
-```sh
-sh setup.sh
-```
+## Call For Contributors
+Any help with contributions to improve the project or fix bugs are completely welcome. Please read the contribution guide in the documentation.  
 
-### To Setup Ontology Miner: 
+## Credits and Appreciation
+This project like all others has been built on shoulders of giants. A big thanks to the creators of the following libraries/open source projects that aided the development of `arxiv-miner`, and it's family of projects:
+- [arxiv-sanity](https://github.com/karpathy/arxiv-sanity-preserver)
+- [arxiv-vanity/engrafo](https://github.com/arxiv-vanity/engrafo) 
+- [arxivscraper](https://github.com/Mahdisadjadi/arxivscraper)
+- [tex2py](https://github.com/alvinwan/tex2py)
+- [cso-classifier](https://github.com/angelosalatino/cso-classifier/) 
+- [axcell](https://github.com/paperswithcode/axcell)
+- [elasticsearch](https://github.com/elastic/elasticsearch)
+- [Semantic Scholar Open Research corpus](https://github.com/allenai/s2orc)
+- [metaflow](https://metaflow.org)
 
-```sh
-sh cso_setup.sh
-```
-
-## What is Done Yet : 
-
-1. Arxiv PDF and LateX Extraction Pipeline
-2. Arxiv Paper Parsing to JSON Objects using Latex and Python. --> Latex Based Symantically parsed Data Extraction :: READY 
-3. Local Database Setup and Data Exploration. 
-
-## What Needs to Be Done ?
-
-1. Data Extraction And Pasing System Are pretty Well set from Database. 
-    1. The Database Generation needs to move from Andrej's script to using the `arxivscraper` which uses the mass Metadata extraction.
-
-2. Final System : 
-    - Scraping Crons
-    - Parsing Idempotent processes. 
-        - TODO : Further parse
-    - ArxivRecord Database with `fs` | `elasticsearch`
-    - Search Interface
-        - Daily Update of New Research
-        - Search indexing for 
-
-
-# How Does it Work ? 
-
-## Overview 
-- Parts of Current System : 
-    - `ArxivDatabase` : Core class to expose base methods for interfacing with DB. It is an adapter that can work with an `filesystem` based database or `elasticsearch`. The purpose of the adapter is ment create an interopratable data layer that can switched according to requirement and need. 
-    - Filesystem based DB uses `ArxivDatabaseService(rpyc.Service,ArxivFSDatabase)`. The `database_server.py` file helps create and FS based database server. 
-    - `HarvestingProcess` : This uses a `ScrapingEngine` to extract `ArxivIdentity` from ArXiv API(`http://export.arxiv.org/oai2?verb=ListRecords`). 
-        - The Data extracted is stored to the database as an `ArxivRecord`. 
-        - `DailyHarvestationProcess` helps retrieve data daily papers. 
-        - `MassHarvestationProcess` gets data based on DateRange. 
-    - `MiningProcess`: Helps mine the papers for `LaTeX` information. The mined `ArxivRecord` is stored in the Database 
-
-- The Database provides a Way to Create/Update `ArxivRecord`. The `ArxivRecord` contains an `ArxivIdentity` which is extracted using the `arxiv_miner.scraping_engine.ScrapingEngine`. `ArxivRecord` is the Fundamental Datastructure use to identify a research paper. `ArxivPaper` is a processing Object which can use a `ArxivRecord` to start the mining process. 
-
-## Running the Damn Thing. 
-- The `config.py` file contains the `Config` Object which is Singleton used for configuration across the project. 
-- Start FS based Database Server with Below Command . The Database Server is responsible For Managing the data. Elasticsearch is also supported as a backend database. 
-    ```sh
-    python database_server.py
-    ```
-- Start the Data Harvester according to your requirements. Can perform a `daily-harvest` or a `date-range` harvest. 
-    ```sh
-    python scrape_papers.py --help
-    ```
-    - DB adapters can be switched. The `--use_defaults` will load the defaults of `--datastore` from `Config`.
-        ```sh
-        python scrape_papers.py --datastore elasticsearch --host localhost --port 18861 daily-harvest
-        ```
-- Start the Miner To parallely start mining the Extracted data. 
-    ```sh
-    python mine_papers.py --help
-    ```
-    - The Miner has the same database cli adapter as Scraper. 
-    ```sh
-    python mine_papers.py --datastore fs --use_defaults start-miner
-    ```
-- Source Harvest and Store to S3: 
-    ```sh
-    nohup /home/ubuntu/arxiv-miner/.env/bin/python /home/ubuntu/arxiv-miner/mass_source_harvest.py --max-chunks 200 > /home/ubuntu/arxiv-miner/mass_harvet.log &
-    ```
-
--  Extract EC2 instance List from AWS
-    ```
-    aws ec2 describe-instances --region=us-east-1 --query 'Reservations[*].Instances[*].[InstanceId,Tags[?Key==`Name`].Value|[0],State.Name,PrivateIpAddress,PublicIpAddress]' --output table > instance_list.md
-    ```
-# TODO / VISION
-1. Create a search interface for looking for research. 
-2. Get daily analytics of the new research coming out 
-3. Create reports and analytics for the new research
+## Licence
+MIT
diff --git a/arxiv_miner/__init__.py b/arxiv_miner/__init__.py
@@ -8,11 +8,6 @@
         ResearchPaper,\
         ResearchPaperSematicParser
 
-from .loader import \
-    ArxivLoader,\
-    ArxivLoaderFilter,\
-    FSArxivLoadingFactory
-
 from .record import \
     ArxivIdentity,\
     ArxivLatexParsingResult,\
@@ -22,8 +17,6 @@
     ArxivSematicParsedResearch
 
 from .database import \
-        ArxivFSDatabaseService,\
-        ArxivDatabaseServiceClient,\
         ArxivElasticSeachDatabaseClient,\
         KeywordsTextSearch,\
         TextSearchFilter,\

diff --git a/arxiv_miner/cli.py b/arxiv_miner/cli.py
@@ -0,0 +1,62 @@
+'''
+This is the Generalised CLI origin of the Project. 
+this will be used for the Extracting the Important CLI information such as Database 
+Selection etc. Can be used as a gateway to integrate all the submodules into one cli invocation
+'''
+
+import click
+from functools import wraps
+import configparser
+from .config import Config
+from .database import SUPPORTED_DBS,get_database_client
+import json
+
+DEFAULT_APP_NAME= 'ArXiv-Miner'
+
+def common_run_options(func):
+    db_defaults = Config.get_db_defaults()
+    @click.option('--host', default=db_defaults['host'], help='ArxivDatabase Host')
+    @click.option('--port', default=db_defaults['port'], help='ArxivDatabase Port')
+    @wraps(func)
+    def wrapper(*args, **kwargs):
+        return func(*args, **kwargs)
+    return wrapper
+
+
+@click.group(invoke_without_command=True)
+@click.option('--use_defaults',is_flag=True,help='Use Default Database Configurations For Chosen Datastore.')
+@click.option('--with-config',default=None,help='Path to configuration ini file to use. Uses a configuration file for the instantiation of the database')
+@common_run_options
+@click.pass_context
+def db_cli(ctx,use_defaults,with_config,host,port,app_name=DEFAULT_APP_NAME):
+    ctx.obj = {}    
+    args , client_class = database_choice(use_defaults,with_config,host,port)
+    print_str = '\n %s Process Using %s Datastore'%(app_name,'elasticsearch')
+    args_str = ''.join(['\n\t'+ i + ' : ' + str(args[i]) for i in args])
+    click.secho(print_str,fg='green',bold=True)
+    click.secho(args_str+'\n\n',fg='magenta')
+    ctx.obj['db_class'] = client_class
+    ctx.obj['db_args'] = args
+
+
+def database_choice(use_defaults,with_config,host,port):
+    client_class = get_database_client('elasticsearch')
+    if with_config is not None:
+        config = configparser.ConfigParser()
+        config.read(with_config)
+        args = dict(index_name=config['elasticsearch']['index'],
+                    host=config['elasticsearch']['host']
+                    )
+        if 'port' in config['elasticsearch']:
+            args['port'] = config['elasticsearch']['port']
+        if 'auth' in config['elasticsearch']:
+            args['auth'] = config['elasticsearch']['auth'].split(' ')
+    # get_database_client will raise error if some-one feeds BS DB
+    elif use_defaults:
+        args = Config.get_defaults('elasticsearch')
+    else:
+        args = dict(index_name=Config.elasticsearch_index,host=host,port=port)
+    return args, client_class
+
+if __name__ == '__main__':
+    db_cli()
diff --git a/arxiv_miner/config.py b/arxiv_miner/config.py
@@ -0,0 +1,32 @@
+# TODO : Move this to configuration format where the entire thing comes from a YML file
+import os
+# global settings
+# -----------------------------------------------------------------------------
+class Config(object):
+    default_database = 'elasticsearch' 
+    elasticsearch_port = 9200
+    elasticsearch_host = 'localhost'
+    elasticsearch_index = 'arxiv_papers'
+    es_auth = None # should be a tuple
+
+    # Object Store 
+    bucket_name = 'arxiv-papers-source-bucket'
+
+    @classmethod
+    def get_defaults(cls,db_str):
+        if db_str == 'elasticsearch':
+            return_dict = dict(\
+                host=cls.elasticsearch_host,\
+                port=cls.elasticsearch_port,\
+                index_name = cls.elasticsearch_index)
+
+            if cls.es_auth is not None:
+                return_dict['auth']=cls.es_auth
+
+            return return_dict
+        else:
+            return None
+
+    @classmethod
+    def get_db_defaults(cls):
+        return cls.get_defaults(cls.default_database)
diff --git a/arxiv_miner/database/__init__.py b/arxiv_miner/database/__init__.py
@@ -11,12 +11,7 @@
     FIELD_MAPPING,\
     DATE_FIELD_NAME
 
-from .filesystem import ArxivFSDatabase
-from .proxy_service import \
-            ArxivFSDatabaseService,\
-            ArxivDatabaseServiceClient
-
-SUPPORTED_DBS = ['fs','elasticsearch']
+SUPPORTED_DBS = ['elasticsearch']
 
 class DatabaseNotSupported(Exception):
     headline = 'DB_CLIENT_NOT_FOUND'
@@ -29,7 +24,5 @@ def __init__(self,given_client):
 def get_database_client(client_name):
     if client_name not in SUPPORTED_DBS:
         raise DatabaseNotSupported(client_name)
-    if client_name == 'fs':
-        return ArxivDatabaseServiceClient
     elif client_name == 'elasticsearch':
         return KeywordsTextSearch
diff --git a/arxiv_miner/database/elasticsearch.py b/arxiv_miner/database/elasticsearch.py
@@ -56,7 +56,6 @@ def __init__(self,index_name=None,host='localhost',port=9200,auth=None):
             src_str = f'{host}'
         else:
             src_str = f'{host}:{port}'
-
         if auth is None:
             self.es = elasticsearch.Elasticsearch(src_str,timeout=30, max_retries=10)
         else:
@@ -811,18 +810,6 @@ def text_aggregation(self,agg_obj:Aggregation):
         return_buckets = agg_obj.transform_resp(aggregation_buckets)
         return return_buckets
 
-    # @async_wrap
-    # def async_text_search_scan(self,filter_obj:TextSearchFilter):
-    #     return self.text_search_scan(filter_obj)
-
-    # @async_wrap
-    # def async_text_aggregation(self,agg_obj:Aggregation):
-    #     return self.text_aggregation(agg_obj)
-
-    # @async_wrap
-    # def async_text_search(self,filter_obj:TextSearchFilter):
-    #     return self.text_search(filter_obj)
-
 class KeywordsTextSearch(ArxivElasticTextSearch):
     def __init__(self, **kwargs):
         super().__init__(**kwargs)