Skip to content

Commit

Permalink
OSS Release (#3)
Browse files Browse the repository at this point in the history
* OSS Cleanup
- Refactored CLI into the main module
- removed all outside scripts and put them in one folder
- removed FS database.
- Removed outside config.
- setup ini file based configuration.
- Create documentation
- Added changelog and contribution guide.
- Added shelll script for open detex etc.
-  fixed the streamlit dashboard.
- Added license
-  Version bump and final cleanup pre merge.
  • Loading branch information
valayDave authored May 28, 2021
1 parent c60d143 commit c7caf0b
Show file tree
Hide file tree
Showing 46 changed files with 603 additions and 1,404 deletions.
17 changes: 17 additions & 0 deletions LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
MIT License
Copyright (c) 2021 Valay Dave
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
132 changes: 33 additions & 99 deletions Readme.md
Original file line number Diff line number Diff line change
@@ -1,110 +1,44 @@
# Arxiv Miner.
# ArXiv-Miner

Repository Helps Mine Arxiv Papers to quickly Scrape through new Papers and Mine data for Faster Readings.
> ArXiv Miner is a toolkit for mining research papers on CS ArXiv.
# BROADER GOAL
1. The goal of this project is to annotate and build faster search around research papers so that I can be quickly aware of what is happening in the domain.
2. It is also ment to structure research papers in searialisable JSON so that I can start annotating research and fixing things around the same.
## What is ArXiv-Miner

# How can One get there ?
`arxiv-miner` is a quick handy library that helps power [Sci-Genie](https://sci-genie.com). Sci-Genie is a search engine for quickly through full text of papers on CS ArXiv. `arxiv-miner` helps extract and parse LaTeX documents from CS ArXiv. It also supports storage and search of those parsed documents using **Elasticsearch**. The library can be applicable for all other domains like Math, Physics, Biology etc.

## ARXIV PAPER MINING
## Documentation
All documentation on how to install and use `arxiv-miner` is provided in the documentation website or inside the [docs folder](docs). Contribution guidelines are also provided there.

### GOAL OF PAPER MINING
Parse the Arxiv Latex/PDF into A research Paper Object which can be serialised so that It is in readable format for some form of Machine learning/Annoation methods. But it all starts from cleaning the Dirt from Arxiv.
## Why was ArXiv-Miner created ?
ArXiv Miner was created for easily scraping, parsing and searching research content on ArXiv. This library was created after stitching together solutions from the code of various tools like [arxiv-sanity](https://github.com/karpathy/arxiv-sanity-preserver), [arxiv-vanity/engrafo](https://github.com/arxiv-vanity/engrafo), [arxivscraper](https://github.com/Mahdisadjadi/arxivscraper), [tex2py](https://github.com/alvinwan/tex2py), [cso-classifier](https://github.com/angelosalatino/cso-classifier/) and [axcell](https://github.com/paperswithcode/axcell). Parsed structure of the content can be useful in search or any scientific research mining/AI applications as a heuristic baseline.

### WAY TO DO IT
1. Extract Papers from `Arxiv` using `scrape_papers.py` script. The `ArxivDatabase` will hold the `ArxivRecord`s.
2. `mine_papers.py` will download the Latex version of the Papers for Arxiv and create and `ArxivRecord` object.
3. The `ArxivRecord` can is a base class to `ArxivPaper`.
4. The `ArxivPaper` Object helps extract the Latex source from the Arxiv and parses it.
- Three things will help solve the Information mining Problem.
1. Extraction of Document Structure/hierarchy via Python-Latex Libraries like `tex2py`.
2. Extraction of Text from Latex Document Using `detex` : https://github.com/pkubowicz/opendetex
3. Collate with the Tree with the text based on hierachical traversal of tree and text-splittig based search to collate the information.
- These things are Managed using the child classes of `LatexInformationParser`. These child classes will help for the Structured `Section` objects which contains the stored parsed structure of the Research paper.
5. The Scaraped/Mined Papers are stored in a `fs` or `elasticsearch` based search engines.
## Core Components of ArXiv-Miner
- Scraping
- Parsing
- Indexing/Storage

## Family Of Projects With ArXiv-Miner
- `arxiv-table-miner` : Coming Soon.
- `arxiv-table-ml-models` : Coming Soon.
- `semantic-scholar-data-pipeline` : https://github.com/valayDave/semantic-scholar-data-pipeline

## Setup
## Disclaimer
This project was developed like a [Cowboy coder](https://en.wikipedia.org/wiki/Cowboy_coding) over the [COVID-19 pandemic](https://en.wikipedia.org/wiki/COVID-19_pandemic). Hence, this **may have bugs and not the most well optimized code**. The primary reason for development was to aid CS and Machine Learning/AI research, but this tool can be extended to all 3M+ documents on ArXiv.

```sh
sh setup.sh
```
## Call For Contributors
Any help with contributions to improve the project or fix bugs are completely welcome. Please read the contribution guide in the documentation.

### To Setup Ontology Miner:
## Credits and Appreciation
This project like all others has been built on shoulders of giants. A big thanks to the creators of the following libraries/open source projects that aided the development of `arxiv-miner`, and it's family of projects:
- [arxiv-sanity](https://github.com/karpathy/arxiv-sanity-preserver)
- [arxiv-vanity/engrafo](https://github.com/arxiv-vanity/engrafo)
- [arxivscraper](https://github.com/Mahdisadjadi/arxivscraper)
- [tex2py](https://github.com/alvinwan/tex2py)
- [cso-classifier](https://github.com/angelosalatino/cso-classifier/)
- [axcell](https://github.com/paperswithcode/axcell)
- [elasticsearch](https://github.com/elastic/elasticsearch)
- [Semantic Scholar Open Research corpus](https://github.com/allenai/s2orc)
- [metaflow](https://metaflow.org)

```sh
sh cso_setup.sh
```

## What is Done Yet :

1. Arxiv PDF and LateX Extraction Pipeline
2. Arxiv Paper Parsing to JSON Objects using Latex and Python. --> Latex Based Symantically parsed Data Extraction :: READY
3. Local Database Setup and Data Exploration.

## What Needs to Be Done ?

1. Data Extraction And Pasing System Are pretty Well set from Database.
1. The Database Generation needs to move from Andrej's script to using the `arxivscraper` which uses the mass Metadata extraction.

2. Final System :
- Scraping Crons
- Parsing Idempotent processes.
- TODO : Further parse
- ArxivRecord Database with `fs` | `elasticsearch`
- Search Interface
- Daily Update of New Research
- Search indexing for


# How Does it Work ?

## Overview
- Parts of Current System :
- `ArxivDatabase` : Core class to expose base methods for interfacing with DB. It is an adapter that can work with an `filesystem` based database or `elasticsearch`. The purpose of the adapter is ment create an interopratable data layer that can switched according to requirement and need.
- Filesystem based DB uses `ArxivDatabaseService(rpyc.Service,ArxivFSDatabase)`. The `database_server.py` file helps create and FS based database server.
- `HarvestingProcess` : This uses a `ScrapingEngine` to extract `ArxivIdentity` from ArXiv API(`http://export.arxiv.org/oai2?verb=ListRecords`).
- The Data extracted is stored to the database as an `ArxivRecord`.
- `DailyHarvestationProcess` helps retrieve data daily papers.
- `MassHarvestationProcess` gets data based on DateRange.
- `MiningProcess`: Helps mine the papers for `LaTeX` information. The mined `ArxivRecord` is stored in the Database

- The Database provides a Way to Create/Update `ArxivRecord`. The `ArxivRecord` contains an `ArxivIdentity` which is extracted using the `arxiv_miner.scraping_engine.ScrapingEngine`. `ArxivRecord` is the Fundamental Datastructure use to identify a research paper. `ArxivPaper` is a processing Object which can use a `ArxivRecord` to start the mining process.

## Running the Damn Thing.
- The `config.py` file contains the `Config` Object which is Singleton used for configuration across the project.
- Start FS based Database Server with Below Command . The Database Server is responsible For Managing the data. Elasticsearch is also supported as a backend database.
```sh
python database_server.py
```
- Start the Data Harvester according to your requirements. Can perform a `daily-harvest` or a `date-range` harvest.
```sh
python scrape_papers.py --help
```
- DB adapters can be switched. The `--use_defaults` will load the defaults of `--datastore` from `Config`.
```sh
python scrape_papers.py --datastore elasticsearch --host localhost --port 18861 daily-harvest
```
- Start the Miner To parallely start mining the Extracted data.
```sh
python mine_papers.py --help
```
- The Miner has the same database cli adapter as Scraper.
```sh
python mine_papers.py --datastore fs --use_defaults start-miner
```
- Source Harvest and Store to S3:
```sh
nohup /home/ubuntu/arxiv-miner/.env/bin/python /home/ubuntu/arxiv-miner/mass_source_harvest.py --max-chunks 200 > /home/ubuntu/arxiv-miner/mass_harvet.log &
```

- Extract EC2 instance List from AWS
```
aws ec2 describe-instances --region=us-east-1 --query 'Reservations[*].Instances[*].[InstanceId,Tags[?Key==`Name`].Value|[0],State.Name,PrivateIpAddress,PublicIpAddress]' --output table > instance_list.md
```
# TODO / VISION
1. Create a search interface for looking for research.
2. Get daily analytics of the new research coming out
3. Create reports and analytics for the new research
## Licence
MIT
7 changes: 0 additions & 7 deletions arxiv_miner/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,6 @@
ResearchPaper,\
ResearchPaperSematicParser

from .loader import \
ArxivLoader,\
ArxivLoaderFilter,\
FSArxivLoadingFactory

from .record import \
ArxivIdentity,\
ArxivLatexParsingResult,\
Expand All @@ -22,8 +17,6 @@
ArxivSematicParsedResearch

from .database import \
ArxivFSDatabaseService,\
ArxivDatabaseServiceClient,\
ArxivElasticSeachDatabaseClient,\
KeywordsTextSearch,\
TextSearchFilter,\
Expand Down
62 changes: 62 additions & 0 deletions arxiv_miner/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
'''
This is the Generalised CLI origin of the Project.
this will be used for the Extracting the Important CLI information such as Database
Selection etc. Can be used as a gateway to integrate all the submodules into one cli invocation
'''

import click
from functools import wraps
import configparser
from .config import Config
from .database import SUPPORTED_DBS,get_database_client
import json

DEFAULT_APP_NAME= 'ArXiv-Miner'

def common_run_options(func):
db_defaults = Config.get_db_defaults()
@click.option('--host', default=db_defaults['host'], help='ArxivDatabase Host')
@click.option('--port', default=db_defaults['port'], help='ArxivDatabase Port')
@wraps(func)
def wrapper(*args, **kwargs):
return func(*args, **kwargs)
return wrapper


@click.group(invoke_without_command=True)
@click.option('--use_defaults',is_flag=True,help='Use Default Database Configurations For Chosen Datastore.')
@click.option('--with-config',default=None,help='Path to configuration ini file to use. Uses a configuration file for the instantiation of the database')
@common_run_options
@click.pass_context
def db_cli(ctx,use_defaults,with_config,host,port,app_name=DEFAULT_APP_NAME):
ctx.obj = {}
args , client_class = database_choice(use_defaults,with_config,host,port)
print_str = '\n %s Process Using %s Datastore'%(app_name,'elasticsearch')
args_str = ''.join(['\n\t'+ i + ' : ' + str(args[i]) for i in args])
click.secho(print_str,fg='green',bold=True)
click.secho(args_str+'\n\n',fg='magenta')
ctx.obj['db_class'] = client_class
ctx.obj['db_args'] = args


def database_choice(use_defaults,with_config,host,port):
client_class = get_database_client('elasticsearch')
if with_config is not None:
config = configparser.ConfigParser()
config.read(with_config)
args = dict(index_name=config['elasticsearch']['index'],
host=config['elasticsearch']['host']
)
if 'port' in config['elasticsearch']:
args['port'] = config['elasticsearch']['port']
if 'auth' in config['elasticsearch']:
args['auth'] = config['elasticsearch']['auth'].split(' ')
# get_database_client will raise error if some-one feeds BS DB
elif use_defaults:
args = Config.get_defaults('elasticsearch')
else:
args = dict(index_name=Config.elasticsearch_index,host=host,port=port)
return args, client_class

if __name__ == '__main__':
db_cli()
32 changes: 32 additions & 0 deletions arxiv_miner/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# TODO : Move this to configuration format where the entire thing comes from a YML file
import os
# global settings
# -----------------------------------------------------------------------------
class Config(object):
default_database = 'elasticsearch'
elasticsearch_port = 9200
elasticsearch_host = 'localhost'
elasticsearch_index = 'arxiv_papers'
es_auth = None # should be a tuple

# Object Store
bucket_name = 'arxiv-papers-source-bucket'

@classmethod
def get_defaults(cls,db_str):
if db_str == 'elasticsearch':
return_dict = dict(\
host=cls.elasticsearch_host,\
port=cls.elasticsearch_port,\
index_name = cls.elasticsearch_index)

if cls.es_auth is not None:
return_dict['auth']=cls.es_auth

return return_dict
else:
return None

@classmethod
def get_db_defaults(cls):
return cls.get_defaults(cls.default_database)
9 changes: 1 addition & 8 deletions arxiv_miner/database/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,7 @@
FIELD_MAPPING,\
DATE_FIELD_NAME

from .filesystem import ArxivFSDatabase
from .proxy_service import \
ArxivFSDatabaseService,\
ArxivDatabaseServiceClient

SUPPORTED_DBS = ['fs','elasticsearch']
SUPPORTED_DBS = ['elasticsearch']

class DatabaseNotSupported(Exception):
headline = 'DB_CLIENT_NOT_FOUND'
Expand All @@ -29,7 +24,5 @@ def __init__(self,given_client):
def get_database_client(client_name):
if client_name not in SUPPORTED_DBS:
raise DatabaseNotSupported(client_name)
if client_name == 'fs':
return ArxivDatabaseServiceClient
elif client_name == 'elasticsearch':
return KeywordsTextSearch
13 changes: 0 additions & 13 deletions arxiv_miner/database/elasticsearch.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,6 @@ def __init__(self,index_name=None,host='localhost',port=9200,auth=None):
src_str = f'{host}'
else:
src_str = f'{host}:{port}'

if auth is None:
self.es = elasticsearch.Elasticsearch(src_str,timeout=30, max_retries=10)
else:
Expand Down Expand Up @@ -811,18 +810,6 @@ def text_aggregation(self,agg_obj:Aggregation):
return_buckets = agg_obj.transform_resp(aggregation_buckets)
return return_buckets

# @async_wrap
# def async_text_search_scan(self,filter_obj:TextSearchFilter):
# return self.text_search_scan(filter_obj)

# @async_wrap
# def async_text_aggregation(self,agg_obj:Aggregation):
# return self.text_aggregation(agg_obj)

# @async_wrap
# def async_text_search(self,filter_obj:TextSearchFilter):
# return self.text_search(filter_obj)

class KeywordsTextSearch(ArxivElasticTextSearch):
def __init__(self, **kwargs):
super().__init__(**kwargs)
Expand Down
Loading

0 comments on commit c7caf0b

Please sign in to comment.