-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* OSS Cleanup - Refactored CLI into the main module - removed all outside scripts and put them in one folder - removed FS database. - Removed outside config. - setup ini file based configuration. - Create documentation - Added changelog and contribution guide. - Added shelll script for open detex etc. - fixed the streamlit dashboard. - Added license - Version bump and final cleanup pre merge.
- Loading branch information
Showing
46 changed files
with
603 additions
and
1,404 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
MIT License | ||
Copyright (c) 2021 Valay Dave | ||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,110 +1,44 @@ | ||
# Arxiv Miner. | ||
# ArXiv-Miner | ||
|
||
Repository Helps Mine Arxiv Papers to quickly Scrape through new Papers and Mine data for Faster Readings. | ||
> ArXiv Miner is a toolkit for mining research papers on CS ArXiv. | ||
# BROADER GOAL | ||
1. The goal of this project is to annotate and build faster search around research papers so that I can be quickly aware of what is happening in the domain. | ||
2. It is also ment to structure research papers in searialisable JSON so that I can start annotating research and fixing things around the same. | ||
## What is ArXiv-Miner | ||
|
||
# How can One get there ? | ||
`arxiv-miner` is a quick handy library that helps power [Sci-Genie](https://sci-genie.com). Sci-Genie is a search engine for quickly through full text of papers on CS ArXiv. `arxiv-miner` helps extract and parse LaTeX documents from CS ArXiv. It also supports storage and search of those parsed documents using **Elasticsearch**. The library can be applicable for all other domains like Math, Physics, Biology etc. | ||
|
||
## ARXIV PAPER MINING | ||
## Documentation | ||
All documentation on how to install and use `arxiv-miner` is provided in the documentation website or inside the [docs folder](docs). Contribution guidelines are also provided there. | ||
|
||
### GOAL OF PAPER MINING | ||
Parse the Arxiv Latex/PDF into A research Paper Object which can be serialised so that It is in readable format for some form of Machine learning/Annoation methods. But it all starts from cleaning the Dirt from Arxiv. | ||
## Why was ArXiv-Miner created ? | ||
ArXiv Miner was created for easily scraping, parsing and searching research content on ArXiv. This library was created after stitching together solutions from the code of various tools like [arxiv-sanity](https://github.com/karpathy/arxiv-sanity-preserver), [arxiv-vanity/engrafo](https://github.com/arxiv-vanity/engrafo), [arxivscraper](https://github.com/Mahdisadjadi/arxivscraper), [tex2py](https://github.com/alvinwan/tex2py), [cso-classifier](https://github.com/angelosalatino/cso-classifier/) and [axcell](https://github.com/paperswithcode/axcell). Parsed structure of the content can be useful in search or any scientific research mining/AI applications as a heuristic baseline. | ||
|
||
### WAY TO DO IT | ||
1. Extract Papers from `Arxiv` using `scrape_papers.py` script. The `ArxivDatabase` will hold the `ArxivRecord`s. | ||
2. `mine_papers.py` will download the Latex version of the Papers for Arxiv and create and `ArxivRecord` object. | ||
3. The `ArxivRecord` can is a base class to `ArxivPaper`. | ||
4. The `ArxivPaper` Object helps extract the Latex source from the Arxiv and parses it. | ||
- Three things will help solve the Information mining Problem. | ||
1. Extraction of Document Structure/hierarchy via Python-Latex Libraries like `tex2py`. | ||
2. Extraction of Text from Latex Document Using `detex` : https://github.com/pkubowicz/opendetex | ||
3. Collate with the Tree with the text based on hierachical traversal of tree and text-splittig based search to collate the information. | ||
- These things are Managed using the child classes of `LatexInformationParser`. These child classes will help for the Structured `Section` objects which contains the stored parsed structure of the Research paper. | ||
5. The Scaraped/Mined Papers are stored in a `fs` or `elasticsearch` based search engines. | ||
## Core Components of ArXiv-Miner | ||
- Scraping | ||
- Parsing | ||
- Indexing/Storage | ||
|
||
## Family Of Projects With ArXiv-Miner | ||
- `arxiv-table-miner` : Coming Soon. | ||
- `arxiv-table-ml-models` : Coming Soon. | ||
- `semantic-scholar-data-pipeline` : https://github.com/valayDave/semantic-scholar-data-pipeline | ||
|
||
## Setup | ||
## Disclaimer | ||
This project was developed like a [Cowboy coder](https://en.wikipedia.org/wiki/Cowboy_coding) over the [COVID-19 pandemic](https://en.wikipedia.org/wiki/COVID-19_pandemic). Hence, this **may have bugs and not the most well optimized code**. The primary reason for development was to aid CS and Machine Learning/AI research, but this tool can be extended to all 3M+ documents on ArXiv. | ||
|
||
```sh | ||
sh setup.sh | ||
``` | ||
## Call For Contributors | ||
Any help with contributions to improve the project or fix bugs are completely welcome. Please read the contribution guide in the documentation. | ||
|
||
### To Setup Ontology Miner: | ||
## Credits and Appreciation | ||
This project like all others has been built on shoulders of giants. A big thanks to the creators of the following libraries/open source projects that aided the development of `arxiv-miner`, and it's family of projects: | ||
- [arxiv-sanity](https://github.com/karpathy/arxiv-sanity-preserver) | ||
- [arxiv-vanity/engrafo](https://github.com/arxiv-vanity/engrafo) | ||
- [arxivscraper](https://github.com/Mahdisadjadi/arxivscraper) | ||
- [tex2py](https://github.com/alvinwan/tex2py) | ||
- [cso-classifier](https://github.com/angelosalatino/cso-classifier/) | ||
- [axcell](https://github.com/paperswithcode/axcell) | ||
- [elasticsearch](https://github.com/elastic/elasticsearch) | ||
- [Semantic Scholar Open Research corpus](https://github.com/allenai/s2orc) | ||
- [metaflow](https://metaflow.org) | ||
|
||
```sh | ||
sh cso_setup.sh | ||
``` | ||
|
||
## What is Done Yet : | ||
|
||
1. Arxiv PDF and LateX Extraction Pipeline | ||
2. Arxiv Paper Parsing to JSON Objects using Latex and Python. --> Latex Based Symantically parsed Data Extraction :: READY | ||
3. Local Database Setup and Data Exploration. | ||
|
||
## What Needs to Be Done ? | ||
|
||
1. Data Extraction And Pasing System Are pretty Well set from Database. | ||
1. The Database Generation needs to move from Andrej's script to using the `arxivscraper` which uses the mass Metadata extraction. | ||
|
||
2. Final System : | ||
- Scraping Crons | ||
- Parsing Idempotent processes. | ||
- TODO : Further parse | ||
- ArxivRecord Database with `fs` | `elasticsearch` | ||
- Search Interface | ||
- Daily Update of New Research | ||
- Search indexing for | ||
|
||
|
||
# How Does it Work ? | ||
|
||
## Overview | ||
- Parts of Current System : | ||
- `ArxivDatabase` : Core class to expose base methods for interfacing with DB. It is an adapter that can work with an `filesystem` based database or `elasticsearch`. The purpose of the adapter is ment create an interopratable data layer that can switched according to requirement and need. | ||
- Filesystem based DB uses `ArxivDatabaseService(rpyc.Service,ArxivFSDatabase)`. The `database_server.py` file helps create and FS based database server. | ||
- `HarvestingProcess` : This uses a `ScrapingEngine` to extract `ArxivIdentity` from ArXiv API(`http://export.arxiv.org/oai2?verb=ListRecords`). | ||
- The Data extracted is stored to the database as an `ArxivRecord`. | ||
- `DailyHarvestationProcess` helps retrieve data daily papers. | ||
- `MassHarvestationProcess` gets data based on DateRange. | ||
- `MiningProcess`: Helps mine the papers for `LaTeX` information. The mined `ArxivRecord` is stored in the Database | ||
|
||
- The Database provides a Way to Create/Update `ArxivRecord`. The `ArxivRecord` contains an `ArxivIdentity` which is extracted using the `arxiv_miner.scraping_engine.ScrapingEngine`. `ArxivRecord` is the Fundamental Datastructure use to identify a research paper. `ArxivPaper` is a processing Object which can use a `ArxivRecord` to start the mining process. | ||
|
||
## Running the Damn Thing. | ||
- The `config.py` file contains the `Config` Object which is Singleton used for configuration across the project. | ||
- Start FS based Database Server with Below Command . The Database Server is responsible For Managing the data. Elasticsearch is also supported as a backend database. | ||
```sh | ||
python database_server.py | ||
``` | ||
- Start the Data Harvester according to your requirements. Can perform a `daily-harvest` or a `date-range` harvest. | ||
```sh | ||
python scrape_papers.py --help | ||
``` | ||
- DB adapters can be switched. The `--use_defaults` will load the defaults of `--datastore` from `Config`. | ||
```sh | ||
python scrape_papers.py --datastore elasticsearch --host localhost --port 18861 daily-harvest | ||
``` | ||
- Start the Miner To parallely start mining the Extracted data. | ||
```sh | ||
python mine_papers.py --help | ||
``` | ||
- The Miner has the same database cli adapter as Scraper. | ||
```sh | ||
python mine_papers.py --datastore fs --use_defaults start-miner | ||
``` | ||
- Source Harvest and Store to S3: | ||
```sh | ||
nohup /home/ubuntu/arxiv-miner/.env/bin/python /home/ubuntu/arxiv-miner/mass_source_harvest.py --max-chunks 200 > /home/ubuntu/arxiv-miner/mass_harvet.log & | ||
``` | ||
|
||
- Extract EC2 instance List from AWS | ||
``` | ||
aws ec2 describe-instances --region=us-east-1 --query 'Reservations[*].Instances[*].[InstanceId,Tags[?Key==`Name`].Value|[0],State.Name,PrivateIpAddress,PublicIpAddress]' --output table > instance_list.md | ||
``` | ||
# TODO / VISION | ||
1. Create a search interface for looking for research. | ||
2. Get daily analytics of the new research coming out | ||
3. Create reports and analytics for the new research | ||
## Licence | ||
MIT |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
''' | ||
This is the Generalised CLI origin of the Project. | ||
this will be used for the Extracting the Important CLI information such as Database | ||
Selection etc. Can be used as a gateway to integrate all the submodules into one cli invocation | ||
''' | ||
|
||
import click | ||
from functools import wraps | ||
import configparser | ||
from .config import Config | ||
from .database import SUPPORTED_DBS,get_database_client | ||
import json | ||
|
||
DEFAULT_APP_NAME= 'ArXiv-Miner' | ||
|
||
def common_run_options(func): | ||
db_defaults = Config.get_db_defaults() | ||
@click.option('--host', default=db_defaults['host'], help='ArxivDatabase Host') | ||
@click.option('--port', default=db_defaults['port'], help='ArxivDatabase Port') | ||
@wraps(func) | ||
def wrapper(*args, **kwargs): | ||
return func(*args, **kwargs) | ||
return wrapper | ||
|
||
|
||
@click.group(invoke_without_command=True) | ||
@click.option('--use_defaults',is_flag=True,help='Use Default Database Configurations For Chosen Datastore.') | ||
@click.option('--with-config',default=None,help='Path to configuration ini file to use. Uses a configuration file for the instantiation of the database') | ||
@common_run_options | ||
@click.pass_context | ||
def db_cli(ctx,use_defaults,with_config,host,port,app_name=DEFAULT_APP_NAME): | ||
ctx.obj = {} | ||
args , client_class = database_choice(use_defaults,with_config,host,port) | ||
print_str = '\n %s Process Using %s Datastore'%(app_name,'elasticsearch') | ||
args_str = ''.join(['\n\t'+ i + ' : ' + str(args[i]) for i in args]) | ||
click.secho(print_str,fg='green',bold=True) | ||
click.secho(args_str+'\n\n',fg='magenta') | ||
ctx.obj['db_class'] = client_class | ||
ctx.obj['db_args'] = args | ||
|
||
|
||
def database_choice(use_defaults,with_config,host,port): | ||
client_class = get_database_client('elasticsearch') | ||
if with_config is not None: | ||
config = configparser.ConfigParser() | ||
config.read(with_config) | ||
args = dict(index_name=config['elasticsearch']['index'], | ||
host=config['elasticsearch']['host'] | ||
) | ||
if 'port' in config['elasticsearch']: | ||
args['port'] = config['elasticsearch']['port'] | ||
if 'auth' in config['elasticsearch']: | ||
args['auth'] = config['elasticsearch']['auth'].split(' ') | ||
# get_database_client will raise error if some-one feeds BS DB | ||
elif use_defaults: | ||
args = Config.get_defaults('elasticsearch') | ||
else: | ||
args = dict(index_name=Config.elasticsearch_index,host=host,port=port) | ||
return args, client_class | ||
|
||
if __name__ == '__main__': | ||
db_cli() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# TODO : Move this to configuration format where the entire thing comes from a YML file | ||
import os | ||
# global settings | ||
# ----------------------------------------------------------------------------- | ||
class Config(object): | ||
default_database = 'elasticsearch' | ||
elasticsearch_port = 9200 | ||
elasticsearch_host = 'localhost' | ||
elasticsearch_index = 'arxiv_papers' | ||
es_auth = None # should be a tuple | ||
|
||
# Object Store | ||
bucket_name = 'arxiv-papers-source-bucket' | ||
|
||
@classmethod | ||
def get_defaults(cls,db_str): | ||
if db_str == 'elasticsearch': | ||
return_dict = dict(\ | ||
host=cls.elasticsearch_host,\ | ||
port=cls.elasticsearch_port,\ | ||
index_name = cls.elasticsearch_index) | ||
|
||
if cls.es_auth is not None: | ||
return_dict['auth']=cls.es_auth | ||
|
||
return return_dict | ||
else: | ||
return None | ||
|
||
@classmethod | ||
def get_db_defaults(cls): | ||
return cls.get_defaults(cls.default_database) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.