Scraping and Chunking experiments

Scraped the entire Notion help center knowledge base:
- Used Python with BeautifulSoup to implement the scraper
- Collected all Help Articles from https://www.notion.so/help
Extracted core text content from each article
Split articles into smaller chunks:
- Explored four different chunking strategies for RAG-type applications:
  1. RecursiveTokenChunker: Recursively splits text based on token count
  2. ClusterSemanticChunker: Uses semantic similarity to cluster text
  3. KamradtModifiedChunker: A modified version of Kamradt's chunking algorithm
  4. NaiveChunker: Simple character-based chunking
- Implemented an LLMChunker as a fifth strategy
- Evaluated each strategy based on precision, recall, and IOU metrics
- Produced arrays of chunks as the final output for each strategy

Results

Evaluation Results Summary:

Chunker	Precision Mean	Recall Mean	IOU Mean
RecursiveTokenChunker	0.0135	0.7358	0.0135
ClusterSemanticChunker	0.0327	0.6862	0.0325
KamradtModifiedChunker	0.0230	0.7293	0.0230
NaiveChunker	0.0367	0.6450	0.0363
LLMSemanticChunker	0.0357	0.7043	0.0357

To use this project, follow these steps:

Setup:

Clone the repository:

git clone https://github.com/chirag9127/chunking_expt.git
cd chunking_expt

Scraping:
- Run the scraper to collect articles from the Notion help center:
```
python scraper.py
```
- This will save the scraped articles in a checkpoint file.
Chunking:
- To chunk the articles using the LLMChunker:
```
python chunker.py
```
- This will process the articles from the checkpoint file and create chunks.
Evaluation:
- To evaluate different chunking strategies:
```
python eval.py
```
- This will run the evaluation on different chunking methods and output the results.
Viewing Results:
- The evaluation results will be saved in chunker_evaluation_results.json.
- You can also find a summary of the results at the end of this README.

Note: Make sure to set up your OpenAI API key in a .env file or as an environment variable before running the LLMChunker.

This project's evaluation methodology is inspired by and builds upon the work presented in Chroma's technical report:

"Evaluating Chunking Strategies for Retrieval" by Chroma Research https://research.trychroma.com/evaluating-chunking

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
chunker.py		chunker.py
chunker_evaluation_results.json		chunker_evaluation_results.json
eval.py		eval.py
generated_queries_excerpts.csv		generated_queries_excerpts.csv
requirements.txt		requirements.txt
scraper.py		scraper.py
utils.py		utils.py