GitHub - secure-software-engineering/TypeEvalPy: A Micro-benchmarking Framework for Python Type Inference Tools

A Micro-benchmarking Framework for Python Type Inference Tools

📌 Features:

📜 Contains 154 code snippets to test and benchmark.
🏷 Offers 845 type annotations across a diverse set of Python functionalities.
📂 Organized into 18 distinct categories targeting various Python features.
🚢 Seamlessly manages the execution of containerized tools.
🔄 Efficiently transforms inferred types into a standardized format.
📊 Automatically produces meaningful metrics for in-depth assessment and comparison.

[New] TypeEvalPy Autogen

🤖 Autogenerates code snippets and ground truth to scale the benchmark based on the original TypeEvalPy benchmark.
📈 The autogen benchmark now contains:
- Python files: 7121
- Type annotations: 78373

🛠️ Supported Tools

Supported ✅	In-progress 🔧	Planned 💡
HeaderGen	Intellij PSI	MonkeyType
Jedi	Pyre	Pyannotate
Pyright	PySonar2
HiTyper	Pytype
Scalpel	TypeT5
Type4Py
GPT
Ollama

🏆 TypeEvalPy Leaderboard

Below is a comparison showcasing exact matches across different tools and LLMs on the Autogen benchmark.

Rank	🛠️ Tool	Function Return Type	Function Parameter Type	Local Variable Type	Total
1	mistral-large-it-2407-123b	16701	728	57550	74979
2	qwen2-it-72b	16488	629	55160	72277
3	llama3.1-it-70b	16648	580	54445	71673
4	gemma2-it-27b	16342	599	49772	66713
5	codestral-v0.1-22b	16456	706	49379	66541
6	codellama-it-34b	15960	473	48957	65390
7	mistral-nemo-it-2407-12.2b	16221	526	48439	65186
8	mistral-v0.3-it-7b	16686	472	47935	65093
9	phi3-medium-it-14b	16802	467	45121	62390
10	llama3.1-it-8b	16125	492	44313	60930
11	codellama-it-13b	16214	479	43021	59714
12	phi3-small-it-7.3b	16155	422	38093	54670
13	qwen2-it-7b	15684	313	38109	54106
14	HeaderGen	14086	346	36370	50802
15	phi3-mini-it-3.8b	15908	320	30341	46569
16	phi3.5-mini-it-3.8b	15763	362	28694	44819
17	codellama-it-7b	13779	318	29346	43443
18	Jedi	13160	0	15403	28563
19	Scalpel	15383	171	18	15572
20	gemma2-it-9b	1611	66	5464	7141
21	Type4Py	3143	38	2243	5424
22	tinyllama-1.1b	1514	28	2699	4241
23	mixtral-v0.1-it-8x7b	3235	33	377	3645
24	phi3.5-moe-it-41.9b	3090	25	273	3388
25	gemma2-it-2b	1497	41	1848	3386

_{(Auto-generated based on the the analysis run on 30 Aug 2024)}

🐳 Running with Docker

1️⃣ Clone the repo

git clone https://github.com/secure-software-engineering/TypeEvalPy.git

2️⃣ Build Docker image

docker build -t typeevalpy .

3️⃣ Run TypeEvalPy

🕒 Takes about 30mins on first run to build Docker containers.

📂 Results will be generated in the results folder within the root directory of the repository. Each results folder will have a timestamp, allowing you to easily track and compare different runs.

Correlation of CSV Files Generated to Tables in ICSE Paper

Here is how the auto-generated CSV tables relate to the paper's tables:

Table 1 in the paper is derived from three auto-generated CSV tables:
- paper_table_1.csv - details Exact matches by type category.
- paper_table_2.csv - lists Exact matches for 18 micro-benchmark categories.
- paper_table_3.csv - provides Sound and Complete values for tools.
Table 2 in the paper is based on the following CSV table:
- paper_table_5.csv - shows Exact matches with top_n values for machine learning tools.

Additionally, there are CSV tables that are not included in the paper:

paper_table_4.csv - containing Sound and Complete values for 18 micro-benchmark categories.
paper_table_6.csv - featuring Sensitivity analysis.

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy

🔧 Optionally, run analysis on specific tools:

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners headergen scalpel

📊 Run analysis on custom benchmarks:

Here, running with the autogen benchmark on HeaderGen

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy \
      --runners headergen \
      --custom_benchmark_dir /app/autogen_typeevalpy_benchmark

🛠️ Available options: headergen, pyright, scalpel, jedi, hityper, type4py, hityperdl

🤖 Running TypeEvalPy with LLMs

TypeEvalPy integrates with LLMs through Ollama, streamlining their management. Begin by setting up your environment:

Create Configuration File: Copy the config_template.yaml from the src directory and rename it to config.yaml.

In the config.yaml, configure in the following:

openai_key: your key for accessing OpenAI's models.
ollama_url: the URL for your Ollama instance. For simplicity, we recommend deploying Ollama using their Docker container. Get started with Ollama here.
prompt_id: set this to questions_based_2 for optimal performance, based on our tests.
ollama_models: select a list of model tags from the Ollama library. For better operation, ensure the model is pre-downloaded with the ollama pull command.

With the config.yaml configured, run the following command:

docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners ollama

Running From Source...

1. 📥 Installation

Clone the repo

git clone https://github.com/secure-software-engineering/TypeEvalPy.git

Install Dependencies and Set Up Virtual Environment

Run the following commands to set up your virtual environment and activate the virtual environment.
```
python3 -m venv .env
```
```
source .env/bin/activate
```
```
pip install -r requirements.txt
```

2. 🚀 Usage: Running the Analysis

Navigate to the src Directory
```
cd src
```
Execute the Analyzer

Run the following command to start the benchmarking process on all tools:
```
python main_runner.py
```
or

Run analysis on specific tools
```
python main_runner.py --runners headergen scalpel
```

Running TypeEvalPy Autogen

To generate an extended version of the original TypeEvalPy benchmark to include many more Python types, run the following commands:

Navigate to the autogen Directory
```
cd autogen
```
Execute the Generation Script

Run the following command to start the generation process:
```
python generate_typeevalpy_dataset.py
```

This will generate a folder in the repo root with the autogen benchmark with the current date.

🤝 Contributing

Thank you for your interest in contributing! To add support for a new tool, please utilize the Docker templates provided in our repository. After implementing and testing your tool, please submit a pull request (PR) with a descriptive message. Our maintainers will review your submission, and merge them.

To get started with integrating your tool, please follow the guide here: docs/Tool_Integration_Guide.md

⭐️ Show Your Support

Give a ⭐️ if this project helped you!

Name	Name	Last commit message	Last commit date
Latest commit ashwinprasadme Add indexing Jan 13, 2025 1a914fa · Jan 13, 2025 History 399 Commits
.vscode	.vscode	Fixed results scripts	Dec 11, 2023
autogen	autogen	Autogen Bug fix for imported facts	Aug 4, 2024
autogen_typeevalpy_benchmark/python_features	autogen_typeevalpy_benchmark/python_features	Autogen Bug fix for imported facts	Aug 4, 2024
docs	docs	Update Leaderboard	Oct 22, 2024
extras	extras	Minor cleanup	Oct 22, 2024
micro-benchmark-autogen-templates/python_features	micro-benchmark-autogen-templates/python_features	Minor	Aug 1, 2024
micro-benchmark-excluded	micro-benchmark-excluded	Excluding mixed sensitivities	Sep 4, 2023
micro-benchmark	micro-benchmark	Minor cleanup	Oct 22, 2024
scripts	scripts	Minor cleanup	Oct 22, 2024
src	src	Add indexing	Jan 13, 2025
.gitignore	.gitignore	Run on custom benchmark	Jul 23, 2024
.pre-commit-config.yaml	.pre-commit-config.yaml	Adding project structure	Apr 19, 2023
Dockerfile	Dockerfile	Dockerfile, Leaderboard generation, Minor fixes	Oct 20, 2023
Makefile	Makefile	Adding project structure	Apr 19, 2023
README.md	README.md	Update Leaderboard	Oct 22, 2024
TypeEvalPy.jpg	TypeEvalPy.jpg	Update Readme	Oct 23, 2023
pyproject.toml	pyproject.toml	Adding project structure	Apr 19, 2023
requirements.txt	requirements.txt	Minor cleanup	Oct 22, 2024
setup.sh	setup.sh	Handling ollama server status	Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Micro-benchmarking Framework for Python Type Inference Tools

📌 Features:

[New] TypeEvalPy Autogen

🛠️ Supported Tools

🏆 TypeEvalPy Leaderboard

🐳 Running with Docker

1️⃣ Clone the repo

2️⃣ Build Docker image

3️⃣ Run TypeEvalPy

🤖 Running TypeEvalPy with LLMs

1. 📥 Installation

2. 🚀 Usage: Running the Analysis

Running TypeEvalPy Autogen

🤝 Contributing

⭐️ Show Your Support

About

Contributors 4

Languages

secure-software-engineering/TypeEvalPy

Folders and files

Latest commit

History

Repository files navigation

A Micro-benchmarking Framework for Python Type Inference Tools

📌 Features:

[New] TypeEvalPy Autogen

🛠️ Supported Tools

🏆 TypeEvalPy Leaderboard

🐳 Running with Docker

1️⃣ Clone the repo

2️⃣ Build Docker image

3️⃣ Run TypeEvalPy

🤖 Running TypeEvalPy with LLMs

1. 📥 Installation

2. 🚀 Usage: Running the Analysis

Running TypeEvalPy Autogen

🤝 Contributing

⭐️ Show Your Support

About

Topics

Resources

Stars

Watchers

Forks

Contributors 4

Languages