Skip to content

Commit

Permalink
Add info about deduplicated datasets
Browse files Browse the repository at this point in the history
  • Loading branch information
cedricrupb committed Nov 29, 2023
1 parent a05ca67 commit a891a13
Showing 1 changed file with 9 additions and 0 deletions.
9 changes: 9 additions & 0 deletions index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,15 @@ program repair. With the recent advances in data-driven
bug detection and repair, single statement bug fixes at the scale of millionth examples become more important than ever. For this reason, we are releasing three new
datasets consisting of single statement changes and bug fixes from over 500K Python Git projects.

## :warning: Deduplicated Datasets
We came to notice that our datasets contain a significant number of duplicate patches that were missed by our deduplication procedure. To mitigate this, we are releasing cleaned versions of **TSSB-3M** and **SSB-9M**:

* [**CTSSB-1M**](https://tssb3m.s3.eu-west-1.amazonaws.com/ctssb_data_1M.zip) A cleaned version of TSSB-3M containing nearly a million isolated single statement bug fixes.

* [**CSSB-2.6M**](https://tssb3m.s3.eu-west-1.amazonaws.com/cssb_data_2_6M.zip) A cleaned version of SSB-9M containing over 2.6 million single statement bug fixes.

The cleaned datasets are also available on [Zenodo](https://doi.org/10.5281/zenodo.10217373).

### Datasets
To download our datasets, use:

Expand Down

0 comments on commit a891a13

Please sign in to comment.