(Check out other code resources in our group at: https://github.com/sunlab-osu)
StaQC (Stack Overflow Question-Code pairs) is the largest dataset to date of around 148K Python and 120K SQL domain question-code pairs, which are automatically mined from Stack Overflow using a Bi-View Hierarchical Neural Network, as described in the paper "StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow" (WWW'18).
StaQC is collected from three sources: multi-code answer posts, single-code answer posts, and manual annotations on multi-code answer posts:
#of question-code pair | ||
Source | Python | SQL |
Multi-Code Answer Posts | 60,083 | 41,826 |
Single-Code Answer Posts | 85,294 | 75,637 |
Manual Annotation | 2,169 | 2,056 |
Sum | 147,546 | 119,519 |
A Multi-code answer post is an (accepted) answer post that contains multiple code snippets, some of which may not be a standalone code solution to the question (see Section 1 in paper). For example, in this multi-code answer post, the third code snippet is not a code solution to the question "How to limit a number to be within a specified range? (Python)".
The ids of question-code pairs automatically mined or manually annotated from multi-code answer posts can be found here: Python and SQL.
Format: Each line corresponds to one code snippet, which can be paired with its question. The code snippet is identified by (question id, code snippet index)
, where the code snippet index
refers to the index (starting from 0) of the code snippet in the accepted answer post of this question. For example, (5996881, 0)
refers to the first code snippet in the accepted answer post of the question with id "5996881", which can be paired with its question "How to limit a number to be within a specified range? (Python)".
We also provide the complete source data. Note that the source data contains all available resources (not only the mined question-code pairs). Given the source data, you can retreive the mined code solutions using the provided question-code ids (see above).
Source data: Python 2.7 Pickle files. Please open with pickle.load(open(filename))
.
- Code snippets for Python and SQL: A dict of {(question id, code index): code snippet}.
- Question titles for Python and SQL: A dict of {question id: question title}.
A Single-code answer post is an (accepted) answer post that contains only one code snippet. We pair such code snippet with the question title as a question-code pair.
Source data: Python 2.7 Pickle files. Please open with pickle.load(open(filename))
.
- Code snippets for Python and for SQL): A dict of {question id: accepted code snippet}.
- Question titles for Python and SQL: A dict of {question id: question title}.
[Update 05/27/2019] If you are using our processed data, vocabularies (text_word_vocab.pickle
for text, code_token_vocab.pickle
for code) can be found in the following folders:
- Python: text vocab, code vocab.
- SQL: text vocab, code vocab.
Human annotations can be found: Python and SQL. Both are pickle files.
The script that extracts features for constructing a "how-to-do-it" question type classifier can be found here. The 250 manually annotated posts for Python and SQL can be found here (label '1' denotes "how-to-do-it"). For details, please refer to Section 2.2.1 in our paper.
The script for processing code snippets can be found here. For details, please read Section 5.1 in our paper. The implementation of the SQL parser is adapted from https://github.com/sriniiyer/codenn.
- Installing package
cd data_processing/codenn/src/sqlparse/
python setup.py install
- Processing code snippets (tokenization, normalizing variable name, etc.)
cd data_processing
Thetokenize_code_corpus
function receives a dictionary of code snippets and returns the paring results. Please runpython code_processing.py
for testing.
We provide processed training/validation/testing files in our experiments here.
-
Before running, please unzip the word embedding files for Python (code_word_embedding.gz*) following:
cd data/data_hnn/python/train/
cat code_word_embedding.gza* | zcat > rnn_partialcontext_word_embedding_code_150.pickle
rm code_word_embedding.gza*
then go back the code dir:
cd ../../../../BiV_HNN/
.No other operations demanded for SQL data.
-
Train:
For Python data:python run.py --train --train_setting=1 --text_model=1 --code_model=1 --query_model=1 --text_model_setting="64-150-24379-0-1-0-1" --code_model_setting="64-150-218900-0-1-0-1" --query_model_setting="64-150-24379-0-1-0-1" --keep_prob=0.5
For SQL data:
python run.py --train --train_setting=2 --text_model=1 --code_model=1 --query_model=1 --text_model_setting="64-150-13698-0-1-0-1" --code_model_setting="64-150-33192-0-1-0-1" --query_model_setting="64-150-13698-0-1-0-1" --keep_prob=0.7
The above program trains the
BiV-HNN
model. It will print the model's learning process on the training set, and its performance on the validation set and the testing set.For training
Text-HNN
, set:
--code_model=0 --query_model=0 --code_model_setting=None --query_model_setting=None
to dismiss the code and query modeling.For training
Code-HNN
, set:
--text_model=0 --text_model_setting=None
to dismiss the text modeling. -
Test:
You may revise thetest
function inrun.py
for testing other datasets, and run the above command (Note: replace--train
with--test
).
If you use the dataset or the code in your research, please cite the following paper:
@inproceedings{yao2018staqc,
title={StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow},
author={Yao, Ziyu and Weld, Daniel S and Chen, Wei-Peng and Sun, Huan},
booktitle={Proceedings of the 2018 World Wide Web Conference on World Wide Web},
pages={1693--1703},
year={2018},
organization={International World Wide Web Conferences Steering Committee}
}
This work is licensed under a Creative Commons Attribution 4.0 International License.