data

Jun 2, 2020

8dbf958 · Jun 2, 2020

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md	Initial commit	Jun 2, 2020
af (1).txt.gz	af (1).txt.gz	Add files via upload	Jun 2, 2020

README.md

In general, I like caching a local copy of the data up in releases (see below). Sometimes it makes more sense to refer to the original source and leave it at that. Choose what works best for your project. One of the nice things about working in GitHub is that you can also refer to other projects. It might be that you are using another researcher's data set, or even a prior data set that you made. In those cases, just link to their data sections and them use the rest of this document to describe what new efforts you bring to the party.

The original package can be downloaded by hand from Kaggle. The original bill text can be downloaded by hand from congress.gov. For convenience, we keep a 2nd copy of all the data gzip'ed in releases. We need to source the data locally for everyone because GitHub has storage limits that we don't want to cross.

Steps

GitHub lets you keep up to a total of 1 GB of files in your repository with a single file being no bigger than 100MB. However, it allows you to keep 100GB of data in your releases section. This artificial limit helps you in terms of data quality when you consider GitHub to be your research journal. As you are making your observations, they become fixed. You certainly don't want to commit data fraud by going back and updating the data.

Because of this, I strongly recommend keeping all your data compressed and in the Releases section of your Repository. Each step below should tell you what file to bring down and place in this folder. Using this process, you get around most of the slow bloated repository issues, while still being able to write your code with relative paths. Trust me, others will thank you for not referring to a particular directory on your personal computer.

In general nothing (other than this file) should be stored in the ~/data directory when viewed in GitHub.

Retrieve the dataset by hand. Click on the download link, saving the file to ~/data/raw
Extract the data in-place
1. right click the file, select '7-zip', select 'Extract Here'

Shortcuts

GitHub has some issues when dealing with large files. The recommended method for dealing with large files is to store them in releases. You can find the gzip'ed versions of the below there. Any of the steps can be skipped by downloading the correct file from releases and proceding from that point forward.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

data

data

README.md

Steps

Shortcuts

Files

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

Steps

Shortcuts