You may already have data that you're interested in working with. You may have an idea for scraping web data from some source, or for using data from an API such as Twitter's. There are lots of sources of data!
-
There are a variety of open data catalogs from various governments and NGOs:
- NYC Open Data
- DC Data Catalog / OpenDataDC
- data.gov
- data.gov.uk
- The US Census
- Data from the World Bank
- The Sunlight Foundation
- ProPublica Data Store
- Humanitarian Data Exchange of the United Nations Office for the Coordination of Humanitarian Affairs
-
Academic institutions host a variety of data set collections:
- The UC Irvine Machine Learning Repository makes available a variety of nice data sets.
- Stanford Large Network Dataset Collection
- Inter-university Consortium for Political and Social Research
- The Pittsburgh Science of Learning Center’s DataShop
- Academic Torrents: A distributed network for sharing large research data sets
-
Quandl: over 9,000,000 financial, economic, and social datasets
-
Infochimps Marketplace: More than 11,000 searchable data sets
-
Kaggle provides data sets with their challenges. You probably won't be able to get their private test sets, but you can get the scores that they report on leaderboards.
-
Donors Choose makes quite a lot of data available which could be interesting.
-
The Echo Nest has some interesting music data with an API that could be interesting.
-
If you're interested in working with a large data set, Amazon has a variety of public data sets that are available through their infrastructure, including Common Crawl and 1000 Genomes.
-
There are also various APIs out there.
-
More lists of data sets:
-
Datasets subreddit: You can ask for help finding a specific data set, or post your own.
-
mldata.org: "machine learning data set repository"
This is just the tip of the iceberg; there's a lot of data out there!