- Ad relevance: given an individual's purchase history and demographics, return a set of advertisements ordered by relevance.
- Yield prediction: given crop type, soil type, soil treatment, temperature, and moisture, predict the weight in harvested grain.
- HyperLEDA: a database and a collection of tools to study the physics of galaxies and cosmology.
- Given certain properties of a gene how one can predict its expression?
- Impute the missing characteristics in a gene expression
- Human epidermal growth factor receptor 2-positive breast cancer brain metastases: Analysis of HER2+ breast cancer brain metastasis specimens and HER2+ nonmetastatic primary breast tumors. Samples were matched for patient age upon primary tumor detection and ER status of primary tumor. Results provide insight into the molecular basis of HER2+ breast cancer outgrowth in the brain. Given these information, how one can recognize HER-2 type genes from the dataset? Also, how one can determine whether a predictive classifier adds predictive value to standard prognostic factors?
- Topic summarization: given a set of messages, identify major topics.
- Enron emails: a set of about 500K emails from Enron executives. Enron is a US energy company that was the subject of a federal investigation.
- Request triage: given request text and the customer's history with the company (purchases and previous communications) categorize the request by topic.
- Privacy Preservance
- Privacy Assurance
- Intrusion Detection
- Phishing Detection
- Malware Target Prediction
- Malware Classification
- KDD CUP 99: The task is to develop to detect network intrusions that protects a computer network from unauthorized users, including perhaps insiders. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad connections, called intrusions or attacks, and good normal connections. The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. The 1999 KDD intrusion detection contest uses a version of this dataset.
- NSL-KDD: NSL-KDD is a data set suggested to solve some of the inherent problems of the KDD'99 data set. Although, this new version of the KDD data set still suffers from some of the problems discussed by McHugh and may not be a perfect representative of existing real networks, because of the lack of public data sets for network-based IDSs, the authors of this data set believe it still can be applied as an effective benchmark data set to help researchers compare different intrusion detection methods. Furthermore, the number of records in the NSL-KDD train and test sets are reasonable. This advantage makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research work will be consistent and comparable.
- UNSW-NB15: This data set is an advancement over the above-mentioned two data sets i.e. KDD CUP 99 and NSL-KDD. It captures more realistic features and way more instances than the other two.
- Phishing Websites: In this dataset, the authors shed light on the important features that have proved to be sound and effective in predicting phishing websites. In addition, they proposed some new features.
- Malware Target Prediction: This Kaggle dataset challenges users to predict if a machine will soon be hit with malware.
- Malware Classification: Static (without executing the file) features derived from domain experts are extracted from malicious, benign and unlabeled data to detect a test set in the future.
- Unified Host and Network Dataset This dataset contains a subset of (anonymized) network and computer events collected from the Los Alamos National Laboratory enterprise network over the course of approximately 90 days. This dataset is useful because the computer host and network data are co-occurring.
- Loan repayment: given an individual's financial history, predict the likelihood that they will successfully repay a loan.
- Loan approval: given an individual's information (such as Self_Employed, Loan_Amount_Term, Credit_History etc), predict if a loan application will be approved or not.
- Credit card approval: given an individual's information, predict a credit card application will be approved or not.
- Search: given a set of search terms, return a set of documents ordered by relevance.
- Disaster modeling: given the history of disaster occurrences, predict the likelihood that a similar disaster will occur again within a given time window.
- Device health: given a stream of device data, determine whether it is working as intended.
- Predictive maintenance: given a stream of device data, anticipate when repair or maintenance will be necessary.
- Device comparisons: given a collection of device data streams from two different hardware/software/environment conditions, determine which is working better.
- Industrial Internet of Things Data: Industrial demand/response IoT data for IoT analytics.
- Topic retrieval: given a topic, find documents and selections focused on it.
- Hierarchical topic modeling: given a set of documents, find major topics discussed and the subtopics under each.
- Caselaw Access Project: includes all official, book-published United States case law — every volume designated as an official report of decisions by a court within the United States.
- Question-answering: given a user query and candidate passages corresponding to each, the task is to mark the most relevant passage which contains the answer to the user query.
- The Stanford Question Answering Dataset (SQuAD2.0): Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
- Content recommendation: given a set of recently viewed documents, return a set of documents ordered by relevance.
- MovieLens 20M Dataset: The datasets describe ratings and free-text tagging activities from MovieLens, a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on October 17, 2016. Some ideas worth exploring: 1. Which genres receive the highest ratings? How does this change over time? 2. Determine the temporal trends in the genres/tagging activity of the movies released
- Detecting medical conditions in radiographs: given a set of X-rays, determine whether each is indicative of the condition of interest.
- Detecting malaria from blood smear: given a patient’s blood smear predict if its malaria infected or not.
- Predicting likelihood of medical condition: given values for a set of patient risk factors, predict the likelihood that a given medical condition will manifest.
- Breast Cancer Data Set: includes information about the breast cancer screening of several female patients. Hosted on UCI.
- Diabetic retinopathy images: a collection of retinal images, each labeled with its retinopathy scale score.
- The Human Microbiome Project: genetic sequences of microbes from hundreds of healthy individuals, across several different sites on the human body: nasal passages, oral cavity, skin, gastrointestinal tract, and urogenital tract.
- MIMIC-III Hospital admissions: 58,000 hospital admissions for 38,645 adults and 7,875 neonates. Access instructions(2012)
- Pima diabetes database: includes BMI, insulin level, age, number of pregnancies, and diabetes diagnosis from female patients of Pima Indian heritage. Hosted on Kaggle.
- Malaria data set
- Chronic_Kidney_Disease Data Set: Given a set of features like age, blood_pressure, rbc_count and so on of an individual the task is predict the individual has chronic kidney disease.
- Voter turnout: given an individual's voting history, predict whether they will vote in the next election.
- Product recommendation: given a set of past purchases, return a set of products ordered by relevance.
- Demand forecasting: given a history of purchases, make predictions for future purchase volumes over time.
- Instacart grocery purchases: 3 Million Instacart Orders. (2017)
- UK retailer transactions "This is a transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers." From UCI ML lab, hosted on Kaggle. (2011)
- Sales by store historical sales data for 45 stores located in different regions. Hosted on Kaggle. (2013)
- Black Friday sales 550,000 items, 100,000 customers. Hosted on Kaggle. (2018)
- Connection recommendation: given an individual's current connections, return a set of potential connections ordered by relevance.
- Facebook 'Circles' (or 'friends lists') from Facebook, collected from survey participants using the Facebook app. (2012)
- Twitter: 'Circles' (or 'lists') from Twitter, crawled from public sources. (2012)
- Predictive maintenance: given a vehicle's usage and maintenance history, predict a given failure is likely to next occur.
- On-Time Performance: Identify problematic parts of a transportation network causing delays, the root causes, and recommendations to improve on-time performance.
- SBB Swiss Federal Railways On-Time Performance: CrowdAI challenge and data set to optimize punctuality of train schedules