Skip to content

Latest commit

 

History

History
137 lines (100 loc) · 12.3 KB

use_cases.md

File metadata and controls

137 lines (100 loc) · 12.3 KB

Data science use cases, by topic

Advertising

  • Ad relevance: given an individual's purchase history and demographics, return a set of advertisements ordered by relevance.

Agriculture

  • Yield prediction: given crop type, soil type, soil treatment, temperature, and moisture, predict the weight in harvested grain.

Astronomy

Relevant data sets

  • HyperLEDA: a database and a collection of tools to study the physics of galaxies and cosmology.

Bioinformatics

  • Given certain properties of a gene how one can predict its expression?
  • Impute the missing characteristics in a gene expression

Relevant data sets

  • Human epidermal growth factor receptor 2-positive breast cancer brain metastases: Analysis of HER2+ breast cancer brain metastasis specimens and HER2+ nonmetastatic primary breast tumors. Samples were matched for patient age upon primary tumor detection and ER status of primary tumor. Results provide insight into the molecular basis of HER2+ breast cancer outgrowth in the brain. Given these information, how one can recognize HER-2 type genes from the dataset? Also, how one can determine whether a predictive classifier adds predictive value to standard prognostic factors?

Correspondence

  • Topic summarization: given a set of messages, identify major topics.

Relevant data sets

  • Enron emails: a set of about 500K emails from Enron executives. Enron is a US energy company that was the subject of a federal investigation.

Customer service

  • Request triage: given request text and the customer's history with the company (purchases and previous communications) categorize the request by topic.

Cyber Security

  • Privacy Preservance
  • Privacy Assurance
  • Intrusion Detection
  • Phishing Detection
  • Malware Target Prediction
  • Malware Classification

Relevant data sets

  • KDD CUP 99: The task is to develop to detect network intrusions that protects a computer network from unauthorized users, including perhaps insiders. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad connections, called intrusions or attacks, and good normal connections. The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. The 1999 KDD intrusion detection contest uses a version of this dataset.
  • NSL-KDD: NSL-KDD is a data set suggested to solve some of the inherent problems of the KDD'99 data set. Although, this new version of the KDD data set still suffers from some of the problems discussed by McHugh and may not be a perfect representative of existing real networks, because of the lack of public data sets for network-based IDSs, the authors of this data set believe it still can be applied as an effective benchmark data set to help researchers compare different intrusion detection methods. Furthermore, the number of records in the NSL-KDD train and test sets are reasonable. This advantage makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research work will be consistent and comparable.
  • UNSW-NB15: This data set is an advancement over the above-mentioned two data sets i.e. KDD CUP 99 and NSL-KDD. It captures more realistic features and way more instances than the other two.
  • Phishing Websites: In this dataset, the authors shed light on the important features that have proved to be sound and effective in predicting phishing websites. In addition, they proposed some new features.
  • Malware Target Prediction: This Kaggle dataset challenges users to predict if a machine will soon be hit with malware.
  • Malware Classification: Static (without executing the file) features derived from domain experts are extracted from malicious, benign and unlabeled data to detect a test set in the future.
  • Unified Host and Network Dataset This dataset contains a subset of (anonymized) network and computer events collected from the Los Alamos National Laboratory enterprise network over the course of approximately 90 days. This dataset is useful because the computer host and network data are co-occurring.

Economic development

Finance

  • Loan repayment: given an individual's financial history, predict the likelihood that they will successfully repay a loan.
  • Loan approval: given an individual's information (such as Self_Employed, Loan_Amount_Term, Credit_History etc), predict if a loan application will be approved or not.
  • Credit card approval: given an individual's information, predict a credit card application will be approved or not.

Relevant data sets

Humanitarian

Information

  • Search: given a set of search terms, return a set of documents ordered by relevance.

Insurance

  • Disaster modeling: given the history of disaster occurrences, predict the likelihood that a similar disaster will occur again within a given time window.

Internet of Things

  • Device health: given a stream of device data, determine whether it is working as intended.
  • Predictive maintenance: given a stream of device data, anticipate when repair or maintenance will be necessary.
  • Device comparisons: given a collection of device data streams from two different hardware/software/environment conditions, determine which is working better.

Relevant data sets

Legal

  • Topic retrieval: given a topic, find documents and selections focused on it.
  • Hierarchical topic modeling: given a set of documents, find major topics discussed and the subtopics under each.

Relevant data sets

  • Caselaw Access Project: includes all official, book-published United States case law — every volume designated as an official report of decisions by a court within the United States.

Machine reading comprehension

  • Question-answering: given a user query and candidate passages corresponding to each, the task is to mark the most relevant passage which contains the answer to the user query.

Relevant data sets

  • The Stanford Question Answering Dataset (SQuAD2.0): Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

Media

  • Content recommendation: given a set of recently viewed documents, return a set of documents ordered by relevance.

Relevant data sets

  • MovieLens 20M Dataset: The datasets describe ratings and free-text tagging activities from MovieLens, a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on October 17, 2016. Some ideas worth exploring: 1. Which genres receive the highest ratings? How does this change over time? 2. Determine the temporal trends in the genres/tagging activity of the movies released

Medicine

  • Detecting medical conditions in radiographs: given a set of X-rays, determine whether each is indicative of the condition of interest.
  • Detecting malaria from blood smear: given a patient’s blood smear predict if its malaria infected or not.
  • Predicting likelihood of medical condition: given values for a set of patient risk factors, predict the likelihood that a given medical condition will manifest.

Relevant data sets

Politics

  • Voter turnout: given an individual's voting history, predict whether they will vote in the next election.

Retail

  • Product recommendation: given a set of past purchases, return a set of products ordered by relevance.
  • Demand forecasting: given a history of purchases, make predictions for future purchase volumes over time.

Relevant data sets

  • Instacart grocery purchases: 3 Million Instacart Orders. (2017)
  • UK retailer transactions "This is a transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers." From UCI ML lab, hosted on Kaggle. (2011)
  • Sales by store historical sales data for 45 stores located in different regions. Hosted on Kaggle. (2013)
  • Black Friday sales 550,000 items, 100,000 customers. Hosted on Kaggle. (2018)

Social networks

  • Connection recommendation: given an individual's current connections, return a set of potential connections ordered by relevance.

Relevant data sets

  • Facebook 'Circles' (or 'friends lists') from Facebook, collected from survey participants using the Facebook app. (2012)
  • Twitter: 'Circles' (or 'lists') from Twitter, crawled from public sources. (2012)

Sports

Relevant data sets

Transportation

  • Predictive maintenance: given a vehicle's usage and maintenance history, predict a given failure is likely to next occur.
  • On-Time Performance: Identify problematic parts of a transportation network causing delays, the root causes, and recommendations to improve on-time performance.

Relevant data sets