Skip to content

Latest commit

 

History

History
160 lines (108 loc) · 4.29 KB

Notes.md

File metadata and controls

160 lines (108 loc) · 4.29 KB

Classical Statistics Linear Regression k-means classifications k-Nearest Neighbours Heirarchical Naive Bayes

Machine Learning Models

Data Cleaning

  • Reduce Noise
  • Easier to Interpret
  • Focus on Patterns

Degrees of Freedom - DOF (Number of fields -1 —> E.g you have data storing 2 variables (Columns)

Step 1: Clustering Creation of Buckets to Group data. Creation of Groups.

Step 2 : Classification Algorithms Unsupervised Learning - classified with Similarity, not accuracy. Supervised Learning - They have accuracy as the key criteria. Examples SPAM filtering FRAUD detection Genetic Testing - Patterns in Variables Psychological Diagnosis Validity is fed by the data available.

k-Nearest Neighbours (kNN)

Naive Bayes - Uses Prior and Posterior Probabilities calculated by Bayes Theorem. Ignores relationship between variables (hence called Naive) Human Readable

Decision Trees - Splits cases into decisions and aggregates into branches. Leaves are end outcomes (classification) Human Readable.

Random Forest - Bunch of decision trees Randomly selected cases, variables and features (make it a "Random" Forest). Less prone to overfitting, most reliable method. Very flexible.

Support Vector Machines(SVM) Black Box Model.

Artificial Neural Networks (ANN) Modelled after neutrons. Black Box Model.

Convolutional Neural Networks (CNN) Black Box Model.

UCI data set

k-means Logistic Regression

Business Analytics

Past performance for future results.

Data Engineering : Backend Integrate Data sources, Build data pipelines. ETL - Extract, Transform, Load.

BI (Business Intelligence): Business Intelligence - Dashboards

BA (Business Analytics): Statistical modelling, exploratory analytics, Business recommendations (prescriptive, exploratory), Statistical Modelling, Machine Learning,

Datascience = Data Engineering + Business Analytics in one.

Datascience covers everything, all in one.

BA:

-Descriptive -Exploratory (WHAT) -Explanatory (WHY) -Predictive(FUTURE) -Prescriptive(STEPS with FUTURE PREDICTIONS) -Experimental (How well prescriptive will work)

Databus - Transportation pipeline - securely store it. Elastic search, runs on top of this data. Analytics is then run on top of this.

Case Study - 1:

Heart disease measurements. BP, cholesterol, Sugar

Case Study - 2:

Email marketing campaigns.

—> Collect & prepare data. (capture) —> Analyze data and identify trends (analyze) —> Build models with machine learning (try to FIT something on this and explain it mathematically) —> Test models to ensure accuracy of predictions. (Use new data sets to ensure the model / prediction is accurate)

Indicator variables (Male-Female : 1/0) or (Defect:Non-defect -> 1:0)

PREDICTIVE ANALYTICS (PA) Supervised Learning:

Uses Regression (Linear or Logistic)

Classification (Decision Trees, Random Forests, Naive Bayes, Neural Networks, Support Vector)

Unsupervised: Clustering (k-means vs k-nearest neighbours) Associative techniques. (market basket analysis : item recommendations for users.

Step 1 :

  • Convert all the text variables into numeric to drive predictor variables.

Step 2:

  • Correlation Analysis : How much a predictor variable influences the predicted variable.

Training set —> Makes the algorithm Testing set —> Test the algorithm

Test the multiple algorithms to check the data set and predict the model, to find the best fit.

I want to know what tools were used to resolve the problem. I don’t just need abstract lines like : "looked at the jmap / jstack dump and analysed the same". Something on the lines of:

  • tool was used

PRESCRIPTIVE ANALYTICS

Linear Programming

  • Used to maximise the output variable (z), known the equation between x,y and z, and also the range limits of x and y.

Testing it in the lab, on a test subject

EXPERIMENTAL ANALYTICS

  • Live testing in the field
  • Actually testing in production
  • Hypothesis
  • Design an Experiment
  • Execute the experiment
  • Analyse the results
  • Choose the best option.

ITERATE if needed

Sample Testing: Choose one subset. Test. Review Results.

A/B Testing: Choose 2 subsets, give them different flavours. See if output is influenced with the different inputs using mathematical modelling techniques.

Multivariate Testing: We take one subset, and we analyse/vary different input variables and make changes to that, to see what it does to the output.