Classical Statistics Linear Regression k-means classifications k-Nearest Neighbours Heirarchical Naive Bayes
Machine Learning Models
Data Cleaning
- Reduce Noise
- Easier to Interpret
- Focus on Patterns
Degrees of Freedom - DOF (Number of fields -1 —> E.g you have data storing 2 variables (Columns)
Step 1: Clustering Creation of Buckets to Group data. Creation of Groups.
Step 2 : Classification Algorithms Unsupervised Learning - classified with Similarity, not accuracy. Supervised Learning - They have accuracy as the key criteria. Examples SPAM filtering FRAUD detection Genetic Testing - Patterns in Variables Psychological Diagnosis Validity is fed by the data available.
k-Nearest Neighbours (kNN)
Naive Bayes - Uses Prior and Posterior Probabilities calculated by Bayes Theorem. Ignores relationship between variables (hence called Naive) Human Readable
Decision Trees - Splits cases into decisions and aggregates into branches. Leaves are end outcomes (classification) Human Readable.
Random Forest - Bunch of decision trees Randomly selected cases, variables and features (make it a "Random" Forest). Less prone to overfitting, most reliable method. Very flexible.
Support Vector Machines(SVM) Black Box Model.
Artificial Neural Networks (ANN) Modelled after neutrons. Black Box Model.
Convolutional Neural Networks (CNN) Black Box Model.
UCI data set
k-means Logistic Regression
Business Analytics
Past performance for future results.
Data Engineering : Backend Integrate Data sources, Build data pipelines. ETL - Extract, Transform, Load.
BI (Business Intelligence): Business Intelligence - Dashboards
BA (Business Analytics): Statistical modelling, exploratory analytics, Business recommendations (prescriptive, exploratory), Statistical Modelling, Machine Learning,
Datascience = Data Engineering + Business Analytics in one.
Datascience covers everything, all in one.
BA:
-Descriptive -Exploratory (WHAT) -Explanatory (WHY) -Predictive(FUTURE) -Prescriptive(STEPS with FUTURE PREDICTIONS) -Experimental (How well prescriptive will work)
Databus - Transportation pipeline - securely store it. Elastic search, runs on top of this data. Analytics is then run on top of this.
Case Study - 1:
Heart disease measurements. BP, cholesterol, Sugar
Case Study - 2:
Email marketing campaigns.
—> Collect & prepare data. (capture) —> Analyze data and identify trends (analyze) —> Build models with machine learning (try to FIT something on this and explain it mathematically) —> Test models to ensure accuracy of predictions. (Use new data sets to ensure the model / prediction is accurate)
Indicator variables (Male-Female : 1/0) or (Defect:Non-defect -> 1:0)
PREDICTIVE ANALYTICS (PA) Supervised Learning:
Uses Regression (Linear or Logistic)
Classification (Decision Trees, Random Forests, Naive Bayes, Neural Networks, Support Vector)
Unsupervised: Clustering (k-means vs k-nearest neighbours) Associative techniques. (market basket analysis : item recommendations for users.
Step 1 :
- Convert all the text variables into numeric to drive predictor variables.
Step 2:
- Correlation Analysis : How much a predictor variable influences the predicted variable.
Training set —> Makes the algorithm Testing set —> Test the algorithm
Test the multiple algorithms to check the data set and predict the model, to find the best fit.
I want to know what tools were used to resolve the problem. I don’t just need abstract lines like : "looked at the jmap / jstack dump and analysed the same". Something on the lines of:
- tool was used
PRESCRIPTIVE ANALYTICS
Linear Programming
- Used to maximise the output variable (z), known the equation between x,y and z, and also the range limits of x and y.
Testing it in the lab, on a test subject
EXPERIMENTAL ANALYTICS
- Live testing in the field
- Actually testing in production
- Hypothesis
- Design an Experiment
- Execute the experiment
- Analyse the results
- Choose the best option.
ITERATE if needed
Sample Testing: Choose one subset. Test. Review Results.
A/B Testing: Choose 2 subsets, give them different flavours. See if output is influenced with the different inputs using mathematical modelling techniques.
Multivariate Testing: We take one subset, and we analyse/vary different input variables and make changes to that, to see what it does to the output.