Notes on DataScience Workflow
From my experience of analyzing various datasets,
I have noted down few notes regarding the workflow for datascience that I recommend.
Understand and Prepare your Data
- Understand your data at high level:
– Look at summary statistics
– Box plot can identify your outliers
– Density plots and histograms to see the spread of data
– Scatter plots to spot relationships between any pair of features. - Deal with missing data.
– Should you fill with nearest neighbor or average value ?
– Should you choose model which is more tolerant to missing value ? ** To Do **
– Should you drop the entire observation if a feature is missing ? - Decide what to do with outliers.
– Could it be due to bad data collection or legitimate extreme values ?
– Would dropping extreme 5% values help or hurt prediction ? - Augment/Refactor your data
- Normalize/Standarize your data by rescaling
- Reduce dimensionality (PCA)
- Capture more complex relationships (e.g. x*y as additional feature)
- Transform model to be easier to interpret.
Problem Definition
Define the problem clearly along with your limitations and constraints.
- If you have labelled data, then it is likely to be supervised learning problem.
- Want to find structure with in unlabelled data ? It is unsupervised problem.
The desired output is usually a set of clusters. - Do you want to optimize your objective by continous experiments or
interacting with an environment, it is a reinforcement learning problem. - If the desired output is a number, it is also a regression problem.
If the desired output is yes/no binary classification it is a
"logistic regression" problem.
Else if the output is to identify the class from finite groups,
it is a general classification problem. - If the goal is to detect anamolies, it is anamoly detection problem.
- Define your limitations in terms of storage power and computation.
Is realtime response a requirement ? Should the learning be fast ?
Should the response prediction be fast ?
Autonomous driving requires the prediction response time to be fast,
though training the model could take long time.
Identify potential Algorithms
- Even though Linear and Logistic regression Algorithms differ in the goal —
to predict a number vs yes/no classification, the algorithm is somewhat similar.
Logistic regression uses non-linear sigmoid function with final output number
is converted to 1/0 using a threshold. - Logistic regression is fairly stable and can take more input features easily.
- Decision trees typically used in conjunction with other techniques like
Random Forest or Gradient Tree Boosting. Easily handles feature interactions.
The disadvantages of decision trees are: 1) Does not support on-the-fly learning,
you need to rebuild your trees when you add more training data or features.
2) Take lot of memory (more features, larger the tree)
3) Easily overfit, but random forests handle it to mitigate this issue.
Lot of gaming theory uses tree pruning. Has strong theoritical foundation and
easy to comprehend. - K-Means clustering of input set.
If you want to make clusters of input groups based on all features without
even understanding the original structure, this is the way to go.
The disadvantage is you have to guess the best K or you do some trial and error. - Principal component analysis (PCA) can help to discard unnecessary
and redundant features, to keep the model simpler, faster and more stable. - SVM can handle high dimensions well. Provides high accuracy.
The cons are … it is difficult to tune and memory intensive.
Example domains: Character recognition, Stock market price, text categorization, etc. - Naive Bayes focuses on joint probability of inputs and outputs.
It can not learn interactions between features.
For problems where Bayes hold good and all inputs are fairly orthogonal,
it performs really well. It can be used as 'Generative' model where
it can even generate possible inputs.
This can be applied to:
– sentiment analysis and text classification
– Recommendation systems like Netflix, Amazon
– To mark an email as spam or not spam
– Face recognition - Random Forest is an ensemble of decision trees. It can solve both regression
and classification problems with large data sets. It also helps identify most
significant variables from thousands of input variables. Random Forest is
highly scalable to any number of dimensions and has generally quite acceptable
performances. Then finally, there are genetic algorithms, which scale admirably
well to any dimension and any data with minimal knowledge of the data itself,
with the most minimal and simplest implementation being the microbial genetic
algorithm. With Random Forest however, learning may be slow (depending on the
parameterization) and it is not possible to iteratively improve the generated
models - Neural Networks take in the weights of connections between neurons .
When all weights are trained, the neural network can be utilized to predict the
class or a quantity.
E.g. object recognition has been as of late enormously enhanced utilizing
Deep Neural Networks.
Applied to unsupervised learning tasks, such as feature extraction,
deep learning also extracts features from raw images or
speech with much less human intervention. (Here the outputs may be labeled
but the input features may not be labeled at all).
On the other hand, neural networks are very hard to just clarify and
parameterization is extremely mind boggling.
They are also very resource and memory intensive.
Discriminative vs generative models
The Supervised learning models are categorized as Discriminative and Generative.
Logistic regression, SVM, etc. are discriminative.
It focuses on P(y/x) conditional probability, y being output and x is input.
On the other hand, the generative models focus on joint propability P(x, y).
The typical generative model approaches contain Naive Bayes,
Gaussian Mixture Model, etc.
So it is easy for these models to 'Generate' potential inputs using this distribution.
Generative model need more training data since it represents the "universe".
A combination of generative/discriminative model is highly recommended and
found to be useful.
Resources
There are couple of resources which I highly recommend as useful reference:
- Scikit Cheat Sheet to pick up the right algorithm :
http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html - Machine learning Mind Map: https://github.com/dformoso/machine-learning-mindmap
- Datascience workbook for classification:
https://github.com/dformoso/sklearn-classification