Getting Started with Random Forest
Published:
A collection of my notes for getting started with random forest for genomics research.
- Chris Albon has clear and easy to follow examples for running RF in python
- Classifier example: Random forest classifier example
- Feature selection example: Feature Selection using random forest
- In R, the ‘randomForest’ package is fine to get started.
- I used a particular package for variable selection, varSelRF
- A tutorial from r-bloggers: How to implement random forests in R
Things to consider when using random forest:
NA’s
You need to get rid of NA’s in your data. There are methods to use RF to impute missing data (‘rfimpute’ in R). In most cases, you might just have to set things to 0 (unless that has biological meaning), or drop rows with NA’s.
Regression or Classification
Is your thing-to-be-predicted continuous or ordinal? Disease/Healthy Blood glucose level, ranging from 0-100 (or something like that)
Feature Importance
How much does a feature contribute to accurate prediction across all trees in a forest. This is especially useful in genomics datasets where your columns are genes (expression, abundance or presence/absence). It’s helpful when you have a high dimensional dataset to be able to rerun the model with only the very important variables.
Do you have enough data?
The number of samples in genomics datasets will no doubt be smaller than the number of features (ex: genes). It is important to at least have enough data to create testing and training datasets. You will find recommendations for different ways to do this, but something like 30/70 split is fine.
Class Imbalance
For example, I have 250 people in my dataset and I’m interested in predicting T1D using gene expression. Unfortunately, only 20/250 have T1D. Class1 = 20, Class2 = 230. Any training set I create from this distribution will overtrain on my disease set. Or, there will not be enough information for the disease set to ever accurately label them as disease.
When you have class imbalance, you can use weights or stratified sampling.
How does it work?
- OOB error: See below. How well does the model predict the OOB samples after training?
- Gini importance: Every time you use a variable, m, to split a node, the gini impurity of the children decreases (i.e. the groups get more pure with every subsequent split of the tree). If you sum the decreases in gini impurity of m across all trees in the forest, you get a quick estimate of the variable importance of m.
- Variable importance: also known as VIMP, is a permutation test of importance for a variable/ feature. Basically, it permutes the values of variable m to see if the model’s prediction accuracy is higher with the correct m, or with the permuted m across all trees.
Bagging, boosting, etc
- Bagging: bootstrap aggregating. Bagging decreases the variance of your prediction. It draws random samples from the given dataset with replacement, and aggregates the predictions in the end to minimize variance. Increases accuracy of the model. Helps unstable learners (decision trees) but hurts stable learners (NaiveBayes, KNN). This is how RF avoids overfitting.
- Boosting: Focus new learners on examples that others get wrong. Train learners sequentially. Convert weak learners into strong classifiers. Boosting helps reduce variance and bias. Uses all data to train. Instance that were misclassified in previous learners are given more weight so that subsequent learners can give them more focus during training. AdaBoost (adaptive boosting) known to have high variance because of over-fitting.
Hyperparameters
- Number of trees (ntree) : the number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times. Best practices: Try the default to begin with and try to make it an odd number (tie-breaking).
- Max features (m_try in R) : number of variables randomly sampled as candidates at each split. The default values are different for regression (p/3) and classification (sqrt℗).
- RF is nice because its defaults usually produce decent results and you don’t have to spend a year tuning hyper parameters in most cases. Best practice: Try the default to begin.
Out-of-bag (OOB) sampling
Random forest’s own cross-validation method. About 1/3 of the data is not used to train the model, and is then used to validate the model’s performance. This is done instead of creating testing/training datasets. The 1/3 samples that are left out of training are known as the “OOB” samples. This is similar to leave-one-out cross-validation.
Overfitting
Overfitting (which is a common problem in ML) can usually be avoided in RF by adding more trees to the forest. Check the confusion matrix (the output predictions from the model) in the end to see if you are overfitting. If your validation set is predicted much worse than your training set is, for example. Or, if one of your classes is always predicted correctly but the others are not.