Testing out decision trees, AdaBoosted trees, and random forest
Recently I experimented with decision trees for classification, to get a better idea of how they work. First I created some 2 dimensional training data with 2 categories, using sci-kit-learn:
Next I wrote some code to test a decision tree with different numbers of splits for the two data sets:
Here decision trees are exhibiting their classic weakness – so-called “high bias”, ie overfitting. Here, the tree was able to fit the data
first set of data (1000 points, 2 Gaussians) perfectly after 10 splits. In the case of the second data sets (1000 points, ‘moons’), the test data
accuracy decreases going from 10 to 100 splits, while the training data accuracy improves to 1, which is a sign of overfitting.
That is why people use boosted ensembles of trees. An ensemble is a weighted sum of models. In ensembles, many simple models can be combined to create a more complex one. The real utility of ensembles comes from how they are trained though, using boosting methods (wikipedia), which are a set of different techniques for training ensembles while preventing (or more technically delaying) overfitting.
AdaBoost
(max depth of base estimators = 2)
**
**
Random Forest
(max depth of base estimators = 2)
Random forest is currently one of the most widely used classification techniques in business. Trees have the nice feature that it is possible to explain in human-understandable terms how the model reached a particular decision/output. Here random forest outperforms Adaboost, but the ‘random’ nature of it seems to be becoming apparent..
more commentary will follow.
Talks on boosted trees
Peter Prettenhofer – Gradient Boosted Regression Trees in scikit-learn
Trevor Hastie (a more detailed talk on Boosting / Ensembles)
MIT : boosting (a mishmash of general concepts)