structure chart of random forest
MAIN:
for each i in [1..N]:
1. create an independent bootstrap resample B of the training data
2. fit a TREE for the resample B
3. add the tree into the ensemble
TREE:
1. if the outcome can be predicted exactly RETURN the outcome;
otherwise:
2. create a random sample S of size m from the set of the input
variables
2. pick the best split-point among these S variables (using gini
index, entropy, etc.)
3. fit a TREE for each branch
4. return the statement "if condition then return T1(input) else
T2(input)"
CLASSIFICATION:
1. classify with each tree in the ensemble
2. output the majority vote (classification) or average (regression)
The idea is that the trees are unbiased and uncorrelated. The
unbiasedness is achieved by not pruning and by growing the trees to
the end. The uncorrelatedness is achieved by random resampling of data
and random resampling of subsets of variables.
Gini index roughly corresponds to entropy. If you have a set
of k outcomes with probabilities [p_1,p_2,...,p_k], the Gini
index is:
G(X) := 1 - Sum[p_i^2, {i,1,k}]
For tree-splitting applications, you're interested in conditional
probabilities of the outcome O given the splitting of the tree
S: [s_1,...,s_l]:
G(O|S) = Sum[ P(s_i) * G(O|s_i), {i,1,l}]
But there are several definitions. The one I give is sometimes
referred to as "weighted".
--
mag. Aleks Jakulin
http://www.ailab.si/aleks/
Artificial Intelligence Laboratory,
Faculty of Computer and Information Science,
University of Ljubljana, Slovenia.
|