Random Forest for Grouped Variables. Only implement for binary classification. The function builts a large number of random decision trees based on a variant of the CARTGV method.
rfgv(data, group, groupImp, ntree = 200, mtry_group = floor(sqrt(length(unique(group[!is.na(group)])))), maxdepth = 1, replace = T, sampsize = ifelse(replace == T, nrow(data), floor(0.632 * nrow(data))), case_min = 1, grp.importance = TRUE, test = NULL, keep_forest = F, crit = 1, penalty = "No", sampvar = FALSE, mtry_var)
data | a data frame containing the response value (for the first variable) and the predictors and used to grow the tree. The name of the response value must be "Y". The response variable must be the first variable of the data frame and the variable meust be coded as the two levels "0" and "1". |
---|---|
group | a vector with the group number of each variable.
(WARNING : if there are " |
groupImp | a vector which indicates the group number of each variable (for the groups used to compute the group importance). |
ntree | an integer indicating the number of trees to grow |
mtry_group | an integer the number of variables randomly samples as candidates at each split. |
maxdepth | an integer indicating the maximal depth for a split-tree. The default value is 2. |
replace | a boolean indicating if sampling of cases is done with or without replacement? |
sampsize | an interger indicating the size of the boostrap samples. |
case_min | an integer indicating the minimun number of cases/non cases in a terminal nodes. The default is 1. |
grp.importance | a boolean indicating if the importance of each group need to be computed |
test | an independent data frame containing the same variables that " |
keep_forest | a boolean indicating if the forest will be retained in the output object |
crit | an integer indicating the impurity function used (1=Gini index / 2=Entropie/ 3=Misclassification rate) |
penalty | a boolean indicating if the decrease in node impurity must take account of the group size. Four penalty are available: "No","Size","Root.size" or "Log". |
sampvar | a boolean indicating if within each splitting tree, a subset of variables is drawn for each group |
mtry_var | a vector of length the number of groups. It indicates the number of drawn variables for each group. Usefull only if sampvar=TRUE |
a list with elements:
predicted
: the predicted values of the observations in the training set named
"data
". The i-th element being the prediction from the ith tree and based
on the i-th out-of-bag sample. The i-th element is missing if the i-th
observation is not part of the the i-th out-of-bag sample.
importance
: a data frame with two coloums. The first column provides the value
of the permutation importance of each group
and the second one gives the value of the permutation importance of
each group normalized by the size of the group
err.rate
: a vector error rates of the prediction on the training set named
"data
", the i-th element being the (OOB) error rate
for all trees up to the i-th.
vote
: a data frame with one row for each input data point and one column for
each class ("0" and "1", in this order), giving the fraction
number of (OOB) ‘votes’ from the random forest.
pred
: the predicted values of the observations in the training set named
"data
". It correspond to the majority vote computed by using the
matrix of predictions "predicted
".
confusion
: the object returned by the function "xtab_function
".
There are the confusion matrix of the prediction (based on OOB data) and
the associated statistics. For more details, see the function
"xtab_function
".
err.rate.test
: (Only if test!=NULL) a vector error rates of the prediction
on the test set named "test
", the i-th element being the error rate
for all trees up to the i-th.
vote.test
: (Only if test!=NULL) a data frame with one row for each observtion
in "test
" and one column for each class ("0" and "1", in this order),
giving the number of ‘votes’ from the random forest.
pred.test
: (Only if test!=NULL) the predicted values of the observations
in "test
".
confusion.test
: (Only if test!=NULL) the object returned by the function
"xtab_function
". There are the confusion matrix of the prediction (based on
"test
") and the associated statistics. For more details,
see the function "xtab_function
".
oob.times
: number of times that an observation in the training set named
"data
" is ‘out-of-bag’ (and thus used in computing OOB error estimate)
keep_forest
: a boolean indicating if the forest will be retained in the
output object
sampvar
: a boolean indicating if within each splitting tree, a subset of
variables is drawn for each group
maxdepth
: an integer indicating the maximal depth for a split-tree.
The default value is 2
mtry_group
: an integer the number of variables randomly samples as candidates
at each split
mtry_var
: a vector of length the number of groups. It indicates the number
of drawn variables for each group. Usefull only if sampvar=TRUE
ntree
: an integer indicating the number of trees to grow.
data(rfgv_dataset) data(group) data <- rfgv_dataset train<-data[which(data[,1]=="train"),-1] # negative index into the `data` test<-data[which(data[,1]=="test"),-1] # object specifying all rows and all columns validation<-data[which(data[,1]=="validation"),-1] # except the first column. forest<-rfgv(train, group=group, groupImp=group, ntree=1, mtry_group=3, sampvar=TRUE, maxdepth=2, replace=TRUE, case_min=1, sampsize=nrow(train), mtry_var=rep(2,5), grp.importance=TRUE, test=test, keep_forest=FALSE, crit=1, penalty="No")