Random Forest for Grouped Variables. Only implement for binary classification. The function builts a large number of random decision trees based on a variant of the CARTGV method.

rfgv(data, group, groupImp, ntree = 200,
  mtry_group = floor(sqrt(length(unique(group[!is.na(group)])))),
  maxdepth = 1, replace = T, sampsize = ifelse(replace == T,
  nrow(data), floor(0.632 * nrow(data))), case_min = 1,
  grp.importance = TRUE, test = NULL, keep_forest = F, crit = 1,
  penalty = "No", sampvar = FALSE, mtry_var)

Arguments

data

a data frame containing the response value (for the first variable) and the predictors and used to grow the tree. The name of the response value must be "Y". The response variable must be the first variable of the data frame and the variable meust be coded as the two levels "0" and "1".

group

a vector with the group number of each variable. (WARNING : if there are "p" goups, the groups must be numbers from "1" to "p" in increasing order. The group label of the response variable is missing (i.e. NA))

groupImp

a vector which indicates the group number of each variable (for the groups used to compute the group importance).

ntree

an integer indicating the number of trees to grow

mtry_group

an integer the number of variables randomly samples as candidates at each split.

maxdepth

an integer indicating the maximal depth for a split-tree. The default value is 2.

replace

a boolean indicating if sampling of cases is done with or without replacement?

sampsize

an interger indicating the size of the boostrap samples.

case_min

an integer indicating the minimun number of cases/non cases in a terminal nodes. The default is 1.

grp.importance

a boolean indicating if the importance of each group need to be computed

test

an independent data frame containing the same variables that "data".

keep_forest

a boolean indicating if the forest will be retained in the output object

crit

an integer indicating the impurity function used (1=Gini index / 2=Entropie/ 3=Misclassification rate)

penalty

a boolean indicating if the decrease in node impurity must take account of the group size. Four penalty are available: "No","Size","Root.size" or "Log".

sampvar

a boolean indicating if within each splitting tree, a subset of variables is drawn for each group

mtry_var

a vector of length the number of groups. It indicates the number of drawn variables for each group. Usefull only if sampvar=TRUE

Value

a list with elements:

  • predicted : the predicted values of the observations in the training set named "data". The i-th element being the prediction from the ith tree and based on the i-th out-of-bag sample. The i-th element is missing if the i-th observation is not part of the the i-th out-of-bag sample.

  • importance: a data frame with two coloums. The first column provides the value of the permutation importance of each group and the second one gives the value of the permutation importance of each group normalized by the size of the group

  • err.rate: a vector error rates of the prediction on the training set named "data", the i-th element being the (OOB) error rate for all trees up to the i-th.

  • vote: a data frame with one row for each input data point and one column for each class ("0" and "1", in this order), giving the fraction number of (OOB) ‘votes’ from the random forest.

  • pred: the predicted values of the observations in the training set named "data". It correspond to the majority vote computed by using the matrix of predictions "predicted".

  • confusion: the object returned by the function "xtab_function". There are the confusion matrix of the prediction (based on OOB data) and the associated statistics. For more details, see the function "xtab_function".

  • err.rate.test: (Only if test!=NULL) a vector error rates of the prediction on the test set named "test", the i-th element being the error rate for all trees up to the i-th.

  • vote.test: (Only if test!=NULL) a data frame with one row for each observtion in "test" and one column for each class ("0" and "1", in this order), giving the number of ‘votes’ from the random forest.

  • pred.test: (Only if test!=NULL) the predicted values of the observations in "test".

  • confusion.test: (Only if test!=NULL) the object returned by the function "xtab_function". There are the confusion matrix of the prediction (based on "test") and the associated statistics. For more details, see the function "xtab_function".

  • oob.times: number of times that an observation in the training set named "data" is ‘out-of-bag’ (and thus used in computing OOB error estimate)

  • keep_forest: a boolean indicating if the forest will be retained in the output object

  • sampvar: a boolean indicating if within each splitting tree, a subset of variables is drawn for each group

  • maxdepth: an integer indicating the maximal depth for a split-tree. The default value is 2

  • mtry_group: an integer the number of variables randomly samples as candidates at each split

  • mtry_var: a vector of length the number of groups. It indicates the number of drawn variables for each group. Usefull only if sampvar=TRUE

  • ntree: an integer indicating the number of trees to grow.

Examples

data(rfgv_dataset) data(group) data <- rfgv_dataset train<-data[which(data[,1]=="train"),-1] # negative index into the `data` test<-data[which(data[,1]=="test"),-1] # object specifying all rows and all columns validation<-data[which(data[,1]=="validation"),-1] # except the first column. forest<-rfgv(train, group=group, groupImp=group, ntree=1, mtry_group=3, sampvar=TRUE, maxdepth=2, replace=TRUE, case_min=1, sampsize=nrow(train), mtry_var=rep(2,5), grp.importance=TRUE, test=test, keep_forest=FALSE, crit=1, penalty="No")