Compute the impurity of a sequence of trees which are based on a rpart object.

impurete_rpart(validation, tree_seq)

Arguments

validation

an new data set. It must be a data frame containing the same variables that those contained in the learning data set used to built the rpart object.

tree_seq

a sequence of subtrees. It must be a list where each element is an object returned by the function 'rpart::snip.rpart'

Value

a list with elements

  • impurete: a matrix containing the impurity values (respectively Gini, Entropy and Misclassification rate) of each subtrees evaluated on the data set 'validation'.

  • pred: a list containing the prediction vector of each subtree

  • summary_noeuds: a list providing informations about each subtree:

    • nom_noeuds: number of the node

    • N: number of observations in the node

    • N[Y=1]: number of observation with "Y=1" in the node

    • P[Y=1]: empirical probability that in the node an observation belongs to the label "Y=1"

    • P[Y=0]: empirical probability that in the node an observation belongs to the label "Y=0"

    • P[hat.Y!=Y]: misclassification rate in the node

Details

a partir d'un echantillon independant et d'une sequence d'arbres emboites La fonction prend en entree un échantillon test et une sequence d'arbres emboites Pour chaque arbre, on predit la classe de chaque observation de l'ensemble test; puis a partir de ces resultats, on calcul l'impurete de chaque arbre (gini, l'entropie et le taux de mal-classes sont calcules)