Title: | Estimate Permutation p-Values for Random Forest Importance Metrics |
---|---|
Description: | Estimate significance of importance metrics for a Random Forest model by permuting the response variable. Produces null distribution of importance metrics for each predictor variable and p-value of observed. Provides summary and visualization functions for 'randomForest' results. |
Authors: | Eric Archer [aut, cre] |
Maintainer: | Eric Archer <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.5.2 |
Built: | 2024-11-15 03:21:07 UTC |
Source: | https://github.com/ericarcher/rfpermute |
Create a vector of balanced (equal) sample sizes for use in
the sampsize
argument of rfPermute
or
randomForest
for a classification model. The
values are derived from a percentage of the smallest class sample size.
balancedSampsize(y, pct = 0.5)
balancedSampsize(y, pct = 0.5)
y |
character, numeric, or factor vector containing classes of response variable. Values will be treated as unique for computing class frequencies. |
pct |
percent of smallest class frequency for |
a named vector of sample sizes as long as the number of classes.
Eric Archer [email protected]
data(mtcars) # A balanced model with default half of smallest class size sampsize_0.5 <- balancedSampsize(mtcars$am) sampsize_0.5 rfPermute(factor(am) ~ ., mtcars, replace = FALSE, sampsize = sampsize_0.5) # A balanced model with one quarter of smallest class size sampsize_0.25 <- balancedSampsize(mtcars$am, pct = 0.25) sampsize_0.25 rfPermute(factor(am) ~ ., mtcars, replace = FALSE, sampsize = sampsize_0.25)
data(mtcars) # A balanced model with default half of smallest class size sampsize_0.5 <- balancedSampsize(mtcars$am) sampsize_0.5 rfPermute(factor(am) ~ ., mtcars, replace = FALSE, sampsize = sampsize_0.5) # A balanced model with one quarter of smallest class size sampsize_0.25 <- balancedSampsize(mtcars$am, pct = 0.25) sampsize_0.25 rfPermute(factor(am) ~ ., mtcars, replace = FALSE, sampsize = sampsize_0.25)
Construct a data frame of case predictions for training data along with vote distributions.
casePredictions(x)
casePredictions(x)
x |
a |
A data frame containing columns of original and predicted cases, whether they were correctly classified, and vote distributions among cases.
Eric Archer [email protected]
library(randomForest) data(mtcars) rf <- randomForest(factor(am) ~ ., mtcars) cp <- casePredictions(rf) cp
library(randomForest) data(mtcars) rf <- randomForest(factor(am) ~ ., mtcars) cp <- casePredictions(rf) cp
Compute the class classification priors and class-specific model binomial p-values using these priors as null hypotheses.
classPriors(x, sampsize)
classPriors(x, sampsize)
x |
a |
sampsize |
the vector of sample sizes used to construct the model. If
provided, must have length equal to number of classes. If set to
|
Eric Archer [email protected]
balancedSampsize
, confusionMatrix
library(randomForest) data(mtcars) # random sampling with replacement rf <- randomForest(factor(am) ~ ., mtcars) confusionMatrix(rf) classPriors(rf, NULL) # balanced design sampsize <- balancedSampsize(mtcars$am) rf <- randomForest(factor(am) ~ ., mtcars, replace = FALSE, sampsize = sampsize) confusionMatrix(rf) classPriors(rf, sampsize)
library(randomForest) data(mtcars) # random sampling with replacement rf <- randomForest(factor(am) ~ ., mtcars) confusionMatrix(rf) classPriors(rf, NULL) # balanced design sampsize <- balancedSampsize(mtcars$am) rf <- randomForest(factor(am) ~ ., mtcars, replace = FALSE, sampsize = sampsize) confusionMatrix(rf) classPriors(rf, sampsize)
Removes cases for a Random Forest classification model with missing data and predictors that are constant.
cleanRFdata(x, y, data, max.levels = 30)
cleanRFdata(x, y, data, max.levels = 30)
x |
columns used as predictor variables as character or numeric vector. |
y |
column used as response variable as character or numeric. |
data |
data.frame containing |
max.levels |
maximum number of levels in response variable |
a data.frame containing cleaned data.
Eric Archer [email protected]
Combines two or more ensembles of rfPermute
objects into
one, combining randomForest
results, null distributions,
and re-calculating p-values.
combineRP(...)
combineRP(...)
... |
two or more objects of class |
Eric Archer [email protected]
data(iris) rp1 <- rfPermute( Species ~ ., iris, ntree = 50, norm.votes = FALSE, nrep = 100, num.cores = 1 ) rp2 <- rfPermute( Species ~ ., iris, ntree = 50, norm.votes = FALSE, nrep = 100, num.cores = 1 ) rp3 <- rfPermute( Species ~ ., iris, ntree = 50, norm.votes = FALSE, nrep = 100, num.cores = 1 ) rp.all <- combineRP(rp1, rp2, rp3) rp.all plotNull(rp.all)
data(iris) rp1 <- rfPermute( Species ~ ., iris, ntree = 50, norm.votes = FALSE, nrep = 100, num.cores = 1 ) rp2 <- rfPermute( Species ~ ., iris, ntree = 50, norm.votes = FALSE, nrep = 100, num.cores = 1 ) rp3 <- rfPermute( Species ~ ., iris, ntree = 50, norm.votes = FALSE, nrep = 100, num.cores = 1 ) rp.all <- combineRP(rp1, rp2, rp3) rp.all plotNull(rp.all)
Generate a confusion matrix for Random Forest classification models with error rates translated into percent correctly classified, and columns for confidence intervals added.
confusionMatrix(x, conf.level = 0.95, threshold = NULL) plotConfMat(x, title = NULL, plot = TRUE)
confusionMatrix(x, conf.level = 0.95, threshold = NULL) plotConfMat(x, title = NULL, plot = TRUE)
x |
a |
conf.level |
confidence level for the |
threshold |
threshold to test observed classification
probability against. Should be a number between 0 and 1.
If not |
title |
a title for the plot. |
plot |
display the plot? |
Eric Archer [email protected]
library(randomForest) data(mtcars) rf <- randomForest(factor(am) ~ ., mtcars) confusionMatrix(rf) confusionMatrix(rf, conf.level = 0.75) confusionMatrix(rf, threshold = 0.7) confusionMatrix(rf, threshold = 0.8) confusionMatrix(rf, threshold = 0.95)
library(randomForest) data(mtcars) rf <- randomForest(factor(am) ~ ., mtcars) confusionMatrix(rf) confusionMatrix(rf, conf.level = 0.75) confusionMatrix(rf, threshold = 0.7) confusionMatrix(rf, threshold = 0.8) confusionMatrix(rf, threshold = 0.95)
The importance
function extracts a matrix of
the observed importance scores and p-values from the object
produced by a call to rfPermute
. plotImportance
produces
a visualization of importance scores as either a barchart or heatmap.
## S3 method for class 'rfPermute' importance(x, scale = TRUE, sort.by = NULL, decreasing = TRUE, ...) plotImportance( x, plot.type = c("bar", "heatmap"), imp.type = NULL, scale = TRUE, sig.only = FALSE, alpha = 0.05, n = NULL, ranks = TRUE, xlab = NULL, ylab = NULL, main = NULL, size = 3, plot = TRUE )
## S3 method for class 'rfPermute' importance(x, scale = TRUE, sort.by = NULL, decreasing = TRUE, ...) plotImportance( x, plot.type = c("bar", "heatmap"), imp.type = NULL, scale = TRUE, sig.only = FALSE, alpha = 0.05, n = NULL, ranks = TRUE, xlab = NULL, ylab = NULL, main = NULL, size = 3, plot = TRUE )
x |
for |
scale |
for permutation based measures, should the measures be divided their "standard errors"? |
sort.by |
character vector giving the importance metric(s) or p-values
to sort by. If |
decreasing |
logical. Should the sort order be increasing or decreasing? |
... |
arguments to be passed to and from other methods. |
plot.type |
plot importances as a |
imp.type |
character vector listing which importance measures to plot. Can be class names (for classification models) or names of overall importance measures (e.g., "MeanDecreaseAccuracy"). |
sig.only |
Plot only the significant (<= |
alpha |
a number specifying the critical alpha for identifying
predictors with importance scores significantly different from random.
This parameter is only relevant if |
n |
plot |
ranks |
plot ranks instead of actual importance scores? |
xlab , ylab
|
labels for the x and y axes. |
main |
main title for plot. |
size |
a value specifying the size of the significance diamond in the
heatmap if the p-value <= |
plot |
display the plot? |
Eric Archer [email protected]
data(mtcars) # A classification model classifying cars to manual or automatic transmission am.rp <- rfPermute(factor(am) ~ ., mtcars, ntree = 100, nrep = 50) imp.scaled <- importance(am.rp, scale = TRUE) imp.scaled # plot scaled importance scores plotImportance(am.rp, scale = TRUE) # plot unscaled and only significant scores plotImportance(am.rp, scale = FALSE, sig.only = TRUE)
data(mtcars) # A classification model classifying cars to manual or automatic transmission am.rp <- rfPermute(factor(am) ~ ., mtcars, ntree = 100, nrep = 50) imp.scaled <- importance(am.rp, scale = TRUE) imp.scaled # plot scaled importance scores plotImportance(am.rp, scale = TRUE) # plot unscaled and only significant scores plotImportance(am.rp, scale = FALSE, sig.only = TRUE)
For classification models, calculate the percent of individuals correctly classified in a specified percent of trees in the forest.
pctCorrect(x, pct = c(seq(0.8, 0.95, 0.05), 0.99))
pctCorrect(x, pct = c(seq(0.8, 0.95, 0.05), 0.99))
x |
a |
pct |
vector of minimum percent of trees voting for each class. Can be
|
a matrix giving the percent of individuals correctly classified in
each class and overall for each threshold value specified in pct
.
Eric Archer [email protected]
library(randomForest) data(mtcars) rf <- randomForest(factor(am) ~ ., mtcars, importance = TRUE) pctCorrect(rf)
library(randomForest) data(mtcars) rf <- randomForest(factor(am) ~ ., mtcars, importance = TRUE) pctCorrect(rf)
For classification models, plot distribution of predictor variables on classes sorted by order of importance in model.
plotImpPreds( x, df, class.col, imp.type = NULL, max.vars = 16, scale = TRUE, size = 1, point.alpha = 0.2, violin.alpha = 0.5, plot = TRUE )
plotImpPreds( x, df, class.col, imp.type = NULL, max.vars = 16, scale = TRUE, size = 1, point.alpha = 0.2, violin.alpha = 0.5, plot = TRUE )
x |
a |
df |
data.frame with predictors in |
class.col |
response column name in |
imp.type |
character string representing importance type to use for sorting predictors. |
max.vars |
number of variables to plot (from most important to least). |
scale |
For permutation based importance measures, should they be divided their "standard errors"? |
size , point.alpha , violin.alpha
|
controls size of points and alpha values (transparency) for points and violin plots. |
plot |
display the plot? |
the ggplot2
object is invisibly returned.
If the model in x
is from randomForest
and was run
with importance = TRUE
, then 'MeanDecreaseAccuracy' is used as
the default importance measure for sorting. Otherwise, 'MeanDecreaseGini'
is used.
Eric Archer [email protected]
library(randomForest) data(mtcars) df <- mtcars df$am <- factor(df$am) rf <- randomForest(am ~ ., df, importance = TRUE) plotImpPreds(rf, df, "am")
library(randomForest) data(mtcars) df <- mtcars df$am <- factor(df$am) rf <- randomForest(am ~ ., df, importance = TRUE) plotImpPreds(rf, df, "am")
Plot distribution of the fraction of trees that samples were inbag in the Random Forest model.
plotInbag(x, bins = 10, replace = TRUE, sampsize = NULL, plot = TRUE)
plotInbag(x, bins = 10, replace = TRUE, sampsize = NULL, plot = TRUE)
x |
a |
bins |
number of bins in histogram. |
replace |
was sampling done with or without replacement? |
sampsize |
sizes of samples drawn. Either a single value or vector of sample sizes as long as the number of classes. |
plot |
display the plot? |
the ggplot2
object is invisibly returned.
Red vertical lines on the plot denote the expected inbag rate(s).
These rates are based on the values of replace
and
sampsize
supplied. If not specified, they are set to the
randomForest
defaults. If this is not the
same as the arguments used to run the model, there will be a mismatch in
the location of these indicator lines and the inbag frequency distribution.
Eric Archer [email protected]
library(randomForest) data(mtcars) sampsize = c(5, 5) rf <- randomForest(factor(am) ~ ., data = mtcars, ntree = 10) plotInbag(rf) rf <- randomForest(factor(am) ~ ., data = mtcars, ntree = 1000) plotInbag(rf) rf <- randomForest(factor(am) ~ ., data = mtcars, ntree = 10000) plotInbag(rf)
library(randomForest) data(mtcars) sampsize = c(5, 5) rf <- randomForest(factor(am) ~ ., data = mtcars, ntree = 10) plotInbag(rf) rf <- randomForest(factor(am) ~ ., data = mtcars, ntree = 1000) plotInbag(rf) rf <- randomForest(factor(am) ~ ., data = mtcars, ntree = 10000) plotInbag(rf)
Plot the Random Forest null distributions importance metrics,
observed values, and p-values for each predictor variable from the
object produced by a call to rfPermute
.
plotNull( x, preds = NULL, imp.type = NULL, scale = TRUE, plot.type = c("density", "hist"), plot = TRUE )
plotNull( x, preds = NULL, imp.type = NULL, scale = TRUE, plot.type = c("density", "hist"), plot = TRUE )
x |
An object produced by a call to |
preds |
a character vector of predictors to plot. If |
imp.type |
A character vector giving the importance metric(s) to plot. |
scale |
Plot importance measures scaled (divided by) standard errors? |
plot.type |
type of plot to produce: |
plot |
display the plot? |
The function will generate an plot for each predictor, with facetted importance metrics. The vertical red line shows the observed importance score and the _p_-value is given in the facet label.
A named list of the ggplot
figures produced is invisibly returned.
Eric Archer [email protected]
# A regression model using the ozone example data(airquality) ozone.rp <- rfPermute( Ozone ~ ., data = airquality, ntree = 100, na.action = na.omit, nrep = 50, num.cores = 1 ) # Plot the null distributions and observed values. plotNull(ozone.rp)
# A regression model using the ozone example data(airquality) ozone.rp <- rfPermute( Ozone ~ ., data = airquality, ntree = 100, na.action = na.omit, nrep = 50, num.cores = 1 ) # Plot the null distributions and observed values. plotNull(ozone.rp)
Plot histogram of assignment probabilities to predicted class. This is used for determining if the model differentiates between correctly and incorrectly classified samples in terms of how strongly they are classified.
plotPredictedProbs(x, bins = 30, plot = TRUE)
plotPredictedProbs(x, bins = 30, plot = TRUE)
x |
a |
bins |
number of bins in histogram. Defaults to number of samples / 5. |
plot |
display the plot? |
the ggplot2
object is invisibly returned.
Eric Archer [email protected]
library(randomForest) data(mtcars) rf <- randomForest(factor(am) ~ ., mtcars) plotPredictedProbs(rf, bins = 20)
library(randomForest) data(mtcars) rf <- randomForest(factor(am) ~ ., mtcars) plotPredictedProbs(rf, bins = 20)
Create a plot of Random Forest proximity scores using multi-dimensional scaling.
plotProximity( x, dim.x = 1, dim.y = 2, class.cols = NULL, legend.type = c("legend", "label", "none"), legend.loc = c("top", "bottom", "left", "right"), point.size = 2, circle.size = 8, circle.border = 1, group.type = c("ellipse", "hull", "contour", "none"), group.alpha = 0.3, ellipse.level = 0.95, n.contour.grid = 100, label.size = 4, label.alpha = 0.7, plot = TRUE )
plotProximity( x, dim.x = 1, dim.y = 2, class.cols = NULL, legend.type = c("legend", "label", "none"), legend.loc = c("top", "bottom", "left", "right"), point.size = 2, circle.size = 8, circle.border = 1, group.type = c("ellipse", "hull", "contour", "none"), group.alpha = 0.3, ellipse.level = 0.95, n.contour.grid = 100, label.size = 4, label.alpha = 0.7, plot = TRUE )
x |
a |
dim.x , dim.y
|
numeric values giving x and y dimensions to plot from multidimensional scaling of proximity scores. |
class.cols |
vector of colors to use for each class. |
legend.type |
type of legend to use to label classes. |
legend.loc |
character keyword specifying location of legend.
Can be |
point.size |
size of central points. Set to |
circle.size |
size of circles around points indicating classification. Set to NULL for no circles. |
circle.border |
width of circle border. |
group.type |
type of grouping to display. Ignored for regression models. |
group.alpha |
value giving alpha transparency level for group shading.
Setting to |
ellipse.level |
the confidence level at which to draw the ellipse. |
n.contour.grid |
number of grid points for contour lines. |
label.size |
size of label if |
label.alpha |
transparency of label background. |
plot |
logical determining whether or not to show plot. |
Produces a scatter plot of proximity scores for dim.x
and
dim.y
dimensions from a multidimensional scale (MDS) conversion of
proximity scores from a randomForest
object. For classification
models, points are colored according to original (inner)
and predicted (outer) class.
a list with:
prox.mds |
the MDS scores of the selected dimensions |
g |
|
Eric Archer [email protected]
library(randomForest) data(symb.metab) rf <- randomForest(type ~ ., symb.metab, proximity = TRUE) # With confidence ellipses plotProximity(rf) # With convex hulls plotProximity(rf, group.type = "hull") # With contours plotProximity(rf, group.type = "contour") # Remove the points and just show ellipses plotProximity(rf, point.size = NULL, circle.size = NULL, group.alpha = 0.5) # Labels instead of a legend plotProximity(rf, legend.type = "label", point.size = NULL, circle.size = NULL, group.alpha = 0.5)
library(randomForest) data(symb.metab) rf <- randomForest(type ~ ., symb.metab, proximity = TRUE) # With confidence ellipses plotProximity(rf) # With convex hulls plotProximity(rf, group.type = "hull") # With contours plotProximity(rf, group.type = "contour") # Remove the points and just show ellipses plotProximity(rf, point.size = NULL, circle.size = NULL, group.alpha = 0.5) # Labels instead of a legend plotProximity(rf, legend.type = "label", point.size = NULL, circle.size = NULL, group.alpha = 0.5)
Plot trace of cumulative OOB (classification) or MSE (regression) error rate by number of trees.
plotTrace(x, pct.correct = TRUE, plot = TRUE)
plotTrace(x, pct.correct = TRUE, plot = TRUE)
x |
a |
pct.correct |
display y-axis as percent correctly classified
( |
plot |
display the plot? |
the ggplot2
object is invisibly returned.
Eric Archer [email protected]
library(randomForest) data(mtcars) rf <- randomForest(factor(am) ~ ., mtcars) plotTrace(rf)
library(randomForest) data(mtcars) rf <- randomForest(factor(am) ~ ., mtcars) plotTrace(rf)
For classification models, plot distribution of votes for each sample in each class.
plotVotes(x, type = NULL, freq.sep.line = TRUE, plot = TRUE)
plotVotes(x, type = NULL, freq.sep.line = TRUE, plot = TRUE)
x |
a |
type |
either |
freq.sep.line |
put frequency of original group on second line in facet
label? If |
plot |
display the plot? |
the ggplot2
object is invisibly returned.
Eric Archer [email protected]
library(randomForest) data(mtcars) rf <- randomForest(factor(am) ~ ., mtcars) plotVotes(rf)
library(randomForest) data(mtcars) rf <- randomForest(factor(am) ~ ., mtcars) plotVotes(rf)
Estimate significance of importance metrics for a Random Forest model by permuting the response variable. Produces null distribution of importance metrics for each predictor variable and p-value of observed.
rfPermute(x, ...) ## Default S3 method: rfPermute(x, y = NULL, ..., num.rep = 100, num.cores = 1) ## S3 method for class 'formula' rfPermute( formula, data = NULL, ..., subset, na.action = na.fail, num.rep = 100, num.cores = 1 ) as.randomForest(x) ## S3 method for class 'rfPermute' print(x, ...) ## S3 method for class 'rfPermute' predict(object, ...)
rfPermute(x, ...) ## Default S3 method: rfPermute(x, y = NULL, ..., num.rep = 100, num.cores = 1) ## S3 method for class 'formula' rfPermute( formula, data = NULL, ..., subset, na.action = na.fail, num.rep = 100, num.cores = 1 ) as.randomForest(x) ## S3 method for class 'rfPermute' print(x, ...) ## S3 method for class 'rfPermute' predict(object, ...)
x , y , formula , data , subset , na.action , ...
|
See |
num.rep |
Number of permutation replicates to run to construct null distribution and calculate p-values (default = 100). |
num.cores |
Number of CPUs to distribute permutation results over.
Defaults to |
object |
an |
All other parameters are as defined in randomForest.formula
.
A Random Forest model is first created as normal to calculate the observed
values of variable importance. The response variable is then permuted
num.rep
times, with a new Random Forest model built for each
permutation step.
An rfPermute
object.
Eric Archer [email protected]
# A regression model predicting ozone levels data(airquality) ozone.rp <- rfPermute(Ozone ~ ., data = airquality, na.action = na.omit, ntree = 100, num.rep = 50) ozone.rp # Plot the scaled importance distributions # Significant (p <= 0.05) predictors are in red plotImportance(ozone.rp, scale = TRUE) # Plot the importance null distributions and observed values for two of the predictors plotNull(ozone.rp, preds = c("Solar.R", "Month")) # A classification model classifying cars to manual or automatic transmission data(mtcars) am.rp <- rfPermute(factor(am) ~ ., mtcars, ntree = 100, num.rep = 50) summary(am.rp) plotImportance(am.rp, scale = TRUE, sig.only = TRUE)
# A regression model predicting ozone levels data(airquality) ozone.rp <- rfPermute(Ozone ~ ., data = airquality, na.action = na.omit, ntree = 100, num.rep = 50) ozone.rp # Plot the scaled importance distributions # Significant (p <= 0.05) predictors are in red plotImportance(ozone.rp, scale = TRUE) # Plot the importance null distributions and observed values for two of the predictors plotNull(ozone.rp, preds = c("Solar.R", "Month")) # A classification model classifying cars to manual or automatic transmission data(mtcars) am.rp <- rfPermute(factor(am) ~ ., mtcars, ntree = 100, num.rep = 50) summary(am.rp) plotImportance(am.rp, scale = TRUE, sig.only = TRUE)
rfPermute
packageRandom Forest Predictor Importance Significance and Model Diagnostics.
rfPermuteTutorial()
rfPermuteTutorial()
rfPermute
or randomForest
models.Combine plots of error traces and inbag rates.
## S3 method for class 'randomForest' summary(object, ...) ## S3 method for class 'rfPermute' summary(object, ...)
## S3 method for class 'randomForest' summary(object, ...) ## S3 method for class 'rfPermute' summary(object, ...)
object |
a |
... |
arguments passed to |
A combination of plots from plotTrace
and
plotInbag
as well as summary confusion matrices
(classification) or error rates (regression) from the model.
Eric Archer [email protected]
# A regression model using the ozone example data(airquality) ozone.rp <- rfPermute( Ozone ~ ., data = airquality, na.action = na.omit, ntree = 100, nrep = 50, num.cores = 1 ) summary(ozone.rp)
# A regression model using the ozone example data(airquality) ozone.rp <- rfPermute( Ozone ~ ., data = airquality, na.action = na.omit, ntree = 100, nrep = 50, num.cores = 1 ) summary(ozone.rp)
A data.frame of 155 metabolite relative concentrations for 64 samples of four Symbiodinium clade types.
data(symb.metab)
data(symb.metab)
data.frame
Klueter, A.; Crandall, J.B.; Archer, F.I.; Teece, M.A.; Coffroth, M.A. Taxonomic and Environmental Variation of Metabolite Profiles in Marine Dinoflagellates of the Genus Symbiodinium. Metabolites 2015, 5, 74-99.