All the source codes of the aggregation methods are available here
.
To run the codes, you can clone
the repository directly
or simply load the R script
source file from the
repository using devtools
package in Rstudio
as follow:
Install devtools package using command:
install.packages("devtools")
Loading the source codes from GitHub
repository using source_url
function by:
devtools::source_url("https://raw.githubusercontent.com/hassothea/AggregationMethods/main/GradientCOBRARegressor.R")
✎ Note: All codes contained in this
Rmarkdown
are built with recent version of (version \(>\) 4.1, available here) and Rstudio (version >2022.02.2+485
, available here). Note also that the code chucks are hidden by default.
To see the codes, you can:
Code
button of the page,
then choose Show All Code to show all the codes,
orCode
button at each section
to show the codes of that specific section.This Rmarkdown
provides the implementation of consensual
aggregation method by Has
(2021). Let \(\mathcal{D}_n=\{(x_1,y_1),...,(x_n,y_n)\}\)
be a training data of size \(n\), where
the input-output couples \((x_i,y_i)\in\mathbb{R}^d\times\mathbb{R}\)
for all \(i=1,...,n\). \(\mathcal{D}_{n}\) is first randomly
partitioned into \(\mathcal{D}_{k}\)
and \(\mathcal{D}_{\ell}\) of size
\(k\) and \(\ell\) respectively such that \(k+\ell=n\). We construct \(M\) regression estimators (machines) \(r_1,...,r_M\) using only \(\mathcal{D}_{k}\). Let \({\bf
r}(x)=(r_1(x),...,r_M(x))^T\in\mathbb{R}^M\) be the vector of
predictions of \(x\in\mathbb{R}^d\),
the kernel-based consensual aggregation method evaluated at point \(x\) is defined by
\[\begin{equation} g_n({\bf r}(x))=\frac{\sum_{i=1}^{\ell}y_iK_h({\bf r}(x)-{\bf r}(x_i))}{\sum_{j=1}^{\ell}K_h({\bf r}(x)-{\bf r}(x_j))} \end{equation}\] where \(K:\mathbb{R}^M\to\mathbb{R}_+\) is a non-increasing kernel function with \(K_h(x)=K(x/h)\) for some smoothing parameter \(h>0\) to be tuned, with the convention \(0/0=0\). Even though the formula of \(g_n\) is defined using only data points in \(\mathcal{D}_{\ell}\), it does depend on the whole training data \(\mathcal{D}_{n}\) as the basic machines \((r_i)_{i=1}^M\) are built using \(\mathcal{D}_{k}\).
We prepare all the necessary tools for this Rmarkdown
.
pacman
package allows us to load (if exists) or install (if
does not exist) any available packages from The Comprehensive R Archive Network
(CRAN) of .
# Check if package "pacman" is already installed
lookup_packages <- installed.packages()[,1]
if(!("pacman" %in% lookup_packages))
install.packages("pacman")
# To be installed or loaded
pacman::p_load(magrittr)
pacman::p_load(tidyverse)
## package for "generateMachines"
pacman::p_load(tree)
pacman::p_load(glmnet)
pacman::p_load(randomForest)
pacman::p_load(FNN)
pacman::p_load(xgboost)
pacman::p_load(keras)
pacman::p_load(pracma)
pacman::p_load(latex2exp)
pacman::p_load(plotly)
rm(lookup_packages)
This section provides functions to generate basic machines (regressors) to be aggregated.
setBasicParameter
This function allows us to set the values of some key parameters of the basic machines.
Argument:
lambda
: the penalty parameter \(\lambda\) used in penalized linear models:
ridge
or lasso
.k
: the parameter \(k\) of \(k\)NN (knn
) regression model
and the default value is \(k=10\).ntree
: the number of trees in random forest
(rf
). By default, ntree = 300
.mtry
: the number of random features chosen in each
split of random forest procedure. By default, mtry = NULL
and the default value of mtry
of randomForest
function from randomForest
library is used.eta_xgb
: the learning rate \(\eta>0\) in gradient step of extreme
gradient boosting method (xgb
) of xgboost
library.nrounds_xgb
: the parameter nrounds
indicating the max number of boosting iterations. By default,
nrounds_xgb = 100
.early_stop_xgb
: the early stopping round criterion of
xgboost
function. By, default,
early_stop_xgb = NULL
and the early stopping function is
not triggered.max_depth_xgb
: maximum depth of trees constructed in
xgboost
.Value:
This function returns a list of all the parameters given in
its arguments, to be fed to the basicMachineParam
argument
of function generateMachines
defined in the next
section.
Remark.2:
lambda
,k
,ntree
can be a single value or a vector. In other words, each type of models can be constructed several times according to the values of the hyperparameters (a single value or vector).
setBasicParameter <- function(lambda = NULL,
k = 10,
ntree = 300,
mtry = NULL,
eta_xgb = 1,
nrounds_xgb = 100,
early_stop_xgb = NULL,
max_depth_xgb = 3){
return(list(
lambda = lambda,
k = k,
ntree = ntree,
mtry = mtry,
eta_xgb = eta_xgb,
nrounds_xgb = nrounds_xgb,
early_stop_xgb = early_stop_xgb,
max_depth_xgb = max_depth_xgb)
)
}
generateMachines
This function generates all the basic machines to be aggregated.
Argument:
train_input
: a matrix or data frame of the training
input data.train_response
: a vector of training response variable
corresponding to the train_input
.scale_input
: logical value specifying whether to scale
the input data (to be between \(0\) and
\(1\)) or not. By default,
scale_input = FALSE
.machines
: types of basic machines to be constructed.
It is a subset of {“lasso”, “ridge”, “knn”, “tree”, “rf”, “xgb”}. By
default, machines = NULL
and all the six types of basic
machines are built.splits
: real number between \(0\) and \(1\) specifying the proportion of training
data used to train the basic machines (\(\mathcal{D}_k\)). The remaining proportion
of (\(1-\) splits
) is used
for the aggregation (\(\mathcal{D}_{\ell}\)). By default,
splits = 0.5
.basicMachineParam
: the option used to setup the values
of parameters of each machines. One should feed the function
setBasicParameter()
defined above to this argument.Value:
This function returns a list of the following objects.
predict2
: the predictions of the remaining part (\(\mathcal{D}_{\ell}\)) of the training data
used for the aggregation.models
: all the constructed basic machines (it
contains only the values of prapeter \(k\) for knn
).id2
: a logical vector of size equals to the number of
lines of the training data indicating the location of the points used to
build the basic machines (FALSE
) and the remaining ones
(TRUE
).scale_max
, scale_min
: if the argument
scale_input = TRUE
, the maximun and minimun values of all
columns are contained in these vectors.✎ Note: You may need to modify the function accordingly if you want to build different types of basic machines.
generateMachines <- function(train_input,
train_response,
scale_input = FALSE,
machines = NULL,
splits = 0.5,
basicMachineParam = setBasicParameter()){
lambda = basicMachineParam$lambda
k <- basicMachineParam$k
ntree <- basicMachineParam$ntree
mtry <- basicMachineParam$mtry
eta_xgb <- basicMachineParam$eta_xgb
nrounds_xgb <- basicMachineParam$nrounds_xgb
early_stop_xgb <- basicMachineParam$early_stop_xgb
max_depth_xgb <- basicMachineParam$max_depth_xgb
# Packages
pacman::p_load(tree)
pacman::p_load(glmnet)
pacman::p_load(randomForest)
pacman::p_load(FNN)
pacman::p_load(xgboost)
# pacman::p_load(keras)
# Preparing data
input_names <- colnames(train_input)
input_size <- dim(train_input)
df_input <- train_input_scale <- train_input
if(scale_input){
maxs <- map_dbl(.x = df_input, .f = max)
mins <- map_dbl(.x = df_input, .f = min)
train_input_scale <- scale(train_input, center = mins, scale = maxs - mins)
}
if(is.matrix(train_input_scale)){
df_input <- as_tibble(train_input_scale)
matrix_input <- train_input_scale
} else{
df_input <- train_input_scale
matrix_input <- as.matrix(train_input_scale)
}
# Machines
lasso_machine <- function(x, lambda0){
if(is.null(lambda)){
cv <- cv.glmnet(matrix_train_x1, train_y1, alpha = 1, lambda = 10^(seq(-3,2,length.out = 50)))
mod <- glmnet(matrix_train_x1, train_y1, alpha = 1, lambda = cv$lambda.min)
} else{
mod <- glmnet(matrix_train_x1, train_y1, alpha = 1, lambda = lambda0)
}
res <- predict.glmnet(mod, newx = x)
return(list(pred = res,
model = mod))
}
ridge_machine <- function(x, lambda0){
if(is.null(lambda)){
cv <- cv.glmnet(matrix_train_x1, train_y1, alpha = 0, lambda = 10^(seq(-3,2,length.out = 50)))
mod <- glmnet(matrix_train_x1, train_y1, alpha = 0, lambda = cv$lambda.min)
} else{
mod <- glmnet(matrix_train_x1, train_y1, alpha = 0, lambda = lambda0)
}
res <- predict.glmnet(mod, newx = x)
return(list(pred = res,
model = mod))
}
tree_machine <- function(x, pa = NULL) {
mod <- tree(as.formula(paste("train_y1~",
paste(input_names, sep = "", collapse = "+"),
collapse = "",
sep = "")),
data = df_train_x1)
res <- as.vector(predict(mod, x))
return(list(pred = res,
model = mod))
}
knn_machine <- function(x, k0) {
mod <- knn.reg(train = matrix_train_x1, test = x, y = train_y1, k = k0)
res = mod$pred
return(list(pred = res,
model = k0))
}
RF_machine <- function(x, ntree0) {
if(is.null(mtry)){
mod <- randomForest(x = df_train_x1, y = train_y1, ntree = ntree0)
}else{
mod <- randomForest(x = df_train_x1, y = train_y1, ntree = ntree0, mtry = mtry)
}
res <- as.vector(predict(mod, x))
return(list(pred = res,
model = mod))
}
xgb_machine = function(x, nrounds_xgb0){
mod <- xgboost(data = matrix_train_x1,
label = train_y1,
eta = eta_xgb,
nrounds = nrounds_xgb0,
objective = "reg:squarederror",
early_stopping_rounds = early_stop_xgb,
max_depth = max_depth_xgb,
verbose = 0)
res <- predict(mod, x)
return(list(pred = res,
model = mod))
}
# All machines
all_machines <- list(lasso = lasso_machine,
ridge = ridge_machine,
knn = knn_machine,
tree = tree_machine,
rf = RF_machine,
xgb = xgb_machine)
# All parameters
all_parameters <- list(lasso = lambda,
ridge = lambda,
knn = k,
tree = 1,
rf = ntree,
xgb = nrounds_xgb)
if(is.null(machines)){
mach <- c("lasso", "ridge", "knn", "tree", "rf", "xgb")
}else{
mach <- machines
}
# Extracting data
M <- length(mach)
size_D1 <- floor(splits*input_size[1])
id_D1 <- logical(input_size[1])
id_D1[sample(input_size[1], size_D1)] <- TRUE
df_train_x1 <- df_input[id_D1,]
matrix_train_x1 <- matrix_input[id_D1,]
train_y1 <- train_response[id_D1]
df_train_x2 <- df_input[!id_D1,]
matrix_train_x2 <- matrix_input[!id_D1,]
# Function to extract df and model from 'map' function
extr_df <- function(x, id){
return(tibble("r_{{id}}":= as.vector(pred_m[[x]]$pred)))
}
extr_mod <- function(x, id){
return(pred_m[[x]]$model)
}
pred_D2 <- c()
all_mod <- c()
cat("\n* Building basic machines ...\n")
cat("\t~ Progress:")
for(m in 1:M){
if(mach[m] %in% c("tree", "rf")){
x0_test <- df_train_x2
} else {
x0_test <- matrix_train_x2
}
if(is.null(all_parameters[[mach[m]]])){
para_ <- 1
}else{
para_ <- all_parameters[[mach[m]]]
}
pred_m <- map(para_,
.f = ~ all_machines[[mach[m]]](x0_test, .x))
tem0 <- imap_dfc(.x = 1:length(para_),
.f = extr_df)
tem1 <- imap(.x = 1:length(para_),
.f = extr_mod)
names(tem0) <- names(tem1) <- paste0(mach[m], 1:length(para_))
pred_D2 <- bind_cols(pred_D2, as_tibble(tem0))
all_mod[[mach[m]]] <- tem1
cat(" ... ", round(m/M, 2)*100L,"%", sep = "")
}
if(scale_input){
return(list(predict2 = pred_D2,
models = all_mod,
id2 = !id_D1,
train_data = list(train_input = train_input_scale,
train_response = train_response),
scale_max = maxs,
scale_min = mins))
} else{
return(list(predict2 = pred_D2,
models = all_mod,
id2 = !id_D1,
train_data = list(train_input = train_input_scale,
train_response = train_response)))
}
}
Example.1: In this example, the method is implemented on
Boston
data ofMASS
library. The basic machines “rf”, “knn” and “xgb” are built on the first part of the training data (\(\mathcal{D}_{k}\)), and the Root Mean Square Errors (RMSE) evaluated on the second part of the training data (\(\mathcal{D}_{\ell}\)) used for aggregation) are reported.
pacman::p_load(MASS)
df <- Boston
basic_machines <- generateMachines(train_input = df[,1:13],
train_response = df[,14],
scale_input = TRUE,
machines = c("rf", "knn", "xgb"),
basicMachineParam = setBasicParameter(lambda = 1:10/10,
ntree = 10:20 * 25,
k = c(2:10)))
##
## * Building basic machines ...
## ~ Progress: ... 33% ... 67% ... 100%
basic_machines$predict2 %>%
sweep(1, df[basic_machines$id2, "medv"]) %>%
.^2 %>%
colMeans %>%
t %>%
sqrt %>%
as_tibble
This part provides functions to approximate the smoothing parameter
\(h>0\) of the aggregation. Two
important optimization methods are implemented: gradient descent
algorithm (grad
) and grid search
(grid
).
setGradParameter
This function allows us to set the values of parameters needed to process the gradient descent algorithm to approximate the hyperparameter of the aggregation method.
Argument:
val_init
: initial value of gradient descent iteration.
By default, val_init = NULL
and the algorithm will select
the best value (with smallest cost function) among n_tries
values of parameters.n_tries
: the number of parameters used to evaluate the
cost function (only when val_init = NULL
), then the best
one among them will be chosen as initial value for gradient descent
algorithm.rate
: the learning rate in gradent descent algorithm.
By default, rate = NULL
(or “auto”) and the value
coef_auto = 0.005
will be used. It can also be a functional
rate, which is a string element of {“logarithm”, “sqrtroot”, “linear”,
“polynomial”, “exponential”}. Each rate is defined according to
coef_
type arguments bellow.min_val
: the mininum value of parameter to be
considered as the initial value in gradient step. By default,
min_val = 1e-4
.max_val
: the maxinum value of parameter to be
considered as the initial value in gradient step. By default,
max_val = 0.1
.max_iter
: maximum itertaion of gradient descent
algorithm. By default, max_iter = 300
.print_step
: a logical value controlling whether to
print the result of each gradient step or not in order to keep track of
the algorithm. By default, print_step = TRUE
.print_result
: a logical value controlling whether to
print the result of each gradient step or not in order to keep track of
the algorithm. By default, print_result = TRUE
.figure
: a logical value controlling whether to plot a
graphic of the result or not. By default,
figure = TRUE
.coef_auto
: the constant learning rate when
rate = NULL
. By default,
coef_auto = 0.005
.coef_log
: the coefficinet multiplying to the
logarithmic increment of the learning rate, i.e., the rate is
rate
\(=\)
coef_log
\(\times
\log(1+t)\) where \(t\) is the
numer number of iteration. By default, coef_log = 1
.coef_sqrt
: the coefficinet multiplying to the
square root increment of the learning rate, i.e., the rate is
rate
\(=\)
coef_sqrt
\(\times
\sqrt{t}\). By default, coef_sqrt = 1
.coef_lm
: the coefficinet multiplying to the
linear increment of the learning rate, i.e., the rate is
rate
\(=\)
coef_lm
\(\times t\). By
default, coef_lm = 1
.deg_poly
: the degree of the polynomial
increment of the learning rate, i.e., the rate is rate
\(=t^{\texttt{coef_poly}}\). By
default, deg_poly = 2
.base_exp
: the base of the exponential
increment of the learning rate, i.e., the rate is rate
\(=\) base_exp
\(^t\). By default,
base_exp = 1.5
.threshold
: the threshold to stop the algorithm what
the relative change is smaller than this value. By default,
threshold = 1e-10
.Value:
This function returns a list of all the parameters given in its arguments.
setGradParameter <- function(val_init = NULL,
n_tries = 10,
rate = NULL,
min_val = 1e-4,
max_val = 0.1,
max_iter = 300,
print_step = TRUE,
print_result = TRUE,
figure = TRUE,
coef_auto = 0.005,
coef_log = 1,
coef_sqrt = 1,
coef_lm = 1,
deg_poly = 2,
base_exp = 1.5,
threshold = 1e-10) {
return(
list(val_init = val_init,
n_tries = n_tries,
rate = rate,
min_val = min_val,
max_val = max_val,
max_iter = max_iter,
print_step = print_step,
print_result = print_result,
figure = figure,
coef_auto = coef_auto,
coef_log = coef_log,
coef_sqrt = coef_sqrt,
coef_lm = coef_lm,
deg_poly = deg_poly,
base_exp = base_exp,
threshold = threshold
)
)
}
gradOptimizer
This function performs gradient descent algorithm to approximate the minimizer of any given functions (convex or locally convex around its optimizer).
Argument:
obj_fun
: the objective function for which its
minimizer is to be estimated. It should be a univarate function of real
positive variables.setParameter
: the control of gradient descent
parameters which should be the function setGradParameter()
defined above.Value:
This function returns a list of the following objects:
opt_param
: the observed value of the minimizer.opt_error
: the value of the optimal risk.all_grad
: the vector of all the gradients collected
during the walk of the algorithm.all_param
: the vector of all parameters collected
during the walk of the algorithm.run_time
: the running time of the algorithm.gradOptimizer <- function(obj_fun,
setParameter = setGradParameter()) {
start.time <- Sys.time()
# Optimization step:
# ==================
spec_print <- function(x) return(ifelse(x > 1e-6,
format(x, digit = 6, nsmall = 6),
format(x, scientific = TRUE, digit = 6, nsmall = 6)))
collect_val <- c()
gradients <- c()
if (is.null(setParameter$val_init)){
val_params <- seq(setParameter$min_val,
setParameter$max_val,
length.out = setParameter$n_tries)
tem <- map_dbl(.x = val_params, .f = obj_fun)
val <- val0 <- val_params[which.min(tem)]
grad_ <- pracma::grad(
f = obj_fun,
x0 = val0,
heps = .Machine$double.eps ^ (1 / 3))
} else{
val <- val0 <- setParameter$val_init
grad_ <- pracma::grad(
f = obj_fun,
x0 = val0,
heps = .Machine$double.eps ^ (1 / 3))
}
if(setParameter$print_step){
cat("\n* Gradient descent algorithm ...")
cat("\n Step\t| Parameter\t| Gradient\t| Threshold \n")
cat(" ", rep("-", 51), sep = "")
cat("\n 0 \t| ", spec_print(val0),
"\t| ", spec_print(grad_),
" \t| ", setParameter$threshold, "\n")
cat(" ", rep("-",51), sep = "")
}
if (is.numeric(setParameter$rate)){
lambda0 <- setParameter$rate / abs(grad_)
rate_GD <- "auto"
} else{
r0 <- setParameter$coef_auto / abs(grad_)
# Rate functions
rate_func <- list(auto = r0,
logarithm = function(i) setParameter$coef_log * log(2 + i) * r0,
sqrtroot = function(i) setParameter$coef_sqrt * sqrt(i) * r0,
linear = function(i) setParameter$coef_lm * (i) * r0,
polynomial = function(i) i ^ setParameter$deg_poly * r0,
exponential = function(i) setParameter$base_exp ^ i * r0)
rate_GD <- match.arg(setParameter$rate,
c("auto",
"logarithm",
"sqrtroot",
"linear",
"polynomial",
"exponential"))
lambda0 <- rate_func[[rate_GD]]
}
i <- 0
if (is.numeric(setParameter$rate) | rate_GD == "auto") {
while (i < setParameter$max_iter) {
if(is.na(grad_)){
val0 <- runif(1, val0*0.99, val0*1.01)
grad_ = pracma::grad(
f = obj_fun,
x0 = val0,
heps = .Machine$double.eps ^ (1 / 3)
)
}
val <- val0 - lambda0 * grad_
if (val < 0){
val <- val0/2
lambda0 <- lambda0/2
}
if(i > 5){
if(sign(grad_) * sign(grad0) < 0){
lambda0 = lambda0 / 2
}
}
relative <- abs((val - val0) / val0)
test_threshold <- max(relative, abs(grad_))
if (test_threshold > setParameter$threshold){
val0 <- val
grad0 <- grad_
} else{
break
}
grad_ = pracma::grad(
f = obj_fun,
x0 = val0,
heps = .Machine$double.eps ^ (1 / 3)
)
i <- i + 1
if(setParameter$print_step){
cat("\n ", i, "\t| ", spec_print(val),
"\t| ", spec_print(grad_),
"\t| ", test_threshold, "\r")
}
collect_val <- c(collect_val, val)
gradients <- c(gradients, grad_)
}
}
else{
while (i < setParameter$max_iter) {
if(is.na(grad_)){
val0 <- runif(1, val0*0.9, val0*1.1)
grad_ = pracma::grad(
f = obj_fun,
x0 = val0,
heps = .Machine$double.eps ^ (1 / 3)
)
}
val <- val0 - lambda0(i) * grad_
if (val < 0){
val <- val0/2
r0 <- r0 / 2
}
if(i > 5){
if(sign(grad_)*sign(grad0) < 0)
r0 <- r0 / 2
}
relative <- abs((val - val0) / val0)
test_threshold <- max(relative, abs(grad_))
if (test_threshold > setParameter$threshold){
val0 <- val
grad0 <- grad_
}
else{
break
}
grad_ <- pracma::grad(
f = obj_fun,
x0 = val0,
heps = .Machine$double.eps ^ (1 / 3)
)
i <- i + 1
if(setParameter$print_step){
cat("\n ", i, "\t| ", spec_print(val),
"\t| ", spec_print(grad_),
"\t| ", test_threshold, "\r")
}
collect_val <- c(collect_val, val)
gradients <- c(gradients, grad_)
}
}
opt_ep <- val
opt_risk <- obj_fun(opt_ep)
if(setParameter$print_step){
cat(rep("-", 55), sep = "")
if(grad_ == 0){
cat("\n Stopped| ", spec_print(val),
"\t| ", 0,
"\t\t| ", test_threshold)
}else{
cat("\n Stopped| ", spec_print(val),
"\t| ", spec_print(grad_),
"\t| ", test_threshold)
}
}
if(setParameter$print_result){
cat("\n ~ Observed parameter:", opt_ep, " in",i, "iterations.")
}
if (setParameter$figure) {
siz <- length(collect_val)
tibble(x = 1:siz,
y = collect_val,
gradient = gradients) %>%
ggplot(aes(x = x, y = y)) +
geom_line(mapping = aes(color = gradient), size = 1) +
geom_point(aes(x = length(x), y = opt_ep), color = "red") +
geom_hline(yintercept = opt_ep, color = "red", linetype = "longdash") +
labs(title = "Gradient steps",
x = "Iteration",
y = "Parameter")-> p
print(p)
}
end.time = Sys.time()
return(list(
opt_param = opt_ep,
opt_error = opt_risk,
all_grad = gradients,
all_param = collect_val,
run_time = difftime(end.time,
start.time,
units = "secs")[[1]]
))
}
Example.2: Approximate \[x^*=\text{arg}\min_{x\in\mathbb{R}}f(x), \text{where }f(x)=(x-1)^2(1+\sin^2(2.5(x-1)))\] Note that argument
val_init
is crucial since \(f\) is not convex.
object_func <- function(x) (x-1)^2*(1+sin(2.5*(x-1))^2)
tibble::tibble(x = seq(-4,6, length.out = 200),
y = object_func(seq(-4,6, length.out = 200))) %>%
ggplot(aes(x = x, y = y)) +
geom_line(color = "skyblue", size = 0.75) +
labs(title = latex2exp::TeX("Cost function $f(x)=(x-1)^2(1+\\sin^2(2.5(x-1)))$")) -> p
print(p)
gd <- gradOptimizer(obj_fun = object_func,
setParameter = setGradParameter(val_init = 2.4,
rate = 1,
print_step = FALSE,
figure = TRUE))
##
## ~ Observed parameter: 1 in 59 iterations.
setGridParameter
This function allows to set the values of the grid search algorithm.
Argument:
min_val
: the minimum value of parameter grid.max_val
: the maximum value of parameter grid.n_val
: the number of points in the grid.parameters
: the vector of paramters in case the
non-uniform grid is considered.print_result
: a logical value specifying whether or
not to print the observed result.figure
: a logical value specifying whether or not to
plot the graphic of error.Value:
This function returns a list of all the parameters given in its arguments.
setGridParameter <- function(min_val = 1e-4,
max_val = 0.1,
n_val = 300,
parameters = NULL,
print_result = TRUE,
figure = TRUE){
return(list(min_val = min_val,
max_val = max_val,
n_val = n_val,
parameters = parameters,
print_result = print_result,
figure = figure))
}
gridOptimizer
Argument:
obj_fun
: the objective function for which its
minimizer is to be estimated. It should be a univarate function of real
positive variables.setParameter
: the control of grid search algorithm
parameters which should be the function setGridParameter()
defined above.Value:
This function returns a list of the following objects:
opt_param
: the observed value of the minimizer.opt_error
: the value of optimal risk.all_risk
: the vector of all the errors evaluated at
all the values of considered parameters.run.time
: the running time of the algorithm.gridOptimizer <- function(obj_func,
setParameter = setGridParameter()){
t0 <- Sys.time()
if(is.null(setParameter$parameters)){
param <- seq(setParameter$min_val,
setParameter$max_val,
length.out = setParameter$n_val)
} else{
param <- setParameter$parameters
}
risk <- param %>%
map_dbl(.f = obj_func)
id_opt <- which.min(risk)
opt_ep <- param[id_opt]
opt_risk <- risk[id_opt]
if(setParameter$print_result){
cat("\n* Grid search algorithm...", "\n ~ Observed parameter :", opt_ep)
}
if(setParameter$figure){
tibble(x = param,
y = risk) %>%
ggplot(aes(x = x, y = y)) +
geom_line(color = "skyblue", size = 0.75) +
geom_point(aes(x = opt_ep, y = opt_risk), color = "red") +
geom_vline(xintercept = opt_ep, color = "red", linetype = "longdash") +
labs(title = "Error as function of parameter",
x = "Parameter",
y = "Error")-> p
print(p)
}
t1 <- Sys.time()
return(list(
opt_param = opt_ep,
opt_error = opt_risk,
all_risk = risk,
run.time = difftime(t1,
t0,
units = "secs")[[1]])
)
}
Example.2: Again with grid search.
grid <- gridOptimizer(obj_fun = object_func,
setParameter = setGridParameter(min_val = -3,
max_val = 5,
n_val = 500,
figure = TRUE))
##
## * Grid search algorithm...
## ~ Observed parameter : 1.008016
Constructing aggregation method is equivalent to approximating the optimal value of smoothing parameter \(h>0\) introduced in section 1.1 by minimizing some lost function. In this study, we propose \(\kappa\)-fold cross validation lost function defined by
\[\begin{equation} \label{eq:kappa} \varphi^{\kappa}(h)=\frac{1}{\kappa}\sum_{k=1}^{\kappa}\sum_{(x_j,y_j)\in F_k}(g_n({\bf r}(x_j))-y_j)^2 \end{equation}\] where
dist_matrix
This function computes different distances between data points of each training folds (\(\mathcal{D}_{\ell}-F_k\)) and the corresponding validation fold \(F_k\) for any \(k=1,\dots,\kappa\). The distance matrices \(D_k=(d[{\bf r}(x_i),{\bf r}(x_j)])_{i,j}\) for \(k=1,\dots,\kappa\), where the distance \(d\) is defined according to different types kernel functions.
Argument:
basicMachines
: the basic machine object, which is an
output of generateMachines
function.n_cv
: the number \(\kappa\) of cross-validation folds. By
default, n_cv = 5
.kernel
: the kernel function used for the aggregation,
which is an element of {“gaussian”, “epanechnikov”, “biweight”,
“triweight”, “triangular”, “naive”, “c.expo”, “expo”}. By default,
kernel = "gaussian"
.id_shuffle
: an integer vector of length equals to the
size of the remaining part of the training data (\({\cal D}_{\ell}\)). Its elements are from
{\(1,...,\kappa\)}, indicating the fold
membership of the data points. By default,
id_shuffle = NULL
, and the data points will be shuffled
randomly.Value:
This functions returns a list of the following objects:
dist
: a list of data frame (tibble
)
containing sublists corresponding to kernel functions used for the
aggregation. Each sublist contains n_cv
numbers of distance
matrices \(D_k=(d[{\bf r}(x_i),{\bf
r}(x_j)])_{i,j}\), for \(k=1,\dots,\kappa\), containing distances
between the data points in valiation fold (by row) and the \(1-\kappa\) remaining folds of training
data. The type of distance matrices depends on the kernel used:
kernel = naive
, the distance matrices contain the
maximum distance between data points, i.e., \[D_k=(\|{\bf r}(x_i)-{\bf
r}(x_j)\|_{\max})_{i,j}\text{ for }k=1,\dots,\kappa.\]kernel = triangular
, the distance matrices contain
the \(L_1\) distance between data
points, i.e., dist_matrix
\[D_k=(\|{\bf r}(x_i)-{\bf r}(x_j)\|_1)_{i,j}\text{
for }k=1,\dots,\kappa.\]ffleffle
: the shuffled indices in
cross-validation.n_cv
: the number \(\kappa\) of cross-validation folds.dist_matrix <- function(basicMachines,
n_cv = 5,
kernel = "gaussian",
id_shuffle = NULL){
n <- nrow(basicMachines$predict2)
n_each_fold <- floor(n/n_cv)
# shuffled indices
if(is.null(id_shuffle)){
shuffle <- 1:(n_cv-1) %>%
rep(n_each_fold) %>%
c(., rep(n_cv, n - n_each_fold * (n_cv - 1))) %>%
sample
}else{
shuffle <- id_shuffle
}
# the prediction matrix D_l
df_ <- as.matrix(basicMachines$predict2)
if(! (kernel %in% c("naive", "triangular"))){
pair_dist <- function(M, N){
n_N <- dim(N)
n_M <- dim(M)
res_ <- 1:nrow(N) %>%
map_dfc(.f = (\(id) tibble('{{id}}' := as.vector(rowSums((M - matrix(rep(N[id,], n_M[1]), ncol = n_M[2], byrow = TRUE))^2)))))
return(res_)
}
}
if(kernel == "triangular"){
pair_dist <- function(M, N){
n_N <- dim(N)
n_M <- dim(M)
res_ <- 1:nrow(N) %>%
map_dfc(.f = (\(id) tibble('{{id}}' := as.vector(rowSums(abs(M - matrix(rep(N[id,], n_M[1]), ncol = n_M[2], byrow = TRUE)))))))
return(res_)
}
}
if(kernel == "naive"){
pair_dist <- function(M, N){
n_N <- dim(N)
n_M <- dim(M)
res_ <- 1:nrow(N) %>%
map_dfc(.f = (\(id) tibble('{{id}}' := as.vector(apply(abs(M - matrix(rep(N[id,], n_M[1]), ncol = n_M[2], byrow = TRUE)), 1, max)))))
return(res_)
}
}
L <- 1:n_cv %>%
map(.f = (\(x) pair_dist(df_[shuffle != x,],
df_[shuffle == x,])))
return(list(dist = L,
id_shuffle = shuffle,
n_cv = n_cv))
}
Example.3: The method
dist_matrix
is implemented on the obtained basic machines built in Example.1 with the corresponding Gaussian kernel function.
dis <- dist_matrix(basicMachines = basic_machines,
n_cv = 3,
kernel = "gaussian")
Example.4: From the distance matrix, we can compute the error corresponding to Gaussian kernel function, then use both of the optimization methods to approximate the smoothing paramter in this case.
# Gaussian kernel
gaussian_kern <- function(.ep = .05,
.dist_matrix,
.train_response2){
kern_fun <- function(x, id, D){
tem0 <- as.matrix(exp(-(x*D)))
y_hat <- .train_response2[.dist_matrix$id_shuffle != id] %*% tem0/colSums(tem0)
return(sum((y_hat - .train_response2[.dist_matrix$id_shuffle == id])^2))
}
temp <- map2(.x = 1:.dist_matrix$n_cv,
.y = .dist_matrix$dist,
.f = ~ kern_fun(x = .ep, id = .x, D = .y))
return(Reduce("+", temp))
}
# Kappa cross-validation error
cost_fun <- function(.ep = .05,
.dist_matrix,
.kernel_func = NULL,
.train_response2){
return(.kernel_func(.ep = .ep,
.dist_matrix = .dist_matrix,
.train_response2 = .train_response2))
}
# cross_validation
err_cv <- function(x,
.dist_matrix = dis,
.kernel_func = gaussian_kern,
.train_response2 = df[basic_machines$id2, 14]) {
res <- cost_fun(.ep = x,
.dist_matrix = .dist_matrix,
.kernel_func = .kernel_func,
.train_response2 = .train_response2)
return(res)
}
# Optimization
opt_param_gd <- gradOptimizer(obj_fun = err_cv,
setParameter = setGradParameter(min_val = 0.0001,
max_val = 1,
rate = 0.007,
n_tries = 10,
print_step = FALSE,
print_result = TRUE,
figure = TRUE))
##
## ~ Observed parameter: 0.009389028 in 240 iterations.
# Optimization
opt_param_grid <- gridOptimizer(obj_fun = err_cv,
setParameter = setGridParameter(min_val = 0.0001,
max_val = 0.1,
n_val = 500,
figure = TRUE))
##
## * Grid search algorithm...
## ~ Observed parameter : 0.009309218
cat('* Observed parameter:\n\t - Gradient descent\t:',
opt_param_gd$opt_param,
'\t with error:', err_cv(opt_param_gd$opt_param),
'\n\t - Grid search\t\t:',opt_param_grid$opt_param,
'\t with error:', err_cv(opt_param_grid$opt_param))
## * Observed parameter:
## - Gradient descent : 0.009389028 with error: 2379.61
## - Grid search : 0.009309218 with error: 2379.624
This function gathers the constructed machines and perform an optimization algorithm to approximate the smoothing parameter for the aggregation.
Argument:
train_design
: a matrix or data frame of the training
input data or the predictions given by some
predictors.
build_machine
must be TRUE
and all the chosen
machines will be constructed using a proportion of the training input
(splits
).build_machine
must be FALSE
which indicates
that no basic machine is constructed and the paramter is tuned using
this predicted features directly.train_response
: a vector of corresponding response
variable of the train_design
.scale_input
: a logical value specifying whether or not
to scale the input data before building the basic regression predictors.
By default, scale_input = FALSE
.scale_machine
: a logical value specifying whether or
not to scale the predicted features given by all the basic regression
predictors, for aggregation. By default,
scale_machine = FALSE
.build_machine
: a logical value specifying whether or
not the basic machines should be constructed. It should be
TRUE
if the train_design
is the input data,
else it should be FALSE
is the train_design
is
the features of predictions. By default,
build_machine = TRUE
.machines
: a vector of basic machines to be
constructed. It must be a subset of {“lasso”, “ridge”, “knn”, “tree”,
“rf”, “xgb”}. By default, machines = NULL
and all the six
types of basic machines are built.splits
: the proportion of training data used to build
the basic machines. By default, splits = .5
.n_cv
: the number of cross-validation folds used to
tune the smoothing parameter.sig, alp
: the support constant \(\sigma>0\) and the exponent \(\alpha>0\) of compactly exponential kernel: \(K(x)=e^{-\|x\|^{\alpha}}\mathbb{1}_{\{\|x\|\leq\sigma\}}\) for any
\(x\in\mathbb{R}^d\). By default,
sig = 3
and alpha = 2
.kernels
: the kernel function or vector of kernel
functions used for the aggregation.optimizeMethod
: the optimization methods used to learn
the smoothing parameter. By default, optimizeMethod
(“grad”) which stands for gradient descent algorithm, and if
“optimizeMethod = grid”, then the grid search algorithm is used. Note that this should be of the same size as
the kernels
argument, otherwise “grid” method will be
used for all the kernels.setBasicMachineParam
: an option used to set the values
of the parameters of the basic machines. setBasicParameter
function should be fed to this argument.setGradParam
: an option used to set the values of the
parameters of the gradient descent algorithm.
setGradParameter
function should be fed to it.setGridParam
: an option used to set the values of the
parameters of the grid search algorithm. setGridParameter
function should be fed to it.Value:
This function returns a list of the following objects:
opt_parameters
: the observed optimal parameter.add_parameters
: other aditional parameters such as
scaling options, parameters of kernel functions and the optimization
methods used.basic_machines
: the list of basic machine object..scale
: the maximum and minimum scaling values of
input data.fit_parameter <- function(train_design,
train_response,
scale_input = FALSE,
scale_machine = FALSE,
build_machine = TRUE,
machines = NULL,
splits = 0.5,
n_cv = 5,
sig = 3,
alp = 2,
kernels = "gaussian",
optimizeMethod = "grad",
setBasicMachineParam = setBasicParameter(),
setGradParam = setGradParameter(),
setGridParam = setGridParameter()){
kernels_lookup <- c("gaussian", "epanechnikov", "biweight", "triweight", "triangular", "naive", "c_exp", "expo")
kernel_real <- kernels %>%
sapply(FUN = function(x) return(match.arg(x, kernels_lookup)))
if(build_machine){
mach2 <- generateMachines(train_input = train_design,
train_response = train_response,
scale_input = scale_input,
machines = machines,
splits = splits,
basicMachineParam = setBasicMachineParam)
} else{
mach2 <- list(predict2 = train_design,
models = colnames(train_design),
id2 = rep(TRUE, nrow(train_design)),
train_data = list(train_response = train_response))
}
maxs_ <- mins_ <- NULL
if(scale_machine){
maxs_ <- map_dbl(.x = mach2$predict2, .f = max)
mins_ <- map_dbl(.x = mach2$predict2, .f = min)
mach2$predict2 <- scale(mach2$predict2,
center = mins_,
scale = maxs_ - mins_)
}
# distance matrix to compute loss function
if_euclid <- FALSE
id_euclid <- NULL
n_ker <- length(kernels)
dist_all <- list()
id_shuf <- NULL
for (k_ in 1:n_ker) {
ker <- kernel_real[k_]
if(ker == "naive"){
dist_all[["naive"]] <- dist_matrix(basicMachines = mach2,
n_cv = n_cv,
kernel = "naive",
id_shuffle = id_shuf)
} else{
if(ker == "triangular"){
dist_all[["triangular"]] <- dist_matrix(basicMachines = mach2,
n_cv = n_cv,
kernel = "triangular",
id_shuffle = id_shuf)
} else{
if(if_euclid){
dist_all[[ker]] <- dist_all[[id_euclid]]
} else{
dist_all[[ker]] <- dist_matrix(basicMachines = mach2,
n_cv = n_cv,
kernel = ker,
id_shuffle = id_shuf)
id_euclid <- ker
if_euclid <- TRUE
}
}
}
id_shuf <- dist_all[[1]]$id_shuffle
}
# Kernel functions
# ================
# expo
expo_kernel <- function(.ep = .05,
.dist_matrix,
.train_response2,
.alpha = alp){
kern_fun <- function(x, id, D){
tem0 <- as.matrix(exp(- (x*D)^.alpha))
y_hat <- .train_response2[.dist_matrix$id_shuffle != id] %*% tem0/colSums(tem0)
return(sum((y_hat - .train_response2[.dist_matrix$id_shuffle == id])^2))
}
temp <- map2(.x = 1:.dist_matrix$n_cv,
.y = .dist_matrix$dist,
.f = ~ kern_fun(x = .ep, id = .x, D = .y))
return(Reduce("+", temp))
}
# C_expo
c.expo_kernel <- function(.ep = .05,
.dist_matrix,
.train_response2,
.sigma = sig,
.alpha = alp){
kern_fun <- function(x, id, D){
tem0 <- x*D
tem0[tem0 < .sigma] <- 0
tem1 <- as.matrix(exp(- tem0^.alpha))
y_hat <- .train_response2[.dist_matrix$id_shuffle != id] %*% tem1/colSums(tem1)
y_hat[is.na(y_hat)] <- 0
return(sum((y_hat - .train_response2[.dist_matrix$id_shuffle == id])^2))
}
temp <- map2(.x = 1:.dist_matrix$n_cv,
.y = .dist_matrix$dist,
.f = ~ kern_fun(x = .ep, id = .x, D = .y))
return(Reduce("+", temp))
}
# Gaussian
gaussian_kernel <- function(.ep = .05,
.dist_matrix,
.train_response2){
kern_fun <- function(x, id, D){
tem0 <- as.matrix(exp(- (x*D)/2))
y_hat <- .train_response2[.dist_matrix$id_shuffle != id] %*% tem0/colSums(tem0)
return(sum((y_hat - .train_response2[.dist_matrix$id_shuffle == id])^2))
}
temp <- map2(.x = 1:.dist_matrix$n_cv,
.y = .dist_matrix$dist,
.f = ~ kern_fun(x = .ep, id = .x, D = .y))
return(Reduce("+", temp))
}
# Epanechnikov
epanechnikov_kernel <- function(.ep = .05,
.dist_matrix,
.train_response2){
kern_fun <- function(x, id, D){
tem0 <- as.matrix(1- x*D)
tem0[tem0 < 0] = 0
y_hat <- .train_response2[.dist_matrix$id_shuffle != id] %*% tem0/colSums(tem0)
return(sum((y_hat - .train_response2[.dist_matrix$id_shuffle == id])^2))
}
temp <- map2(.x = 1:.dist_matrix$n_cv,
.y = .dist_matrix$dist,
.f = ~ kern_fun(x = .ep, id = .x, D = .y))
return(Reduce("+", temp))
}
# Biweight
biweight_kernel <- function(.ep = .05,
.dist_matrix,
.train_response2){
kern_fun <- function(x, id, D){
tem0 <- as.matrix(1- x*D)
tem0[tem0 < 0] = 0
y_hat <- .train_response2[.dist_matrix$id_shuffle != id] %*% tem0^2/colSums(tem0^2)
y_hat[is.na(y_hat)] <- 0
return(sum((y_hat - .train_response2[.dist_matrix$id_shuffle == id])^2))
}
temp <- map2(.x = 1:.dist_matrix$n_cv,
.y = .dist_matrix$dist,
.f = ~ kern_fun(x = .ep, id = .x, D = .y))
return(Reduce("+", temp))
}
# Triweight
triweight_kernel <- function(.ep = .05,
.dist_matrix,
.train_response2){
kern_fun <- function(x, id, D){
tem0 <- as.matrix(1- x*D)
tem0[tem0 < 0] = 0
y_hat <- .train_response2[.dist_matrix$id_shuffle != id] %*% tem0^3/colSums(tem0^3)
y_hat[is.na(y_hat)] <- 0
return(sum((y_hat - .train_response2[.dist_matrix$id_shuffle == id])^2))
}
temp <- map2(.x = 1:.dist_matrix$n_cv,
.y = .dist_matrix$dist,
.f = ~ kern_fun(x = .ep, id = .x, D = .y))
return(Reduce("+", temp))
}
# Triangular
triangular_kernel <- function(.ep = .05,
.dist_matrix,
.train_response2){
kern_fun <- function(x, id, D){
tem0 <- as.matrix(1- x*D)
tem0[tem0 < 0] <- 0
y_hat <- .train_response2[.dist_matrix$id_shuffle != id] %*% tem0/colSums(tem0)
y_hat[is.na(y_hat)] = 0
return(sum((y_hat - .train_response2[.dist_matrix$id_shuffle == id])^2))
}
temp <- map2(.x = 1:.dist_matrix$n_cv,
.y = .dist_matrix$dist,
.f = ~ kern_fun(x = .ep, id = .x, D = .y))
return(Reduce("+", temp))
}
# Naive
naive_kernel <- function(.ep = .05,
.dist_matrix,
.train_response2){
kern_fun <- function(x, id, D){
tem0 <- (as.matrix(x*D) < 1)
y_hat <- .train_response2[.dist_matrix$id_shuffle != id] %*% tem0/colSums(tem0)
y_hat[is.na(y_hat)] = 0
return(sum((y_hat - .train_response2[.dist_matrix$id_shuffle == id])^2))
}
temp <- map2(.x = 1:.dist_matrix$n_cv,
.y = .dist_matrix$dist,
.f = ~ kern_fun(x = .ep, id = .x, D = .y))
return(Reduce("+", temp))
}
# error function
error_cv <- function(x,
.dist_matrix = NULL,
.kernel_func = NULL,
.train_response2 = NULL){
res <- .kernel_func(.ep = x,
.dist_matrix = .dist_matrix,
.train_response2 = .train_response2)
return(res/n_cv)
}
# list of kernel functions
list_funs <- list(gaussian = gaussian_kernel,
epanechnikov = epanechnikov_kernel,
biweight = biweight_kernel,
triweight = triweight_kernel,
triangular = triangular_kernel,
naive = naive_kernel,
expo = expo_kernel,
c.expo = c.expo_kernel)
# error for all kernel functions
error_func <- kernel_real %>%
map(.f = ~ (\(x) error_cv(x,
.dist_matrix = dist_all[[.x]],
.kernel_func = list_funs[[.x]],
.train_response2 = train_response[mach2$id2])))
names(error_func) <- kernels
# list of prameter setup
list_param <- list(grad = setGradParam,
GD = setGradParam,
grid = setGridParam)
# list of optimizer
list_optimizer <- list(grad = gradOptimizer,
GD = gradOptimizer,
grid = gridOptimizer)
optMethods <- optimizeMethod
if(length(kernels) != length(optMethods)){
warning("* kernels and optimization methods differ in sides! Grid search will be used!")
optMethods = rep("grid", length(kernels))
}
# Optimization
parameters <- map2(.x = kernels,
.y = optMethods,
.f = ~ list_optimizer[[.y]](obj_fun = error_func[[.x]],
setParameter = list_param[[.y]]))
names(parameters) <- paste0(kernel_real, "_", optMethods)
return(list(opt_parameters = parameters,
add_parameters = list(scale_input = scale_input,
scale_machine = scale_machine,
sig = sig,
alp = alp,
opt_methods = optimizeMethod),
basic_machines = mach2,
.scale = list(min = mins_,
max = maxs_)))
}
Example.5: We approximate the smoothing parameter of
Boston
data.
df <- MASS::Boston
train <- logical(nrow(df))
train[sample(length(train), floor(0.75*nrow(df)))] <- TRUE
param <- fit_parameter(train_design = df[train, 1:13],
train_response = df[train, 14],
machines = c("knn", "rf", "xgb"),
splits = .65,
n_cv = 3,
kernels = c("gaussian", "gaussian", "biweight", "triangular"),
optimizeMethod = c("grad", "grid", "grid", "grid"),
setBasicMachineParam = setBasicParameter(k = 2:5,
ntree = c(1,3,5)*100,
nrounds_xgb = c(1,3,5)*100),
setGradParam = setGradParameter(rate = 0.001))
##
## * Building basic machines ...
## ~ Progress: ... 33% ... 67% ... 100%
## * Gradient descent algorithm ...
## Step | Parameter | Gradient | Threshold
## ---------------------------------------------------
## 0 | 0.033400 | 64.605213 | 1e-10
## ---------------------------------------------------
## 1 | 0.032400 | -4.13601e-01 | 64.60521
## 2 | 0.0324064 | 0.0201676 | 0.4136014
## 3 | 0.0324061 | -9.77896e-04 | 0.0201676
## 4 | 0.0324061 | 4.74145e-05 | 0.0009778963
## 5 | 0.0324061 | -2.28108e-06 | 4.741446e-05
## 6 | 0.0324061 | 1.07952e-07 | 2.281076e-06
## 7 | 0.0324061 | -9.38714e-09 | 1.079521e-07
## 8 | 0.0324061 | -1.87743e-08 | 9.387143e-09
## 9 | 0.0324061 | 9.38714e-09 | 1.877429e-08
## 10 | 0.0324061 | 0e+00 | 9.387143e-09
-------------------------------------------------------
## Stopped| 0.0324061 | 0 | 0
## ~ Observed parameter: 0.0324061 in 10 iterations.
##
## * Grid search algorithm...
## ~ Observed parameter : 0.03250903
##
## * Grid search algorithm...
## ~ Observed parameter : 0.002104682
##
## * Grid search algorithm...
## ~ Observed parameter : 0.01780803
param$opt_parameters %>%
map_dfc(.f = ~ .x$opt_param) %>%
print
## # A tibble: 1 × 4
## gaussian_grad gaussian_grid biweight_grid triangular_grid
## <dbl> <dbl> <dbl> <dbl>
## 1 0.0324 0.0325 0.00210 0.0178
The smoothing parameter obtained from the previous section can be used to make the final predictions.
Several types of kernel functions used for the aggregation are defined in this section.
Argument:
epsilon
: the value of the parameter used..y2
: the vector of response variable of the second
part \(\mathcal{D}_{\ell}\) of the
training data, which is used for the aggregation..distance
: the distance matrix object obtained from
dist_matrix
function..kern
: the string specifying the kernel function. By
default, .kern = "gaussian"
..sig, .alp
: the parameters of exponential kernel
function..meth
: the string of optimization methods to be linked
to the name of the kernel functions if one kernel is used more than
once.Value:
This function returns the predictions of the aggregation method evaluated with the given parameter.
kernel_pred <- function(epsilon,
.y2,
.distance,
.kern = "gaussian",
.sig = sqrt(.5),
.alp = 2,
.meth = NA){
dis <- as.matrix(.distance)
# Kernel functions
# ================
expo_kernel <- function(.ep,
.alpha = .alp){
tem0 <- exp(- (.ep*dis)^.alpha)
y_hat <- .y2 %*% tem0/colSums(tem0)
return(t(y_hat))
}
c.expo_kernel <- function(.ep,
.sigma = .sig,
.alpha = .alp){
tem0 <- .ep*dis
tem0[tem0 < .sigma] <- 0
tem1 <- exp(- tem0^.alpha)
y_hat <- .y2 %*% tem1/colSums(tem1)
y_hat[is.na(y_hat)] <- 0
return(t(y_hat))
}
gaussian_kernel <- function(.ep){
tem0 <- exp(- (.ep*dis)/2)
y_hat <- .y2 %*% tem0/colSums(tem0)
return(t(y_hat))
}
# Epanechnikov
epanechnikov_kernel <- function(.ep){
tem0 <- 1- .ep*dis
tem0[tem0 < 0] = 0
y_hat <- .y2 %*% tem0/colSums(tem0)
return(t(y_hat))
}
# Biweight
biweight_kernel <- function(.ep){
tem0 <- 1- .ep*dis
tem0[tem0 < 0] = 0
y_hat <- .y2 %*% tem0^2/colSums(tem0^2)
y_hat[is.na(y_hat)] <- 0
return(t(y_hat))
}
# Triweight
triweight_kernel <- function(.ep){
tem0 <- 1- .ep*dis
tem0[tem0 < 0] = 0
y_hat <- .y2 %*% tem0^3/colSums(tem0^3)
y_hat[is.na(y_hat)] <- 0
return(t(y_hat))
}
# Triangular
triangular_kernel <- function(.ep){
tem0 <- 1- .ep*dis
tem0[tem0 < 0] <- 0
y_hat <- .y2 %*% tem0/colSums(tem0)
y_hat[is.na(y_hat)] = 0
return(t(y_hat))
}
# Naive
naive_kernel <- function(.ep){
tem0 <- (.ep*dis < 1)
y_hat <- .y2 %*% tem0/colSums(tem0)
y_hat[is.na(y_hat)] = 0
return(t(y_hat))
}
# Prediction
kernel_list <- list(gaussian = gaussian_kernel,
epanechnikov = epanechnikov_kernel,
biweight = biweight_kernel,
triweight = triweight_kernel,
triangular = triangular_kernel,
naive = naive_kernel,
expo = expo_kernel,
c.expo = c.expo_kernel)
res <- tibble(as.vector(kernel_list[[.kern]](.ep = epsilon)))
names(res) <- ifelse(is.na(.meth),
.kern,
paste0(.kern, '_', .meth))
return(res)
}
predict_agg
Argument:
fitted_models
: the object obtained from
fit_parameter
function.new_data
: the new testing data to be predicted.test_response
: the testing response variable, it is
optional. If it is given, the mean square error on the testing data is
also computed. By default, test_response = NULL
.Value:
This function returns a list of the following objects:
fitted_aggregate
: the predictions by the aggregation
methods.fitted_machine
: the predictions given by all the basic
machines.mse
: the mean square error computed only if the
test_reponse
argument is note NULL
.# Prediction
predict_agg <- function(fitted_models,
new_data,
test_response = NULL){
opt_param <- fitted_models$opt_parameters
add_param <- fitted_models$add_parameters
basic_mach <- fitted_models$basic_machines
kern0 <- names(opt_param)
kernel0 <- stringr::str_split(kern0, "_") %>%
imap_dfc(.f = ~ tibble("{.y}" := .x))
kerns <- kernel0[1,] %>%
as.character
opt_meths <- kernel0[2,] %>%
as.character
new_data_ <- new_data
# if basic machines are built
if(is.list(basic_mach$models)){
mat_input <- as.matrix(basic_mach$train_data$train_input)
if(add_param$scale_input){
new_data_ <- scale(new_data,
center = basic_mach$scale_min,
scale = basic_mach$scale_max - basic_mach$scale_min)
}
if(is.matrix(new_data_)){
mat_test <- new_data_
df_test <- as_tibble(new_data_)
} else {
mat_test <- as.matrix(new_data_)
df_test <- new_data_
}
# Prediction test by basic machines
built_models <- basic_mach$models
pred_test <- function(meth){
if(meth == "knn"){
pre <- 1:length(built_models[[meth]]) %>%
map_dfc(.f = (\(k) tibble('{{k}}' := FNN::knn.reg(train = mat_input[!basic_mach$id2,],
test = mat_test,
y = basic_mach$train_data$train_response[!basic_mach$id2],
k = built_models[[meth]][[k]])$pred)))
}
if(meth %in% c("tree", "rf")){
pre <- 1:length(built_models[[meth]]) %>%
map_dfc(.f = (\(k) tibble('{{k}}' := as.vector(predict(built_models[[meth]][[k]], df_test)))))
}
if(meth %in% c("lasso", "ridge")){
pre <- 1:length(built_models[[meth]]) %>%
map_dfc(.f = (\(k) tibble('{{k}}' := as.vector(predict.glmnet(built_models[[meth]][[k]], mat_test)))))
}
if(meth == "xgb"){
pre <- 1:length(built_models[[meth]]) %>%
map_dfc(.f = (\(k) tibble('{{k}}' := as.vector(predict(built_models[[meth]][[k]], mat_test)))))
}
colnames(pre) <- names(built_models[[meth]])
return(pre)
}
pred_test_all <- names(built_models) %>%
map_dfc(.f = pred_test)
} else{
pred_test_all <- new_data_
}
pred_test0 <- pred_test_all
if(add_param$scale_machine){
pred_test_all <- scale(pred_test_all,
center = fitted_models$.scale$min,
scale = fitted_models$.scale$max - fitted_models$.scale$min)
}
# Prediction train2
pred_train_all <- basic_mach$predict2
colnames(pred_test_all) <- colnames(pred_train_all)
d_train <- dim(pred_train_all)
d_test <- dim(pred_test_all)
pred_test_mat <- as.matrix(pred_test_all)
pred_train_mat <- as.matrix(pred_train_all)
# Distance matrix
dist_mat <- function(kernel = "gaussian"){
if(!(kernel %in% c("naive", "triangular"))){
res_ <- 1:d_test[1] %>%
map_dfc(.f = (\(id) tibble('{{id}}' := as.vector(rowSums((pred_train_mat - matrix(rep(pred_test_mat[id,],
d_train[1]),
ncol = d_train[2],
byrow = TRUE))^2)))))
}
if(kernel == "triangular"){
res_ <- 1:d_test[1] %>%
map_dfc(.f = (\(id) tibble('{{id}}' := as.vector(rowSums(abs(pred_train_mat - matrix(rep(pred_test_mat[id,],
d_train[1]),
ncol = d_train[2],
byrow = TRUE)))))))
}
if(kernel == "naive"){
res_ <- 1:d_test[1] %>%
map_dfc(.f = (\(id) tibble('{{id}}' := as.vector(apply(abs(pred_train_mat - matrix(rep(pred_test_mat[id,], d_train[1]),
ncol = d_train[2],
byrow = TRUE)), 1, max)))))
}
return(res_)
}
dists <- 1:length(kerns) %>%
map(.f = ~ dist_mat(kerns[.x]))
tab_nam <- table(kerns)
nam <- names(tab_nam[tab_nam > 1])
vec <- rep(NA, length(kerns))
for(id in nam){
id_ <- kerns == id
if(!is.null(id_)){
vec[id_] = add_param$opt_methods[id_]
}
}
prediction <- 1:length(kerns) %>%
map_dfc(.f = ~ kernel_pred(epsilon = opt_param[[kern0[.x]]]$opt_param,
.y2 = basic_mach$train_data$train_response[basic_mach$id2],
.distance = dists[[.x]],
.kern = kerns[.x],
.sig = add_param$sig,
.alp = add_param$alp,
.meth = vec[.x]))
if(is.null(test_response)){
return(list(fitted_aggregate = prediction,
fitted_machine = pred_test_all))
} else{
error <- cbind(pred_test_all, prediction) %>%
dplyr::mutate(y_test = test_response) %>%
dplyr::summarise_all(.funs = ~ (. - y_test)) %>%
dplyr::select(-y_test) %>%
dplyr::summarise_all(.funs = ~ mean(.^2))
return(list(fitted_aggregate = prediction,
fitted_machine = pred_test0,
mse = error))
}
}
Example.6 Aggregation on
Boston
dataset.
pred <- predict_agg(param,
new_data = df[!train, -14],
test_response = df[!train, 14])
sqrt(pred$mse)
GradientCOBRARegressor
This function puts together all the functions above and provides the desire result of the kernel-based consensual aggregation method.
Argument:
train_design
: a matrix or data frame of the training
input data or the predictions given by some
predictors.
build_machine
must be TRUE
and all the chosen
machines will be constructed using a proportion of the training input
(splits
).build_machine
must be FALSE
which indicates
that no basic machine is constructed and the paramter is tuned using
this predicted features directly.train_response
: a vector of corresponding response
variable of the train_design
.test_design
: the matrix or data frame of testing
input data or testing predictions according to
train_design
.test_response
: the testing response variable, it is
optional. If it is given, the mean square error on the testing data is
also computed. By default, test_response = NULL
.scale_input
: a logical value specifying whether or not
to scale the input data before building the basic regression predictors.
By default, scale_input = FALSE
.scale_machine
: a logical value specifying whether or
not to scale the predicted features given by all the basic regression
predictors, for aggregation. By default,
scale_machine = FALSE
.build_machine
: a logical value specifying whether or
not the basic machines should be constructed. It should be
TRUE
if the train_design
is the input data,
else it should be FALSE
is the train_design
is
the features of predictions. By default,
build_machine = TRUE
.machines
: a vector of basic machines to be
constructed. It must be a subset of {“lasso”, “ridge”, “knn”, “tree”,
“rf”, “xgb”}. By default, machines = NULL
and all the six
types of basic machines are built.splits
: the proportion of training data used to build
the basic machines. By default, splits = .5
.n_cv
: the number of cross-validation folds used to
tune the smoothing parameter.sig, alp
: the support constant \(\sigma>0\) and the exponent \(\alpha>0\) of compactly exponential kernel: \(K(x)=e^{-\|x\|^{\alpha}}\mathbb{1}_{\{\|x\|\leq\sigma\}}\) for any
\(x\in\mathbb{R}^d\). By default,
sig = 3
and alpha = 2
.kernels
: the kernel function or vector of kernel
functions used for the aggregation.optimizeMethod
: the optimization methods used to learn
the smoothing parameter. By default, optimizeMethod
(“grad”) which stands for gradient descent algorithm, and if
“optimizeMethod = grid”, then the grid search algorithm is used. Note that this should be of the same size as
the kernels
argument, otherwise “grid” method will be
used for all the kernels.setBasicMachineParam
: an option used to set the values
of the parameters of the basic machines. setBasicParameter
function should be fed to this argument.setGradParam
: an option used to set the values of the
parameters of the gradient descent algorithm.
setGradParameter
function should be fed to it.setGridParam
: an option used to set the values of the
parameters of the grid search algorithm. setGridParameter
function should be fed to it.Value:
This function return a list of the following objects:
fitted_aggregate
: the predictions of the testing data
given by the aggregation methods correpsonding to all the kernel
functions.fitted_machine
: the predictions of the testing data
given by the basic machines.pred_train2
: the predictions of the second part of the
training data \(\mathcal{D}_{\ell}\) by
all the basic machines.opt_parameter
: the observed optimal smoothing
parameters.mse
: the mean square error evaluated on the testing
data if the testing response variable is given.kernels
: the kernel functions used.ind_D2
: the indices of the second part of the training
data used for the aggregation.GradientCOBRARegressor <- function(train_design,
train_response,
test_design,
test_response = NULL,
scale_input = FALSE,
scale_machine = FALSE,
build_machine = TRUE,
machines = NULL,
splits = 0.5,
n_cv = 5,
sig = 3,
alp = 2,
kernels = "gaussian",
optimizeMethod = "grad",
setBasicMachineParam = setBasicParameter(),
setGradParam = setGradParameter(),
setGridParam = setGridParameter()){
# build machines + tune parameter
fit_mod <- fit_parameter(train_design = train_design,
train_response = train_response,
scale_input = scale_input,
scale_machine = scale_machine,
build_machine = build_machine,
machines = machines,
splits = splits,
n_cv = n_cv,
sig = sig,
alp = alp,
kernels = kernels,
optimizeMethod = optimizeMethod,
setBasicMachineParam = setBasicMachineParam,
setGradParam = setGradParam,
setGridParam = setGridParam)
# prediction
pred <- predict_agg(fitted_models = fit_mod,
new_data = test_design,
test_response = test_response)
return(list(fitted_aggregate = pred$fitted_aggregate,
fitted_machine = pred$fitted_machine,
pred_train2 = fit_mod$basic_machines$predict2,
opt_parameter = fit_mod$opt_parameters,
mse = pred$mse,
kernels = kernels,
ind_D2 = fit_mod$basic_machines$id2))
}
Example.7 A complete aggregation is implemented on Abalone dataset. Three types of basic machines, and three different kernel functions are used.
pacman::p_load(readr)
colname <- c("Type", "LongestShell", "Diameter", "Height", "WholeWeight", "ShuckedWeight", "VisceraWeight", "ShellWeight", "Rings")
df <- readr::read_delim("https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", col_names = colname, delim = ",", show_col_types = FALSE)
train <- logical(nrow(df))
train[sample(length(train), floor(0.75*nrow(df)))] <- TRUE
agg <- GradientCOBRARegressor(train_design = df[train, 2:8],
train_response = df$Rings[train],
test_design = df[!train, 2:8],
test_response = df$Rings[!train],
machines = c("knn", "rf", "xgb"),
splits = .5,
n_cv = 3,
kernels = c("gaussian","triangular"),
optimizeMethod = c("grad", "grid"),
setBasicMachineParam = setBasicParameter(k = 2:10,
ntree = 1:5*100,
nrounds_xgb = 1:5*100),
setGradParam = setGradParameter(rate = 0.07,
print_step = TRUE),
setGridParam = setGridParameter(n_val = 100))
##
## * Building basic machines ...
## ~ Progress: ... 33% ... 67% ... 100%
## * Gradient descent algorithm ...
## Step | Parameter | Gradient | Threshold
## ---------------------------------------------------
## 0 | 0.100000 | -6.42721e+02 | 1e-10
## ---------------------------------------------------
## 1 | 0.170000 | 44.878464 | 642.7214
## 2 | 0.165112 | 17.640025 | 44.87846
## 3 | 0.163191 | 6.480987 | 17.64002
## 4 | 0.162485 | 2.313328 | 6.480987
## 5 | 0.162233 | 0.816705 | 2.313328
## 6 | 0.162144 | 0.287190 | 0.8167048
## 7 | 0.162113 | 0.100847 | 0.2871902
## 8 | 0.162102 | 0.0353948 | 0.1008471
## 9 | 0.162098 | 0.0124206 | 0.03539479
## 10 | 0.162097 | 0.00435826 | 0.01242058
## 11 | 0.162096 | 0.0015292 | 0.004358263
## 12 | 0.162096 | 0.000536757 | 0.001529203
## 13 | 0.162096 | 0.000188231 | 0.0005367569
## 14 | 0.162096 | 6.6123e-05 | 0.000188231
## 15 | 0.162096 | 2.3205e-05 | 6.612304e-05
## 16 | 0.162096 | 8.11049e-06 | 2.320502e-05
## 17 | 0.162096 | 2.92879e-06 | 8.110492e-06
## 18 | 0.162096 | 9.01166e-07 | 2.928789e-06
## 19 | 0.162096 | 3.00389e-07 | 9.011658e-07
## 20 | 0.162096 | 1.12646e-07 | 3.003886e-07
## 21 | 0.162096 | 3.75486e-08 | 1.126457e-07
## 22 | 0.162096 | 3.75486e-08 | 3.754857e-08
## 23 | 0.162096 | 7.50971e-08 | 3.754857e-08
## 24 | 0.162096 | -7.50971e-08 | 7.509715e-08
## 25 | 0.162096 | 7.50971e-08 | 7.509715e-08
## 26 | 0.162096 | 0e+00 | 7.509715e-08
-------------------------------------------------------
## Stopped| 0.162096 | 0 | 0
## ~ Observed parameter: 0.162096 in 26 iterations.
##
## * Grid search algorithm...
## ~ Observed parameter : 0.03239091
sqrt(agg$mse)
📖 Read also MixCobra method by Fischer and Mougeot (2019)