🔎 How to download & run the codes?

All the source codes of the aggregation methods are available here . To run the codes, you can clone the repository directly or simply load the R script source file from the repository using devtools package in Rstudio as follow:

  1. Install devtools package using command:

    install.packages("devtools")

  2. Loading the source codes from GitHub repository using source_url function by:

    devtools::source_url("https://raw.githubusercontent.com/hassothea/AggregationMethods/main/GradientCOBRARegressor.R")


✎ Note: All codes contained in this Rmarkdown are built with recent version of (version \(>\) 4.1, available here) and Rstudio (version > 2022.02.2+485, available here). Note also that the code chucks are hidden by default.

To see the codes, you can:


1 Consensual aggregation & important packages

1.1 Consensual aggregation

This Rmarkdown provides the implementation of consensual aggregation method by Has (2021). Let \(\mathcal{D}_n=\{(x_1,y_1),...,(x_n,y_n)\}\) be a training data of size \(n\), where the input-output couples \((x_i,y_i)\in\mathbb{R}^d\times\mathbb{R}\) for all \(i=1,...,n\). \(\mathcal{D}_{n}\) is first randomly partitioned into \(\mathcal{D}_{k}\) and \(\mathcal{D}_{\ell}\) of size \(k\) and \(\ell\) respectively such that \(k+\ell=n\). We construct \(M\) regression estimators (machines) \(r_1,...,r_M\) using only \(\mathcal{D}_{k}\). Let \({\bf r}(x)=(r_1(x),...,r_M(x))^T\in\mathbb{R}^M\) be the vector of predictions of \(x\in\mathbb{R}^d\), the kernel-based consensual aggregation method evaluated at point \(x\) is defined by

\[\begin{equation} g_n({\bf r}(x))=\frac{\sum_{i=1}^{\ell}y_iK_h({\bf r}(x)-{\bf r}(x_i))}{\sum_{j=1}^{\ell}K_h({\bf r}(x)-{\bf r}(x_j))} \end{equation}\] where \(K:\mathbb{R}^M\to\mathbb{R}_+\) is a non-increasing kernel function with \(K_h(x)=K(x/h)\) for some smoothing parameter \(h>0\) to be tuned, with the convention \(0/0=0\). Even though the formula of \(g_n\) is defined using only data points in \(\mathcal{D}_{\ell}\), it does depend on the whole training data \(\mathcal{D}_{n}\) as the basic machines \((r_i)_{i=1}^M\) are built using \(\mathcal{D}_{k}\).

1.2 Important packages

We prepare all the necessary tools for this Rmarkdown. pacman package allows us to load (if exists) or install (if does not exist) any available packages from The Comprehensive R Archive Network (CRAN) of .

# Check if package "pacman" is already installed 

lookup_packages <- installed.packages()[,1]
if(!("pacman" %in% lookup_packages))
  install.packages("pacman")


# To be installed or loaded
pacman::p_load(magrittr)
pacman::p_load(tidyverse)

## package for "generateMachines"
pacman::p_load(tree)
pacman::p_load(glmnet)
pacman::p_load(randomForest)
pacman::p_load(FNN)
pacman::p_load(xgboost)
pacman::p_load(keras)
pacman::p_load(pracma)
pacman::p_load(latex2exp)
pacman::p_load(plotly)
rm(lookup_packages)

2 Basic Machine generator

This section provides functions to generate basic machines (regressors) to be aggregated.

2.1 Function : setBasicParameter

This function allows us to set the values of some key parameters of the basic machines.

  • Argument:

    • lambda : the penalty parameter \(\lambda\) used in penalized linear models: ridge or lasso.
    • k : the parameter \(k\) of \(k\)NN (knn) regression model and the default value is \(k=10\).
    • ntree : the number of trees in random forest (rf). By default, ntree = 300.
    • mtry : the number of random features chosen in each split of random forest procedure. By default, mtry = NULL and the default value of mtry of randomForest function from randomForest library is used.
    • eta_xgb : the learning rate \(\eta>0\) in gradient step of extreme gradient boosting method (xgb) of xgboost library.
    • nrounds_xgb : the parameter nrounds indicating the max number of boosting iterations. By default, nrounds_xgb = 100.
    • early_stop_xgb : the early stopping round criterion of xgboost function. By, default, early_stop_xgb = NULL and the early stopping function is not triggered.
    • max_depth_xgb : maximum depth of trees constructed in xgboost.
  • Value:

    This function returns a list of all the parameters given in its arguments, to be fed to the basicMachineParam argument of function generateMachines defined in the next section.


Remark.2: lambda, k, ntree can be a single value or a vector. In other words, each type of models can be constructed several times according to the values of the hyperparameters (a single value or vector).


setBasicParameter <- function(lambda = NULL,
                              k = 10, 
                              ntree = 300, 
                              mtry = NULL, 
                              eta_xgb = 1, 
                              nrounds_xgb = 100, 
                              early_stop_xgb = NULL,
                              max_depth_xgb = 3){
  return(list(
    lambda = lambda,
    k = k,
    ntree = ntree, 
    mtry = mtry, 
    eta_xgb = eta_xgb, 
    nrounds_xgb = nrounds_xgb, 
    early_stop_xgb = early_stop_xgb,
    max_depth_xgb = max_depth_xgb)
  )
}

2.2 Function : generateMachines

This function generates all the basic machines to be aggregated.

  • Argument:

    • train_input : a matrix or data frame of the training input data.
    • train_response : a vector of training response variable corresponding to the train_input.
    • scale_input : logical value specifying whether to scale the input data (to be between \(0\) and \(1\)) or not. By default, scale_input = FALSE.
    • machines : types of basic machines to be constructed. It is a subset of {“lasso”, “ridge”, “knn”, “tree”, “rf”, “xgb”}. By default, machines = NULL and all the six types of basic machines are built.
    • splits : real number between \(0\) and \(1\) specifying the proportion of training data used to train the basic machines (\(\mathcal{D}_k\)). The remaining proportion of (\(1-\) splits) is used for the aggregation (\(\mathcal{D}_{\ell}\)). By default, splits = 0.5.
    • basicMachineParam : the option used to setup the values of parameters of each machines. One should feed the function setBasicParameter() defined above to this argument.
  • Value:

    This function returns a list of the following objects.

    • predict2 : the predictions of the remaining part (\(\mathcal{D}_{\ell}\)) of the training data used for the aggregation.
    • models : all the constructed basic machines (it contains only the values of prapeter \(k\) for knn).
    • id2 : a logical vector of size equals to the number of lines of the training data indicating the location of the points used to build the basic machines (FALSE) and the remaining ones (TRUE).
    • scale_max, scale_min : if the argument scale_input = TRUE, the maximun and minimun values of all columns are contained in these vectors.

✎ Note: You may need to modify the function accordingly if you want to build different types of basic machines.


generateMachines <- function(train_input, 
                             train_response,
                             scale_input = FALSE,
                             machines = NULL, 
                             splits = 0.5, 
                             basicMachineParam = setBasicParameter()){
  lambda = basicMachineParam$lambda
  k <- basicMachineParam$k 
  ntree <- basicMachineParam$ntree 
  mtry <- basicMachineParam$mtry
  eta_xgb <- basicMachineParam$eta_xgb 
  nrounds_xgb <- basicMachineParam$nrounds_xgb
  early_stop_xgb <- basicMachineParam$early_stop_xgb
  max_depth_xgb <- basicMachineParam$max_depth_xgb
  
  # Packages
  pacman::p_load(tree)
  pacman::p_load(glmnet)
  pacman::p_load(randomForest)
  pacman::p_load(FNN)
  pacman::p_load(xgboost)
  # pacman::p_load(keras)
  
  # Preparing data
  input_names <- colnames(train_input)
  input_size <- dim(train_input)
  df_input <- train_input_scale <- train_input
  if(scale_input){
    maxs <- map_dbl(.x = df_input, .f = max)
    mins <- map_dbl(.x = df_input, .f = min)
    train_input_scale <- scale(train_input, center = mins, scale = maxs - mins)
  }
  if(is.matrix(train_input_scale)){
    df_input <- as_tibble(train_input_scale)
    matrix_input <- train_input_scale
  } else{
    df_input <- train_input_scale
    matrix_input <- as.matrix(train_input_scale)
  }
  
  # Machines
  lasso_machine <- function(x, lambda0){
    if(is.null(lambda)){
      cv <- cv.glmnet(matrix_train_x1, train_y1, alpha = 1, lambda = 10^(seq(-3,2,length.out = 50)))
      mod <- glmnet(matrix_train_x1, train_y1, alpha = 1, lambda = cv$lambda.min)
    } else{
      mod <- glmnet(matrix_train_x1, train_y1, alpha = 1, lambda = lambda0)
    }
    res <- predict.glmnet(mod, newx = x)
    return(list(pred = res,
                model = mod))
  }
  ridge_machine <- function(x, lambda0){
    if(is.null(lambda)){
      cv <- cv.glmnet(matrix_train_x1, train_y1, alpha = 0, lambda = 10^(seq(-3,2,length.out = 50)))
      mod <- glmnet(matrix_train_x1, train_y1, alpha = 0, lambda = cv$lambda.min)
    } else{
      mod <- glmnet(matrix_train_x1, train_y1, alpha = 0, lambda = lambda0)
    }
    res <- predict.glmnet(mod, newx = x)
    return(list(pred = res,
                model = mod))
  }
  tree_machine <- function(x, pa = NULL) {
    mod <- tree(as.formula(paste("train_y1~", 
                                 paste(input_names, sep = "", collapse = "+"), 
                                 collapse = "", 
                                 sep = "")), 
                data = df_train_x1)
    res <- as.vector(predict(mod, x))
    return(list(pred = res,
                model = mod))
  }
  knn_machine <- function(x, k0) {
    mod <- knn.reg(train = matrix_train_x1, test = x, y = train_y1, k = k0)
    res = mod$pred
    return(list(pred = res,
                model = k0))
  }
  RF_machine <- function(x, ntree0) {
    if(is.null(mtry)){
      mod <- randomForest(x = df_train_x1, y = train_y1, ntree = ntree0)
    }else{
      mod <- randomForest(x = df_train_x1, y = train_y1, ntree = ntree0, mtry = mtry)
    }
    res <- as.vector(predict(mod, x))
    return(list(pred = res,
                model = mod))
  }
  xgb_machine = function(x, nrounds_xgb0){
    mod <- xgboost(data = matrix_train_x1, 
                   label = train_y1, 
                   eta = eta_xgb,
                   nrounds = nrounds_xgb0,
                   objective = "reg:squarederror",
                   early_stopping_rounds = early_stop_xgb,
                   max_depth = max_depth_xgb,
                   verbose = 0)
    res <- predict(mod, x)
    return(list(pred = res,
                model = mod))
  }
  
  # All machines
  all_machines <- list(lasso = lasso_machine, 
                       ridge = ridge_machine, 
                       knn = knn_machine, 
                       tree = tree_machine, 
                       rf = RF_machine,
                       xgb = xgb_machine)
  # All parameters
  all_parameters <- list(lasso = lambda, 
                         ridge = lambda, 
                         knn = k, 
                         tree = 1, 
                         rf = ntree,
                         xgb = nrounds_xgb)
  if(is.null(machines)){
    mach <- c("lasso", "ridge", "knn", "tree", "rf", "xgb")
  }else{
    mach <- machines
  }
  # Extracting data
  M <- length(mach)
  size_D1 <- floor(splits*input_size[1])
  id_D1 <- logical(input_size[1])
  id_D1[sample(input_size[1], size_D1)] <- TRUE
  
  df_train_x1 <- df_input[id_D1,]
  matrix_train_x1 <- matrix_input[id_D1,]
  train_y1 <- train_response[id_D1]
  df_train_x2 <- df_input[!id_D1,]
  matrix_train_x2 <- matrix_input[!id_D1,]
  
  # Function to extract df and model from 'map' function
  extr_df <- function(x, id){
    return(tibble("r_{{id}}":= as.vector(pred_m[[x]]$pred)))
  }
  extr_mod <- function(x, id){
    return(pred_m[[x]]$model)
  }
  
  pred_D2 <- c()
  all_mod <- c()
  cat("\n* Building basic machines ...\n")
  cat("\t~ Progress:")
  for(m in 1:M){
    if(mach[m] %in% c("tree", "rf")){
      x0_test <- df_train_x2
    } else {
      x0_test <- matrix_train_x2
    }
    if(is.null(all_parameters[[mach[m]]])){
      para_ <- 1
    }else{
      para_ <- all_parameters[[mach[m]]]
    }
    pred_m <-  map(para_, 
                   .f = ~ all_machines[[mach[m]]](x0_test, .x))
    tem0 <- imap_dfc(.x = 1:length(para_), 
                     .f = extr_df)
    tem1 <- imap(.x = 1:length(para_), 
                 .f = extr_mod)
    names(tem0) <- names(tem1) <- paste0(mach[m], 1:length(para_))
    pred_D2 <- bind_cols(pred_D2, as_tibble(tem0))
    all_mod[[mach[m]]] <- tem1
    cat(" ... ", round(m/M, 2)*100L,"%", sep = "")
  }
  if(scale_input){
    return(list(predict2 = pred_D2,
                models = all_mod,
                id2 = !id_D1,
                train_data = list(train_input = train_input_scale, 
                                  train_response = train_response),
                scale_max = maxs,
                scale_min = mins))
  } else{
    return(list(predict2 = pred_D2,
                models = all_mod,
                id2 = !id_D1,
                train_data = list(train_input = train_input_scale, 
                                  train_response = train_response)))
  }
}

Example.1: In this example, the method is implemented on Boston data of MASS library. The basic machines “rf”, “knn” and “xgb” are built on the first part of the training data (\(\mathcal{D}_{k}\)), and the Root Mean Square Errors (RMSE) evaluated on the second part of the training data (\(\mathcal{D}_{\ell}\)) used for aggregation) are reported.


pacman::p_load(MASS)
df <- Boston
basic_machines <- generateMachines(train_input = df[,1:13],
                 train_response = df[,14],
                 scale_input = TRUE,
                 machines = c("rf", "knn", "xgb"),
                 basicMachineParam = setBasicParameter(lambda = 1:10/10, 
                                                ntree = 10:20 * 25,
                                                k = c(2:10)))
## 
## * Building basic machines ...
##  ~ Progress: ... 33% ... 67% ... 100%
basic_machines$predict2 %>%
  sweep(1, df[basic_machines$id2, "medv"]) %>%
  .^2 %>%
  colMeans %>%
  t %>%
  sqrt %>%
  as_tibble

3 Optimization algorithm

This part provides functions to approximate the smoothing parameter \(h>0\) of the aggregation. Two important optimization methods are implemented: gradient descent algorithm (grad) and grid search (grid).

3.1 Gradient descent algorithm

3.1.1 Function : setGradParameter

This function allows us to set the values of parameters needed to process the gradient descent algorithm to approximate the hyperparameter of the aggregation method.

  • Argument:

    • val_init : initial value of gradient descent iteration. By default, val_init = NULL and the algorithm will select the best value (with smallest cost function) among n_tries values of parameters.
    • n_tries : the number of parameters used to evaluate the cost function (only when val_init = NULL), then the best one among them will be chosen as initial value for gradient descent algorithm.
    • rate : the learning rate in gradent descent algorithm. By default, rate = NULL (or “auto”) and the value coef_auto = 0.005 will be used. It can also be a functional rate, which is a string element of {“logarithm”, “sqrtroot”, “linear”, “polynomial”, “exponential”}. Each rate is defined according to coef_ type arguments bellow.
    • min_val : the mininum value of parameter to be considered as the initial value in gradient step. By default, min_val = 1e-4.
    • max_val : the maxinum value of parameter to be considered as the initial value in gradient step. By default, max_val = 0.1.
    • max_iter : maximum itertaion of gradient descent algorithm. By default, max_iter = 300.
    • print_step : a logical value controlling whether to print the result of each gradient step or not in order to keep track of the algorithm. By default, print_step = TRUE.
    • print_result : a logical value controlling whether to print the result of each gradient step or not in order to keep track of the algorithm. By default, print_result = TRUE.
    • figure : a logical value controlling whether to plot a graphic of the result or not. By default, figure = TRUE.
    • coef_auto : the constant learning rate when rate = NULL. By default, coef_auto = 0.005.
    • coef_log : the coefficinet multiplying to the logarithmic increment of the learning rate, i.e., the rate is rate \(=\) coef_log\(\times \log(1+t)\) where \(t\) is the numer number of iteration. By default, coef_log = 1.
    • coef_sqrt : the coefficinet multiplying to the square root increment of the learning rate, i.e., the rate is rate \(=\) coef_sqrt\(\times \sqrt{t}\). By default, coef_sqrt = 1.
    • coef_lm : the coefficinet multiplying to the linear increment of the learning rate, i.e., the rate is rate \(=\) coef_lm\(\times t\). By default, coef_lm = 1.
    • deg_poly : the degree of the polynomial increment of the learning rate, i.e., the rate is rate \(=t^{\texttt{coef_poly}}\). By default, deg_poly = 2.
    • base_exp : the base of the exponential increment of the learning rate, i.e., the rate is rate \(=\) base_exp\(^t\). By default, base_exp = 1.5.
    • threshold : the threshold to stop the algorithm what the relative change is smaller than this value. By default, threshold = 1e-10.
  • Value:

    This function returns a list of all the parameters given in its arguments.

setGradParameter <- function(val_init = NULL, 
                             n_tries = 10, 
                             rate = NULL, 
                             min_val = 1e-4,
                             max_val = 0.1,
                             max_iter = 300, 
                             print_step = TRUE, 
                             print_result = TRUE,
                             figure = TRUE, 
                             coef_auto = 0.005,
                             coef_log = 1,
                             coef_sqrt = 1,
                             coef_lm = 1,
                             deg_poly = 2,
                             base_exp = 1.5,
                             threshold = 1e-10) {
  return(
    list(val_init = val_init,
      n_tries = n_tries,
      rate = rate,
      min_val = min_val,
      max_val = max_val,
      max_iter = max_iter,
      print_step = print_step,
      print_result = print_result,
      figure = figure,
      coef_auto = coef_auto,
      coef_log = coef_log,
      coef_sqrt = coef_sqrt,
      coef_lm = coef_lm,
      deg_poly = deg_poly,
      base_exp = base_exp,
      threshold = threshold
    )
  )
}

3.1.2 Function : gradOptimizer

This function performs gradient descent algorithm to approximate the minimizer of any given functions (convex or locally convex around its optimizer).

  • Argument:

    • obj_fun : the objective function for which its minimizer is to be estimated. It should be a univarate function of real positive variables.
    • setParameter : the control of gradient descent parameters which should be the function setGradParameter() defined above.
  • Value:

    This function returns a list of the following objects:

    • opt_param : the observed value of the minimizer.
    • opt_error : the value of the optimal risk.
    • all_grad : the vector of all the gradients collected during the walk of the algorithm.
    • all_param : the vector of all parameters collected during the walk of the algorithm.
    • run_time : the running time of the algorithm.
gradOptimizer <- function(obj_fun,
                          setParameter = setGradParameter()) {
  start.time <- Sys.time()
  # Optimization step:
  # ==================
  spec_print <- function(x) return(ifelse(x > 1e-6, 
                                          format(x, digit = 6, nsmall = 6), 
                                          format(x, scientific = TRUE, digit = 6, nsmall = 6)))
  collect_val <- c()
  gradients <- c()
  if (is.null(setParameter$val_init)){
    val_params <- seq(setParameter$min_val, 
                      setParameter$max_val, 
                      length.out = setParameter$n_tries)
    tem <- map_dbl(.x = val_params, .f = obj_fun)
    val <- val0 <- val_params[which.min(tem)]
    grad_ <- pracma::grad(
      f = obj_fun, 
      x0 = val0, 
      heps = .Machine$double.eps ^ (1 / 3))
  } else{
    val <- val0 <- setParameter$val_init
    grad_ <- pracma::grad(
      f = obj_fun, 
      x0 = val0, 
      heps = .Machine$double.eps ^ (1 / 3))
  }
  if(setParameter$print_step){
    cat("\n* Gradient descent algorithm ...")
    cat("\n  Step\t|  Parameter\t|  Gradient\t|  Threshold \n")
    cat(" ", rep("-", 51), sep = "")
    cat("\n   0 \t| ", spec_print(val0),
        "\t| ", spec_print(grad_), 
        " \t| ", setParameter$threshold, "\n")
    cat(" ", rep("-",51), sep = "")
  }
  if (is.numeric(setParameter$rate)){
    lambda0 <- setParameter$rate / abs(grad_)
    rate_GD <- "auto"
  } else{
    r0 <- setParameter$coef_auto / abs(grad_)
    # Rate functions
    rate_func <- list(auto = r0, 
                      logarithm = function(i)  setParameter$coef_log * log(2 + i) * r0,
                      sqrtroot = function(i) setParameter$coef_sqrt * sqrt(i) * r0,
                      linear = function(i) setParameter$coef_lm * (i) * r0,
                      polynomial = function(i) i ^ setParameter$deg_poly * r0,
                      exponential = function(i) setParameter$base_exp ^ i * r0)
    rate_GD <- match.arg(setParameter$rate, 
                         c("auto", 
                          "logarithm", 
                          "sqrtroot", 
                          "linear", 
                          "polynomial", 
                          "exponential"))
    lambda0 <- rate_func[[rate_GD]]
  }
  i <- 0
  if (is.numeric(setParameter$rate) | rate_GD == "auto") {
    while (i < setParameter$max_iter) {
      if(is.na(grad_)){
        val0 <- runif(1, val0*0.99, val0*1.01) 
        grad_ = pracma::grad(
          f = obj_fun, 
          x0 = val0, 
          heps = .Machine$double.eps ^ (1 / 3)
       )
      }
      val <- val0 - lambda0 * grad_
      if (val < 0){
        val <- val0/2
        lambda0 <- lambda0/2
      }
      if(i > 5){
        if(sign(grad_) * sign(grad0) < 0){
          lambda0 = lambda0 / 2
        }
      }
      relative <- abs((val - val0) / val0)
      test_threshold <- max(relative, abs(grad_))
      if (test_threshold > setParameter$threshold){
        val0 <- val
        grad0 <- grad_
      } else{
        break
      }
      grad_ = pracma::grad(
        f = obj_fun, 
        x0 = val0, 
        heps = .Machine$double.eps ^ (1 / 3)
      )
      i <- i + 1
      if(setParameter$print_step){
        cat("\n  ", i, "\t| ", spec_print(val), 
            "\t| ", spec_print(grad_), 
            "\t| ", test_threshold, "\r")
      }
      collect_val <- c(collect_val, val)
      gradients <- c(gradients, grad_)
    }
  }
  else{
    while (i < setParameter$max_iter) {
      if(is.na(grad_)){
        val0 <- runif(1, val0*0.9, val0*1.1) 
        grad_ = pracma::grad(
          f = obj_fun, 
          x0 = val0, 
          heps = .Machine$double.eps ^ (1 / 3)
       )
      }
      val <- val0 - lambda0(i) * grad_
      if (val < 0){
        val <- val0/2
        r0 <- r0 / 2
      }
      if(i > 5){
        if(sign(grad_)*sign(grad0) < 0)
          r0 <- r0 / 2
      }
      relative <- abs((val - val0) / val0)
      test_threshold <- max(relative, abs(grad_))
      if (test_threshold > setParameter$threshold){
        val0 <- val
        grad0 <- grad_
      }
      else{
        break
      }
      grad_ <- pracma::grad(
        f = obj_fun, 
        x0 = val0, 
        heps = .Machine$double.eps ^ (1 / 3)
      )
      i <- i + 1
      if(setParameter$print_step){
        cat("\n  ", i, "\t| ", spec_print(val), 
            "\t| ", spec_print(grad_), 
            "\t| ", test_threshold, "\r")
      }
      collect_val <- c(collect_val, val)
      gradients <- c(gradients, grad_)
    }
  }
  opt_ep <- val
  opt_risk <- obj_fun(opt_ep)
  if(setParameter$print_step){
    cat(rep("-", 55), sep = "")
    if(grad_ == 0){
      cat("\n Stopped| ", spec_print(val), 
        "\t| ", 0, 
        "\t\t| ", test_threshold)
    }else{
      cat("\n Stopped| ", spec_print(val), 
        "\t| ", spec_print(grad_), 
        "\t| ", test_threshold)
    } 
  }
  if(setParameter$print_result){
    cat("\n ~ Observed parameter:", opt_ep, " in",i, "iterations.")
  }
  if (setParameter$figure) {
    siz <- length(collect_val)
    tibble(x = 1:siz,
           y = collect_val,
           gradient = gradients) %>%
      ggplot(aes(x = x, y = y)) +
      geom_line(mapping = aes(color = gradient), size = 1) +
      geom_point(aes(x = length(x), y = opt_ep), color = "red") +
      geom_hline(yintercept = opt_ep, color = "red", linetype = "longdash") +
      labs(title = "Gradient steps",
           x = "Iteration",
           y = "Parameter")-> p
    print(p)
  }
  end.time = Sys.time()
  return(list(
    opt_param = opt_ep,
    opt_error = opt_risk,
    all_grad = gradients,
    all_param = collect_val,
    run_time = difftime(end.time, 
                        start.time, 
                        units = "secs")[[1]]
  ))
}

Example.2: Approximate \[x^*=\text{arg}\min_{x\in\mathbb{R}}f(x), \text{where }f(x)=(x-1)^2(1+\sin^2(2.5(x-1)))\] Note that argument val_init is crucial since \(f\) is not convex.


object_func <- function(x) (x-1)^2*(1+sin(2.5*(x-1))^2)
tibble::tibble(x = seq(-4,6, length.out = 200), 
               y = object_func(seq(-4,6, length.out = 200))) %>%
  ggplot(aes(x = x, y = y)) +
  geom_line(color = "skyblue", size = 0.75) + 
  labs(title = latex2exp::TeX("Cost function $f(x)=(x-1)^2(1+\\sin^2(2.5(x-1)))$")) -> p
print(p)

gd <- gradOptimizer(obj_fun = object_func,
                  setParameter = setGradParameter(val_init = 2.4,
                                                rate = 1,
                                                print_step = FALSE,
                                                figure = TRUE))
## 
##  ~ Observed parameter: 1  in 59 iterations.

3.2 Grid search algorithm

3.2.1 Function : setGridParameter

This function allows to set the values of the grid search algorithm.

  • Argument:

    • min_val : the minimum value of parameter grid.
    • max_val : the maximum value of parameter grid.
    • n_val : the number of points in the grid.
    • parameters : the vector of paramters in case the non-uniform grid is considered.
    • print_result : a logical value specifying whether or not to print the observed result.
    • figure : a logical value specifying whether or not to plot the graphic of error.
  • Value:

    This function returns a list of all the parameters given in its arguments.

setGridParameter <- function(min_val = 1e-4, 
                             max_val = 0.1, 
                             n_val = 300, 
                             parameters = NULL,
                             print_result = TRUE,
                             figure = TRUE){
  return(list(min_val = min_val,
              max_val = max_val,
              n_val = n_val,
              parameters = parameters,
              print_result = print_result,
              figure = figure))
}

3.2.2 Function : gridOptimizer

  • Argument:

    • obj_fun : the objective function for which its minimizer is to be estimated. It should be a univarate function of real positive variables.
    • setParameter : the control of grid search algorithm parameters which should be the function setGridParameter() defined above.
  • Value:

    This function returns a list of the following objects:

    • opt_param : the observed value of the minimizer.
    • opt_error : the value of optimal risk.
    • all_risk : the vector of all the errors evaluated at all the values of considered parameters.
    • run.time : the running time of the algorithm.
gridOptimizer <- function(obj_func,
                         setParameter = setGridParameter()){
  t0 <- Sys.time()
  if(is.null(setParameter$parameters)){
    param <- seq(setParameter$min_val, 
                 setParameter$max_val,
                 length.out = setParameter$n_val)
  } else{
    param <- setParameter$parameters
  }
  risk <- param %>%
    map_dbl(.f = obj_func)
  id_opt <- which.min(risk)
  opt_ep <- param[id_opt]
  opt_risk <- risk[id_opt]
  if(setParameter$print_result){
    cat("\n* Grid search algorithm...", "\n ~ Observed parameter :", opt_ep)
  }
  if(setParameter$figure){
    tibble(x = param, 
           y = risk) %>%
      ggplot(aes(x = x, y = y)) +
      geom_line(color = "skyblue", size = 0.75) +
      geom_point(aes(x = opt_ep, y = opt_risk), color = "red") +
      geom_vline(xintercept = opt_ep, color = "red", linetype = "longdash") +
      labs(title = "Error as function of parameter", 
           x = "Parameter",
           y = "Error")-> p
    print(p)
  }
  t1 <- Sys.time()
  return(list(
    opt_param = opt_ep,
    opt_error = opt_risk,
    all_risk = risk,
    run.time = difftime(t1, 
                        t0, 
                        units = "secs")[[1]])
  )
}

Example.2: Again with grid search.


grid <- gridOptimizer(obj_fun = object_func,
                     setParameter = setGridParameter(min_val = -3,
                                                     max_val = 5,
                                                     n_val = 500,
                                                     figure = TRUE))
## 
## * Grid search algorithm... 
##  ~ Observed parameter : 1.008016

3.3 \(\kappa\)-cross validation lost function

Constructing aggregation method is equivalent to approximating the optimal value of smoothing parameter \(h>0\) introduced in section 1.1 by minimizing some lost function. In this study, we propose \(\kappa\)-fold cross validation lost function defined by

\[\begin{equation} \label{eq:kappa} \varphi^{\kappa}(h)=\frac{1}{\kappa}\sum_{k=1}^{\kappa}\sum_{(x_j,y_j)\in F_k}(g_n({\bf r}(x_j))-y_j)^2 \end{equation}\] where

  • for any \(k=1,...,\kappa\), \(F_k\) denotes the \(k\)th validation fold.
  • \(g_n({\bf r}(x_j))\) is the prediction of \(x_j\) of \(F_k\), computed using the data points from the remaining part \({\cal D}_{\ell}-F_k\) by, \[g_n({\bf r}(x_j))=\frac{\sum_{(x_i,y_i)\in{\cal D}_{\ell}-F_k}y_iK_h({\bf r}(x_j)- {\bf r}(x_i))}{\sum_{(x_i,y_i)\in{\cal D}_{\ell}-F_k}K_h({\bf r}(x_j)- {\bf r}(x_i))}\]

3.4 Function: dist_matrix

This function computes different distances between data points of each training folds (\(\mathcal{D}_{\ell}-F_k\)) and the corresponding validation fold \(F_k\) for any \(k=1,\dots,\kappa\). The distance matrices \(D_k=(d[{\bf r}(x_i),{\bf r}(x_j)])_{i,j}\) for \(k=1,\dots,\kappa\), where the distance \(d\) is defined according to different types kernel functions.

  • Argument:

    • basicMachines : the basic machine object, which is an output of generateMachines function.
    • n_cv : the number \(\kappa\) of cross-validation folds. By default, n_cv = 5.
    • kernel : the kernel function used for the aggregation, which is an element of {“gaussian”, “epanechnikov”, “biweight”, “triweight”, “triangular”, “naive”, “c.expo”, “expo”}. By default, kernel = "gaussian".
    • id_shuffle : an integer vector of length equals to the size of the remaining part of the training data (\({\cal D}_{\ell}\)). Its elements are from {\(1,...,\kappa\)}, indicating the fold membership of the data points. By default, id_shuffle = NULL, and the data points will be shuffled randomly.
  • Value:

    This functions returns a list of the following objects:

    • dist : a list of data frame (tibble) containing sublists corresponding to kernel functions used for the aggregation. Each sublist contains n_cv numbers of distance matrices \(D_k=(d[{\bf r}(x_i),{\bf r}(x_j)])_{i,j}\), for \(k=1,\dots,\kappa\), containing distances between the data points in valiation fold (by row) and the \(1-\kappa\) remaining folds of training data. The type of distance matrices depends on the kernel used:
      • If kernel = naive, the distance matrices contain the maximum distance between data points, i.e., \[D_k=(\|{\bf r}(x_i)-{\bf r}(x_j)\|_{\max})_{i,j}\text{ for }k=1,\dots,\kappa.\]
      • If kernel = triangular, the distance matrices contain the \(L_1\) distance between data points, i.e., dist_matrix \[D_k=(\|{\bf r}(x_i)-{\bf r}(x_j)\|_1)_{i,j}\text{ for }k=1,\dots,\kappa.\]
      • Otherwise, the distance matrices contain the squared \(L_2\) distance between data points, i.e., \[D_k=(\|{\bf r}(x_i)-{\bf r}(x_j)\|_ 2^2)_{i,j}\text{ for }k=1,\dots,\kappa.\]
    • ffleffle : the shuffled indices in cross-validation.
    • n_cv : the number \(\kappa\) of cross-validation folds.
dist_matrix <- function(basicMachines,
                        n_cv = 5,
                        kernel = "gaussian",
                        id_shuffle = NULL){
  n <- nrow(basicMachines$predict2)
  n_each_fold <- floor(n/n_cv)
  # shuffled indices
  if(is.null(id_shuffle)){
    shuffle <- 1:(n_cv-1) %>%
    rep(n_each_fold) %>%
    c(., rep(n_cv, n - n_each_fold * (n_cv - 1))) %>%
    sample
  }else{
    shuffle <- id_shuffle
  }
  # the prediction matrix D_l
  df_ <- as.matrix(basicMachines$predict2)
  if(! (kernel %in% c("naive", "triangular"))){
    pair_dist <- function(M, N){
      n_N <- dim(N)
      n_M <- dim(M)
      res_ <- 1:nrow(N) %>%
        map_dfc(.f = (\(id) tibble('{{id}}' := as.vector(rowSums((M - matrix(rep(N[id,], n_M[1]), ncol = n_M[2], byrow = TRUE))^2)))))
      return(res_)
    }
  }
  if(kernel == "triangular"){
    pair_dist <- function(M, N){
      n_N <- dim(N)
      n_M <- dim(M)
      res_ <- 1:nrow(N) %>%
        map_dfc(.f = (\(id) tibble('{{id}}' := as.vector(rowSums(abs(M - matrix(rep(N[id,], n_M[1]), ncol = n_M[2], byrow = TRUE)))))))
      return(res_)
    }
  }
  if(kernel == "naive"){
    pair_dist <- function(M, N){
      n_N <- dim(N)
      n_M <- dim(M)
      res_ <- 1:nrow(N) %>%
        map_dfc(.f = (\(id) tibble('{{id}}' := as.vector(apply(abs(M - matrix(rep(N[id,], n_M[1]), ncol = n_M[2], byrow = TRUE)), 1, max)))))
      return(res_)
    }
  }
  L <- 1:n_cv %>%
      map(.f = (\(x) pair_dist(df_[shuffle != x,],
                                df_[shuffle == x,])))
  return(list(dist = L, 
              id_shuffle = shuffle,
              n_cv = n_cv))
}

Example.3: The method dist_matrix is implemented on the obtained basic machines built in Example.1 with the corresponding Gaussian kernel function.


dis <- dist_matrix(basicMachines = basic_machines,
            n_cv = 3,
            kernel = "gaussian")

Example.4: From the distance matrix, we can compute the error corresponding to Gaussian kernel function, then use both of the optimization methods to approximate the smoothing paramter in this case.


# Gaussian kernel
gaussian_kern <- function(.ep = .05,
                            .dist_matrix,
                            .train_response2){
  kern_fun <- function(x, id, D){
    tem0 <- as.matrix(exp(-(x*D)))
    y_hat <- .train_response2[.dist_matrix$id_shuffle != id] %*% tem0/colSums(tem0)
    return(sum((y_hat - .train_response2[.dist_matrix$id_shuffle == id])^2))
  }
  temp <- map2(.x = 1:.dist_matrix$n_cv, 
                   .y = .dist_matrix$dist, 
                   .f = ~ kern_fun(x = .ep, id = .x, D = .y))
  return(Reduce("+", temp))
}

# Kappa cross-validation error
cost_fun <- function(.ep = .05,
                      .dist_matrix,
                      .kernel_func = NULL,
                      .train_response2){
  return(.kernel_func(.ep = .ep,
                     .dist_matrix = .dist_matrix,
                     .train_response2 = .train_response2))
}

# cross_validation
err_cv <- function(x, 
                     .dist_matrix = dis,
                     .kernel_func = gaussian_kern,
                     .train_response2 = df[basic_machines$id2, 14]) {
  res <- cost_fun(.ep = x,
              .dist_matrix = .dist_matrix,
              .kernel_func = .kernel_func,
              .train_response2 = .train_response2)
  return(res)
}

# Optimization
opt_param_gd <- gradOptimizer(obj_fun = err_cv,
                              setParameter = setGradParameter(min_val = 0.0001,
                                                              max_val = 1,
                                                              rate = 0.007,
                                                              n_tries = 10,
                                                              print_step = FALSE,
                                                              print_result = TRUE,
                                                              figure = TRUE))
## 
##  ~ Observed parameter: 0.009389028  in 240 iterations.

# Optimization
opt_param_grid <- gridOptimizer(obj_fun = err_cv,
                                setParameter = setGridParameter(min_val = 0.0001,
                                                                max_val = 0.1,
                                                                n_val = 500,
                                                                figure = TRUE))
## 
## * Grid search algorithm... 
##  ~ Observed parameter : 0.009309218

cat('* Observed parameter:\n\t - Gradient descent\t:', 
    opt_param_gd$opt_param,
    '\t with error:', err_cv(opt_param_gd$opt_param),
    '\n\t - Grid search\t\t:',opt_param_grid$opt_param,
    '\t with error:', err_cv(opt_param_grid$opt_param))
## * Observed parameter:
##   - Gradient descent : 0.009389028    with error: 2379.61 
##   - Grid search      : 0.009309218    with error: 2379.624

3.5 Fitting parameter

This function gathers the constructed machines and perform an optimization algorithm to approximate the smoothing parameter for the aggregation.

  • Argument:

    • train_design : a matrix or data frame of the training input data or the predictions given by some predictors.
      • If the input data is given, the option build_machine must be TRUE and all the chosen machines will be constructed using a proportion of the training input (splits).
      • If the predictions are given, the option build_machine must be FALSE which indicates that no basic machine is constructed and the paramter is tuned using this predicted features directly.
    • train_response : a vector of corresponding response variable of the train_design.
    • scale_input : a logical value specifying whether or not to scale the input data before building the basic regression predictors. By default, scale_input = FALSE.
    • scale_machine : a logical value specifying whether or not to scale the predicted features given by all the basic regression predictors, for aggregation. By default, scale_machine = FALSE.
    • build_machine: a logical value specifying whether or not the basic machines should be constructed. It should be TRUE if the train_design is the input data, else it should be FALSE is the train_design is the features of predictions. By default, build_machine = TRUE.
    • machines : a vector of basic machines to be constructed. It must be a subset of {“lasso”, “ridge”, “knn”, “tree”, “rf”, “xgb”}. By default, machines = NULL and all the six types of basic machines are built.
    • splits : the proportion of training data used to build the basic machines. By default, splits = .5.
    • n_cv : the number of cross-validation folds used to tune the smoothing parameter.
    • sig, alp : the support constant \(\sigma>0\) and the exponent \(\alpha>0\) of compactly exponential kernel: \(K(x)=e^{-\|x\|^{\alpha}}\mathbb{1}_{\{\|x\|\leq\sigma\}}\) for any \(x\in\mathbb{R}^d\). By default, sig = 3 and alpha = 2.
    • kernels : the kernel function or vector of kernel functions used for the aggregation.
    • optimizeMethod : the optimization methods used to learn the smoothing parameter. By default, optimizeMethod (“grad”) which stands for gradient descent algorithm, and if “optimizeMethod = grid”, then the grid search algorithm is used. Note that this should be of the same size as the kernels argument, otherwise “grid” method will be used for all the kernels.
    • setBasicMachineParam : an option used to set the values of the parameters of the basic machines. setBasicParameter function should be fed to this argument.
    • setGradParam : an option used to set the values of the parameters of the gradient descent algorithm. setGradParameter function should be fed to it.
    • setGridParam : an option used to set the values of the parameters of the grid search algorithm. setGridParameter function should be fed to it.
  • Value:

    This function returns a list of the following objects:

    • opt_parameters : the observed optimal parameter.
    • add_parameters : other aditional parameters such as scaling options, parameters of kernel functions and the optimization methods used.
    • basic_machines : the list of basic machine object.
    • .scale : the maximum and minimum scaling values of input data.
fit_parameter <- function(train_design, 
                          train_response,
                          scale_input = FALSE,
                          scale_machine = FALSE,
                          build_machine = TRUE,
                          machines = NULL, 
                          splits = 0.5, 
                          n_cv = 5,
                          sig = 3,
                          alp = 2,
                          kernels = "gaussian",
                          optimizeMethod = "grad",
                          setBasicMachineParam = setBasicParameter(),
                          setGradParam = setGradParameter(),
                          setGridParam = setGridParameter()){
  kernels_lookup <- c("gaussian", "epanechnikov", "biweight", "triweight", "triangular", "naive", "c_exp", "expo")
  kernel_real <- kernels %>%
    sapply(FUN = function(x) return(match.arg(x, kernels_lookup)))
  if(build_machine){
     mach2 <- generateMachines(train_input = train_design,
                               train_response = train_response,
                               scale_input = scale_input,
                               machines = machines,
                               splits = splits,
                               basicMachineParam = setBasicMachineParam)
  } else{
    mach2 <- list(predict2 = train_design,
                  models = colnames(train_design),
                  id2 = rep(TRUE, nrow(train_design)),
                  train_data = list(train_response = train_response))                  
  }
  maxs_ <- mins_ <- NULL
  if(scale_machine){
    maxs_ <- map_dbl(.x = mach2$predict2, .f = max)
    mins_ <- map_dbl(.x = mach2$predict2, .f = min)
    mach2$predict2 <- scale(mach2$predict2, 
                            center = mins_, 
                            scale = maxs_ - mins_)
  }
  # distance matrix to compute loss function
  if_euclid <- FALSE
  id_euclid <- NULL
  n_ker <- length(kernels)
  dist_all <- list()
  id_shuf <- NULL
  for (k_ in 1:n_ker) {
    ker <- kernel_real[k_]
    if(ker == "naive"){
      dist_all[["naive"]] <- dist_matrix(basicMachines = mach2,
                                         n_cv = n_cv,
                                         kernel = "naive",
                                         id_shuffle = id_shuf)
    } else{
      if(ker == "triangular"){
        dist_all[["triangular"]] <- dist_matrix(basicMachines = mach2,
                                                n_cv = n_cv,
                                                kernel = "triangular",
                                                id_shuffle = id_shuf)
      } else{
        if(if_euclid){
          dist_all[[ker]] <- dist_all[[id_euclid]]
        } else{
          dist_all[[ker]] <- dist_matrix(basicMachines = mach2,
                                         n_cv = n_cv,
                                         kernel = ker,
                                         id_shuffle = id_shuf)
          id_euclid <- ker
          if_euclid <- TRUE
        }
      }
    }
    id_shuf <- dist_all[[1]]$id_shuffle
  }

  # Kernel functions 
  # ================
  # expo
  expo_kernel <- function(.ep = .05,
                          .dist_matrix,
                          .train_response2,
                          .alpha = alp){
  kern_fun <- function(x, id, D){
    tem0 <- as.matrix(exp(- (x*D)^.alpha))
    y_hat <- .train_response2[.dist_matrix$id_shuffle != id] %*% tem0/colSums(tem0)
    return(sum((y_hat - .train_response2[.dist_matrix$id_shuffle == id])^2))
  }
  temp <- map2(.x = 1:.dist_matrix$n_cv,
               .y = .dist_matrix$dist, 
               .f = ~ kern_fun(x = .ep, id = .x, D = .y))
  return(Reduce("+", temp))
  }
  # C_expo
  c.expo_kernel <- function(.ep = .05,
                              .dist_matrix,
                              .train_response2,
                              .sigma = sig,
                              .alpha = alp){
  kern_fun <- function(x, id, D){
    tem0 <- x*D
    tem0[tem0 < .sigma] <- 0
    tem1 <- as.matrix(exp(- tem0^.alpha))
    y_hat <- .train_response2[.dist_matrix$id_shuffle != id] %*% tem1/colSums(tem1)
    y_hat[is.na(y_hat)] <- 0
    return(sum((y_hat - .train_response2[.dist_matrix$id_shuffle == id])^2))
  }
  temp <- map2(.x = 1:.dist_matrix$n_cv,
               .y = .dist_matrix$dist, 
               .f = ~ kern_fun(x = .ep, id = .x, D = .y))
  return(Reduce("+", temp))
}
  
  # Gaussian
  gaussian_kernel <- function(.ep = .05,
                              .dist_matrix,
                              .train_response2){
  kern_fun <- function(x, id, D){
    tem0 <- as.matrix(exp(- (x*D)/2))
    y_hat <- .train_response2[.dist_matrix$id_shuffle != id] %*% tem0/colSums(tem0)
    return(sum((y_hat - .train_response2[.dist_matrix$id_shuffle == id])^2))
  }
  temp <- map2(.x = 1:.dist_matrix$n_cv,
               .y = .dist_matrix$dist, 
               .f = ~ kern_fun(x = .ep, id = .x, D = .y))
  return(Reduce("+", temp))
}

# Epanechnikov
  epanechnikov_kernel <- function(.ep = .05,
                                  .dist_matrix,
                                  .train_response2){
  kern_fun <- function(x, id, D){
    tem0 <- as.matrix(1- x*D)
    tem0[tem0 < 0] = 0
    y_hat <- .train_response2[.dist_matrix$id_shuffle != id] %*% tem0/colSums(tem0)
    return(sum((y_hat - .train_response2[.dist_matrix$id_shuffle == id])^2))
  }
  temp <- map2(.x = 1:.dist_matrix$n_cv, 
               .y = .dist_matrix$dist, 
               .f = ~ kern_fun(x = .ep, id = .x, D = .y))
  return(Reduce("+", temp))
}

# Biweight
  biweight_kernel <- function(.ep = .05,
                              .dist_matrix,
                              .train_response2){
  kern_fun <- function(x, id, D){
    tem0 <- as.matrix(1- x*D)
    tem0[tem0 < 0] = 0
    y_hat <- .train_response2[.dist_matrix$id_shuffle != id] %*% tem0^2/colSums(tem0^2)
    y_hat[is.na(y_hat)] <- 0
    return(sum((y_hat - .train_response2[.dist_matrix$id_shuffle == id])^2))
  }
  temp <- map2(.x = 1:.dist_matrix$n_cv, 
                   .y = .dist_matrix$dist, 
                   .f = ~ kern_fun(x = .ep, id = .x, D = .y))
  return(Reduce("+", temp))
}

# Triweight
  triweight_kernel <- function(.ep = .05,
                               .dist_matrix,
                               .train_response2){
  kern_fun <- function(x, id, D){
    tem0 <- as.matrix(1- x*D)
    tem0[tem0 < 0] = 0
    y_hat <- .train_response2[.dist_matrix$id_shuffle != id] %*% tem0^3/colSums(tem0^3)
    y_hat[is.na(y_hat)] <- 0
    return(sum((y_hat - .train_response2[.dist_matrix$id_shuffle == id])^2))
  }
  temp <- map2(.x = 1:.dist_matrix$n_cv, 
                   .y = .dist_matrix$dist, 
                   .f = ~ kern_fun(x = .ep, id = .x, D = .y))
  return(Reduce("+", temp))
}

# Triangular
  triangular_kernel <- function(.ep = .05,
                                .dist_matrix,
                                .train_response2){
  kern_fun <- function(x, id, D){
    tem0 <- as.matrix(1- x*D)
    tem0[tem0 < 0] <- 0
    y_hat <- .train_response2[.dist_matrix$id_shuffle != id] %*% tem0/colSums(tem0)
    y_hat[is.na(y_hat)] = 0
    return(sum((y_hat - .train_response2[.dist_matrix$id_shuffle == id])^2))
  }
  temp <- map2(.x = 1:.dist_matrix$n_cv, 
                   .y = .dist_matrix$dist, 
                   .f = ~ kern_fun(x = .ep, id = .x, D = .y))
  return(Reduce("+", temp))
  }

  # Naive
  naive_kernel <- function(.ep = .05,
                                .dist_matrix,
                                .train_response2){
    kern_fun <- function(x, id, D){
      tem0 <- (as.matrix(x*D) < 1)
      y_hat <- .train_response2[.dist_matrix$id_shuffle != id] %*% tem0/colSums(tem0)
      y_hat[is.na(y_hat)] = 0
      return(sum((y_hat - .train_response2[.dist_matrix$id_shuffle == id])^2))
    }
    temp <- map2(.x = 1:.dist_matrix$n_cv, 
                 .y = .dist_matrix$dist, 
                 .f = ~ kern_fun(x = .ep, id = .x, D = .y))
    return(Reduce("+", temp))
  }
  
  # error function
  error_cv <- function(x, 
                       .dist_matrix = NULL,
                       .kernel_func = NULL,
                       .train_response2 = NULL){
    res <- .kernel_func(.ep = x,
                        .dist_matrix = .dist_matrix,
                        .train_response2 = .train_response2)
    return(res/n_cv)
  }
  
  # list of kernel functions
  list_funs <- list(gaussian = gaussian_kernel,
                    epanechnikov = epanechnikov_kernel,
                    biweight = biweight_kernel,
                    triweight = triweight_kernel,
                    triangular = triangular_kernel,
                    naive = naive_kernel,
                    expo = expo_kernel,
                    c.expo = c.expo_kernel)
  
  # error for all kernel functions
  error_func <- kernel_real %>%
    map(.f = ~ (\(x) error_cv(x, 
                              .dist_matrix = dist_all[[.x]],
                              .kernel_func = list_funs[[.x]],
                              .train_response2 = train_response[mach2$id2])))
  names(error_func) <- kernels
  # list of prameter setup
  list_param <- list(grad = setGradParam,
                     GD = setGradParam,
                     grid = setGridParam)
  # list of optimizer
  list_optimizer <- list(grad = gradOptimizer,
                         GD = gradOptimizer,
                         grid = gridOptimizer)
  optMethods <- optimizeMethod
  if(length(kernels) != length(optMethods)){
    warning("* kernels and optimization methods differ in sides! Grid search will be used!")
    optMethods = rep("grid", length(kernels))
  }

  # Optimization
  parameters <- map2(.x = kernels,
                     .y = optMethods,
                     .f = ~ list_optimizer[[.y]](obj_fun = error_func[[.x]],
                                                 setParameter = list_param[[.y]]))
  names(parameters) <- paste0(kernel_real, "_", optMethods)
  return(list(opt_parameters = parameters,
              add_parameters = list(scale_input = scale_input,
                                    scale_machine = scale_machine,
                                    sig = sig,
                                    alp = alp,
                                    opt_methods = optimizeMethod),
              basic_machines = mach2,
              .scale = list(min = mins_,
                            max = maxs_)))
}

Example.5: We approximate the smoothing parameter of Boston data.


df <- MASS::Boston
train <- logical(nrow(df))
train[sample(length(train), floor(0.75*nrow(df)))] <- TRUE

param <- fit_parameter(train_design = df[train, 1:13],
                       train_response = df[train, 14],
                       machines = c("knn", "rf", "xgb"),
                       splits = .65,
                       n_cv = 3,
                       kernels = c("gaussian", "gaussian", "biweight", "triangular"),
                       optimizeMethod = c("grad", "grid", "grid", "grid"),
                       setBasicMachineParam = setBasicParameter(k = 2:5,
                                                                ntree = c(1,3,5)*100,
                                                                nrounds_xgb = c(1,3,5)*100),
                       setGradParam = setGradParameter(rate = 0.001))
## 
## * Building basic machines ...
##  ~ Progress: ... 33% ... 67% ... 100%
## * Gradient descent algorithm ...
##   Step   |  Parameter    |  Gradient |  Threshold 
##  ---------------------------------------------------
##    0     |  0.033400     |  64.605213    |  1e-10 
##  ---------------------------------------------------
##    1     |  0.032400     |  -4.13601e-01     |  64.60521 
##    2     |  0.0324064    |  0.0201676    |  0.4136014 
##    3     |  0.0324061    |  -9.77896e-04     |  0.0201676 
##    4     |  0.0324061    |  4.74145e-05  |  0.0009778963 
##    5     |  0.0324061    |  -2.28108e-06     |  4.741446e-05 
##    6     |  0.0324061    |  1.07952e-07  |  2.281076e-06 
##    7     |  0.0324061    |  -9.38714e-09     |  1.079521e-07 
##    8     |  0.0324061    |  -1.87743e-08     |  9.387143e-09 
##    9     |  0.0324061    |  9.38714e-09  |  1.877429e-08 
##    10    |  0.0324061    |  0e+00    |  9.387143e-09 
-------------------------------------------------------
##  Stopped|  0.0324061     |  0        |  0
##  ~ Observed parameter: 0.0324061  in 10 iterations.

## 
## * Grid search algorithm... 
##  ~ Observed parameter : 0.03250903

## 
## * Grid search algorithm... 
##  ~ Observed parameter : 0.002104682

## 
## * Grid search algorithm... 
##  ~ Observed parameter : 0.01780803

param$opt_parameters %>%
  map_dfc(.f = ~ .x$opt_param) %>%
  print
## # A tibble: 1 × 4
##   gaussian_grad gaussian_grid biweight_grid triangular_grid
##           <dbl>         <dbl>         <dbl>           <dbl>
## 1        0.0324        0.0325       0.00210          0.0178

4 Prediction

The smoothing parameter obtained from the previous section can be used to make the final predictions.

4.1 Kernel functions

Several types of kernel functions used for the aggregation are defined in this section.

  • Argument:

    • epsilon : the value of the parameter used.
    • .y2 : the vector of response variable of the second part \(\mathcal{D}_{\ell}\) of the training data, which is used for the aggregation.
    • .distance : the distance matrix object obtained from dist_matrix function.
    • .kern : the string specifying the kernel function. By default, .kern = "gaussian".
    • .sig, .alp : the parameters of exponential kernel function.
    • .meth : the string of optimization methods to be linked to the name of the kernel functions if one kernel is used more than once.
  • Value:

    This function returns the predictions of the aggregation method evaluated with the given parameter.

kernel_pred <- function(epsilon,
                        .y2, 
                        .distance, 
                        .kern = "gaussian",
                        .sig = sqrt(.5), 
                        .alp = 2,
                        .meth = NA){
  dis <- as.matrix(.distance)
  # Kernel functions 
  # ================
  expo_kernel <- function(.ep,
                          .alpha = .alp){
    tem0 <- exp(- (.ep*dis)^.alpha)
    y_hat <- .y2 %*% tem0/colSums(tem0)
    return(t(y_hat))
  }
  c.expo_kernel <- function(.ep,
                            .sigma = .sig,
                            .alpha = .alp){
    tem0 <- .ep*dis
    tem0[tem0 < .sigma] <- 0
    tem1 <- exp(- tem0^.alpha)
    y_hat <- .y2 %*% tem1/colSums(tem1)
    y_hat[is.na(y_hat)] <- 0
    return(t(y_hat))
  }
  gaussian_kernel <- function(.ep){
    tem0 <- exp(- (.ep*dis)/2)
    y_hat <- .y2 %*% tem0/colSums(tem0)
    return(t(y_hat))
  }

  # Epanechnikov
  epanechnikov_kernel <- function(.ep){
    tem0 <- 1- .ep*dis
    tem0[tem0 < 0] = 0
    y_hat <- .y2 %*% tem0/colSums(tem0)
    return(t(y_hat))
  }
  # Biweight
  biweight_kernel <- function(.ep){
    tem0 <- 1- .ep*dis
    tem0[tem0 < 0] = 0
    y_hat <- .y2 %*% tem0^2/colSums(tem0^2)
    y_hat[is.na(y_hat)] <- 0
    return(t(y_hat))
  }

  # Triweight
  triweight_kernel <- function(.ep){
    tem0 <- 1- .ep*dis
    tem0[tem0 < 0] = 0
    y_hat <- .y2 %*% tem0^3/colSums(tem0^3)
    y_hat[is.na(y_hat)] <- 0
    return(t(y_hat))
  }

  # Triangular
  triangular_kernel <- function(.ep){
    tem0 <- 1- .ep*dis
    tem0[tem0 < 0] <- 0
    y_hat <- .y2 %*% tem0/colSums(tem0)
    y_hat[is.na(y_hat)] = 0
    return(t(y_hat))
 }
  # Naive
  naive_kernel <- function(.ep){
      tem0 <- (.ep*dis < 1)
      y_hat <- .y2 %*% tem0/colSums(tem0)
      y_hat[is.na(y_hat)] = 0
      return(t(y_hat))
  }
  # Prediction
  kernel_list <- list(gaussian = gaussian_kernel,
                      epanechnikov = epanechnikov_kernel,
                      biweight = biweight_kernel,
                      triweight = triweight_kernel,
                      triangular = triangular_kernel,
                      naive = naive_kernel,
                      expo = expo_kernel,
                      c.expo = c.expo_kernel)
  res <- tibble(as.vector(kernel_list[[.kern]](.ep = epsilon)))
  names(res) <- ifelse(is.na(.meth), 
                       .kern, 
                       paste0(.kern, '_', .meth))
  return(res)
}

4.2 Functions: predict_agg

  • Argument:

    • fitted_models : the object obtained from fit_parameter function.
    • new_data : the new testing data to be predicted.
    • test_response : the testing response variable, it is optional. If it is given, the mean square error on the testing data is also computed. By default, test_response = NULL.
  • Value:

    This function returns a list of the following objects:

    • fitted_aggregate : the predictions by the aggregation methods.
    • fitted_machine : the predictions given by all the basic machines.
    • mse : the mean square error computed only if the test_reponse argument is note NULL.
# Prediction
predict_agg <- function(fitted_models,
                        new_data,
                        test_response = NULL){
  opt_param <- fitted_models$opt_parameters
  add_param <- fitted_models$add_parameters
  basic_mach <- fitted_models$basic_machines
  kern0 <- names(opt_param)
  kernel0 <- stringr::str_split(kern0, "_") %>%
    imap_dfc(.f = ~ tibble("{.y}" := .x)) 
  kerns <- kernel0[1,] %>%
    as.character
  opt_meths <- kernel0[2,] %>%
    as.character
  new_data_ <- new_data
  # if basic machines are built
  if(is.list(basic_mach$models)){
    mat_input <- as.matrix(basic_mach$train_data$train_input)
    if(add_param$scale_input){
      new_data_ <- scale(new_data, 
                         center = basic_mach$scale_min, 
                         scale = basic_mach$scale_max - basic_mach$scale_min)
    }
    if(is.matrix(new_data_)){
      mat_test <- new_data_
      df_test <- as_tibble(new_data_)
    } else {
      mat_test <- as.matrix(new_data_)
      df_test <- new_data_
    }
    
    # Prediction test by basic machines
    built_models <- basic_mach$models
    pred_test <- function(meth){
      if(meth == "knn"){
        pre <- 1:length(built_models[[meth]]) %>%
          map_dfc(.f = (\(k) tibble('{{k}}' := FNN::knn.reg(train = mat_input[!basic_mach$id2,], 
                                                            test = mat_test, 
                                                            y = basic_mach$train_data$train_response[!basic_mach$id2],
                                                            k = built_models[[meth]][[k]])$pred)))
      }
      if(meth %in% c("tree", "rf")){
        pre <- 1:length(built_models[[meth]]) %>%
          map_dfc(.f = (\(k) tibble('{{k}}' := as.vector(predict(built_models[[meth]][[k]], df_test)))))
      }
      if(meth %in% c("lasso", "ridge")){
        pre <- 1:length(built_models[[meth]]) %>%
          map_dfc(.f = (\(k) tibble('{{k}}' := as.vector(predict.glmnet(built_models[[meth]][[k]], mat_test)))))
      }
      if(meth == "xgb"){
        pre <- 1:length(built_models[[meth]]) %>%
          map_dfc(.f = (\(k) tibble('{{k}}' := as.vector(predict(built_models[[meth]][[k]], mat_test)))))
      }
      colnames(pre) <- names(built_models[[meth]])
      return(pre)
    }
    pred_test_all <- names(built_models) %>%
      map_dfc(.f = pred_test)
  } else{
    pred_test_all <- new_data_
  }
  pred_test0 <- pred_test_all
  if(add_param$scale_machine){
      pred_test_all <- scale(pred_test_all, 
                             center = fitted_models$.scale$min,
                             scale = fitted_models$.scale$max - fitted_models$.scale$min)
  }
  # Prediction train2
  pred_train_all <- basic_mach$predict2
  colnames(pred_test_all) <- colnames(pred_train_all)
  d_train <- dim(pred_train_all)
  d_test <- dim(pred_test_all)
  pred_test_mat <- as.matrix(pred_test_all)
  pred_train_mat <- as.matrix(pred_train_all)
  # Distance matrix
  dist_mat <- function(kernel = "gaussian"){
    if(!(kernel %in% c("naive", "triangular"))){
      res_ <- 1:d_test[1] %>%
        map_dfc(.f = (\(id) tibble('{{id}}' := as.vector(rowSums((pred_train_mat - matrix(rep(pred_test_mat[id,], 
                                                                                              d_train[1]), 
                                                                                          ncol = d_train[2], 
                                                                                          byrow = TRUE))^2)))))
    }
    if(kernel == "triangular"){
      res_ <- 1:d_test[1] %>%
        map_dfc(.f = (\(id) tibble('{{id}}' := as.vector(rowSums(abs(pred_train_mat - matrix(rep(pred_test_mat[id,], 
                                                                                                 d_train[1]), 
                                                                                             ncol = d_train[2], 
                                                                                             byrow = TRUE)))))))
    }
    if(kernel == "naive"){
      res_ <- 1:d_test[1] %>%
        map_dfc(.f = (\(id) tibble('{{id}}' := as.vector(apply(abs(pred_train_mat - matrix(rep(pred_test_mat[id,], d_train[1]), 
                                                                                           ncol = d_train[2], 
                                                                                           byrow = TRUE)), 1, max)))))
    }
    return(res_)
  }
  dists <- 1:length(kerns) %>%
      map(.f = ~ dist_mat(kerns[.x]))
  tab_nam <- table(kerns)
  nam <- names(tab_nam[tab_nam > 1])
  vec <- rep(NA, length(kerns))
  for(id in nam){
    id_ <- kerns == id
    if(!is.null(id_)){
      vec[id_] = add_param$opt_methods[id_]
    }
  }
  prediction <- 1:length(kerns) %>% 
    map_dfc(.f = ~ kernel_pred(epsilon = opt_param[[kern0[.x]]]$opt_param,
                               .y2 = basic_mach$train_data$train_response[basic_mach$id2],
                               .distance = dists[[.x]],
                               .kern = kerns[.x], 
                               .sig = add_param$sig, 
                               .alp = add_param$alp,
                               .meth = vec[.x]))
  if(is.null(test_response)){
    return(list(fitted_aggregate = prediction,
           fitted_machine = pred_test_all))
  } else{
    error <- cbind(pred_test_all, prediction) %>%
      dplyr::mutate(y_test = test_response) %>%
      dplyr::summarise_all(.funs = ~ (. - y_test)) %>%
      dplyr::select(-y_test) %>%
      dplyr::summarise_all(.funs = ~ mean(.^2))
    return(list(fitted_aggregate = prediction,
                fitted_machine = pred_test0,
                mse = error))
  }
}

Example.6 Aggregation on Boston dataset.


pred <- predict_agg(param,
            new_data = df[!train, -14],
            test_response = df[!train, 14])
sqrt(pred$mse)

5 Function : GradientCOBRARegressor

This function puts together all the functions above and provides the desire result of the kernel-based consensual aggregation method.

GradientCOBRARegressor <- function(train_design, 
                          train_response,
                          test_design,
                          test_response = NULL,
                          scale_input = FALSE,
                          scale_machine = FALSE,
                          build_machine = TRUE,
                          machines = NULL, 
                          splits = 0.5, 
                          n_cv = 5,
                          sig = 3,
                          alp = 2,
                          kernels = "gaussian",
                          optimizeMethod = "grad",
                          setBasicMachineParam = setBasicParameter(),
                          setGradParam = setGradParameter(),
                          setGridParam = setGridParameter()){
  # build machines + tune parameter
  fit_mod <- fit_parameter(train_design = train_design, 
                           train_response = train_response,
                           scale_input = scale_input,
                           scale_machine = scale_machine,
                           build_machine = build_machine,
                           machines = machines, 
                           splits = splits, 
                           n_cv = n_cv,
                           sig = sig,
                           alp = alp,
                           kernels = kernels,
                           optimizeMethod = optimizeMethod,
                           setBasicMachineParam = setBasicMachineParam,
                           setGradParam = setGradParam,
                           setGridParam = setGridParam)
  # prediction
  pred <- predict_agg(fitted_models = fit_mod,
                      new_data = test_design,
                      test_response = test_response)
  return(list(fitted_aggregate = pred$fitted_aggregate,
              fitted_machine = pred$fitted_machine,
              pred_train2 = fit_mod$basic_machines$predict2,
              opt_parameter = fit_mod$opt_parameters,
              mse = pred$mse,
              kernels = kernels,
              ind_D2 = fit_mod$basic_machines$id2))
}

Example.7 A complete aggregation is implemented on Abalone dataset. Three types of basic machines, and three different kernel functions are used.


pacman::p_load(readr)
colname <- c("Type", "LongestShell", "Diameter", "Height", "WholeWeight", "ShuckedWeight", "VisceraWeight", "ShellWeight", "Rings")
df <- readr::read_delim("https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", col_names = colname, delim = ",", show_col_types = FALSE)

train <- logical(nrow(df))
train[sample(length(train), floor(0.75*nrow(df)))] <- TRUE

agg <- GradientCOBRARegressor(train_design = df[train, 2:8],
                    train_response = df$Rings[train],
                    test_design = df[!train, 2:8],
                    test_response = df$Rings[!train],
                    machines = c("knn", "rf", "xgb"),
                    splits = .5,
                    n_cv = 3,
                    kernels = c("gaussian","triangular"),
                    optimizeMethod = c("grad", "grid"),
                    setBasicMachineParam = setBasicParameter(k = 2:10,
                                                             ntree = 1:5*100,
                                                             nrounds_xgb = 1:5*100),
                    setGradParam = setGradParameter(rate = 0.07,
                                                    print_step = TRUE),
                    setGridParam = setGridParameter(n_val = 100))
## 
## * Building basic machines ...
##  ~ Progress: ... 33% ... 67% ... 100%
## * Gradient descent algorithm ...
##   Step   |  Parameter    |  Gradient |  Threshold 
##  ---------------------------------------------------
##    0     |  0.100000     |  -6.42721e+02     |  1e-10 
##  ---------------------------------------------------
##    1     |  0.170000     |  44.878464    |  642.7214 
##    2     |  0.165112     |  17.640025    |  44.87846 
##    3     |  0.163191     |  6.480987     |  17.64002 
##    4     |  0.162485     |  2.313328     |  6.480987 
##    5     |  0.162233     |  0.816705     |  2.313328 
##    6     |  0.162144     |  0.287190     |  0.8167048 
##    7     |  0.162113     |  0.100847     |  0.2871902 
##    8     |  0.162102     |  0.0353948    |  0.1008471 
##    9     |  0.162098     |  0.0124206    |  0.03539479 
##    10    |  0.162097     |  0.00435826   |  0.01242058 
##    11    |  0.162096     |  0.0015292    |  0.004358263 
##    12    |  0.162096     |  0.000536757  |  0.001529203 
##    13    |  0.162096     |  0.000188231  |  0.0005367569 
##    14    |  0.162096     |  6.6123e-05   |  0.000188231 
##    15    |  0.162096     |  2.3205e-05   |  6.612304e-05 
##    16    |  0.162096     |  8.11049e-06  |  2.320502e-05 
##    17    |  0.162096     |  2.92879e-06  |  8.110492e-06 
##    18    |  0.162096     |  9.01166e-07  |  2.928789e-06 
##    19    |  0.162096     |  3.00389e-07  |  9.011658e-07 
##    20    |  0.162096     |  1.12646e-07  |  3.003886e-07 
##    21    |  0.162096     |  3.75486e-08  |  1.126457e-07 
##    22    |  0.162096     |  3.75486e-08  |  3.754857e-08 
##    23    |  0.162096     |  7.50971e-08  |  3.754857e-08 
##    24    |  0.162096     |  -7.50971e-08     |  7.509715e-08 
##    25    |  0.162096     |  7.50971e-08  |  7.509715e-08 
##    26    |  0.162096     |  0e+00    |  7.509715e-08 
-------------------------------------------------------
##  Stopped|  0.162096  |  0        |  0
##  ~ Observed parameter: 0.162096  in 26 iterations.

## 
## * Grid search algorithm... 
##  ~ Observed parameter : 0.03239091

sqrt(agg$mse)

📖 Read also MixCobra method by Fischer and Mougeot (2019)


References