Bregman divergences (BD)
Definition Let
\(\phi:\mathcal{C}\rightarrow\mathbb{R}\) be
a strictly convex and continuously differentiable function defined on a
measurable convex subset \(\mathcal{C}\subset\mathbb{R}^d\) . Let \(int(\mathcal{C})\) denote its relative
interior. A Bregman divergence indexed by \(\phi\) is a dissimilarity measure \(d_{\phi}:\mathcal{C}\times
int(\mathcal{C})\rightarrow\mathbb{R}\) defined for any pair
\((x,y)\in \mathcal{C}\times
int(\mathcal{C})\) by, \[\begin{equation}
\label{eq:1.10}
d_{\phi}(x,y)=\phi(x)-\phi(y)-\langle x-y,\nabla\phi(y)\rangle
\end{equation}\] where \(\nabla\phi(y)\) denotes the gradient of
\(\phi\) computed at a point \(y\in int(\mathcal{C})\) . A Bregman
divergence is not necessarily a metric as it may not be symmetric and
the triangular inequality might not be satisfied.
This section defines all the Bregman divergences used. The list of
all the Bregman divergences is given in the table below:
Euclidean
\({\|x\|_2^2}=\sum_{i=1}^dx_i^2\)
\(\|x-y\|_2^2\)
\(\mathbb{R}^d\)
General Kullback-Leibler
\(\sum_{i=1}^d x_i\ln(
x_i)\)
\(\sum_{i=1}^d( x_i\ln(\frac{
x_i}{y_i})-(x_i-y_i))\)
\((0,+\infty)^d\)
Logistic
\(\sum_{i=1}^d(x_i\ln(
x_i)+(1- x_i)\ln(1- x_i))\)
\(\sum_{i=1}^d\Big(
x_i\ln(\frac{x_i}{y_i})+(1- x_i)\ln(\frac{1-
x_i}{1-y_i})\Big)\)
\((0,1)^d\)
Itakura-Saito
\(-\sum_{i=1}^d\ln(
x_i)\)
\(\sum_{i=1}^d\Big(\frac{
x_i}{y_i}-\ln(\frac{ x_i}{y_i})-1\Big)\)
\((0,+\infty)^d\)
Exponential
\(\sum_{i=1}^de^{x_i}\)
\(\sum_{i=1}^d(e^{x_i}-e^{y_i}-e^{y_i}(x_i-y_i))\)
\(\mathbb{R}^d\)
Polynomial
\(\sum_{i=1}^d|x|^p,p>2\)
\(\sum_{k=1}^d(|x_k|^p-|y_k|^p-\text{sign}(y_k)^pp(x_k-y_k)y_k^{p-1})\)
\(\mathbb{R}^d\)
Look-up list of Bregman
divergences
The codes below provide a look-up list of all the BDs defined in the
table above.
euclidDiv <- function(X., y., deg = NULL){
res <- sweep(X., 2, y.)
return(rowSums(res^2))
}
gklDiv <- function(X., y., deg = NULL){
res <- c("/", "-") %>%
map(.f = ~ sweep(X., 2, y., FUN = .x))
return(rowSums(X.*log(res[[1]]) - res[[2]]))
}
logDiv <- function(X., y., deg = NULL){
res <- map2(.x = list(X., 1-X.),
.y = list(y., 1-y.),
.f = ~ sweep(.x, 2, .y, FUN = "/"))
return(rowSums(X.*log(res[[1]])+(1-X.)*log(res[[2]])))
}
itaDiv <- function(X., y., deg = NULL){
res <- sweep(X., 2, y., FUN = "/")
return(rowSums(res-log(res) - 1))
}
expDiv <- function(X., y., deg = NULL){
exp_y <- exp(y.)
res <- sweep(1+X., 2, y.) %>%
sweep(2, exp_y, FUN = "*")
return(rowSums(exp(X.)-res))
}
polyDiv <- function(X., y., deg = 3){
S <- map2(.x = list(X., X.^deg),
.y = list(y., y.^deg),
.f = ~ sweep(.x,
MARGIN = 2,
STATS = .y,
FUN = "-"))
if(deg %% 2 == 0){
Tem <- sweep(S[[1]], 2, y.^(deg-1), FUN = "*")
res <- rowSums(S[[2]] - deg * Tem)
}
else{
Tem <- sweep(S[[1]], 2, sign(y.) * y.^(deg-1), FUN = "*")
res <- rowSums(S[[2]] - deg * Tem)
}
return(res)
}
lookup_div <- list(
euclidean = euclidDiv,
gkl = gklDiv,
logistic = logDiv,
itakura = itaDiv,
exponential = expDiv,
polynomial = polyDiv
)
Function :
BregmanDiv
This function computes Bregman divergence matrix between two sets of
data points. Each set of data points should be represented by a matrix,
data frame, or tibble object where each row corresponds to each
individual data point.
BregmanDiv <- function(X.,
C.,
div = c("euclidean",
"gkl",
"logistic",
"itakura",
"exponential",
"polynomial"),
deg = 3){
div <- match.arg(div)
d_c <- dim(C.)
if(is.null(d_c)){
C <- matrix(C., nrow = 1, byrow = TRUE)
} else{
C <- as.matrix(C.)
}
if(is.null(dim(X.))){
X <- matrix(X., nrow = 1, byrow = TRUE)
} else{
X <- as.matrix(X.)
}
dis <- map_dfc(.x = 1:dim(C)[1],
.f = ~ tibble('{{.x}}' := lookup_div[[div]](X, C[.x,], deg = deg)))
return(dis)
}
🧾 Remark.2 : Note that “logistic” Bregman divergence
can handle only data points with domain \({\cal C}=(0,1)^d\) , therefore, it should be
used only in suitable cases.
Step \(K\) :
\(K\) -means with Bregman
divergences
This section implements \(K\) -means
algorithm using Bregman divergences which corresponds to the step \(K\) of KFC procedure.
Function :
findClosestCentroid
and newCentroids
These two functions perform the main steps of \(K\) -means algorithm. Function
findClosestCentroid
assigns any data points to some cluster
according to the smallest divergence between the data point and the
centroid. It provides a vector of clusters of all the data points. From
there, function newCentroids
computes new centroids given
the cluster labels of all data points.
findClosestCentroid <- function(x., centroids., div, deg = 3){
dist <- BregmanDiv(x., centroids., div, deg)
clust <- 1:nrow(x.) %>%
map_int(.f = ~ which.min(dist[.x,]))
return(clust)
}
newCentroids <- function(x., clusters.){
centroids <- unique(clusters.) %>%
map_dfr(.f = ~ colMeans(x.[clusters. == .x, ]))
return(centroids)
}
Function :
kmeansBD
This function performs \(K\) -means
algorithm with BDs.
kmeansBD <- function(train_input,
K,
n_start = 5,
maxIter = 500,
deg = 3,
scale_input = FALSE,
div = "euclidean",
splits = 1,
epsilon = 1e-10,
center_ = NULL,
scale_ = NULL,
id_shuffle = NULL){
start_time <- Sys.time()
# Distortion function
X <- as.matrix(train_input)
N <- dim(X)
if(scale_input){
if(!(is.null(center_) & is.null(scale_))){
if(length(center_) == 1){
center_ <- rep(center_, N[2])
}
if(length(scale_) == 1){
scale_ <- rep(scale_, N[2])
}
} else{
min_ <- apply(X, 2, FUN = min)
c_ <- abs(colMeans(X)/5)
center_ <- min_ - c_
scale_ <- apply(X, 2, FUN = max) - center_ + 1
}
X <- scale(X, center = center_, scale = scale_)
}
if(is.null(id_shuffle)){
train_id <- rep(TRUE, N[1])
if(splits < 1){
train_id[sample(N[1], floor(N[1]*(1-splits)))] <- FALSE
}
} else{
train_id <- id_shuffle
}
X_train1 <- X[train_id,]
X_train2 <- X[!train_id,]
mu <- as.matrix(colMeans(X_train1))
distortion <- function(clus){
cent <- newCentroids(X_train1, clus)
var_within <- 1:K %>%
map(.f = ~ BregmanDiv(X_train1[clus == .x,],
cent[.x,],
div,
deg)) %>%
map(.f = sum) %>%
Reduce("+", .)
return(var_within)
}
# Kmeans algorithm
kmeansWithBD <- function(x., k., maxiter., eps.) {
n. <- nrow(x.)
# initialization
init <- sample(n., k.)
centroids_old <- x.[init,]
i <- 0
while(i < maxIter){
# Assignment step
clusters <- findClosestCentroid(x., centroids_old, div, deg)
# Recompute centroids
centroids_new <- newCentroids(x., clusters)
if ((sum(is.na(centroids_new)) > 0) |
(nrow(centroids_new) != k.)) {
init <- sample(n., k.)
centroids_old <- x.[init,]
warning("NA produced -> reinitialize centroids...!")
}
else{
if(sum(abs(centroids_new - centroids_old)) > eps.){
centroids_old <- centroids_new
} else{
break
}
}
i <- i + 1
}
return(clusters)
}
results <- 1:n_start %>%
map_dfc(.f = ~ tibble("{{.x}}" := kmeansWithBD(X_train1,
K,
maxIter,
epsilon)))
opt_id <- 1:n_start %>%
map_dbl(.f = ~ distortion(results[[.x]])) %>%
which.min
cluster <- clusters <- results[[opt_id]]
j <- 1
ID <- unique(cluster)
for (i in ID) {
clusters[cluster == i] = j
j = j + 1
}
centroids = newCentroids(X_train1, clusters)
time_taken <- Sys.time() - start_time
return(
list(
centroids = centroids,
clusters = clusters,
train_data = list(X_train = X_train1,
X_remain = X_train2,
id_remain = !train_id),
parameters = list(div = div,
deg = deg,
center_ = center_,
scale_ = scale_),
running_time = time_taken
)
)
}
Example.1 : We perform \(K\) -means algorithm with "gkl"
BD on Abalone
dataset.
pacman::p_load(readr)
colname <- c("Type", "LongestShell", "Diameter", "Height", "WholeWeight", "ShuckedWeight", "VisceraWeight", "ShellWeight", "Rings")
df <- readr::read_delim("https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", col_names = colname, delim = ",", show_col_types = FALSE)
n <- nrow(df)
train <- logical(n)
train[sample(n, floor(n*0.8))] <- TRUE
cl <- df[train,2:(ncol(df)-1)] %>%
kmeansBD(K = 3, div = "gkl", splits = 0.5, scale_input = TRUE)
table(cl$clusters)
1 2 3
655 466 550
Step \(F\) :
Fitting predictive models
This section builds global models by fitting local model on each
given cluster of the obtained partition. This corresponds to the step
\(F\) of the procedure.
Function :
fitLocalModels
This function fits local models \(({\cal
M_{j,k}})_{j,k}\) on all clusters \(k=1,...,K\) of the obtained partition,
given by Bregman divergence \({\cal
B}_j\) , for \(j=1,...,M\) .
fitLocalModels <- function(kmeans_BD,
train_response,
model = "lm",
formula = NULL){
start_time <- Sys.time()
X_train <- kmeans_BD$train_data$X_train
y_train <- train_response[!(kmeans_BD$train_data$id_remain)]
X_remain <- kmeans_BD$train_data$X_remain
y_remain <- NULL
if(!is.null(X_remain)){
y_remain <- train_response[kmeans_BD$train_data$id_remain]
}
pacman::p_load(tree)
pacman::p_load(randomForest)
model_ <- ifelse(model == "tree", tree::tree, model)
K <- nrow(kmeans_BD$centroids)
if (is.null(formula)){
form <- formula(target ~ .)
}
else{
form <- update(formula, target ~ .)
}
data_ <- bind_cols(X_train, "target":= y_train)
fit_lookup <- list(lm = "fitted.values",
rf = "predicted")
if(is.character(model_)){
model_lookup <- list(lm = lm,
rf = randomForest::randomForest)
mod <- map(.x = 1:K,
.f = ~ model_lookup[[model_]](formula = form,
data = data_[kmeans_BD$clusters == .x, ]))
} else{
mod <- map(.x = 1:K,
.f = ~ model_(formula = form,
data = data_[kmeans_BD$clusters == .x,]))
}
pred0 <- NULL
if(!is.null(X_remain)){
pred0 <- vector(mode = "numeric",
length = length(y_remain))
clus <- findClosestCentroid(x. = X_remain,
centroids. = kmeans_BD$centroids,
div = kmeans_BD$parameters$div,
deg = kmeans_BD$parameters$deg)
for(i_ in 1:K){
pred0[clus == i_] <- predict(mod[[i_]],
as.data.frame(X_remain[clus == i_,]))
}
}
time_taken <- Sys.time() - start_time
return(list(
local_models = mod,
kmeans_BD = kmeans_BD,
data_remain = list(fit = pred0,
response = y_remain),
running_time = time_taken
))
}
Example.2 : From Example.1 above,
multiple linear regression models are built on all the obtained
clusters. The mean square error of this model, evaluated on the
remaining \(50\%\) of the training data
is computed.
fit <- fitLocalModels(train_response = df$Rings[train],
kmeans_BD = cl,
model = "lm")
mean((fit$data_remain$response- fit$data_remain$fit)^2)
[1] 4.680465
Function :
localPredict
This function allows us to predict any new observations using the
candidate model \({\cal M}_j=({\cal
M}_{j,k})_{k=1}^M\) corresponding to Bregman divergence \({\cal B}_j\) , for some \(j\in J\subset\{1,...,M\}\) .
localPredict <- function(localModels,
newData){
kmean_BD <- localModels$kmeans_BD
K <- nrow(kmean_BD$centroids)
newData_ <- newData
if(!(is.null(kmean_BD$parameters$center_))){
newData_ <- scale(newData,
center = kmean_BD$parameters$center_,
scale = kmean_BD$parameters$scale_)
id0 <- (newData_ <= 0)
if(sum(id0) > 0){
min_ <- min(newData_[id0])
newData_[id0] <- runif(sum(id0), min(1e-3, min_/10), min_)
}
}
clus <- findClosestCentroid(x. = newData_,
centroids. = kmean_BD$centroids,
div = kmean_BD$parameters$div,
deg = kmean_BD$parameters$deg)
pred0 <- vector(mode = "numeric", length = nrow(newData_))
for(i_ in 1:K){
pred0[clus == i_] <- predict(localModels$local_models[[i_]],
as.data.frame(newData_[clus == i_,]))
}
pred0 <- as_tibble(pred0)
names(pred0) <- ifelse(kmean_BD$parameters$div == "polynomial",
paste0("polynomial", kmean_BD$parameters$deg),
kmean_BD$parameters$div)
return(pred0)
}
Example.3 : The performance of the candidate model
corresponding to "gkl"
divergence is compared to random
forest regression on a \(20\%\) testing
data.
y_hat <- localPredict(fit,
df[!train, 2:(ncol(df)-1),])
rf <- randomForest(Rings ~ ., data = df[train,2:ncol(df)])
mean((predict(rf, newdata = df[!train,2:ncol(df)])- df$Rings[!train])^2)
[1] 4.345826
mean((y_hat$gkl-df$Rings[!train])^2)
[1] 4.524662
Step \(C\) :
Combining methods
The source codes and information of the aggregation methods are
available here
.
The codes below imports the aggregation methods into Rstudio
environment.
pacman::p_load(devtools)
### Kernel based consensual aggregation
source_url("https://raw.githubusercontent.com/hassothea/AggregationMethods/main/KernelAggReg.R")
ℹ SHA-1 hash of file is 813bced0dd2d9aa2431d07b64de08db7a9887bb1
### MixCobra
source_url("https://raw.githubusercontent.com/hassothea/AggregationMethods/main/MixCobraReg.R")
ℹ SHA-1 hash of file is c874cc5e16f484866d07476d002ec77f244989ee
Function : stepK
,
stepF
and stepC
These functions allow to set values of the parameters in the three
steps of the KFC procedure. Each function returns a list of all the
parameters given in its arguments.
stepK = function(K,
n_start = 5,
maxIter = 300,
deg = 3,
scale_input = FALSE,
div = NULL,
splits = 0.75,
epsilon = 1e-10,
center_ = NULL,
scale_ = NULL){
return(list(K = K,
n_start = n_start,
maxIter = maxIter,
deg = deg,
scale_input = scale_input,
div = div,
splits = splits,
epsilon = epsilon,
center_ = center_ ,
scale_ = scale_))
}
stepF = function(formula = NULL,
model = "lm"){
return(list(formula = formula,
model = model))
}
stepC = function(n_cv = 5,
method = c("cobra", "mixcobra"),
opt_methods = c("grad", "grid"),
kernels = "gaussian",
scale_features = FALSE){
return(list(n_cv = n_cv,
method = method,
opt_methods = opt_methods,
kernels = kernels,
scale_features = scale_features))
}
Function : KFCRegressor
This function is the complete implementation of KFC procedure.
🧾 Remark.3 : The parallel
argument
above requires internet connection to load the source codes of \(K\) -means algorithm with BDs from GitHub
.
It is performed on the maximum number of clusters of your machine, and
the speed is at least two times faster than without parallelism,
however, it is not so stable depending on your internet connection or
machine.
For the aggregation methods in step \(C\) :
KFCRegressor = function(train_input,
train_response,
test_input,
test_response = NULL,
n_cv = 5,
parallel = TRUE,
inv_sigma = sqrt(.5),
alp = 2,
K_step = stepK(splits = .5),
F_step = stepF(),
C_step = stepC(),
setGradParamAgg = setGradParameter(),
setGridParamAgg = setGridParameter(),
setGradParamMix = setGradParameter_Mix(),
setGridParamMix = setGridParameter_Mix(),
silent = FALSE){
start_time <- Sys.time()
lookup_div_names <- c("euclidean",
"gkl",
"logistic",
"itakura",
"polynomial")
div_ <- K_step$div
### K step: Kmeans clustering with BDs
if (is.null(K_step$div)){
divergences <- lookup_div_names
warning("No divergence provided! All of them are used!")
}
else{
divergences <- K_step$div %>%
map_chr(.f = ~ match.arg(arg = .x,
choices = lookup_div_names))
}
div_list <- divergences %>%
map(.f = (\(x) if(x != "polynomial") return(x) else return(rep("polynomial", length(K_step$deg))))) %>%
unlist
deg_list <- rep(NA, length(div_))
deg_list[div_list == "polynomial"] <- K_step$deg
div_names <- map2_chr(.x = div_list,
.y = deg_list,
.f = (\(x, y) if(is.na(y)) return(x) else return(paste0(x,y))))
### Step K: Kmeans clustering with Bregman divergences
dm <- dim(train_input)
id_shuffle <- vector(length = dm[1])
n_train <- floor(K_step$splits * dm[1])
id_shuffle[sample(dm[1], n_train)] <- TRUE
if(parallel){
numCores <- parallel::detectCores()
doParallel::registerDoParallel(numCores) # use multicore, set to the number of our cores
kmean_ <- foreach(i=1:length(div_names)) %dopar% {
devtools::source_url("https://raw.githubusercontent.com/hassothea/KFC-Procedure/master/kmeanBD.R")
kmeansBD(train_input = train_input,
K = K_step$K,
div = div_list[i],
n_start = K_step$n_start,
maxIter = K_step$maxIter,
deg = deg_list[i],
scale_input = K_step$scale_input,
splits = K_step$splits,
epsilon = K_step$epsilon,
center_ = K_step$center_,
scale_ = K_step$scale_,
id_shuffle = id_shuffle)
}
doParallel::stopImplicitCluster()
} else{
kmean_ <- map2(.x = div_list,
.y = deg_list,
.f = ~ kmeansBD(train_input = train_input,
K = K_step$K,
div = .x,
n_start = K_step$n_start,
maxIter = K_step$maxIter,
deg = .y,
scale_input = K_step$scale_input,
splits = K_step$splits,
epsilon = K_step$epsilon,
center_ = K_step$center_,
scale_ = K_step$scale_,
id_shuffle = id_shuffle))
}
names(kmean_) <- div_names
### F step: Fitting the corresponding model on each observed cluster
model_ <- div_names %>%
map(.f = ~ fitLocalModels(kmean_[[.x]],
train_response = train_response,
model = F_step$model,
formula = F_step$formula))
names(model_) <- div_names
pred_combine <- model_ %>%
map_dfc(.f = ~ .x$data_remain$fit)
y_remain <- train_response[!id_shuffle]
pred_test <- div_names %>%
map_dfc(.f = ~ localPredict(model_[[.x]],
test_input))
names(pred_test) <- names(pred_combine) <- div_names
# C step: Consensual regression aggregation method with kernel-based COBRA
list_method_agg <- list(mixcobra = function(pred){MixCobraReg(train_input = train_input[!id_shuffle,],
train_response = y_remain,
test_input = test_input,
train_predictions = pred,
test_predictions = pred_test,
test_response = test_response,
scale_input = K_step$scale_input,
scale_machine = C_step$scale_features,
n_cv = C_step$n_cv,
inv_sigma = inv_sigma,
alp = alp,
kernels = C_step$kernels,
optimizeMethod = C_step$opt_methods,
setGradParam = setGradParamMix,
setGridParam = setGridParamMix,
silent = silent)},
cobra = function(pred){kernelAggReg(train_design = pred,
train_response = y_remain,
test_design = pred_test,
test_response = test_response,
scale_input = K_step$scale_input,
scale_machine = C_step$scale_features,
build_machine = FALSE,
machines = NULL,
n_cv = C_step$n_cv,
inv_sigma = sqrt(.5),
alp = 2,
kernels = C_step$kernels,
optimizeMethod = C_step$opt_methods,
setGradParam = setGradParamAgg,
setGridParam = setGridParamAgg,
silent = silent)})
res <- map(.x = C_step$method,
.f = ~ list_method_agg[[.x]](pred_combine))
list_agg_methods <- list(cobra = "cob",
mixcobra = "mix")
names(res) <- C_step$method
ext_fun <- function(L, nam){
tab <- L$fitted_aggregate
names(tab) <- paste0(names(tab), "_", nam)
return(tab)
}
pred_fin <- C_step$method %>%
map_dfc(.f = ~ ext_fun(res[[.x]], list_agg_methods[[.x]]))
time.taken <- Sys.time() - start_time
### To finish
if(is.null(test_response)){
return(list(
predict_final = pred_fin,
predict_local = pred_test,
agg_method = res,
running_time = time.taken
))
} else{
error <- cbind(pred_test, pred_fin) %>%
dplyr::mutate(y_test = test_response) %>%
dplyr::summarise_all(.funs = ~ (. - y_test)) %>%
dplyr::select(-y_test) %>%
dplyr::summarise_all(.funs = ~ mean(.^2))
return(list(
predict_final = pred_fin,
predict_local = pred_test,
agg_method = res,
mse = error,
running_time = time.taken
))
}
}
Example.4 : A complete KFC procedure is implemented
on the same Abalone data, using \(5\)
BDs "euclidean"
, "itakura"
, "gkl"
and "polynomial"
(of degree \(3\) and \(6\) ). Both aggregation methods are used in
the step \(C\) . Two kernel functions
are used for each aggregation method: "gaussian"
(with
gradient descent algorithm) and "epanechnikov"
(with grid
search algorithm).
train1 <- logical(n)
train1[sample(n, floor(n*0.8))] <- TRUE
kfc1 <- KFCRegressor(train_input = df[train1,2:ncol(df)],
train_response = df$Rings[train1],
test_input = df[!train1,2:ncol(df)],
K_step = stepK(K = 3,
scale_input = TRUE,
div = c("eucl", "ita", "gkl", "poly"),
deg = c(3, 6),
splits = .5),
C_step = stepC(method = c("cobra", "mixcobra"),
opt_methods = c("grad", "grid"),
kernels = c("gaussian", "gaussian"),
scale_features = FALSE),
setGradParamAgg = setGradParameter(rate = 0.2),
#coef_lm = 2),
setGridParamAgg = setGridParameter(min_val = .00001,
max_val = 10,
n_val = 100),
setGradParamMix = setGradParameter_Mix(rate = "linear",
coef_auto = c(.5,.5)),
setGridParamMix = setGridParameter_Mix(min_alpha = 1e-10,
max_alpha = 0.5,
min_beta = 1e-10,
max_beta = 1,
n_alpha = 20,
n_beta = 20))
* Gradient descent algorithm ...
Step | Parameter | Gradient | Threshold
---------------------------------------------------
0 | 7.777800 | -4.40022e-10 | 1e-10
---------------------------------------------------
1 | 7.977800 | 2.5668e-10 | 0.02571421
2 | 7.777800 | -4.40022e-10 | 0.02506957
3 | 7.977800 | 2.5668e-10 | 0.02571421
4 | 7.777800 | -4.40022e-10 | 0.02506957
5 | 7.977800 | 2.5668e-10 | 0.02571421
6 | 7.777800 | -4.40022e-10 | 0.02506957
7 | 7.977800 | 2.5668e-10 | 0.02571421
8 | 7.911133 | 0e+00 | 0.008356523
-------------------------------------------------------
Stopped| 7.911133 | 0 | 0
~ Observed parameter: 7.911133 in 8 iterations.
* Grid search algorithm...
~ Observed parameter : 8.08081
MixCobra for regression
-----------------------
* Gradient descent algorithm ...
Step | alpha ; beta | Gradient (alpha ; beta) | Threshold
--------------------------------------------------------------------------------
0 | 5.00005 ; 25.05000 | -4.68285e-04 ; 7.3337e-11 | 1e-10
--------------------------------------------------------------------------------
0 | 5.0000 ; 25.0500 | -4.6828e-04 ; 7.3337e-11 | 0.004214561
1 | 5.5000 ; 24.5500 | -3.6509e-04 ; -1.1001e-10 | 0.03327781
2 | 6.2797 ; 26.0500 | -2.2769e-04 ; 2.2001e-10 | 0.07586129
3 | 7.0090 ; 21.5500 | -1.2144e-04 ; -3.6669e-11 | 0.1617499
4 | 7.5277 ; 22.5500 | -5.7064e-05 ; 3.6669e-10 | 0.05317693
5 | 7.8323 ; 10.0500 | -2.3067e-05 ; -1.1001e-10 | 0.425719
6 | 7.9801 ; 14.5500 | -7.5142e-06 ; 1.4667e-10 | 0.2599088
7 | 8.0363 ; 11.0500 | -1.7573e-06 ; -1.1001e-10 | 0.1578404
8 | 8.0513 ; 12.5500 | -2.3244e-07 ; 2.5668e-10 | 0.07937698
9 | 8.0535 ; 10.5813 | -6.3437e-09 ; -4.4002e-10 | 0.09567286
10 | 8.0536 ; 12.4563 | -7.3337e-11 ; 0e+00 | 0.100622
11 | 8.0536 ; 12.4563 | 8.4338e-10 ; 0e+00 | 4.199661e-08
12 | 8.0536 ; 12.4563 | -3.6669e-10 ; -1.4667e-10 | 5.268665e-07
13 | 8.0536 ; 12.6594 | -3.3002e-10 ; 1.1001e-10 | 0.009903915
14 | 8.0536 ; 12.5773 | -5.1336e-10 ; 3.6669e-11 | 0.003960444
15 | 8.0536 ; 12.5627 | -2.5668e-10 ; 0e+00 | 0.0007101231
16 | 8.0536 ; 12.5627 | 7.3337e-11 ; 3.3002e-10 | 5.317427e-08
17 | 8.0536 ; 12.4880 | -3.6669e-11 ; -1.1001e-10 | 0.003623708
18 | 8.0536 ; 12.5012 | 7.3337e-11 ; 3.6669e-10 | 0.000641805
19 | 8.0536 ; 12.4780 | 7.3337e-11 ; -7.3337e-11 | 0.001128374
20 | 8.0536 ; 12.4804 | 1.8334e-10 ; -1.1001e-10 | 0.0001189123
21 | 8.0536 ; 12.4823 | -3.6669e-11 ; 3.3002e-10 | 9.363669e-05
22 | 8.0536 ; 12.4763 | 3.6669e-11 ; 3.6669e-11 | 0.0002942408
23 | 8.0536 ; 12.4759 | 1.4667e-10 ; 2.2001e-10 | 1.709539e-05
24 | 8.0536 ; 12.4738 | 0e+00 ; 0e+00 | 0.0001070309
25 | 8.0536 ; 12.4738 | 0e+00 ; 0e+00 | 3.666853e-10
--------------------------------------------------------------------------------
Stopped| 8.0536 ; 12.4738 | 0 | 0
~ Observed parameter: (alpha, beta) = ( 8.05358 , 12.47375 ) in 26 itertaions.
* Grid search algorithm...
~ Observed parameter: (alpha, beta) = (0.5, 1)
The mean square errors evaluated on \(20\%\) -testing data of the above
computation are reported below.
rf1 <- randomForest::randomForest(Rings ~ ., data = df[train1,2:ncol(df)])
kfc1$predict_final %>%
mutate(rf = predict(rf1, newdata = df[!train1,2:ncol(df)])) %>%
sweep(MARGIN = 1, STATS = df$Rings[!train1], FUN = "-") %>%
.^2 %>%
colMeans
gaussian_grad_cob gaussian_grid_cob gaussian_grad_mix gaussian_grid_mix rf
0.005980861 0.005980861 4.701859057 9.531297947 5.096848245
We can see that KFC procedure performs really well on this real-life
dataset.
📖 Read also KernelAggReg
and MixCobraReg methods.
---
title: "<span style='color: #1C81AA;'>**KFC procedure for regression -</span> [Has et al. (2021)](https://www.tandfonline.com/eprint/YKGS8GTKDBKYFXEGFWSB/full?target=10.1080/00949655.2021.1891539)**"
author: "<span style='color: #D4A51C;'>***Sothea Has***</span>"
date: "5/20/2022"
output:
  html_document:
    css: hideOutput.css
    includes:
      in_header: hideOutput.script
    df_print: paged
    code_folding: hide
    number_sections: yes
    toc: yes
    toc_depth: '2'
    tocdepth: 2
  html_notebook:
    css: hideOutput.css
    includes:
      in_header: hideOutput.script
    code_folding: hide
    number_sections: yes
    toc: yes
    toc_depth: 2
    tocdepth: 2
  pdf_document:
    toc: yes
    toc_depth: '2'
---

<style>
  .btn {
    border-width: 0 0px 0px 0px;
    font-weight: normal;
    text-transform: ;
  }
.btn-default {
  color: #2ecc71;
    background-color: #ffffff;
    border-color: #ffffff;
}
</style>

<!-- Colors
blue : #1FAAE3
yellow : #F0AE14
green : #54D319 
red : #E6180A
-->


```{r, echo=FALSE}
# Check if package "fontawesome" is already installed 

lookup_packages <- installed.packages()[,1]
if(!("fontawesome" %in% lookup_packages))
  install.packages("fontawesome")
```


<span style="color: #1FAAE3;">&#128270;<u> How to download & run the codes?</u></span>{-}
===

All the source codes of the aggregation methods are available [here <span style="color: #097BC1"> `r fontawesome::fa("github")`</span>](https://github.com/hassothea/AggregationMethods). To run the codes, you can <span style="color: #097BC1">`clone`</span> the repository directly or simply load the <span style="color: #097BC1">`R script`</span> source file from the repository using [devtools](https://cran.r-project.org/web/packages/devtools/index.html) package in <span style="color: #0287D8;"> **Rstudio** </span> as follow:

1. Install [devtools](https://cran.r-project.org/web/packages/devtools/index.html) package using command: 

    `install.packages("devtools")`

2. Loading the source codes from <span style="color: #097BC1">GitHub `r fontawesome::fa("github")`</span> repository using `source_url` function by: 

    `devtools::source_url("https://raw.githubusercontent.com/hassothea/KFC-Procedure/master/KFCReg.R")`

---

> **&#9998; Note**: All codes contained in this `Rmarkdown` are built with recent version of <span style="color: #097BC1">`r fontawesome::fa("r-project")`</span> (version $>$ 4.1, available [here](https://cran.r-project.org/bin/windows/base/)) and <span style="color: #0287D8;"> **Rstudio** </span> (version > `2022.02.2+485`, available [here](https://www.rstudio.com/products/rstudio/download/#download)). Note also that the code chucks are <span style="color: #E6180A;">hidden</span> by default.

<span style="color: #F0AE14"> **To see the codes, you can:** </span>

- click on the top-right <span style="color: #54D319 ;">Code</span> button of the page, then choose **Show All Code** to show all the codes, or 
- simply click on the right-corner <span style="color: #54D319 ;">Code</span> button at each section to show the codes of that specific section.

---

<span style="color: #1FAAE3;"><u>KFC procedure & important packages </u></span>
===

<span style="color: #F0AE14;"><u>KFC procedure</u></span>
---

KFC procedure is a three-step methodology which puts together clustering and consensual aggregation methods for building predictions in supervised learning problems. The procedure is inspired by many real-life prediction problems when the in The three steps of the procedure are:

- **Step *K* **: $K$-means clustering algorithm is implemented on the input data using several options of Bregman divergences $({\cal B}_j)_{j=1}^M$ ($M$ is the number of total divergences used), therefore, the input data is partitioned into many different structures, according to the property of each Bregman divergence.
- **Step *F* **: For a partition structure given by a divergence ${\cal B}_j$, we fit simple models (linear, for example) on all the clusters of the obtained partition. Then, the collection ${\cal M}_j=\{{\cal M}_{j,k}\}_{k=1}^K$ of these local models is called *candidate* model, corresponding to the Bregman divergence ${\cal B}_j$. At the end of this step, several candidate models are constructed. 
- **Step *C* **: This step aggregates the obtained candidate models using consensual aggregation methods studied in [Has (2021)](https://hal.archives-ouvertes.fr/hal-02884333v5) or [Fischer and Mougeot (2019)](https://www.sciencedirect.com/science/article/pii/S0378375818302349).

---

![The figure above represents the summary of KFC procedure](./kfc.png)

---

> 🧾 **Remark.1**: 
The prediction of any observation $x$ given by a candidate model ${\cal M}_j$ is done in two simple steps:

1. $x$ is classified into one of the obtained clusters using the corresponding divergence ${\cal B}_j$, i.e.,
  $$x\in{\cal C}_{k^*} \Leftrightarrow {\cal B}_j(c_{k^*},x)=\inf_{1\leq k\leq K}{\cal B}_j(c_k,x)$$
where $\{c_1,...,c_K\}_{k=1}^K$ are the centroids of the corresponding clusters $\{C_1,...,C_K\}$.

2. The prediction of $x$ is given by the corresponding local model ${\cal M}_{j,k^*}$ defined on cluster $k^*$, i.e., ${\cal M}_j(x)={\cal M}_{j,k^*}(x)$.

---

<span style="color: #F0AE14;"><u>Important packages</u></span>
---

We prepare all the necessary tools for this `Rmarkdown`. The `pacman` package allows us to load (if exists) or install (if does not exist) any available packages from [The Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/) of <span style="color: #097BC1">`r fontawesome::fa("r-project")`</span>. 


```{r}
# Check if package "pacman" is already installed 

lookup_packages <- installed.packages()[,1]
if(!("pacman" %in% lookup_packages))
  install.packages("pacman")


# To be installed or loaded
pacman::p_load(magrittr)
pacman::p_load(tidyverse)

## package for "generateMachines"
pacman::p_load(tree)
pacman::p_load(glmnet)
pacman::p_load(randomForest)
pacman::p_load(FNN)
pacman::p_load(xgboost)
pacman::p_load(keras)
pacman::p_load(pracma)
pacman::p_load(latex2exp)
pacman::p_load(plotly)
pacman::p_load(parallel)
pacman::p_load(foreach)
pacman::p_load(doParallel)
rm(lookup_packages)
```


<span style="color: #1FAAE3;"><u>Bregman divergences (BD)</u></span>
===

<span style="color: #1FAAE3;">**Definition**</span> Let $\phi:\mathcal{C}\rightarrow\mathbb{R}$ be a strictly convex and continuously differentiable function defined on a measurable convex subset $\mathcal{C}\subset\mathbb{R}^d$. Let $int(\mathcal{C})$ denote its relative interior. A Bregman divergence indexed by $\phi$ is a dissimilarity measure $d_{\phi}:\mathcal{C}\times int(\mathcal{C})\rightarrow\mathbb{R}$ defined for any pair $(x,y)\in \mathcal{C}\times int(\mathcal{C})$ by,
\begin{equation}
\label{eq:1.10}
d_{\phi}(x,y)=\phi(x)-\phi(y)-\langle x-y,\nabla\phi(y)\rangle 
\end{equation}
where $\nabla\phi(y)$ denotes the gradient of $\phi$ computed at a point $y\in int(\mathcal{C})$. A Bregman divergence is not necessarily a metric as it may not be symmetric and the triangular inequality might not be satisfied.

This section defines all the Bregman divergences used. The list of all the Bregman divergences is given in the table below:

----

Name                        $\phi$                                        $d_{\phi}$                                                                                                    $\cal C$
------------------------- ----------------------------------------------- ----------------------------------------------------------------------------------- -----------
Euclidean                 ${\|x\|_2^2}=\sum_{i=1}^dx_i^2$                   $\|x-y\|_2^2$                                                                     $\mathbb{R}^d$
General Kullback-Leibler  $\sum_{i=1}^d x_i\ln( x_i)$                       $\sum_{i=1}^d( x_i\ln(\frac{ x_i}{y_i})-(x_i-y_i))$                               $(0,+\infty)^d$                                  
Logistic                  $\sum_{i=1}^d(x_i\ln( x_i)+(1- x_i)\ln(1- x_i))$  $\sum_{i=1}^d\Big( x_i\ln(\frac{x_i}{y_i})+(1- x_i)\ln(\frac{1- x_i}{1-y_i})\Big)$   $(0,1)^d$
Itakura-Saito             $-\sum_{i=1}^d\ln( x_i)$                          $\sum_{i=1}^d\Big(\frac{ x_i}{y_i}-\ln(\frac{ x_i}{y_i})-1\Big)$                    $(0,+\infty)^d$
Exponential               $\sum_{i=1}^de^{x_i}$                             $\sum_{i=1}^d(e^{x_i}-e^{y_i}-e^{y_i}(x_i-y_i))$                                    $\mathbb{R}^d$
Polynomial                $\sum_{i=1}^d|x|^p,p>2$                           $\sum_{k=1}^d(|x_k|^p-|y_k|^p-\text{sign}(y_k)^pp(x_k-y_k)y_k^{p-1})$             $\mathbb{R}^d$

----

<span style="color: #F0AE14;"><u>Look-up list of Bregman divergences</u></span>
----

The codes below provide a look-up list of all the BDs defined in the table above.

```{r}
euclidDiv <- function(X., y., deg = NULL){
    res <- sweep(X., 2, y.)
    return(rowSums(res^2))
}
gklDiv <- function(X., y., deg = NULL){
  res <- c("/", "-") %>%
    map(.f = ~ sweep(X., 2, y., FUN = .x))
  return(rowSums(X.*log(res[[1]]) - res[[2]]))
}
logDiv <- function(X., y., deg = NULL){
    res <-  map2(.x = list(X., 1-X.),
                 .y = list(y., 1-y.),
                 .f = ~ sweep(.x, 2, .y, FUN = "/"))
    return(rowSums(X.*log(res[[1]])+(1-X.)*log(res[[2]])))
}
itaDiv <- function(X., y., deg = NULL){
    res <- sweep(X., 2, y., FUN = "/")
    return(rowSums(res-log(res) - 1))
}
expDiv <- function(X., y., deg = NULL){
    exp_y <- exp(y.)
    res <- sweep(1+X., 2, y.) %>%
      sweep(2, exp_y, FUN = "*")
    return(rowSums(exp(X.)-res))
}
polyDiv <- function(X., y., deg = 3){
    S <- map2(.x = list(X., X.^deg),
              .y = list(y., y.^deg),
              .f = ~ sweep(.x, 
                           MARGIN = 2, 
                           STATS = .y,
                           FUN = "-"))
    if(deg %% 2 == 0){
      Tem <- sweep(S[[1]], 2, y.^(deg-1), FUN = "*")
      res <- rowSums(S[[2]] - deg * Tem)
    }
    else{
      Tem <- sweep(S[[1]], 2, sign(y.) * y.^(deg-1), FUN = "*")
      res <- rowSums(S[[2]] - deg * Tem)
    }
    return(res)
}
lookup_div <- list(
  euclidean = euclidDiv,
  gkl = gklDiv,
  logistic = logDiv,
  itakura = itaDiv,
  exponential = expDiv,
  polynomial = polyDiv
)
```


<span style="color: #F0AE14;"><u>Function</u></span> : `BregmanDiv`
----

This function computes Bregman divergence matrix between two sets of data points. Each set of data points should be represented by a matrix, data frame, or tibble object where each row corresponds to each individual data point.

- **Argument**:

    - `X.`, `C.` : data matrices, tibbles and data frames, containing the data points (by row) for which the Bregman divergences between them are to be computed.
    - `div` : the divergence type to be used. It should be a subset of `{"euclidean", "gkl", "logistic", "itakura", "exponential", "polynomial"}`.
    - `deg` : the degree of polynomial BD (if one is used).
- **Value**: 
    
    This function returns a *tibble* object $D=(d_{i,j})$ where $d_{i,j}$ is the Bregman divergence between row $i$ of `X.` and row $j$ of `C.`.


```{r}
BregmanDiv <- function(X., 
                       C., 
                       div = c("euclidean",
                                "gkl",
                                "logistic",
                                "itakura",
                                "exponential",
                                "polynomial"),
                       deg = 3){
  div <- match.arg(div)
  d_c <- dim(C.)
  if(is.null(d_c)){
    C <- matrix(C., nrow = 1, byrow = TRUE)
  } else{
    C <- as.matrix(C.)
  }
  if(is.null(dim(X.))){
    X <- matrix(X., nrow = 1, byrow = TRUE)
  } else{
    X <- as.matrix(X.)
  }
  dis <-  map_dfc(.x = 1:dim(C)[1],
                  .f = ~ tibble('{{.x}}' := lookup_div[[div]](X, C[.x,], deg = deg)))
  return(dis)
}
```


---

> 🧾 **Remark.2**: 
Note that "logistic" Bregman divergence can handle only data points with domain ${\cal C}=(0,1)^d$, therefore, it should be used only in suitable cases.

---

<span style="color: #1FAAE3;"><u>Step $K$: $K$-means with Bregman divergences</u></span>
===

This section implements $K$-means algorithm using Bregman divergences which corresponds to the step $K$ of KFC procedure.

<span style="color: #F0AE14;"><u>Function</u></span> : `findClosestCentroid` and `newCentroids`
----

These two functions perform the main steps of $K$-means algorithm. Function `findClosestCentroid` assigns any data points to some cluster according to the smallest divergence between the data point and the centroid. It provides a vector of clusters of all the data points. From there, function `newCentroids` computes new centroids given the cluster labels of all data points.

- **Argument**:

    - `x.` : the data matrices, tibbles and data frames, containing the data points to be assigned to some cluster.
    - `centroids` : the matrix or data frame of centroids (by row).
    - `div` : the divergence type to be used.
    - `deg` : the degree of polynomial BD (if one is used).
    
- **Value**: 

    The each of the two functions returns arguments for one another:
    
    - `findClosestCentroid` returns a vector of size equals to the number of rows of data matrix x., containing the cluster labels of the data points.
    - `newCentroids` returns new matrix of centroids.

```{r}
findClosestCentroid <- function(x., centroids., div, deg = 3){
  dist <- BregmanDiv(x., centroids., div, deg)
  clust <- 1:nrow(x.) %>%
    map_int(.f = ~ which.min(dist[.x,]))
  return(clust)
}
newCentroids <- function(x., clusters.){
  centroids <- unique(clusters.) %>%
    map_dfr(.f = ~ colMeans(x.[clusters. == .x, ]))
  return(centroids)
}
```

<span style="color: #F0AE14;"><u>Function</u></span> : `kmeansBD`
----

This function performs $K$-means algorithm with BDs. 

- **Argument**:

    - `train_input` : the data matrices, tibbles and data frames, containing the data points.
    - `K` : the number of clusters.
    - `n_start` : the number of times to perform the algorithm, and the best one among them is chosen to be the final result. This is done to avoid local optimal solutions. By default, `n_start = 5`.
    - `maxIter` : the maximum number of iterations in case the algorithm does not converge. By default, `maxIter = 500`.
    - `deg` : the degree of polynomial BD (if one is used).
    - `scale_input` : a logical value controlling whether to scale the input to be in $(0,1)$ or not. By default, `scale_input = FALSE`.
    - `div` : the type of divergence to be used. By default, `div = "euclidean"` and the usual $K$-means algorithm is performed.
    - `splits` : a real number between $0$ and $1$ specifying the proportion of training data to be used to perform $K$-means algorithm. The remaining part with be used for the aggregation. By default, `splits = 1` and all the input data are used.
    - `epsilon` : the stopping criterion of the algorithm. By default, `epsilon = 1e-10`.
    - `center_`, `scale_` : the center and scale to be used to scale the input data. By default, they are `NULL`.
    - `id_shuffle` : a logical vector specifying which part of the training data will be selected to perform the algorithm. This is important when we want to perform the algorithm on the same set of data points but with different BDs. 
    
- **Value**: 

    This function returns a **list** of the following objects:
    - `centroids` : the matrix of the centroids obtained by the algorithm.
    - `clusters` : a vector of cluster labels of the data points.
    - `train_data` : a list of the following objects:
        - `X_train` : the training data used for the algorithm.
        - `X_remain` : the remaining part of the input data used for the aggregation.
        - `id_remain` : a logical vector specifying the remaining part (`X_remain`) of the input data.
    - `parameters` : a list of the following objects:
        - `div` : divergence used.
        - `deg` : the degree of polynomial BD (if one is used).
        - `center_`, `scale_` : the center and scale used to scale the input data.
    - `running_time`: the computational time of the algorithm.

```{r}
kmeansBD <- function(train_input,
                     K,
                     n_start = 5,
                     maxIter = 500,
                     deg = 3,
                     scale_input = FALSE,
                     div = "euclidean",
                     splits = 1,
                     epsilon = 1e-10,
                     center_ = NULL,
                     scale_ = NULL,
                     id_shuffle = NULL){
  start_time <- Sys.time()
  # Distortion function
  X <- as.matrix(train_input)
  N <- dim(X)
  if(scale_input){
    if(!(is.null(center_) & is.null(scale_))){
      if(length(center_) == 1){
        center_ <- rep(center_, N[2])
      }
      if(length(scale_) == 1){
        scale_ <- rep(scale_, N[2])
      }
    } else{
      min_ <- apply(X, 2, FUN = min)
      c_ <- abs(colMeans(X)/5)
      center_ <- min_ - c_
      scale_ <- apply(X, 2, FUN = max) - center_ + 1
    }
    X <- scale(X, center = center_, scale = scale_)
  }
  if(is.null(id_shuffle)){
    train_id <- rep(TRUE, N[1])
    if(splits < 1){
      train_id[sample(N[1], floor(N[1]*(1-splits)))] <- FALSE
    }
  } else{
    train_id <- id_shuffle
  }
  X_train1 <- X[train_id,]
  X_train2 <- X[!train_id,]
  mu <- as.matrix(colMeans(X_train1))
  distortion <- function(clus){
    cent <- newCentroids(X_train1, clus)
    var_within <- 1:K %>%
      map(.f = ~ BregmanDiv(X_train1[clus == .x,], 
                            cent[.x,], 
                            div, 
                            deg)) %>%
      map(.f = sum) %>%
      Reduce("+", .)
    return(var_within)
  }
  # Kmeans algorithm
  kmeansWithBD <- function(x., k., maxiter., eps.) {
    n. <- nrow(x.)
    # initialization
    init <- sample(n., k.)
    centroids_old <- x.[init,]
    i <- 0
    while(i < maxIter){
      # Assignment step
      clusters <- findClosestCentroid(x., centroids_old, div, deg)
      # Recompute centroids
      centroids_new <- newCentroids(x., clusters)
      if ((sum(is.na(centroids_new)) > 0) |
          (nrow(centroids_new) != k.)) {
        init <- sample(n., k.)
        centroids_old <- x.[init,]
        warning("NA produced -> reinitialize centroids...!")
      }
      else{
        if(sum(abs(centroids_new - centroids_old)) > eps.){
          centroids_old <- centroids_new
        } else{
          break
        }
      }
      i <- i + 1
    }
    return(clusters)
  }
  results <- 1:n_start %>% 
    map_dfc(.f = ~ tibble("{{.x}}" := kmeansWithBD(X_train1, 
                                                   K,
                                                   maxIter, 
                                                   epsilon)))
  opt_id <- 1:n_start %>%
    map_dbl(.f = ~ distortion(results[[.x]])) %>%
    which.min
  cluster <- clusters <- results[[opt_id]]
  j <- 1
  ID <- unique(cluster)
  for (i in ID) {
    clusters[cluster == i] = j
    j =  j + 1
  }
  centroids = newCentroids(X_train1, clusters)
  time_taken <- Sys.time() - start_time
  return(
    list(
      centroids = centroids,
      clusters = clusters,
      train_data = list(X_train = X_train1,
                        X_remain = X_train2,
                        id_remain = !train_id),
      parameters = list(div = div,
                        deg = deg,
                        center_ = center_,
                        scale_ = scale_),
      running_time = time_taken
    )
  )
}
```

----

> **Example.1**: We perform $K$-means algorithm with `"gkl"` BD on [Abalone](https://archive.ics.uci.edu/ml/machine-learning-databases/abalone) dataset.

----

```{r}
pacman::p_load(readr)
colname <- c("Type", "LongestShell", "Diameter", "Height", "WholeWeight", "ShuckedWeight", "VisceraWeight", "ShellWeight", "Rings")
df <- readr::read_delim("https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", col_names = colname, delim = ",", show_col_types = FALSE)
n <- nrow(df)
train <- logical(n)
train[sample(n,  floor(n*0.8))] <- TRUE
cl <- df[train,2:(ncol(df)-1)] %>%
  kmeansBD(K = 3, div = "gkl", splits = 0.5, scale_input = TRUE)
table(cl$clusters)
```


<span style="color: #1FAAE3;"><u>Step $F$: Fitting predictive models</u></span>
===

This section builds global models by fitting local model on each given cluster of the obtained partition. This corresponds to the step $F$ of the procedure.

<span style="color: #F0AE14;"><u>Function</u></span> : `fitLocalModels`
----

This function fits local models $({\cal M_{j,k}})_{j,k}$ on all clusters $k=1,...,K$ of the obtained partition, given by Bregman divergence ${\cal B}_j$, for $j=1,...,M$.

- **Argument**:

    - `kmeans_BD` : an object obtained from `kmeansBD` function.
    - `train_response` : the vector of response variable corresponding to the full `input_data` given to `kmeanBD` function.
    - `model` :a local model type to fit on all the clusters of the given partition. It should be either a model object (works with function `predict`), or a string element of {"lm", "tree", "rf"} which corresponds to linear regression, tree and random forest respectively. By default, `model = "lm"`.
    - `formula` : the degree of polynomial BD (if one is used).

- **Value**:

    This function returns a list of the following objects:
    - `local_models` : all the local models fitted on all clusters of the given partition.
    - `kmeans_BD` : the `kmeansBD` object.
    - `data_remain` : a list of the following objects:
        - `fit` : the fitted values of the remaining part of the input data.
        - `response` : the actual response values corresponding to the remaining input data.
    - `running_time` : the computational time of the algorithm.

```{r}
fitLocalModels <- function(kmeans_BD,
                           train_response,
                           model = "lm",
                           formula = NULL){
  start_time <- Sys.time()
  X_train <- kmeans_BD$train_data$X_train
  y_train <- train_response[!(kmeans_BD$train_data$id_remain)]
  X_remain <- kmeans_BD$train_data$X_remain
  y_remain <- NULL
  if(!is.null(X_remain)){
    y_remain <- train_response[kmeans_BD$train_data$id_remain]
  }
  pacman::p_load(tree)
  pacman::p_load(randomForest)
  model_ <- ifelse(model == "tree", tree::tree, model)
  K <- nrow(kmeans_BD$centroids)
  if (is.null(formula)){
    form <- formula(target ~ .)
  }
  else{
    form <- update(formula, target ~ .)
  }
  data_ <- bind_cols(X_train, "target":= y_train)
  fit_lookup <- list(lm = "fitted.values",
                     rf = "predicted")
  if(is.character(model_)){
    model_lookup <- list(lm = lm,
                         rf = randomForest::randomForest)
    mod <- map(.x = 1:K, 
               .f = ~ model_lookup[[model_]](formula = form, 
                                             data = data_[kmeans_BD$clusters == .x, ]))
  } else{
    mod <- map(.x = 1:K, 
               .f = ~ model_(formula = form, 
                             data = data_[kmeans_BD$clusters == .x,]))
  }
  pred0 <- NULL
  if(!is.null(X_remain)){
    pred0 <- vector(mode = "numeric", 
                    length = length(y_remain))
    clus <- findClosestCentroid(x. = X_remain,
                                centroids. = kmeans_BD$centroids,
                                div = kmeans_BD$parameters$div,
                                deg = kmeans_BD$parameters$deg)
    for(i_ in 1:K){
      pred0[clus == i_] <- predict(mod[[i_]],
                                   as.data.frame(X_remain[clus == i_,]))
    }
  }
  time_taken <- Sys.time() - start_time
  return(list(
    local_models = mod,
    kmeans_BD = kmeans_BD,
    data_remain = list(fit = pred0,
                       response = y_remain),
    running_time = time_taken
  ))
}
```

---- 

> **Example.2**: From **Example.1** above, multiple linear regression models are built on all the obtained clusters. The mean square error of this model, evaluated on the remaining $50\%$ of the training data is computed.

----

```{r}
fit <- fitLocalModels(train_response = df$Rings[train],
                      kmeans_BD = cl,
                      model = "lm")

mean((fit$data_remain$response- fit$data_remain$fit)^2)
```

<span style="color: #F0AE14;"><u>Function</u></span> : `localPredict`
----

This function allows us to predict any new observations using the candidate model ${\cal M}_j=({\cal M}_{j,k})_{k=1}^M$ corresponding to Bregman divergence ${\cal B}_j$, for some $j\in J\subset\{1,...,M\}$. 

- **Argument**:

    - `localModels` : a local model object obtained from `fitLocalModels` function.
    - `newData` : new input data to be predicted using the candidate models given in `localModels` argument.
    
- **Value**:

    This function returns a predicted vector of the `newData`.

```{r}
localPredict <- function(localModels,
                         newData){
  kmean_BD <- localModels$kmeans_BD
  K <- nrow(kmean_BD$centroids)
  newData_ <- newData
  if(!(is.null(kmean_BD$parameters$center_))){
    newData_ <- scale(newData,
                      center = kmean_BD$parameters$center_,
                      scale = kmean_BD$parameters$scale_)
    id0 <- (newData_ <= 0)
    if(sum(id0) > 0){
      min_ <- min(newData_[id0])
      newData_[id0] <- runif(sum(id0), min(1e-3, min_/10), min_)
    }
  }
  clus <- findClosestCentroid(x. = newData_,
                              centroids. = kmean_BD$centroids,
                              div = kmean_BD$parameters$div,
                              deg = kmean_BD$parameters$deg)
  pred0 <- vector(mode = "numeric", length = nrow(newData_))
  for(i_ in 1:K){
    pred0[clus == i_] <- predict(localModels$local_models[[i_]],
                                 as.data.frame(newData_[clus == i_,]))
  }
  pred0 <- as_tibble(pred0)
  names(pred0) <- ifelse(kmean_BD$parameters$div == "polynomial",
                         paste0("polynomial", kmean_BD$parameters$deg),
                         kmean_BD$parameters$div)
  return(pred0)
}
```

----

> **Example.3**: The performance of the candidate model corresponding to `"gkl"` divergence is compared to random forest regression on a $20\%$ testing data.

----

```{r}
y_hat <- localPredict(fit,
                      df[!train, 2:(ncol(df)-1),])
rf <- randomForest(Rings ~ ., data = df[train,2:ncol(df)])
mean((predict(rf, newdata = df[!train,2:ncol(df)])- df$Rings[!train])^2)
mean((y_hat$gkl-df$Rings[!train])^2)
```


<span style="color: #1FAAE3;"><u>Step $C$: Combining methods</u></span>
===

The source codes and information of the aggregation methods are available [here <span style="color: #097BC1"> `r fontawesome::fa("github")`</span>](https://github.com/hassothea/AggregationMethods). The codes below imports the aggregation methods into <span style="color: #0287D8;"> **Rstudio** </span> environment.

```{r, warning=FALSE}
pacman::p_load(devtools)
### Kernel based consensual aggregation
source_url("https://raw.githubusercontent.com/hassothea/AggregationMethods/main/KernelAggReg.R")
### MixCobra
source_url("https://raw.githubusercontent.com/hassothea/AggregationMethods/main/MixCobraReg.R")
```

<span style="color: #F0AE14;"><u>Function</u></span> : `stepK`, `stepF` and `stepC`
----

These functions allow to set values of the parameters in the three steps of the KFC procedure. Each function returns a list of all the parameters given in its arguments.


```{r}
stepK = function(K,
                 n_start = 5,
                 maxIter = 300,
                 deg = 3,
                 scale_input = FALSE,
                 div = NULL,
                 splits = 0.75,
                 epsilon = 1e-10,
                 center_ = NULL,
                 scale_ = NULL){
  return(list(K = K,
              n_start = n_start,
              maxIter = maxIter,
              deg = deg,
              scale_input = scale_input,
              div = div,
              splits = splits,
              epsilon = epsilon,
              center_ = center_ ,
              scale_ = scale_))
}

stepF = function(formula = NULL, 
                 model = "lm"){
  return(list(formula = formula, 
              model = model))
}

stepC = function(n_cv = 5,
                 method = c("cobra", "mixcobra"),
                 opt_methods = c("grad", "grid"),
                 kernels = "gaussian",
                 scale_features = FALSE){
  return(list(n_cv = n_cv,
              method = method,
              opt_methods = opt_methods,
              kernels = kernels,
              scale_features = scale_features))
}
```


<span style="color: #1FAAE3;"><u>Function</u></span>: `KFCreg`
===

This function is the complete implementation of KFC procedure.

- **Argument**:

    - `train_input` : the matrix or data frame of training input data.
    - `train_response` : the training response variable.
    - `test_input`: the testing input data.
    - `test_response` : the response variable of the testing data. It is optional. If it is not `NULL`, the mean square error (mse) is computed.
    - `n_cv` : the number of folds in cross-validation.
    - `parallel` : a logical value specifying whether or not to perform $K$-means algorithm (step $K$) in parallel. By default, `parallel = TRUE`. Note that **internet connection** is required in this case.
    - `inv_sigma`, `alp` : the inverse normalized constant $\sigma^{-1}>0$ and the exponent $\alpha>0$ of exponential kernel: $K(x)=e^{-\|x/\sigma\|^{\alpha}}$ for any $x\in\mathbb{R}^d$. By default, `inv_sigma = `$\sqrt{1/2}$ and `alpha = 2` which corresponds to the Gaussian kernel.
    - `K_step` : an object obtained from function `stepK`, allowing to set the values of the paramters in step $K$ of the procedure.
    - `F_step` : an object obtained from function `stepF`, allowing to set the values of the paramters in step $F$ of the procedure.
    - `C_step` : an object obtained from function `stepC`, allowing to set the values of the paramters in step $C$ of the procedure.
    - `setGradParamAgg` : an object from the `setGradParameter` function, allowing to set the values of the parameters of **gradient descent** algorithm for the <span style="color: #E6180A;">**1st aggregation method$^1$**</span>.
    - `setGridParamAgg` : an object from the `setGridParameter` function, allowing to set the values of the parameters of the **grid search** algorithm for the <span style="color: #E6180A;">**1st aggregation method$^1$**</span>.
    - `setGradParamMix` : an object from the `setGradParameter_Mix` function, allowing to set the values of the parameters of the **gradient descent** algorithm for the <span style="color: #0832CD">**2nd aggregation method$^2$**</span>.
    - `setGridParamMix` : an object from the `setGridParameter_Mix` function, allowing to set the values of the parameters of the **grid search** algorithm for the <span style="color: #0832CD">**2nd aggregation method$^2$**</span>.
    - `silent` : a logical value to silent all the messages during the algorithm.
    
- **Value**:

    This function returns a list of the following objects:
    
    - `predict_final` : the final predictions given by the aggregation of all the candidate models.
    - `predict_local` : the predictions given by all the individual candidate models.
    - `agg_method` : the list of aggregation methods obtained in the step $C$ of the procedure.
    - `running_time` : the computational time of the algorithm.

----

  > 🧾 **Remark.2**: The `parallel` argument above requires internet connection to load the source codes of $K$-means algorithm with BDs from [GitHub <span style="color: #097BC1"> `r fontawesome::fa("github")`</span>](https://github.com/hassothea/KFC-Procedure/blob/master/kmeanBD.R). It is performed on the maximum number of clusters of your machine, and the speed is at least two times faster than without parallelism, however, it is not so stable depending on your internet connection or machine. 

> For the aggregation methods in step $C$:


- <span style="color: #E6180A;">$^1$</span> is the kernel-based consensual aggregation for regression by [Has (2021)](https://hal.archives-ouvertes.fr/hal-02884333v5). The source codes of these functions are available in [AggregationMethods](https://github.com/hassothea/AggregationMethods) github repository <span style="color: #097BC1"> `r fontawesome::fa("github")`</span> (file name: [KernelAggReg.R](https://github.com/hassothea/AggregationMethods/blob/main/KernelAggReg.R)), and the documentation is available [here](https://hassothea.github.io/files/KernelAggReg/KernelAggReg.html).
- <span style="color: #0832CD">$^2$</span> is the aggregation using input-output trade-off by [Fischer and Mougeot (2019)](https://www.sciencedirect.com/science/article/pii/S0378375818302349). The source codes of these functions are available in [AggregationMethods](https://github.com/hassothea/AggregationMethods) github repository <span style="color: #097BC1"> `r fontawesome::fa("github")`</span> (file name: [MixCobraReg.R](https://github.com/hassothea/AggregationMethods/blob/main/MixCobraReg.R)), and the documentation is available on [here](https://hassothea.github.io/files/KernelAggReg/MixCobraReg.html).

----


```{r}
KFCreg = function(train_input,
                  train_response,
                  test_input,
                  test_response = NULL,
                  n_cv = 5,
                  parallel = TRUE,
                  inv_sigma = sqrt(.5),
                  alp = 2,
                  K_step = stepK(splits = .5),
                  F_step = stepF(),
                  C_step = stepC(),
                  setGradParamAgg = setGradParameter(),
                  setGridParamAgg = setGridParameter(),
                  setGradParamMix = setGradParameter_Mix(),
                  setGridParamMix = setGridParameter_Mix(),
                  silent = FALSE){
  start_time <- Sys.time()
  lookup_div_names <- c("euclidean",
                         "gkl",
                         "logistic",
                         "itakura",
                         "polynomial")
  div_ <- K_step$div
  ### K step: Kmeans clustering with BDs
  if (is.null(K_step$div)){
    divergences <- lookup_div_names
    warning("No divergence provided! All of them are used!")
  }
  else{
    divergences <- K_step$div %>% 
      map_chr(.f = ~ match.arg(arg = .x, 
                               choices = lookup_div_names))
  }
  div_list <- divergences %>% 
    map(.f = (\(x) if(x != "polynomial") return(x) else return(rep("polynomial", length(K_step$deg))))) %>%
    unlist
  deg_list <- rep(NA, length(div_))
  deg_list[div_list == "polynomial"] <- K_step$deg
  div_names <- map2_chr(.x = div_list,
                        .y = deg_list,
                        .f = (\(x, y) if(is.na(y)) return(x) else return(paste0(x,y))))
  ### Step K: Kmeans clustering with Bregman divergences
  dm <- dim(train_input)
  id_shuffle <- vector(length = dm[1])
  n_train <- floor(K_step$splits * dm[1])
  id_shuffle[sample(dm[1], n_train)] <- TRUE
  if(parallel){
    numCores <- parallel::detectCores()
    doParallel::registerDoParallel(numCores) # use multicore, set to the number of our cores
    kmean_ <- foreach(i=1:length(div_names)) %dopar% {
      devtools::source_url("https://raw.githubusercontent.com/hassothea/KFC-Procedure/master/kmeanBD.R")
      kmeansBD(train_input = train_input,
               K = K_step$K,
               div = div_list[i],
               n_start = K_step$n_start,
               maxIter = K_step$maxIter,
               deg = deg_list[i],
               scale_input = K_step$scale_input,
               splits = K_step$splits,
               epsilon = K_step$epsilon,
               center_ = K_step$center_,
               scale_ = K_step$scale_,
               id_shuffle = id_shuffle)
    }
    doParallel::stopImplicitCluster()
  } else{
    kmean_ <- map2(.x = div_list,
                   .y = deg_list,
                   .f = ~ kmeansBD(train_input = train_input,
                                   K = K_step$K,
                                   div = .x,
                                   n_start = K_step$n_start,
                                   maxIter = K_step$maxIter,
                                   deg = .y,
                                   scale_input = K_step$scale_input,
                                   splits = K_step$splits,
                                   epsilon = K_step$epsilon,
                                   center_ = K_step$center_,
                                   scale_ = K_step$scale_,
                                   id_shuffle = id_shuffle))
  }
  names(kmean_) <- div_names
  ### F step: Fitting the corresponding model on each observed cluster
  model_ <- div_names %>%
    map(.f = ~ fitLocalModels(kmean_[[.x]],
                              train_response = train_response,
                              model = F_step$model,
                              formula = F_step$formula))
  names(model_) <- div_names
  pred_combine <- model_ %>%
    map_dfc(.f = ~ .x$data_remain$fit)
  y_remain <- train_response[!id_shuffle]
  pred_test <- div_names %>%
    map_dfc(.f = ~ localPredict(model_[[.x]],
                                test_input))
  names(pred_test) <- names(pred_combine) <- div_names
  # C step: Consensual regression aggregation method with kernel-based COBRA
  list_method_agg <- list(mixcobra = function(pred){MixCobraReg(train_input = train_input[!id_shuffle,],
                                                                train_response = y_remain,
                                                                test_input = test_input,
                                                                train_predictions = pred,
                                                                test_predictions = pred_test,
                                                                test_response = test_response,
                                                                scale_input = K_step$scale_input,
                                                                scale_machine = C_step$scale_features,
                                                                n_cv = C_step$n_cv,
                                                                inv_sigma = inv_sigma,
                                                                alp = alp,
                                                                kernels = C_step$kernels,
                                                                optimizeMethod = C_step$opt_methods,
                                                                setGradParam = setGradParamMix,
                                                                setGridParam = setGridParamMix,
                                                                silent = silent)},
                          cobra = function(pred){kernelAggReg(train_design = pred,
                                                              train_response = y_remain,
                                                              test_design = pred_test,
                                                              test_response = test_response,
                                                              scale_input = K_step$scale_input,
                                                              scale_machine = C_step$scale_features,
                                                              build_machine = FALSE,
                                                              machines = NULL,
                                                              n_cv = C_step$n_cv,
                                                              inv_sigma = sqrt(.5),
                                                              alp = 2,
                                                              kernels = C_step$kernels,
                                                              optimizeMethod = C_step$opt_methods,
                                                              setGradParam = setGradParamAgg,
                                                              setGridParam = setGridParamAgg,
                                                              silent = silent)})
  res <- map(.x = C_step$method,
             .f = ~ list_method_agg[[.x]](pred_combine))
  list_agg_methods <- list(cobra = "cob",
                           mixcobra = "mix")
  names(res) <- C_step$method
  ext_fun <- function(L, nam){
    tab <- L$fitted_aggregate
    names(tab) <- paste0(names(tab), "_", nam)
    return(tab)
  }
  pred_fin <- C_step$method %>%
    map_dfc(.f = ~ ext_fun(res[[.x]], list_agg_methods[[.x]]))
  time.taken <- Sys.time() - start_time
  ### To finish
  if(is.null(test_response)){
    return(list(
    predict_final = pred_fin,
    predict_local = pred_test,
    agg_method = res,
    running_time = time.taken
  ))
  } else{
    error <- cbind(pred_test, pred_fin) %>%
      dplyr::mutate(y_test = test_response) %>%
      dplyr::summarise_all(.funs = ~ (. - y_test)) %>%
      dplyr::select(-y_test) %>%
      dplyr::summarise_all(.funs = ~ mean(.^2))
    return(list(
      predict_final = pred_fin,
      predict_local = pred_test,
      agg_method = res,
      mse = error,
      running_time = time.taken
  ))
  }
}
```

> **Example.4**: A complete KFC procedure is implemented on the same Abalone data, using $5$ BDs `"euclidean"`, `"itakura"`, `"gkl"` and `"polynomial"` (of degree $3$ and $6$). Both aggregation methods are used in the step $C$. Two kernel functions are used for each aggregation method: `"gaussian"` (with gradient descent algorithm) and `"epanechnikov"` (with grid search algorithm).

```{r}
train1 <- logical(n)
train1[sample(n,  floor(n*0.8))] <- TRUE
kfc1 <- KFCreg(train_input = df[train1,2:ncol(df)],
                train_response = df$Rings[train1],
                test_input = df[!train1,2:ncol(df)],
                K_step = stepK(K = 3,
                               scale_input = TRUE,
                               div = c("eucl", "ita", "gkl", "poly"),
                               deg = c(3, 6),
                               splits = .5),
                C_step = stepC(method = c("cobra", "mixcobra"),
                               opt_methods = c("grad", "grid"),
                               kernels = c("gaussian", "gaussian"),
                               scale_features = FALSE),
                setGradParamAgg = setGradParameter(rate = 0.2),
                                                   #coef_lm = 2),
                setGridParamAgg = setGridParameter(min_val = .00001,
                                                   max_val = 10,
                                                   n_val = 100),
                setGradParamMix = setGradParameter_Mix(rate = "linear",
                                                       coef_auto = c(.5,.5)),
                setGridParamMix = setGridParameter_Mix(min_alpha = 1e-10,
                                                       max_alpha = 0.5,
                                                       min_beta = 1e-10,
                                                       max_beta = 1,
                                                       n_alpha = 20,
                                                       n_beta = 20))
```

> The mean square errors evaluated on $20\%$-testing data of the above computation are reported below.

```{r}
rf1 <- randomForest::randomForest(Rings ~ ., data = df[train1,2:ncol(df)])
kfc1$predict_final %>%
  mutate(rf = predict(rf1, newdata = df[!train1,2:ncol(df)])) %>%
  sweep(MARGIN = 1, STATS = df$Rings[!train1], FUN = "-") %>%
  .^2  %>%
  colMeans
```

> We can see that KFC procedure performs really well on this real-life dataset.

- [Has et al. (2021)](https://www.tandfonline.com/eprint/YKGS8GTKDBKYFXEGFWSB/full?target=10.1080/00949655.2021.1891539)
- [Fischer and Mougeot (2019)](https://www.sciencedirect.com/science/article/pii/S0378375818302349)
- [Has (2021)](https://hal.archives-ouvertes.fr/hal-02884333v5)
- [Biau et al. (2016)](https://www.sciencedirect.com/science/article/pii/S0047259X15000950)
- ...
- [dplyr videos](https://www.youtube.com/hashtag/dplyr) `r fontawesome::fa("video")`
- [ggplot2 video tutorial](https://www.youtube.com/hashtag/ggplot2) `r fontawesome::fa("video")`
- [R for data science](https://r4ds.had.co.nz/)

---

> <span style="color: #1FAAE3;">&#128214; Read also [KernelAggReg](https://hassothea.github.io/files/CodesPhD/KernelAggReg.html) and [MixCobraReg](https://hassothea.github.io/files/CodesPhD/MixCobraReg.html)</span>.

---