name: layout-general layout: true class: left, top <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: rgba(2, 70, 79, 0.874); } .remark-slide-number .numeric { position: absolute; bottom: 4%; height: 4px; display: block; right: 2.5%; font-weight: bold; } .remark-slide-number .numeric-out{ color: rgba(2, 70, 79, 0.874); } </style> --- count: false class: top, center, title-slide <img src="img/X.png" align="right" height="110" width="100"> <img src="img/cnrs.png" align="middle" height="95"> <img src="img/UniversiteParisCite_logo.jpg" align="left" height="90"> <hr class="L1"> ## `\(K\)`-means Clustering Algorithm<br> via Vector Quantization <hr class="L2"> #### Graduate School of Science-RUPP <img src="img/RUPP_logo.png" align="middle" height="100"> ### Sothea .textsc[Has] --- ### In word<h0br> .stress[Clustering] consists in _partitioning_ a set of points from some metric space in such a way that .stress[points within the same group are close enough] while .stress[points from different groups are distant]. It belongs to the .stress[unsupervised learning] branch of Machine Learning (ML).<hbr> -- .pull-left[ ```r faithful[1:7,] %>% knitr::kable(format = "markdown") ``` | eruptions| waiting| |---------:|-------:| | 3.600| 79| | 1.800| 54| | 3.333| 74| | 2.283| 62| | 4.533| 85| | 2.883| 55| | 4.700| 88| ] .pull-right[ <img src="img/faithful.png" align="left" height="370" width='340'> ] --- count:false ### In word<h0br> .stress[Clustering] consists in _partitioning_ a set of points from some metric space in such a way that .stress[points within the same group are close enough] while .stress[points from different groups are distant]. It belongs to the .stress[unsupervised learning] branch of Machine Learning (ML).<hbr> .pull-left[ ```r faithful[1:7,] %>% knitr::kable(format = "markdown") ``` | eruptions| waiting| |---------:|-------:| | 3.600| 79| | 1.800| 54| | 3.333| 74| | 2.283| 62| | 4.533| 85| | 2.883| 55| | 4.700| 88| ] .pull-right[
] --- count:false ### In word<h0br> .stress[Clustering] consists in _partitioning_ a set of points from some metric space in such a way that .stress[points within the same group are close enough] while .stress[points from different groups are distant]. It belongs to the .stress[unsupervised learning] branch of Machine Learning (ML).<hbr> .pull-left[ ```r df_iris[id,] %>% select(S.L, S.W, P.L, P.W, Species) %>% knitr::kable(format = "markdown") ``` | | S.L| S.W| P.L| P.W|Species | |:---|---:|---:|---:|---:|:----------| |1 | 5.1| 3.5| 1.4| 0.2|setosa | |2 | 4.9| 3.0| 1.4| 0.2|setosa | |51 | 7.0| 3.2| 4.7| 1.4|versicolor | |52 | 6.4| 3.2| 4.5| 1.5|versicolor | |101 | 6.3| 3.3| 6.0| 2.5|virginica | |102 | 5.8| 2.7| 5.1| 1.9|virginica | ] .pull-right[ <img src="img/iris.png" align="left" height="365" width='350'> ] --- count:false ### In word<h0br> .stress[Clustering] consists in _partitioning_ a set of points from some metric space in such a way that .stress[points within the same group are close enough] while .stress[points from different groups are distant]. It belongs to the .stress[unsupervised learning] branch of Machine Learning (ML).<hbr> .pull-left[ ```r df_iris[id,] %>% select(S.L, S.W, P.L, P.W, Species) %>% knitr::kable(format = "markdown") ``` | | S.L| S.W| P.L| P.W|Species | |:---|---:|---:|---:|---:|:----------| |1 | 5.1| 3.5| 1.4| 0.2|setosa | |2 | 4.9| 3.0| 1.4| 0.2|setosa | |51 | 7.0| 3.2| 4.7| 1.4|versicolor | |52 | 6.4| 3.2| 4.5| 1.5|versicolor | |101 | 6.3| 3.3| 6.0| 2.5|virginica | |102 | 5.8| 2.7| 5.1| 1.9|virginica | ] .pull-right[
] --- count:false ### In word<h0br> .stress[Clustering] consists in _partitioning_ a set of points from some metric space in such a way that .stress[points within the same group are close enough] while .stress[points from different groups are distant]. It belongs to the .stress[unsupervised learning] branch of Machine Learning (ML).<hbr> ### Clustering in ML applications<h0br> Clustering shows up in many Machine Learning applications, for example: <h0br> .left-column[ - .stress[
__Marketing__]: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records - .stress[
__Biology__]: classification of plants and animals given their features - .stress[
__Insurance__]: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds ] .right-column[ - .stress[
__Bookshops__]: book ordering (recommendation) - .stress[
__City-planning__]: identifying groups of houses according to their type, value and geographical location - .stress[
__Internet__]: document classification; clustering weblog data to discover groups of similar access patterns; topic modeling ... ] --- class: left, top ## What's "vector quantization"?<hbr> - A mathematical model of _data compression_ - Compression (encoding): original .stress[data points] `\(\to\)` less/simpler/compact .stress[objects] - Reconstruction (decoding): compressed object `\(\to\)` original data .center[<img src="./img/reconstruct.png" width="700px"/>] - Type: variable-rate & .stress[fixed-rate] quantization problem. --- template: inter-slide class: left, middle count: false ##
.bold-blue[Outline] <br> .hhead[I. Theory of fixed-rate vector quantization] <br> .hhead[II. Empirical setting & K-means algorithm] <br> .hhead[III. Questions & conclusion] --- template: inter-slide class: left, middle count: false ##
.bold-blue[Outline] <br> .section[I. Theory of fixed-rate vector quantization] <br> .hhead[II. Empirical setting & K-means algorithm] <br> .hhead[III. Questions & conclusion] --- ## Fixed-rate quantization problem<hbr> ### Notation<hbr> - Data: `\(X_1,X_2,..., X \sim \mu\)`, `\(iid\)` `\(\mathbb{R}^p\)`-valued random variable. -- - A .stress[fixed-rate] `\(K\)`.stress[-point quantizer] `\(q\)` associated with .stress[codebook] `\(\mathcal{C}=\{c_1, ..., c_K\}\subset\mathbb{R}^p\)`<br> ( `\(c_k\)` are called .stress[codevectors] ) is a Borel measurable mapping: `$$q:\mathbb{R}^p\to \mathcal{C}\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x\mapsto c_k=q(x), \text{ for some }k.$$` -- - Distortion measure: let `\(d\)` be a distance (Euclidean in this talk), `$$\mathcal{D}_K(\mu, q)=\mathbb{E}_{\mu}[d(X,q(X))]=\int_{\mathbb{R}^p}d(x,q(x))\mu(dx)=\int_{\mathbb{R}^p}\|x-q(x)\|^2\mu(dx).$$` -- - Optimal performance: `$$\mathcal{D}_K^*(\mu)=\inf_{q\in\mathcal{Q}_K}\mathcal{D}_K(\mu,q), \text{ with }\mathcal{Q}_K=\{q:\mathcal{D}_K(\mu,q)<\infty\}.$$` --- ## Fixed-rate quantization problem<hbr> ### Objective<hbr> - Find `\(q^*\)` such that `$$\mathcal{D}_K(\mu,q^*)=\mathcal{D}_K^*(\mu) =\inf_{q\in\mathcal{Q}_K}\mathcal{D}_K(\mu,q)=\inf_{q\in\mathcal{Q}_K}\int_{\mathbb{R}^p}\|x-q(x)\|^2\mu(dx).$$` <h0br> -- ### Remark<h0br> .pull-left[ - `\(q\)` is completely defined by - its codebook `\(\mathcal{C}=\{c_k\}_{k=1}^K\)` - the associated .stress[cells] `\(S_k=\{x\in\mathbb{R}^p:q(x)=c_k\}\)`. - `\(\mathcal{S}=\{S_k\}_{k=1}^K\)` is a partition of `\(\mathbb{R}^p\)`. ] .pull-right[ <hbr> .right[<img src="./img/voro.png" width="300px" height ="210px"/>] ] --- ##
Some key quantities<hbr> - Let `\(X\sim\mu\)` be an `\(\mathbb{R}^p\)`-valued random variable and `\(S\subset\mathbb{R}^p\)` be s.t `\(\mu(S)>0\)`, then - `\(\mathbb{E}(X)=\arg\min_{c\in\mathbb{R}^p}\mathbb{E}[\|X-c\|^2]\)` - `\(\mathbb{V}(X)=\min_{c\in\mathbb{R}^p}\mathbb{E}[\|X-c\|^2]\)` - `\(\mathbb{E}(X|X\in S)=\arg\min_{c\in\mathbb{R}^p}\mathbb{E}[\|X-c\|^2|X\in S]\)` - `\(\mathbb{V}(X|X\in S)=\min_{c\in\mathbb{R}^p}\mathbb{E}[\|X-c\|^2|X\in S]\)`.<h0br> .pull-right[ .right[<img src="./img/voro.png" width="300px" height ="210px"/>] ] --- count:false ##
Some key quantities<hbr> - Let `\(X\sim\mu\)` be an `\(\mathbb{R}^p\)`-valued random variable and `\(S\subset\mathbb{R}^p\)` be s.t `\(\mu(S)>0\)`, then - `\(\mathbb{E}(X)=\arg\min_{c\in\mathbb{R}^p}\mathbb{E}[\|X-c\|^2]\)` - `\(\mathbb{V}(X)=\min_{c\in\mathbb{R}^p}\mathbb{E}[\|X-c\|^2]\)` - `\(\mathbb{E}(X|X\in S)=\arg\min_{c\in\mathbb{R}^p}\mathbb{E}[\|X-c\|^2|X\in S]\)` - `\(\mathbb{V}(X|X\in S)=\min_{c\in\mathbb{R}^p}\mathbb{E}[\|X-c\|^2|X\in S]\)`.<h0br> .pull-left[ - Conditional expectation and variance are defined by `$$\mathbb{E}(X|X\in S)=\frac{\int_{S}x\mu(dx)}{\mu(S)}\quad\quad\quad\quad\quad\quad\quad\ \ \ \\ \mathbb{V}(X|X\in S)=\frac{\int_{S}\|x-\mathbb{E}(X|X\in S)\|^2\mu(dx)}{\mu(S)}$$` ] .pull-right[ .right[<img src="./img/voro.png" width="300px" height ="210px"/>] ] --- ## Theoretical result<hbr> ### Lemma 1 (.textsc[.stress[Nearest Neighbor Condition]])<h0br> ---- - Let `\(q\)` and `\(\hat{q}\)` be two `\(K\)`-point quantizers with the same codebook `\(\mathcal{C}=\{c_k\}_{k=1}^K\)` s.t `$$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \hat{q}(x)=\arg\min_{c_k\in\mathcal{C}}\|x-c_k\|^2\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$$` where the ties are broken arbitrarily. Then one has `$$\mathcal{D}_K(\mu,\hat{q})\leq\mathcal{D}_K(\mu,q).$$` <h0br> ---- </h0br> <h3br> .center[ ![](kMeans_files/figure-html/nnb_cond-1.png)![](kMeans_files/figure-html/nnb_cond-2.png) ] --- ## Theoretical result<hbr> ### Lemma 1 (.textsc[.stress[Centroid Condition]])<h0br> ---- - Let `\(q\)` be any `\(K\)`-point quantizer with partition cells `\(\mathcal{S}=\{S_k\}_{k=1}^K\)`. If `\(\hat{q}\)` is defined on the same cells with codevectors given by `$$\hat{c}_k=\arg\min_{c\in\mathbb{R}^p}\mathbb{E}[\|X-c\|^2|X\in S_k]=\mathbb{E}[X|X\in S_k], \text{ for }k=1,...,K,$$` then `$$\mathcal{D}_K(\mu,\hat{q})\leq \mathcal{D}_K(\mu,q).$$` <h0br> ---- .center[ ![](kMeans_files/figure-html/centroid_cond-1.png)![](kMeans_files/figure-html/centroid_cond-2.png) ] --- ## Lemma 1 suggests that...<hbr> - .textsc[.stress[Nearest Neighbor Condition]]: given .stress[codebook] `\(\mathcal{C}=\{c_k\}_{k=1}^K\)`, we know an optimal way to assign `\(x \mapsto c_j\Leftrightarrow\)` .stress[partition] `\(S=\{S_k\}_{k=1}^K\)`. - .textsc[.stress[Centroid Condition]]: given a .stress[partition] `\(S=\{S_k\}_{k=1}^K\)`, we know a way to define an optimal .stress[codebook] `\(\mathcal{C}=\{c_k\}_{k=1}^K\)`. <h0br> .center[ <video width="600" height="375" controls> <source src="./img/quantizer.mp4" type="video/mp4"> Your browser does not support the video tag. </video> ] --- count:false ## Lemma 1 suggests that...<hbr> - .textsc[.stress[Nearest Neighbor Condition]]: given .stress[codebook] `\(\mathcal{C}=\{c_k\}_{k=1}^K\)`, we know an optimal way to assign `\(x \mapsto c_j\Leftrightarrow\)` .stress[partition] `\(S=\{S_k\}_{k=1}^K\)`. - .textsc[.stress[Centroid Condition]]: given a .stress[partition] `\(S=\{S_k\}_{k=1}^K\)`, we know a way to define an optimal .stress[codebook] `\(\mathcal{C}=\{c_k\}_{k=1}^K\)`. <hbr> .pull-left[ ### Remark<hbr> - Such a `\(\hat{q}\)` (.stress[Lemma 1]) with codebook `\(\hat{\mathcal{C}}=\{\hat{c}_k\}_{k=1}^K\)` is called .stress[nearest neighbor] quantizer (.stress[NNQ]). - The associated partition `\(\hat{S}=\{\hat{S}_k\}\)` is called .stress[Voronoï] or .stress[nearest neighbor] of `\(\mathbb{R}^p\)` associated with `\(\hat{\mathcal{C}}\)`. ] .pull-right[ <img src="./img/voronoi.png" width="380px" height ="260px"/> ] --- ## Existence of optimal NNQ<hbr> - From .stress[Lemma 1], it suffices to consider .stress[NNQ] i.e., `$$\mathcal{D}_K^*(\mu)=\inf_{q\in\mathcal{Q}_K}\int_{\mathbb{R}^p}\|x-q(x)\|^2\mu(dx) =\inf_{\hat{\mathcal{C}}:|\hat{\mathcal{C}}|=K}\int_{\mathbb{R}^p}\min_{\hat{c}_k\in\hat{\mathcal{C}}}\|x-\hat{c}_k\|^2\mu(dx).$$` <h0br> -- ### Theorem 1 (.textsc[Existence of optimal] NNQ)<h0br> ---- There exists a nearest neighbor quantizer `\(q^*\in\mathcal{Q}_K\)` such that `$$\mathcal{D}_K(\mu, q^*)=\mathcal{D}_K^*(\mu)=\inf_{(\hat{c}_1,...,\hat{c}_K)\in(\mathbb{R}^p)^K}\int_{\mathbb{R}^p}\min_{1\leq k\leq K}\|x-\hat{c}_k\|^2\mu(dx).$$` <h0br> ---- -- - .stress[Sketch of proof]: let `\(g_K(y_1,...,y_K)=\int_{\mathbb{R}^p}\min_{1\leq k\leq K}\|x-y_k\|^2\mu(dx)\)`, thus: - `\(g_K\)` is continuous on `\((\mathbb{R}^p)^K\)`. - Let `\(\overline{B}_{r}=\{x\in\mathbb{R}^p:\|x\|\leq r\}\)` be the closed ball of `\(\mathbb{R}^p\)` with radius `\(r>0\)`, centered at the origin, then `$$\exists R>0: \mathcal{D}_K^*(\mu)=\inf_{(y_1,...,y_K)\in(\overline{B}_{R})^K}g_K(y_1,...,y_K).$$` --- template: inter-slide class: left, middle count: false ##
.bold-blue[Outline] <br> .hhead[I. Theory of fixed-rate vector quantization] <br> .section[II. Empirical setting & K-means algorithm] <br> .hhead[III. Questions & conclusion] --- ## Empirical setting<hbr> - In practice, everything is intractable unless `\(\mu\)` is known! Unfortunately, it's not! -- - Observed training data: `\(\mathcal{T}_n=\{X_1,...,X_n\}\)` with `\(X_i\overset{iid}{\sim}\mu\)` of size `\(n\)`. -- - Empirical distribution: if `\(A\)` is a Borel measurable subset of `\(\mathbb{R}^p\)`, let `$$\mu_n(A)=\frac{1}{n}\sum_{i=1}^n\mathbb{1}_{\{X_i\in A\}}$$` - Empirical distortion measure: `$$\mathcal{D}_K(\mu_n, q)=\frac{1}{n}\sum_{i=1}^n\|X_i-q(X_i)\|^2$$` -- - .stress[Empirically optimal NNQ]: `\(q_n^*(.)=f(.|X_1,...X_n)\in\mathcal{Q}_K\)` satisfying `$$\mathcal{D}_K(\mu_n,q_n^*)=\inf_{q\in\mathcal{Q}_K}\mathcal{D}_k(\mu_n,q)=\inf_{q\in\mathcal{Q}_K}\frac{1}{n}\sum_{i=1}^n\|X_i-q(X_i)\|^2$$` --- ## Consistency of empirical design<hbr> Roughly speaking, we want to show that `\(q_n^*\leadsto q^*\)` as `\(n\to\infty\)` in some sense. -- ### Theorem 2 (.textsc[Consistency of empirical design])<h0br> ---- For any `\(K\geq 1\)`, the sequence of the .stress[empirically optimal NNQ] `\((q_n^*)_{n\in\mathbb{N}}\)` satisfies `$$\lim_{n\to\infty}\mathcal{D}_K(\mu,q_n^*)=\mathcal{D}_K^*(\mu)\quad \mu\text{-}a.s.$$` <h0br> ---- -- .stress[Sketch of proof]: let `\(\mathcal{L}_2(\mathbb{R}^p)=\{\mu:\mathbb{E}_{\mu}(\|X\|^2)=\int_{\mathbb{R}^p}\|x\|^2\mu(dx)<\infty\}\)`.<hbr> - `\(L_2\)` Wasserstein distance: `\(\rho(\mu,\nu)=\displaystyle\inf_{X\sim\mu,Y\sim\nu}[\mathbb{E}(\|X-Y\|^2)]^{1/2}, \mu,\nu\in\mathcal{L}_2(\mathbb{R}^p)\)`. - If `\(q\)` is a .stress[NNQ]: `\(|(\mathcal{D}_K(\mu,q))^{1/2}-(\mathcal{D}_K(\nu,q))^{1/2}|<\rho(\mu, \nu),\forall \mu, \nu\in\mathcal{L}_2(\mathbb{R}^p).\)`<br><br> And consequently, `\(|(\mathcal{D}_K(\mu, q_n^*))^{1/2}-(\mathcal{D}_K^*(\mu))^{1/2}|<2\rho(\mu, \mu_n),\forall \mu\in\mathcal{L}_2(\mathbb{R}^p).\)` - One has: `\(\lim_{n\to\infty}\rho(\nu,\nu_n)=0\Leftrightarrow \nu_n\underset{n\to\infty}{\rightharpoonup}\nu\)` and `\(\mathbb{E}_{\nu_n}(\|X\|^2)\underset{n\to\infty}{\rightarrow}\mathbb{E}_{\nu}(\|X\|^2)\)`. --- ## Finite sample upper bounds The previous result does not provide any information about the convergence rate of `\(\mathcal{D}_K(\mu,q_n^*)\to\mathcal{D}_K^*(\mu)\)` or any bound on a finite resource. -- ### Theorem 3 (.textsc[Rate of convergence])<h0br> ---- Let `\(\mathcal{P}(R)=\{\mu:\mathbb{P}_{\mu}(X\in \overline{B}_ R)=\mu(\overline{B}_ R)=1\}\)`, thus for every `\(\mu\in\mathcal{P}(R)\)`, for `\(n\)` large enough, `$$\mathcal{D}_K(\mu,q_n^*)-\mathcal{D}_K^*(\mu)=O\Big(\sqrt{\frac{\log(n)}{n}}\Big)\quad \mu\text{-}a.s.$$` <h0br> ---- -- .stress[Sketch of proof]: --- count: false ## Finite sample upper bounds The previous result does not provide any information about the convergence rate of `\(\mathcal{D}_K(\mu,q_n^*)\to\mathcal{D}_K^*(\mu)\)` or any bound on a finite resource. ### Theorem 3 (.textsc[Rate of convergence])<h0br> ---- Let `\(\mathcal{P}(R)=\{\mu:\mathbb{P}_{\mu}(X\in \overline{B}_ R)=\mu(\overline{B}_ R)=1\}\)`, thus for every `\(\mu\in\mathcal{P}(R)\)`, for `\(n\)` large enough, `$$\mathcal{D}_K(\mu,q_n^*)-\mathcal{D}_K^*(\mu)=O\Big(\sqrt{\frac{\log(n)}{n}}\Big)\quad \mu\text{-}a.s.$$` <h0br> ---- .stress[Sketch of proof]: .stress[let not talk about it!!!] .stress[For a complete proof, see [Tamás Linder (2001)](chrome-extension://oemmndcbldboiebfnladdacbdfmadadm/https://mast.queensu.ca/~linder/pdf/cism.pdf)]. --- ## _K_-means algorithm<hbr> Given `\(\mathcal{T}_n=\{X_1,...,X_n\}\)`, we want to group this data into `\(K\)` groups such that:<h0br> - data points in the same group are similar enough - data points of different groups tend to be distant. <h0br> -- ### Algorithm<h0br> ---- 1. .stress[Initialization]: codebook `\(\mathcal{C}=\{c_1,...,c_K\}\subset\mathcal{T}_n\)` 2. .stress[Assigning step]: `for i = 1,2,...,n:`<br> `assign` `\(X_i\)` `to group` `\(k\)` `if` `\(\|X_i-c_k\|^2=\min_{1\leq j\leq K}\|X_i-c_j\|^2\)` 3. .stress[Centroid recomputation]: let `\(S_k=\{X_i\in\mathcal{T}_n:\|X_i-c_k\|\leq\|X_i-c_j\|,\forall j\neq k\}\)`<br> for `\(k=1,2,...,K\)`, then the codevectors are updated by `$$c_k\leftarrow \frac{1}{|S_k|}\sum_{X_i\in S_k}X_i.$$` 4. Alternatively repeat step 2. and 3. until any stopping criterion is met. <hbr> ---- --- ## _K_-means algorithm<h0br> ---- 1. .stress[Initialization]: codebook `\(\mathcal{C}=\{c_1,...,c_K\}\subset\mathcal{T}_n\)` 2. .stress[Assigning step]: `\(\mathcal{C}=\{c_1,...,c_K\}\to \hat{\mathcal{S}}=\{\hat{S}_1,...,\hat{S}_K\}\)`. 3. .stress[Centroid recomputation]: `\(\mathcal{S}=\{S_1,...,S_K\}\to\hat{\mathcal{C}}=\{\hat{c}_1,...,\hat{c}_K\}\)`. 4. Alternatively repeat step 2. and 3. until any stopping criterion is met. <hbr> ---- .pull-left[
] .pull-right[ <br> <br> .center[ .stress[K-means algorithm]<br><br> .stress[with] `\(K=3\)`<br><br> .stress[(Lloyd's method).] ] ] --- template: inter-slide class: left, middle count: false ##
.bold-blue[Outline] <br> .hhead[I. Theory of fixed-rate vector quantization] <br> .hhead[II. Empirical setting & K-means algorithm] <br> .section[III. Questions & conclusion] --- ##
Question around _K_-means algorithm<hbr> - Choosing `\(K\)`? - Assessing clustering quality? - Scaling or not scaling? - Choosing a distance? (See also, [Banerjee et al. (2005)](chrome-extension://oemmndcbldboiebfnladdacbdfmadadm/https://jmlr.org/papers/volume6/banerjee05b/banerjee05b.pdf) and [Fisher (2010)](chrome-extension://oemmndcbldboiebfnladdacbdfmadadm/https://www.lpsm.paris/_media/users/fischer/quantization_and_clustering_with_bregman_divergences.pdf)) - Initialization methods? - Step Assigning/Centoid recomputation fast update? - Stopping rules? ... -- .stress[
] If there are no restrictions on the dimension of the input space, on `\(K\)`, or on sample size, computing an optimal codebook is a [NP-hard problem](https://en.wikipedia.org/wiki/NP-hardness). -- .stress[
] It suffers in high dimensional input spaces ( `\(p\)` large) due to [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality). -- .stress[
] The empirical distortion measure `\(\mathcal{D}_K(\mu_n,\hat{q})=\frac{1}{n}\sum_{i=1}\min_{1\leq k\leq K}\|X_i-\hat{c}_k\|^2\)` is not convex; therefore, the algorithm might stuck in a local optimal solution. --- ##
Conclusion<hbr> - We went briefly from theories to algorithm of the `\(K\)`-means clustering method. - The `\(K\)`-means algorithm adapts the theoretical property of .stress[NNQ] into an iterative method that constructs a sequence of .stress[Voronoï] partitions. - The number of clusters `\(K\)` is supposed to be a known input parameter. In practice, .stress[cross-validation] method can be used to approximate the optimal `\(K\)`. - Convergence to a local minimum may produce counterintuitive ("wrong") results. In practice, several initializations can be done, and the best solution is retained. - Dimensional reduction such as [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) or random projection such as [Johnson-Lindenstrauss Lemma](https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma) can be used alongside with or in addition to `\(K\)`-means. - `\(K\)`-means does not work well in clustering data points of mixing classes or of spiral shapes... In this case, one might be interested in [Spectral clustering](https://en.wikipedia.org/wiki/Spectral_clustering). .center[<img src="./img/spiral.png" width="140px" height ="130px"/> <img src="./img/spiral2.png" width="150px" height ="132px"/>] ??? Squared Euclidean distance is very sensitive to outliers An inappropriate choice of k may yield poor results. That is why, when performing k-means, it is important to run diagnostic checks for determining the number of clusters in the data set. --- count: false template: inter-slide class: left, middle count: false .center[# References]<h0br> 📚 [Linder, T. (2002). Learning-Theoretic Methods in Vector Quantization. In: Györfi, L. (eds) Principles of Nonparametric Learning. International Centre for Mechanical Sciences, vol 434. Springer, Vienna.](https://link.springer.com/chapter/10.1007/978-3-7091-2568-7_4) 📚 [Banerjee, S. Merugu, I.S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705–1749, 2005](chrome-extension://oemmndcbldboiebfnladdacbdfmadadm/https://jmlr.org/papers/volume6/banerjee05b/banerjee05b.pdf) 📚 [A. Fischer. Quantization and clustering with Bregman divergences. Journal of Multivariate Analysis, 101(10):2207–2221, 2010.](chrome-extension://oemmndcbldboiebfnladdacbdfmadadm/https://www.lpsm.paris/_media/users/fischer/quantization_and_clustering_with_bregman_divergences.pdf) 📚 [S. Has, A. Fischer, and M. Mougeot. Kfc: A clusterwise supervised learning procedure based on the aggregation of distances. Journal of Statistical Computation and Simulation, 0(0):1–21, 2021. doi: 10.1080/00949655.2021. 1891539.](https://www.tandfonline.com/doi/abs/10.1080/00949655.2021.1891539) 📚 [Aarti Singh. Spectral Clustering. Machine Learning 10-701/15-781](chrome-extension://oemmndcbldboiebfnladdacbdfmadadm/https://www.cs.cmu.edu/~aarti/Class/10701/slides/Lecture21_2.pdf)
[https://github.com/hassothea/KFC-Procedure](https://github.com/hassothea/KFC-Procedure) <h0br> .pull-right[<h0br> # Thank you 🤓 ]