BIRCH is a very cleaver clustering algorithm which is very fast in practise. At the end of its clustering, it guaranties that the average cluster radius is bounded. This is an attempt to convert this bound to a more useful one.

Cluster Radius of BIRCH

BIRCH algorithm maintains a 3 tuple feature for each of its clusters called a Characteristic Feature. If a cluster contains points,

CF &= (n,LS,SS) \\ LS &= \sum_{i=1}^n x_i \\ SS &= \sum_{i=1}^n \langle x_i,x_i\rangle As the definition implies, LS stands for Linear Sum and SS is Square Sum of the cluster points.

is very useful to quickly measure how close or far a pair of clusters are. If we decide to combine clusters into one, the combined can easily be computed from individual 's. Many other useful measures are explained in [R2]. Only the ones relevant to the problem at hand will be enumerated here.

The centroid of the cluster is given as x_0 = \frac{1}{n} \sum_{i=1}^n x_i

Radius of a cluster is defined as R_x = \left ( \frac{\sum_{i=1}^n \langle x_i-x_0, x_i-x_0 \rangle}{n}\right )^{\frac{1}{2}} This is not equivalent to average distance between all points and the centroid. That would be R_c = \frac{1}{n}\sum_{i=1}^n \langle x_i-x_0, x_i-x_0 \rangle^{\frac{1}{2}} The relationship between and is the subject of this article.

Some limits on BIRCH cluster radius

Consider the cluster in question contains points . BIRCH algorithm guaranties that the average cluster radius . The algorithms takes as one if its parameters.

Lets abstract out the basic building blocks of and by defining z_i = \langle x_i-x_0, x_i-x_0 \rangle^{\frac{1}{2}} Then and . This looks like and norm on . So lets try to connect all of them through the following inequality on norms.

\textbf{[P1]}\qquad \|z\|_p \leq \|z\|_r \leq n^{\left(\frac{1}{r}-\frac{1}{p}\right)}\|z\|_p A simple consequence of this inequality is . With this the bounds possible on the distances are \sqrt{nR_x^2} &\leq nR_c \leq \sqrt{n} \sqrt{nR_x^2} \\ \textbf{[B1]}\qquad \frac{R_x}{\sqrt{n}} &\leq R_c \leq R_x [P1] also gives us and . represents the distance of farthest point from the centroid. So what we have is \textbf{[B2]}\qquad \frac{\max{z_i}}{n} &\leq R_c \leq \max{z_i} \\ \textbf{[B3]}\qquad \frac{\max{z_i}}{\sqrt{n}} &\leq R_x \leq \max{z_i}

References

P1 This property can be proved from Holder's Inequality by assuming

R2 Zhang, T., Ramakrishnan, R. and Livny, M., 1997. BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1(2), pp.141-182.