K-means clustering
K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
K-means clustering is an NP-hard problem, but can be simply implemented using the iterative refinement technique outlined below.
- Pick k points called means. This is called initialization.
- Associate each input point with the mean that is closest to it. We obtain k clusters of points, and we refer to this process as classifying the points.
- Update each mean to have the average value of the corresponding cluster.
- If the k means have significantly changed, go back to step 2. If they did not, we say that the algorithm converged.
- The k means represent different clusters – every point is in the cluster corresponding to the closest mean.
Worked Example
Using the same one-dimensional dataset as the previous example, but instead requiring the clustering/k-means
namespace:
(require '[clustering.core.k-means :as k-means])
(require '[clj-time.core :refer [after? date-time interval in-days])
(require '[clj-time.format :refer [unparse formatters])
(require '[clj-time.coerce :refer [to-long from-long])
(def test-dataset
(hash-set
(date-time 2013 7 21)
(date-time 2013 7 25)
...)))
We still need a distance measure between two dates, but additionally, the K-Means algorithm also requires a function capable of averaging across a range of dates:
(defn distance [dt-a dt-b]
(if (after? dt-a dt-b)
(distance dt-b dt-a)
(in-days (interval dt-a dt-b))))
(defn average [dates]
(from-long
(/ (reduce + (map to-long dates)) (count dates))))
The algorithm is initialised with some mean values randomly selected from the dataset. Note that the desired number of clusters must be specified up-front (in this case, 3):
(def means (k-means/init-means 3 test-dataset)
(def groups (k-means/cluster distance average test-dataset means 0))
(count groups)
; => 3
(map fmt (sort (groups 0)))
; => ("2012-12-26" "2012-12-28" "2013-01-16")
(map fmt (sort (groups 1)))
; => ("2013-07-01" "2013-07-14" "2013-07-21" "2013-07-25" "2013-07-31" "2013-08-03")
(map fmt (sort (groups 2)))
; => ("2012-05-28" "2012-06-02" "2012-06-06" "2012-06-07" "2012-06-09")
(Note: Because of the random initialisation of the means, the cluster orderings will be different each time evaluation occurs.)