Quality Threshold (QT) clustering
From: https://sites.google.com/site/dataclusteringalgorithms/quality-threshold-clustering-algorithm-1
- Initialize the threshold distance allowed for clusters and the minimum cluster size.
- Build a candidate cluster for each data point by including the closest point, the next closest, and so on, until the distance of the cluster surpasses the threshold.
- Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration.
- Repeat with the reduced set of points until no more cluster can be formed having the minimum cluster size.
NOTE: QT clustering is computationally intensive and time consuming - increasing the minimum cluster size or increasing the number of data points can greatly increase the computational time.
Worked example
We’ll start by trying to cluster a simple one-dimensional set of dates:
(require '[clustering.core.qt :as qt])
(require '[clj-time.core :refer [after? date-time interval in-days])
(require '[clj-time.format :refer [unparse formatters])
(def test-dataset
(hash-set
(date-time 2013 7 21)
(date-time 2013 7 25)
(date-time 2013 7 14)
(date-time 2013 7 31)
(date-time 2013 7 1)
(date-time 2013 8 3)
(date-time 2012 12 26)
(date-time 2012 12 28)
(date-time 2013 1 16)
(date-time 2012 6 2)
(date-time 2012 6 7)
(date-time 2012 6 6)
(date-time 2012 6 9)
(date-time 2012 5 28)))
In order to use the QT clustering algorithm, we need to first define some measure of distance between data-points; this is quite easy for dates:
(defn distance [dt-a dt-b]
(if (after? dt-a dt-b)
(distance dt-b dt-a)
(in-days (interval dt-a dt-b))))
(distance
(date-time 2012 12 26)
(date-time 2013 1 16))
; => 21
For convenience, let’s also define a date formatter:
(def fmt (partial unparse (formatters :date)))
(fmt (date-time 2019 2 19))
; => "2019-02-19"
To split these into clusters (with a minimum cluster size of 3), grouped roughly into months,
(def groups (qt/cluster distance test-dataset 31 3))
(count groups)
; => 3
(map fmt (sort (groups 0)))
; => ("2013-07-01" "2013-07-14" "2013-07-21" "2013-07-25" "2013-07-31" "2013-08-03")
(map fmt (sort (groups 1)))
; => ("2012-05-28" "2012-06-02" "2012-06-06" "2012-06-07" "2012-06-09")
(map fmt (sort (groups 2)))
; => ("2012-12-26" "2012-12-28" "2013-01-16")