K-Means¶

The K-Means algorithm partitions \(n\) feature vectors into \(k\) clusters minimizing some criterion. Each cluster is characterized by a representative point, called a centroid.

Given the training set \(X = \{ x_1, \ldots, x_n \}\) of \(p\)-dimensional feature vectors and a positive integer \(k\), the problem is to find a set \(C = \{ c_1, \ldots, c_k \}\) of \(p\)-dimensional centroids that minimize the objective function

\[\Phi_{X}(C) = \sum_{i = 1}^n d^2(x_i, C),\]

where \(d^2(x_i, C)\) is the squared Euclidean distance from \(x_i\) to the closest centroid in \(C\),

\[d^2(x_i, C) = \min_{1 \leq j \leq k} \| x_i - c_j \|^2, \quad 1 \leq i \leq n.\]

Expression \(\|\cdot\|\) denotes \(L_2\) norm.

Note

In the general case, \(d\) may be an arbitrary distance function. Current version of the oneDAL spec defines only Euclidean distance case.

Lloyd’s method¶

The Lloyd’s method [Lloyd82] consists in iterative updates of centroids by applying the alternating Assignment and Update steps, where \(t\) denotes a index of the current iteration, e.g., \(C^{(t)} = \{ c_1^{(t)}, \ldots, c_k^{(t)} \}\) is the set of centroids at the \(t\)-th iteration. The method requires the initial centroids \(C^{(1)}\) to be specified at the beginning of the algorithm (\(t = 1\)).

(1) Assignment step: Assign each feature vector \(x_i\) to the nearest centroid. \(y_i^{(t)}\) denotes the assigned label (cluster index) to the feature vector \(x_i\).

\[y_i^{(t)} = \mathrm{arg}\min_{1 \leq j \leq k} \| x_i - c_j^{(t)} \|^2, \quad 1 \leq i \leq n.\]

Each feature vector from the training set \(X\) is assigned to exactly one centroid so that \(X\) is partitioned to \(k\) disjoint sets (clusters)

\[S_j^{(t)} = \big\{ \; x_i \in X : \; y_i^{(t)} = j \; \big\}, \quad 1 \leq j \leq k.\]

(2) Update step: Recalculate centroids by averaging feature vectors assigned to each cluster.

\[c_j^{(t + 1)} = \frac{1}{|S_j^{(t)}|} \sum_{x \in S_j^{(t)}} x, \quad 1 \leq j \leq k.\]

The steps (1) and (2) are performed until the following stop condition,

\[\sum_{j=1}^k \big\| c_j^{(t)} - c_j^{(t+1)} \big\|^2 < \varepsilon,\]

is satisfied or number of iterations exceeds the maximal value \(T\) defined by the user.

Usage example¶

onedal::kmeans::model run_training(const onedal::table& data,
                                   const onedal::table& initial_centroids) {

   const auto kmeans_desc = onedal::kmeans::desc<float>{}
      .set_cluster_count(10)
      .set_max_iteration_count(50)
      .set_accuracy_threshold(1e-4);

   const auto result = onedal::train(kmeans_desc, data, initial_centroids);

   print_table("labels", result.get_labels());
   print_table("centroids", result.get_model().get_centroids());
   print_value("objective", result.get_objective_function_value());

   return result.get_model();
}

onedal::table run_inference(const onedal::kmeans::model& model,
                            const onedal::table& new_data) {

   const auto kmeans_desc = onedal::kmeans::desc<float>{}
      .set_cluster_count(model.get_cluster_count());

   const auto result = onedal::infer(kmeans_desc, model, new_data);

   print_table("labels", result.get_labels());
}

API¶

Methods¶

namespace method {
   struct lloyd {};
   using by_default = lloyd;
} // namespace method

struct lloyd¶: Tag-type that denotes Lloyd’s method.

using by_default = lloyd ¶: Alias tag-type for the Lloyd’s method.

Descriptor¶

template <typename Float = float,
          typename Method = method::by_default>
class desc {
public:
   desc();

   int64_t get_cluster_count() const;
   int64_t get_max_iteration_count() const;
   double get_accuracy_threshold() const;

   desc& set_cluster_count(int64_t);
   desc& set_max_iteration_count(int64_t);
   desc& set_accuracy_threshold(double);
};

template<typename Float = float, typename Method = method::by_default> class desc¶

Template Parameters

Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.
Method – Tag-type that specifies an implementation of K-Means algorithm. Can be method::lloyd or method::by_default.

desc()¶: Creates new instance of descriptor with the default attribute values.

std::int64_t cluster_count = 2¶

The number of clusters \(k\).

Getter & Setter: std::int64_t get_cluster_count() const

desc& set_cluster_count(std::int64_t)
Invariants: cluster_count > 0

std::int64_t max_iteration_count = 100¶

The maximum number of iterations \(T\).

Getter & Setter: std::int64_t get_max_iteration_count() const

desc& set_max_iteration_count(std::int64_t)
Invariants: max_iteration_count >= 0

double accuracy_threshold = 0.0¶

The threshold \(\varepsilon\) for the stop condition.

Getter & Setter: double get_accuracy_threshold() const

desc& set_accuracy_threshold(double)
Invariants: accuracy_threshold >= 0.0

Model¶

class model {
public:
   model();

   const table& get_centroids() const;
   int64_t get_cluster_count() const;
};

class model¶

model()¶: Creates a model with the default attribute values.

table centroids = table()¶

\(k \times p\) table with the cluster centroids. Each row of the table stores one centroid.

Getter: const table& get_centroids() const

std::int64_t cluster_count = 0¶

Number of clusters \(k\) in the trained model.

Getter: std::int64_t get_cluster_count() const
Invariants: cluster_count == centroids.row_count

Training `onedal::train(...)`¶

Input¶

class train_input {
public:
   train_input();
   train_input(const table& data);
   train_input(const table& data, const table& initial_centroids);

   const table& get_data() const;
   const table& get_initial_centroids() const;

   train_input& set_data(const table&);
   train_input& set_initial_centroids(const table&);
};

class train_input¶

train_input()¶: Creates input for the training operation with the default attribute values.

train_input(const table &data)¶: Creates input for the training operation with the given data, the other attributes get default values.

train_input(const table &data, const table &initial_centroids)¶: Creates input for the training operation with the given data and initial_centroids.

table data = table()¶

\(n \times p\) table with the data to be clustered, where each row stores one feature vector.

Getter & Setter: const table& get_data() const

train_input& set_data(const table&)

table initial_centroids = table()¶

\(k \times p\) table with the initial centroids, where each row stores one centroid.

Getter & Setter: const table& get_initial_centroids() const

train_input& set_initial_centroids(const table&)

Result¶

class train_result {
public:
   train_result();

   const model& get_model() const;
   const table& get_labels() const;
   int64_t get_iteration_count() const;
   double get_objective_function_value() const;
};

class train_result¶

train_result()¶: Creates result of the training operation with the default attribute values.

kmeans::model model = kmeans::model()¶

The trained K-means model.

Getter: const model& get_model() const

table labels = table()¶

\(n \times 1\) table with the labels \(y_i\) assigned to the samples \(x_i\) in the input data, \(1 \leq 1 \leq n\).

Getter: const table& get_labels() const

std::int64_t iteration_count = 0¶

The number of iterations performed by the algorithm.

Invariants: iteration_count >= 0

double objective_function_value = 0.0¶

The value of the objective function \(\Phi_X(C)\), where \(C\) is model.centroids (see kmeans::model::centroids).

Invariants: objective_function_value >= 0.0

Operation semantics¶

template<typename Descriptor> kmeans::train_result train(const Descriptor &desc, const kmeans::train_input &input)¶

Template Parameters: Descriptor – K-Means algorithm descriptor kmeans::desc.

Preconditions: input.data.is_empty == false

input.initial_centroids.is_empty == false

input.initial_centroids.row_count == desc.cluster_count

input.initial_centroids.column_count == input.data.column_count
Postconditions: result.labels.is_empty == false

result.labels.row_count == input.data.row_count

result.model.centroids.is_empty == false

result.model.clusters.row_count == desc.cluster_count

result.model.clusters.column_count == input.data.column_count

result.iteration_count <= desc.max_iteration_count

Inference `onedal::infer(...)`¶

Input¶

class infer_input {
public:
   infer_input();
   infer_input(const model& m);
   infer_input(const model& m, const table& data);

   const model& get_model() const;
   const table& get_data() const;

   infer_input& set_model(const model&);
   infer_input& set_data(const table&);
};

class infer_input¶

infer_input()¶: Creates input for the inference operation with the default attribute values.

infer_input(const kmeans::model &model)¶: Creates input for the inference operation with the given model, the other attributes get default values.

infer_input(const kmeans::model &model, const table &data)¶: Creates input for the inference operation with the given model and data.

table data = table()¶

\(n \times p\) table with the data to be assigned to the clusters, where each row stores one feature vector.

Getter & Setter: const table& get_data() const

infer_input& set_data(const table&)

kmeans::model model = kmeans::model()¶

The trained K-Means model (see kmeans::model).

Getter & Setter: const kmeans::model& get_model() const

infer_input& set_model(const kmeans::model&)

Result¶

class infer_result {
public:
   infer_result();

   const table& get_labels() const;
   double get_objective_function_value() const;
};

class infer_result¶

infer_result()¶: Creates result of the inference operation with the default attribute values.

table labels = table()¶

\(n \times 1\) table with assignments labels to feature vectors in the input data.

Getter: const table& get_labels() const

double objective_function_value = 0.0¶

The value of the objective function \(\Phi_X(C)\), where \(C\) is defined by the corresponding infer_input::model::centroids.

Invariants: objective_function_value >= 0.0

Operation semantics¶

template<typename Descriptor> kmeans::infer_result infer(const Descriptor &desc, const kmeans::infer_input &input)¶

Template Parameters: Descriptor – K-Means algorithm descriptor kmeans::desc.

Preconditions: input.data.is_empty == false

input.model.centroids.is_empty == false

input.model.centroids.row_count == desc.cluster_count

input.model.centroids.column_count == input.data.column_count
Postconditions: result.labels.is_empty == false

result.labels.row_count == input.data.row_count

K-Means¶

Lloyd’s method¶

Usage example¶

API¶

Methods¶

Descriptor¶

Model¶

Training onedal::train(...)¶

Input¶

Result¶

Operation semantics¶

Inference onedal::infer(...)¶

Input¶

Result¶

Operation semantics¶

Training `onedal::train(...)`¶

Inference `onedal::infer(...)`¶