K-Means¶
The K-Means algorithm partitions \(n\) feature vectors into \(k\) clusters minimizing some criterion. Each cluster is characterized by a representative point, called a centroid.
Given the training set \(X = \{ x_1, \ldots, x_n \}\) of \(p\)-dimensional feature vectors and a positive integer \(k\), the problem is to find a set \(C = \{ c_1, \ldots, c_k \}\) of \(p\)-dimensional centroids that minimize the objective function
where \(d^2(x_i, C)\) is the squared Euclidean distance from \(x_i\) to the closest centroid in \(C\),
Expression \(\|\cdot\|\) denotes \(L_2\) norm.
Note
In the general case, \(d\) may be an arbitrary distance function. Current version of the oneDAL spec defines only Euclidean distance case.
Lloyd’s method¶
The Lloyd’s method [Lloyd82] consists in iterative updates of centroids by applying the alternating Assignment and Update steps, where \(t\) denotes a index of the current iteration, e.g., \(C^{(t)} = \{ c_1^{(t)}, \ldots, c_k^{(t)} \}\) is the set of centroids at the \(t\)-th iteration. The method requires the initial centroids \(C^{(1)}\) to be specified at the beginning of the algorithm (\(t = 1\)).
(1) Assignment step: Assign each feature vector \(x_i\) to the nearest centroid. \(y_i^{(t)}\) denotes the assigned label (cluster index) to the feature vector \(x_i\).
Each feature vector from the training set \(X\) is assigned to exactly one centroid so that \(X\) is partitioned to \(k\) disjoint sets (clusters)
(2) Update step: Recalculate centroids by averaging feature vectors assigned to each cluster.
The steps (1) and (2) are performed until the following stop condition,
is satisfied or number of iterations exceeds the maximal value \(T\) defined by the user.
Usage example¶
onedal::kmeans::model run_training(const onedal::table& data,
const onedal::table& initial_centroids) {
const auto kmeans_desc = onedal::kmeans::desc<float>{}
.set_cluster_count(10)
.set_max_iteration_count(50)
.set_accuracy_threshold(1e-4);
const auto result = onedal::train(kmeans_desc, data, initial_centroids);
print_table("labels", result.get_labels());
print_table("centroids", result.get_model().get_centroids());
print_value("objective", result.get_objective_function_value());
return result.get_model();
}
onedal::table run_inference(const onedal::kmeans::model& model,
const onedal::table& new_data) {
const auto kmeans_desc = onedal::kmeans::desc<float>{}
.set_cluster_count(model.get_cluster_count());
const auto result = onedal::infer(kmeans_desc, model, new_data);
print_table("labels", result.get_labels());
}
API¶
Methods¶
namespace method {
struct lloyd {};
using by_default = lloyd;
} // namespace method
-
struct
lloyd
¶ Tag-type that denotes Lloyd’s method.
-
using
by_default
= lloyd¶ Alias tag-type for the Lloyd’s method.
Descriptor¶
template <typename Float = float,
typename Method = method::by_default>
class desc {
public:
desc();
int64_t get_cluster_count() const;
int64_t get_max_iteration_count() const;
double get_accuracy_threshold() const;
desc& set_cluster_count(int64_t);
desc& set_max_iteration_count(int64_t);
desc& set_accuracy_threshold(double);
};
-
template<typename
Float
= float, typenameMethod
= method::by_default>
classdesc
¶ - Template Parameters
Float – The floating-point type that the algorithm uses for intermediate computations. Can be
float
ordouble
.Method – Tag-type that specifies an implementation of K-Means algorithm. Can be
method::lloyd
ormethod::by_default
.
-
desc
()¶ Creates new instance of descriptor with the default attribute values.
-
std::int64_t
cluster_count
= 2¶ The number of clusters \(k\).
- Getter & Setter
std::int64_t get_cluster_count() const
desc& set_cluster_count(std::int64_t)
- Invariants
cluster_count > 0
-
std::int64_t
max_iteration_count
= 100¶ The maximum number of iterations \(T\).
- Getter & Setter
std::int64_t get_max_iteration_count() const
desc& set_max_iteration_count(std::int64_t)
- Invariants
max_iteration_count >= 0
-
double
accuracy_threshold
= 0.0¶ The threshold \(\varepsilon\) for the stop condition.
- Getter & Setter
double get_accuracy_threshold() const
desc& set_accuracy_threshold(double)
- Invariants
accuracy_threshold >= 0.0
Model¶
class model {
public:
model();
const table& get_centroids() const;
int64_t get_cluster_count() const;
};
-
class
model
¶ -
model
()¶ Creates a model with the default attribute values.
-
table
centroids
= table()¶ \(k \times p\) table with the cluster centroids. Each row of the table stores one centroid.
- Getter
const table& get_centroids() const
-
std::int64_t
cluster_count
= 0¶ Number of clusters \(k\) in the trained model.
- Getter
std::int64_t get_cluster_count() const
- Invariants
cluster_count == centroids.row_count
-
Training onedal::train(...)
¶
Input¶
class train_input {
public:
train_input();
train_input(const table& data);
train_input(const table& data, const table& initial_centroids);
const table& get_data() const;
const table& get_initial_centroids() const;
train_input& set_data(const table&);
train_input& set_initial_centroids(const table&);
};
-
class
train_input
¶ -
train_input
()¶ Creates input for the training operation with the default attribute values.
-
train_input
(const table &data)¶ Creates input for the training operation with the given
data
, the other attributes get default values.
-
train_input
(const table &data, const table &initial_centroids)¶ Creates input for the training operation with the given data and
initial_centroids
.
-
Result¶
class train_result {
public:
train_result();
const model& get_model() const;
const table& get_labels() const;
int64_t get_iteration_count() const;
double get_objective_function_value() const;
};
-
class
train_result
¶ -
train_result
()¶ Creates result of the training operation with the default attribute values.
-
kmeans::model
model
= kmeans::model()¶ The trained K-means model.
- Getter
const model& get_model() const
-
table
labels
= table()¶ \(n \times 1\) table with the labels \(y_i\) assigned to the samples \(x_i\) in the input data, \(1 \leq 1 \leq n\).
- Getter
const table& get_labels() const
-
std::int64_t
iteration_count
= 0¶ The number of iterations performed by the algorithm.
- Invariants
iteration_count >= 0
-
double
objective_function_value
= 0.0¶ The value of the objective function \(\Phi_X(C)\), where \(C\) is
model.centroids
(seekmeans::model::centroids
).- Invariants
objective_function_value >= 0.0
-
Operation semantics¶
-
template<typename
Descriptor
>
kmeans::train_resulttrain
(const Descriptor &desc, const kmeans::train_input &input)¶ - Template Parameters
Descriptor – K-Means algorithm descriptor
kmeans::desc
.
- Preconditions
- Postconditions
Inference onedal::infer(...)
¶
Input¶
class infer_input {
public:
infer_input();
infer_input(const model& m);
infer_input(const model& m, const table& data);
const model& get_model() const;
const table& get_data() const;
infer_input& set_model(const model&);
infer_input& set_data(const table&);
};
-
class
infer_input
¶ -
infer_input
()¶ Creates input for the inference operation with the default attribute values.
-
infer_input
(const kmeans::model &model)¶ Creates input for the inference operation with the given
model
, the other attributes get default values.
-
infer_input
(const kmeans::model &model, const table &data)¶ Creates input for the inference operation with the given
model
anddata
.
-
Result¶
class infer_result {
public:
infer_result();
const table& get_labels() const;
double get_objective_function_value() const;
};
-
class
infer_result
¶ -
infer_result
()¶ Creates result of the inference operation with the default attribute values.
-
table
labels
= table()¶ \(n \times 1\) table with assignments labels to feature vectors in the input data.
- Getter
const table& get_labels() const
-
double
objective_function_value
= 0.0¶ The value of the objective function \(\Phi_X(C)\), where \(C\) is defined by the corresponding
infer_input::model::centroids
.- Invariants
objective_function_value >= 0.0
-
Operation semantics¶
-
template<typename
Descriptor
>
kmeans::infer_resultinfer
(const Descriptor &desc, const kmeans::infer_input &input)¶ - Template Parameters
Descriptor – K-Means algorithm descriptor
kmeans::desc
.
- Preconditions
- Postconditions
result.labels.is_empty == false
result.labels.row_count == input.data.row_count