k-Nearest Neighbors Classification (k-NN)¶

\(k\)-NN classification algorithm infers the class for the new feature vector by computing majority vote of the \(k\) nearest observations from the training set.

Operation	Computational methods		Programming Interface
Training	Brute-force	k-d tree	train(…)	train_input	train_result
Inference	Brute-force	k-d tree	infer(…)	infer_input	infer_result

Mathematical formulation¶

Training¶

Let \(X = \{ x_1, \ldots, x_n \}\) be the training set of \(p\)-dimensional feature vectors, let \(Y = \{ y_1, \ldots, y_n \}\) be the set of class labels, where \(y_i \in \{ 0, \ldots, c-1 \}\), \(1 \leq i \leq n\). Given \(X\), \(Y\) and the number of nearest neighbors \(k\), the problem is to build a model that allows distance computation between the feature vectors in training and inference sets at the inference stage.

Training method: brute-force¶

The training operation produces the model that stores all the feature vectors from the initial training set \(X\).

Training method: k-d tree¶

The training operation builds a \(k\)-\(d\) tree that partitions the training set \(X\) (for more details, see k-d Tree).

Inference¶

Let \(X' = \{ x_1', \ldots, x_m' \}\) be the inference set of \(p\)-dimensional feature vectors. Given \(X'\), the model produced at the training stage and the number of nearest neighbors \(k\), the problem is to predict the label \(y_j'\) for each \(x_j'\), \(1 \leq j \leq m\), by performing the following steps:

Identify the set \(N(x_j') \subseteq X\) of the \(k\) feature vectors in the training set that are nearest to \(x_j'\) with respect to the Euclidean distance.
Estimate the conditional probability for the \(l\)-th class as the fraction of vectors in \(N(x_j')\) whose labels \(y_j\) are equal to \(l\):

(1)¶\[P_{jl} = \frac{1}{| N(x_j') |} \Big| \big\{ x_r \in N(x_j') : y_r = l \big\} \Big|, \quad 1 \leq j \leq m, \; 0 \leq l < c.\]
Predict the class that has the highest probability for the feature vector \(x_j'\):

(2)¶\[y_j' = \mathrm{arg}\max_{0 \leq l < c} P_{jl}, \quad 1 \leq j \leq m.\]

Inference method: brute-force¶

The inference operation determines the set \(N(x_j')\) of the nearest feature vectors by iterating over all the pairs \((x_j', x_i)\) in the implementation defined order, \(1 \leq i \leq n\), \(1 \leq j \leq m\). The final prediction is computed according to the equations (1) and (2).

Inference method: k-d tree¶

The inference operation traverses the \(k\)-\(d\) tree to find feature vectors associated with a leaf node that are closest to \(x_j'\), \(1 \leq j \leq m\). The set \(\tilde{n}(x_j')\) of the currently-known nearest \(k\)-th neighbors is progressively updated during tree traversal. The search algorithm limits exploration of the nodes for which the distance between the \(x_j'\) and respective part of the feature space is not less than the distance between \(x_j'\) and the most distant feature vector from \(\tilde{n}(x_j')\). Once tree traversal is finished, \(\tilde{n}(x_j') \equiv N(x_j')\). The final prediction is computed according to the equations (1) and (2).

Programming Interface¶

Descriptor¶

template <typename Float = float,
          typename Method = method::by_default>
class descriptor {
public:
   explicit descriptor(std::int64_t class_count,
                       std::int64_t neighbor_count);

   std::int64_t get_class_count() const;
   descriptor& set_class_count(std::int64_t);

   std::int64_t get_neighbor_count() const;
   descriptor& set_neighbor_count(std::int64_t);
};

template<typename Float = float, typename Method = method::by_default> class descriptor¶

Template Parameters

Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.
Method – Tag-type that specifies an implementation of algorithm. Can be method::bruteforce or method::kd_tree.

Constructors

descriptor(std::int64_t class_count, std::int64_t neighbor_count)¶: Creates a new instance of the class with the given class_count and neighbor_count property values.

Properties

std::int64_t class_count¶

The number of classes \(c\).

Getter & Setter: std::int64_t get_class_count() const

descriptor & set_class_count(std::int64_t)
Invariants: class_count > 1

std::int64_t neighbor_count¶

The number of neighbors \(k\).

Getter & Setter: std::int64_t get_neighbor_count() const

descriptor & set_neighbor_count(std::int64_t)
Invariants: neighbor_count > 0

Computational methods¶

namespace method {
   struct bruteforce {};
   struct kd_tree {};
   using by_default = bruteforce;
} // namespace method

struct bruteforce¶: Tag-type that denotes brute-force computational method.

struct kd_tree¶: Tag-type that denotes k-d tree computational method.

using by_default = bruteforce ¶: Alias tag-type for brute-force computational method.

Model¶

class model {
public:
   model();
};

class model¶

Constructors

model()¶: Creates a new instance of the class with the default property values.

Training `train(...)`¶

Input¶

class train_input {
public:
   train_input(const table& data = table{},
               const table& labels = table{});

   const table& get_data() const;
   train_input& set_data(const table&);

   const table& get_labels() const;
   train_input& set_labels(const table&);
};

class train_input¶

Constructors

train_input(const table &data = table{}, const table &labels = table{})¶

Properties

const table &data = table{}¶

The training set \(X\).

Getter & Setter: const table & get_data() const

train_input & set_data(const table &)

const table &labels = table{}¶

Vector of labels \(y\) for the training set \(X\).

Getter & Setter: const table & get_labels() const

train_input & set_labels(const table &)

Result¶

class train_result {
public:
   train_result();

   const model& get_model() const;
};

class train_result¶

Constructors

train_result()¶

Properties

const model &model = model{}¶

The trained \(k\)-NN model.

Getter & Setter: const model & get_model() const

Operation¶

template<typename Float, typename Method> train_result train(const descriptor<Float, Method> &desc, const train_input &input)¶

Runs the training operation for \(k\)-NN classifier. For more details see onedal::train.

Template Parameters

Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.
Method – Tag-type that specifies an implementation of algorithm. Can be method::bruteforce or method::kd_tree.

Parameters

desc – Descriptor of the algorithm.
input – Input data for the training operation.

Preconditions: input.data.has_data == true

input.labels.has_data == true

input.data.rows == input.labels.rows

input.data.columns == 1

input.labels[i] >= 0

input.labels[i] < desc.class_count

Inference `infer(...)`¶

Input¶

class infer_input {
public:
   infer_input(const model& m = model{},
               const table& data = table{});

   const model& get_model() const;
   infer_input& set_model(const model&);

   const table& get_data() const;
   infer_input& set_data(const table&);
};

class infer_input¶

Constructors

infer_input(const model &m = model{}, const table &data = table{})¶

Properties

const model &model = model{}¶

The trained \(k\)-NN model.

Getter & Setter: const model & get_model() const

infer_input & set_model(const model &)

const table &data = table{}¶

The dataset for inference \(X'\).

Getter & Setter: const table & get_data() const

infer_input & set_data(const table &)

Result¶

class infer_result {
public:
   infer_result();

   const table& get_labels() const;
};

class infer_result¶

Constructors

infer_result()¶

Properties

const table &labels = model{}¶

The predicted labels.

Getter & Setter: const table & get_labels() const

Operation¶

template<typename Float, typename Method> infer_result infer(const descriptor<Float, Method> &desc, const infer_input &input)¶

Runs the inference operation for \(k\)-NN classifier. For more details see onedal::infer.

Template Parameters

Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.
Method – Tag-type that specifies an implementation of algorithm. Can be method::bruteforce or method::kd_tree.

Parameters

desc – Descriptor of the algorithm.
input – Input data for the inference operation.

Preconditions: input.data.has_data == true
Preconditions: result.labels.rows == input.data.rows

result.labels.columns == 1

result.labels[i] >= 0

result.labels[i] < desc.class_count

k-Nearest Neighbors Classification (k-NN)¶

Mathematical formulation¶

Training¶

Training method: brute-force¶

Training method: k-d tree¶

Inference¶

Inference method: brute-force¶

Inference method: k-d tree¶

Programming Interface¶

Descriptor¶

Computational methods¶

Model¶

Training train(...)¶

Input¶

Result¶

Operation¶

Inference infer(...)¶

Input¶

Result¶

Operation¶

Training `train(...)`¶

Inference `infer(...)`¶