Principal Components Analysis (PCA)

Principal Component Analysis (PCA) is an algorithm for exploratory data analysis and dimensionality reduction. PCA transforms a set of feature vectors of possibly correlated features to a new set of uncorrelated features, called principal components. Principal components are the directions of the largest variance, that is, the directions where the data is mostly spread out.

Given the training set \(X = \{ x_1, \ldots, x_n \}\) of \(p\)-dimensional feature vectors and the number of principal components \(r\), the problem is to compute \(r\) principal directions (\(p\)-dimensional eigenvectors) for the training set. The eigenvectors can be grouped into the \(r \times p\) matrix \(T\) that contains one eigenvector in each row.

oneDAL specifies two methods for PCA computation:

  1. Covariance-based method

  2. SVD-based method

Covariance-based method

[TBD]

SVD-based method

[TBD]

Sign-flip technique

Eigenvectors computed by some eigenvalue solvers are not uniquely defined due to sign ambiguity. To get the deterministic result, a sign-flip technique should be applied. One of the sign-flip techniques proposed in [Bro07] requires the following modification of matrix \(T\):

\[\hat{T}_i = T_i \cdot \mathrm{sgn}(\max_{1 \leq j \leq p } |{T}_{ij}|), \quad 1 \leq i \leq r,\]

where \(T_i\) is \(i\)-th row, \(T_{ij}\) is the element in the \(i\)-th row and \(j\)-th column, \(\mathrm{sgn}(\cdot)\) is the signum function,

\[\begin{split}\mathrm{sgn}(x) = \begin{cases} -1, & x < 0, \\ 0, & x = 0, \\ 1, & x > 0. \end{cases}\end{split}\]

Note

The sign-flip technique described above is an example. oneDAL spec does not require implementation of this sign-flip technique. Implementer can choose an arbitrary technique that modifies the eigenvectors’ signs.

Usage example

onedal::pca::model run_training(const onedal::table& data) {

   const auto pca_desc = onedal::pca::desc<float>{}
      .set_component_count(5)
      .set_deterministic(true);

   const auto result = onedal::train(pca_desc, data);

   print_table("means", result.get_means());
   print_table("variances", result.get_variances());
   print_table("eigenvalues", result.get_eigenvalues());
   print_table("eigenvectors", result.get_model().get_eigenvectors());

   return result.get_model();
}
onedal::table run_inference(const onedal::pca::model& model,
                            const onedal::table& new_data) {

   const auto pca_desc = onedal::pca::desc<float>{}
      .set_component_count(model.get_component_count());

   const auto result = onedal::infer(pca_desc, model, new_data);

   print_table("labels", result.get_transformed_data());
}

API

Methods

namespace method {
   struct cov {};
   struct svd {};
   using by_default = cov;
} // namespace method
struct cov

Tag-type that denotes Covariance-based method.

struct svd

Tag-type that denotes SVD-based method.

using by_default = cov

Alias tag-type for the Covariance-based method.

Descriptor

template <typename Float = float,
          typename Method = method::by_default>
class desc {
public:
   desc();

   int64_t get_component_count() const;
   bool get_deterministic() const;

   desc& set_component_count(int64_t);
   desc& set_deterministic(bool);
};
template<typename Float = float, typename Method = method::by_default>
class desc
Template Parameters
  • Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.

  • Method – Tag-type that specifies an implementation of PCA algorithm. Can be method::cov, method::svd or method::by_default.

desc()

Creates a new instance of the descriptor with the default attribute values.

std::int64_t component_count = 0

The number of principal components \(r\). If it is zero, the algorithm computes the eigenvectors for all features, \(r = p\).

Getter & Setter
std::int64_t get_component_count() const
desc& set_component_count(std::int64_t)
Invariants
bool set_deterministic = true

Specifies whether the algorithm applies the Sign-flip technique or uses a deterministic eigenvalues solver. If it is true, directions of the eigenvectors must be deterministic.

Getter & Setter
bool get_deterministic() const
desc& set_deterministic(bool)

Model

class model {
public:
   model();

   const table& get_eigenvectors() const;
   int64_t get_component_count() const;
};
class model
model()

Creates a model with the default attribute values.

table eigenvectors = table()

\(r \times p\) table with the eigenvectors. Each row contains one eigenvector.

Getter
const table& get_eigenvectors() const
std::int64_t component_count = 0

The number of components \(r\) in the trained model.

Getter
std::int64_t get_component_count() const
Invariants

Training onedal::train(...)

Input

class train_input {
public:
   train_input();
   train_input(const table& data);

   const table& get_data() const;

   train_input& set_data(const table&);
};
class train_input
train_input()

Creates an input for the training operation with the default attribute values.

train_input(const table &data)

Creates an input for the training operation with the given data.

table data = table()

\(n \times p\) table with the training data, where each row stores one feature vector.

Getter & Setter
const table& get_data() const
train_input& set_data(const table&)

Result

class train_result {
public:
   train_result();

   const model& get_model() const;
   const table& get_means() const;
   const table& get_variances() const;
   const table& get_eigenvalues() const;
};
class train_result
train_result()

Creates a result of the training operation with the default attribute values.

pca::model model = pca::model()

The trained PCA model.

Getter
const model& get_model() const
table means = table()

\(1 \times r\) table that contains mean value for the first \(r\) features.

Getter
const table& get_means() const
table variances = table()

\(1 \times r\) table that contains variance for the first \(r\) features.

Getter
const table& get_variances() const
table eigenvalues = table()

\(1 \times r\) table that contains eigenvalue for for the first \(r\) features.

Getter
const table& get_eigenvalues() const

Operation semantics

template<typename Descriptor>
pca::train_result train(const Descriptor &desc, const pca::train_input &input)
Template Parameters

Descriptor – PCA algorithm descriptor pca::desc.

Preconditions
input.data.is_empty == false
input.data.column_count >= desc.component_count
Postconditions
result.means.row_count == 1
result.means.column_count == desc.component_count
result.variances.row_count == 1
result.variances.column_count == desc.component_count
result.variances >= 0.0
result.eigenvalues.row_count == 1
result.eigenvalues.column_count == desc.component_count
result.model.eigenvectors.row_count == 1
result.model.eigenvectors.column_count == desc.component_count

Inference onedal::infer(...)

Input

class infer_input {
public:
   infer_input();
   infer_input(const model& m);
   infer_input(const model& m, const table& data);

   const model& get_model() const;
   const table& get_data() const;

   infer_input& set_model(const model&);
   infer_input& set_data(const table&);
};
class infer_input
infer_input()

Creates an input for the inference operation with the default attribute values.

infer_input(const pca::model &model)

Creates an input for the inference operation with the given model, the other attributes get default values.

infer_input(const pca::model &model, const table &data)

Creates an input for the inference operation with the given model and data.

table data = table()

\(n \times p\) table with the data to be projected to the \(r\) principal components previously extracted from a training set.

Getter & Setter
const table& get_data() const
infer_input& set_data(const table&)
pca::model model = pca::model()

The trained PCA model (see pca::model).

Getter & Setter
const pca::model& get_model() const
infer_input& set_model(const pca::model&)

Result

class infer_result {
public:
   infer_result();

   const table& get_transformed_data() const;
};
class infer_result
infer_result()

Creates a result of the inference operation with the default attribute values.

table transformed_data = table()

\(n \times r\) table that contains data projected to the \(r\) principal components.

Getter
const table& get_transformed_data() const

Operation semantics

template<typename Descriptor>
pca::infer_result infer(const Descriptor &desc, const pca::infer_input &input)
Template Parameters

Descriptor – PCA algorithm descriptor pca::desc.

Preconditions
input.data.is_empty == false
input.model.eigenvectors.row_count == desc.component_count
input.model.eigenvectors.column_count = input.data.column_count
Postconditions
result.transformed_data.row_count == input.data.row_count
result.transformed_data.column_count == desc.component_count