Principal Components Analysis (PCA)¶

Principal Component Analysis (PCA) is an algorithm for exploratory data analysis and dimensionality reduction. PCA transforms a set of feature vectors of possibly correlated features to a new set of uncorrelated features, called principal components. Principal components are the directions of the largest variance, that is, the directions where the data is mostly spread out.

Given the training set \(X = \{ x_1, \ldots, x_n \}\) of \(p\)-dimensional feature vectors and the number of principal components \(r\), the problem is to compute \(r\) principal directions (\(p\)-dimensional eigenvectors) for the training set. The eigenvectors can be grouped into the \(r \times p\) matrix \(T\) that contains one eigenvector in each row.

oneDAL specifies two methods for PCA computation:

Covariance-based method
SVD-based method

Covariance-based method¶

[TBD]

SVD-based method¶

[TBD]

Sign-flip technique¶

Eigenvectors computed by some eigenvalue solvers are not uniquely defined due to sign ambiguity. To get the deterministic result, a sign-flip technique should be applied. One of the sign-flip techniques proposed in [Bro07] requires the following modification of matrix \(T\):

\[\hat{T}_i = T_i \cdot \mathrm{sgn}(\max_{1 \leq j \leq p } |{T}_{ij}|), \quad 1 \leq i \leq r,\]

where \(T_i\) is \(i\)-th row, \(T_{ij}\) is the element in the \(i\)-th row and \(j\)-th column, \(\mathrm{sgn}(\cdot)\) is the signum function,

\[\begin{split}\mathrm{sgn}(x) = \begin{cases} -1, & x < 0, \\ 0, & x = 0, \\ 1, & x > 0. \end{cases}\end{split}\]

Note

The sign-flip technique described above is an example. oneDAL spec does not require implementation of this sign-flip technique. Implementer can choose an arbitrary technique that modifies the eigenvectors’ signs.

Usage example¶

onedal::pca::model run_training(const onedal::table& data) {

   const auto pca_desc = onedal::pca::desc<float>{}
      .set_component_count(5)
      .set_deterministic(true);

   const auto result = onedal::train(pca_desc, data);

   print_table("means", result.get_means());
   print_table("variances", result.get_variances());
   print_table("eigenvalues", result.get_eigenvalues());
   print_table("eigenvectors", result.get_model().get_eigenvectors());

   return result.get_model();
}

onedal::table run_inference(const onedal::pca::model& model,
                            const onedal::table& new_data) {

   const auto pca_desc = onedal::pca::desc<float>{}
      .set_component_count(model.get_component_count());

   const auto result = onedal::infer(pca_desc, model, new_data);

   print_table("labels", result.get_transformed_data());
}

API¶

Methods¶

namespace method {
   struct cov {};
   struct svd {};
   using by_default = cov;
} // namespace method

struct cov¶: Tag-type that denotes Covariance-based method.

struct svd¶: Tag-type that denotes SVD-based method.

using by_default = cov ¶: Alias tag-type for the Covariance-based method.

Descriptor¶

template <typename Float = float,
          typename Method = method::by_default>
class desc {
public:
   desc();

   int64_t get_component_count() const;
   bool get_deterministic() const;

   desc& set_component_count(int64_t);
   desc& set_deterministic(bool);
};

template<typename Float = float, typename Method = method::by_default> class desc¶

Template Parameters

Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.
Method – Tag-type that specifies an implementation of PCA algorithm. Can be method::cov, method::svd or method::by_default.

desc()¶: Creates a new instance of the descriptor with the default attribute values.

std::int64_t component_count = 0¶

The number of principal components \(r\). If it is zero, the algorithm computes the eigenvectors for all features, \(r = p\).

Getter & Setter: std::int64_t get_component_count() const

desc& set_component_count(std::int64_t)
Invariants: component_count >= 0

bool set_deterministic = true¶

Specifies whether the algorithm applies the Sign-flip technique or uses a deterministic eigenvalues solver. If it is true, directions of the eigenvectors must be deterministic.

Getter & Setter: bool get_deterministic() const

desc& set_deterministic(bool)

Model¶

class model {
public:
   model();

   const table& get_eigenvectors() const;
   int64_t get_component_count() const;
};

class model¶

model()¶: Creates a model with the default attribute values.

table eigenvectors = table()¶

\(r \times p\) table with the eigenvectors. Each row contains one eigenvector.

Getter: const table& get_eigenvectors() const

std::int64_t component_count = 0¶

The number of components \(r\) in the trained model.

Getter: std::int64_t get_component_count() const
Invariants: component_count == eigenvectors.row_count

Training `onedal::train(...)`¶

Input¶

class train_input {
public:
   train_input();
   train_input(const table& data);

   const table& get_data() const;

   train_input& set_data(const table&);
};

class train_input¶

train_input()¶: Creates an input for the training operation with the default attribute values.

train_input(const table &data)¶: Creates an input for the training operation with the given data.

table data = table()¶

\(n \times p\) table with the training data, where each row stores one feature vector.

Getter & Setter: const table& get_data() const

train_input& set_data(const table&)

Result¶

class train_result {
public:
   train_result();

   const model& get_model() const;
   const table& get_means() const;
   const table& get_variances() const;
   const table& get_eigenvalues() const;
};

class train_result¶

train_result()¶: Creates a result of the training operation with the default attribute values.

pca::model model = pca::model()¶

The trained PCA model.

Getter: const model& get_model() const

table means = table()¶

\(1 \times r\) table that contains mean value for the first \(r\) features.

Getter: const table& get_means() const

table variances = table()¶

\(1 \times r\) table that contains variance for the first \(r\) features.

Getter: const table& get_variances() const

table eigenvalues = table()¶

\(1 \times r\) table that contains eigenvalue for for the first \(r\) features.

Getter: const table& get_eigenvalues() const

Operation semantics¶

template<typename Descriptor> pca::train_result train(const Descriptor &desc, const pca::train_input &input)¶

Template Parameters: Descriptor – PCA algorithm descriptor pca::desc.

Preconditions: input.data.is_empty == false

input.data.column_count >= desc.component_count
Postconditions: result.means.row_count == 1

result.means.column_count == desc.component_count

result.variances.row_count == 1

result.variances.column_count == desc.component_count

result.variances >= 0.0

result.eigenvalues.row_count == 1

result.eigenvalues.column_count == desc.component_count

result.model.eigenvectors.row_count == 1

result.model.eigenvectors.column_count == desc.component_count

Inference `onedal::infer(...)`¶

Input¶

class infer_input {
public:
   infer_input();
   infer_input(const model& m);
   infer_input(const model& m, const table& data);

   const model& get_model() const;
   const table& get_data() const;

   infer_input& set_model(const model&);
   infer_input& set_data(const table&);
};

class infer_input¶

infer_input()¶: Creates an input for the inference operation with the default attribute values.

infer_input(const pca::model &model)¶: Creates an input for the inference operation with the given model, the other attributes get default values.

infer_input(const pca::model &model, const table &data)¶: Creates an input for the inference operation with the given model and data.

table data = table()¶

\(n \times p\) table with the data to be projected to the \(r\) principal components previously extracted from a training set.

Getter & Setter: const table& get_data() const

infer_input& set_data(const table&)

pca::model model = pca::model()¶

The trained PCA model (see pca::model).

Getter & Setter: const pca::model& get_model() const

infer_input& set_model(const pca::model&)

Result¶

class infer_result {
public:
   infer_result();

   const table& get_transformed_data() const;
};

class infer_result¶

infer_result()¶: Creates a result of the inference operation with the default attribute values.

table transformed_data = table()¶

\(n \times r\) table that contains data projected to the \(r\) principal components.

Getter: const table& get_transformed_data() const

Operation semantics¶

template<typename Descriptor> pca::infer_result infer(const Descriptor &desc, const pca::infer_input &input)¶

Template Parameters: Descriptor – PCA algorithm descriptor pca::desc.

Preconditions: input.data.is_empty == false

input.model.eigenvectors.row_count == desc.component_count

input.model.eigenvectors.column_count = input.data.column_count
Postconditions: result.transformed_data.row_count == input.data.row_count

result.transformed_data.column_count == desc.component_count

Principal Components Analysis (PCA)¶

Covariance-based method¶

SVD-based method¶

Sign-flip technique¶

Usage example¶

API¶

Methods¶

Descriptor¶

Model¶

Training onedal::train(...)¶

Input¶

Result¶

Operation semantics¶

Inference onedal::infer(...)¶

Input¶

Result¶

Operation semantics¶

Training `onedal::train(...)`¶

Inference `onedal::infer(...)`¶