Principal Components Analysis

Principal Component Analysis (PCA) is a statistical technique commonly used for dimensionality reduction and data compression. It is an unsupervised technique, and thus requires no training data. It's particularly useful when dealing with high-dimensional data, such as multispectral and hyperspectral data. These types of data are often encountered in fields like remote sensing, image processing, and spectroscopy, where each pixel or data point contains information from multiple bands or wavelengths.

1. Dimensionality Reduction: Multispectral and hyperspectral data typically contain a large number of bands or wavelengths, leading to high-dimensional data. PCA works by transforming the original data into a new set of variables called principal components (PCs). These PCs are linear combinations of the original bands and are sorted in order of their variance. The first PC captures the most significant variance in the data, the second PC captures the second most significant variance, and so on. By selecting a subset of these PCs, you can reduce the dimensionality of the data while retaining most of its important information.

2. Data Compression: PCA can also be used for data compression. By selecting a smaller number of principal components to represent the data, you effectively reduce the amount of storage or memory required to store the data. This is especially useful when dealing with large datasets where storage and processing resources are limited.

3. Data Visualization: PCA can help visualize high-dimensional data in lower dimensions. While original data with numerous bands can be difficult to visualize, projecting the data onto the first few principal components allows you to create scatter plots or graphs that provide insights into data clusters, patterns, and anomalies.

4. Data Classification: When dealing with multispectral and hyperspectral data, PCA can aid in data classification tasks. High-dimensional data can lead to overfitting and computational challenges in classification algorithms. By applying PCA, you can reduce the dimensionality of the data while preserving the most relevant information. This often results in improved classification performance, reduced computational requirements, and better generalization.

Here's how PCA can be used for classification with multispectral and hyperspectral data:

Preprocessing: The first step involves preprocessing the data to remove noise, correct artifacts, and normalize the values across different bands. In particular, the PCA algorithm used in HeavyML does not support NULL data values. So if you have such values in your dataset you will need to either delete those records or impute values to fill the NULLs.
PCA Transformation: The preprocessed data is then subjected to PCA transformation to obtain the principal components. The number of components chosen depends on how much variance you want to retain and the trade-off between dimensionality reduction and information preservation. We recommend in most cases that you start with 3 components, since this can be visualized using a scatter plot with two dimensions and the third dimension used as a color measure. However if your downstream purpose is analytical, you should adjust this to retain the variance required.
Training and Classification: The reduced-dimensional data, represented by the selected principal components, is used as input for classification algorithms. These algorithms (e.g., Support Vector Machines, Random Forests, Neural Networks) are trained on labeled data to learn the relationships between the principal components and the classes. Fior example, with hyperspectral data containing hundreds of bands, it is common practice to use PCA first, and then to apply techniques such as random forests regression to the resulting PCA bands. This can increase the likelihood of a model converging, and is certain to reduce run times and memory use substantially.
Testing and Prediction: Once the classifier is trained, it can be used to classify new, unseen data points based on their reduced-dimensional representations obtained from PCA. This is conceptually similar to categorical classification, but transforms new data into a continuous value along each PCA axis.

By using PCA for dimensionality reduction and subsequent classification, you can effectively handle the challenges posed by high-dimensional multispectral and hyperspectral data, leading to improved classification accuracy and more efficient computation.

Method

There are two steps to using PCA within HeavyML. First you create a model, and then you run predictions using that model.

For example, imagine that we want to visualize hyperspectral data from the ENMAP satellite which contains up to 244 bands. For brevity, let's build a PCA model on the first four bands:

CREATE MODEL enmap_hyperspectral_pca OF TYPE PCA AS 
SELECT band_1_1, band_1_2, band_1_3, band_1_4
FROM enmap_hyperspectral

The syntax is identical to other HeavyML model creation steps, except that the type is given as PCA. Bands can be specified in any order, as long as that is consistent between model creation and prediction steps. If the command above is successful, you can use the following command to verify successful model creation:

SHOW MODEL DETAILS 'enmap_hyperspectral_2022_pca_v2'

Running predictions requires three types of parameters: the model name, the bands, and the desired PCA component (1..num_bands). For example for the first four bands

SELECT raster_lon, raster_lat, 
PCA_PROJECT('enmap_hyperspectral_pca', band_1_1, band_1_2, band_1_3, band_1_4, 1), 
PCA_PROJECT('enmap_hyperspectral_pca', band_1_1, band_1_2, band_1_3, band_1_4, 2),
PCA_PROJECT('enmap_hyperspectral_pca', band_1_1, band_1_2, band_1_3, band_1_4, 1) 
FROM  enmap_hyperspectral

The above command generates the PCA values on-the-fly. A fragment like that above could also be used to provide the data for training any other HeavyML model. But you can also persist PCA values by projecting them within a CREATE TABLE AS SELECT statement or using the UPDATE command.

PreviousGradient Boosting Tree Regression NextData Science Foundation