Jul 12, 20232 min read

Principal Component Analysis: A Technique for Exploring Higher-Dimensional Data

Updated: Jul 21, 2023

Real-time data rarely ever comes in one, two, or three-dimensional formats. Biostatistician Karl Pearson is often credited with forming mathematical statistics and founded the first university statistics department at University College London in 1911. Karl Pearson is also responsible for inventing Principal Component Analysis (PCA), a method for reducing the dimensions in a quantitative dataset while conserving as much information as possible. From this method, two additional statistical techniques were conceived; Multiple Correspondence Analysis (MCA), which allows for the same dimensionality reduction for qualitative data, and Factor Analysis of Mixed Data (FAMD) for datasets with both numerical and qualitative dimensions.

PCA often begins with large numerical datasets with many dimensions or features that represent the raw data of a given analysis. These datasets can have origins from anywhere in the Life Sciences industry, such as Genomics, Transcriptomics, Proteomics, or Metabolomics [2].

PCA involves a complex series of linear algebra operations starting from a base matrix of the original data. The matrix is Z-Score Normalized so that all variables are treated equally; from there, the covariance matrix is found, and the associated Eigenvectors and Eigenvalues are calculated. The Eigenvectors and Eigenvalues are the primary statistical values that gauge how much data you can preserve and with how many new Principal Components [3]. It is common when performing PCA to set a threshold for the desired amount of variance the new components are to capture, typically above 70% [3]. The new Principal Components can be visualized in 2 and 3-dimensional plots and give new insights into data patterns and grouping, which features are or are not contributing to variance, the correlation between variables, and other intuitions.

PCA can be executed in several ways; some standard methods include commercial tools and Python scripts using libraries like Scikit. Data scientists employ PCA in Life Science and other industries like finance or physics. PCA allows statisticians in these fields to gain mathematically meaningful insights into their data to help make decisions, further understand, and highlight key data elements.

To see a mathematical walk-through, read the LaTex PDF attached, which includes formulas

and an example.

References.

[1]. Image by Freepik

[2]. Ghosh, T., Zhang, W., Ghosh, D., Kechris, K., Predictive Modeling for Metabolomics Data. Methods Mol Biol. (2020): 313 - 336.

[3]. Jolliffe, T. I., and Cadima, J., Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci. (2016) 374.

Principal Component Analysis: A Technique for Exploring Higher-Dimensional Data

Comments