What Should We Learn? Rethinking PCA for Modern Data Sets

You are here

Inside Signal Processing Newsletter Home Page

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

News and Resources for Members of the IEEE Signal Processing Society

What Should We Learn? Rethinking PCA for Modern Data Sets

By: 
Yang Li

In today’s big and messy data age, there is a lot of data generated everywhere around us. Examples include texts, tweets, network traffic, changing Facebook connections, or video surveillance feeds coming in from one or multiple cameras. Dimension reduction and noise/outlier removal are usually important preprocessing steps before any high-dimensional (big) data set can be used for inference. A common way to do this is via solving the principal component analysis (PCA) problemor its robust extensions. The basic PCA problem has been studied for over a century since the early work by Pearson in 1901 and Hotelling in 1933. The aim of PCA is to reduce the dimensionality of multivariate data while preserving as much of the relevant information as possible. It is often the first step in various types of exploratory data analysis, predictive modeling, and classification and clustering tasks, and finds applications in biomedical imaging, computer vision, process fault detection, recommendation systems’ design, and many more domains.

“PCA” refers to the following problem. Given a data set (a set of data vectors, or, more generally a set of data “tensors”) and a dimension k, find the k-dimensional subspace that “best” approximates the given data set. There are various notions of “best”; the traditional one used for classical PCA is either minimum Frobenius norm or minimum spectral norm of the approximation error of the data matrix. PCA, without constraints, and for clean data, is a solved problem. By the Eckart–Young–Mirsky theorem, computing the top k left singular vectors of the data matrix returns the PCA solution. On the other hand, robust PCA, which refers to the problem of PCA in the presence of outliers, is a much harder problem and one for which provably correct solutions have started appearing only recently. The same is true for dynamic PCA (subspace tracking or streaming PCA), dynamic or recursive robust PCA (robust subspace tracking), PCA and subspace tracking with missing data, and the related low-rank matrix completion problem, as well as for sparse PCA. Sparse PCA refers to the PCA problem when the principal components are assumed to be sparse. In fact, even the classical PCA problem with speed or memory constraints is not well understood.

The above issues have become particularly important for modern data sets because 1) the data matrix is often so large that it cannot be directly stored in the computer’s memory (need for streaming solutions); 2) a lot of data consist of missing entries and/or outliercorrupted entries (need for matrix completion and robust versions of PCA and subspace recovery); 3) a lot of data arrive sequentially, the data subspace itself may change over time, the entire data set cannot be stored but short batches can be, and decisions are often needed in real time or near real time (need for dynamic PCA and robust PCA and subspace tracking); 4) data are often distributed and stored over multiple locations and one needs to perform PCA without communicating all the data to a central location (need for distributed PCA); and 5) many types of data are better represented as a tensor data set rather than a vector data set or matrix (need for tensor PCA).

PCA and its many extensions are used everywhere as summarized above. Moreover, PCA is also a key intermediate or initialization step in solving many convex and nonconvex optimization problems. All areas of electrical engineering and computer science (EECS) have benefited hugely from, and have contributed significantly to, solutions of PCA and extensions. Most people in EECS know certain aspects of PCA, not all, and the special issue Rethinking PCA for Modern Data Sets: Theory, Algorithms, and Applications published by Proceedings of the IEEE in August 2018, helps bridge the gap in knowledge for EECS researchers from various backgrounds.

SPS Social Media

IEEE SPS Educational Resources

IEEE SPS Resource Center

IEEE SPS YouTube Channel