## 统计代写|主成分分析代写Principal Component Analysis代考|Model Selection by Information-Theoretic Criteria

Let $X=\left[x_{1}, x_{2}, \ldots, x_{N}\right] \in \mathbb{R}^{D \times N}$ be the mean-subtracted data matrix. When the data points are noise-free, they lie exactly in a subspace of dimension $d$. Hence, we can estimate $d$ as the rank of $X$, i.e., $d=\operatorname{rank}(X)$. However, when the data are contaminated by noise, the matrix $X$ will be of full rank in general; hence we cannot use its rank to estimate $d$. Nonetheless, notice that the SVD of the noisy data matrix $X$ gives a solution to PCA not only for a particular dimension $d$ of the subspace, but also for all $d=1,2, \ldots, D$. This has an important side benefit: if the dimension of the subspace $S$ is not known or specified a priori, rather than optimizing for both $d$ and $S$ simultaneously, we can easily look at the entire spectrum of solutions for different values of $d$ to decide on the “best” estimate $\hat{d}$ for the dimension of the subspace $d$ given the data $X$.

One possible criterion is to chose $d$ as the dimension that minimizes the leastsquares error between the given data $X$ and its projection $\widehat{X}^{d}=\left[\hat{\boldsymbol{x}}{1}^{d}, \hat{\boldsymbol{x}}{2}^{d}, \ldots, \hat{\boldsymbol{x}}{N}^{d}\right]$ onto the subspace $S$ of dimension $d$. As shown in the proof of Theorem $2.3$, the least-squares error is given by the sum of the squares of the remaining singular values of $X$, i.e., $$J(d) \doteq\left|X-\widehat{X}^{d}\right|{F}^{2}=\sum_{j=1}^{N}\left|\boldsymbol{x}{j}-\hat{\boldsymbol{x}}{j}^{d}\right|^{2}=\sum_{i=d+1}^{D} \sigma_{i}^{2} .$$
However, this is not a good criterion, because $J(d)$ is a nonincreasing function of $d$. In fact, the best solution is obtained when $d=\operatorname{rank}(X)$, because $J(d)=0$.

The problem of determining the optimal dimension $\hat{d}$ is in fact a “model selection” problem. As we discussed in the introduction of the book, the conventional wisdom is to strike a good balance between the complexity of the chosen model and the fidelity of the data to the model. The dimension $d$ of the subspace $S$ is a natural measure of model complexity, while the least-squares error $\left|X-\widehat{X}^{d}\right|_{F}^{2}=\sum_{i=d+1}^{D} \sigma_{i}^{2}$ or its leading term, $\sigma_{d+1}^{2}$, are natural measures of the data fidelity. Perhaps the simplest model selection criterion is to minimize the complexity subject to a bound on the fidelity. For example, we can choose $d$ as the smallest number such that the fidelity is less than a threshold $\tau>0$, i.e.,
$$\hat{d}=\min {d}\left{d: \sum{i=d+1}^{D} \sigma_{i}^{2}<\tau\right} \quad \text { or } \quad \hat{d}=\min {d}\left{d: \sigma{d+1}^{2}<\tau\right} .$$

## 统计代写|主成分分析代写Principal Component Analysis代考|Model Selection by Rank Minimization

In this section, we present an alternative view of model selection based on the rank minimization approach to $\mathrm{PCA}$ introduced in Section 2.1.3. In this approach, the PCA problem is posed as one of finding a rank- $d$ matrix $A$ that best approximates the mean-subtracted data matrix $X$, i.e.,
$$\min {A}|X-A|{F}^{2} \text { s.t. } \operatorname{rank}(A)=d .$$

Although this problem is nonconvex due to the rank constraint, as we showed in Section 2.1.3, its optimal solution can be computed in closed form as
$$A=U \mathcal{H}{\sigma{d+1}}(\Sigma) V^{\top}$$
where $X=U \Sigma V^{\top}$ is the SVD of $X, \sigma_{k}$ is the $k$ th singular value of $X$, and $\mathcal{H}{\varepsilon}(x)$ is the hard thresholding operator: $$\mathcal{H}{\varepsilon}(x)= \begin{cases}x & |x|>\varepsilon \ 0 & \text { else }\end{cases}$$
However, this closed-form solution requires $d$ to be known.
When $d$ is unknown, the problem of finding a low-rank approximation can be formulated as
$$\min {A}|X-A|{F}^{2}+\tau \operatorname{rank}(A)$$
where $\tau>0$ is a parameter. Since the optimal solution of (2.88) for a fixed rank $d=\operatorname{rank}(A)$ is $A=U \mathcal{H}{\sigma{d+1}}(\Sigma) V^{\top}$, the problem in $(2.91)$ reduces to
$$\min {d} \sum{k>d} \sigma_{k}^{2}+\tau d$$

