统计代写|主成分分析代写Principal Component Analysis代考|PPCA from Population Mean and Covariance

Observe that in general, we cannot uniquely recover the model parameters from $\mu_{x}$ and $\Sigma_{x}$ by solving the equations in (2.49). For instance, notice that $\mu$ and $\mu_{y}$ cannot be uniquely recovered from $\mu_{x}$. Similarly to what we did in the case of PCA, this issue can be easily resolved by assuming that $\mu_{y}=\mathbf{0}$. This leads to the following estimate of $\boldsymbol{\mu}$ :
$$\hat{\mu}=\mu_{x},$$
which is the same estimate as that of PCA (see Exercise 2.6). Another ambiguity that cannot be resolved in a straightforward manner is that $\Sigma_{y}$ and $\Sigma_{\varepsilon}$ cannot be uniquely recovered from $\Sigma_{x}$. For instance, $\Sigma_{y}=0$ and $\Sigma_{\varepsilon}=\Sigma_{x}$ is a valid solution. However, this solution is not meaningful, because it assigns all the information in $\Sigma_{x}$ to the error, rather than to the low-dimensional representation.

To resolve this ambiguity, we need to make some additional assumptions. Intuitively, we would like $B \Sigma_{y} B^{\top}$ to capture as much information about $\Sigma_{x}$ as possible. Thus it makes sense for $\Sigma_{y}$ to be of full rank and for $\Sigma_{\varepsilon}$ to be as close to zero as possible. More specifically, the assumptions made in PPCA are the following:

1. The low-dimensional representation has unit covariance, i.e., $\Sigma_{y}=I_{d} \in \mathbb{R}^{d \times d}$.
2. The noise covariance matrix $\Sigma_{\varepsilon} \in \mathbb{R}^{D \times D}$ is isotropic, i.e., $\Sigma_{\varepsilon}=\sigma^{2} I_{D}$.
Under these assumptions, the covariance of the observations must be of the form
$$\Sigma_{x}=B B^{\top}+\sigma^{2} I_{D} .$$
It follows from this relationship that the eigenvalues of $\Sigma_{x}$ must be equal to the eigenvalues of $B B^{\top}$ plus $\sigma^{2}$. Since $B B^{\top}$ has rank $d$ and is positive semidefinite, $D-d$ eigenvalues of $B B^{\top}$ must be equal to zero. Therefore, the smallest $D-d$ eigenvalues of $\Sigma_{x}$ must be equal to each other and equal to $\sigma^{2}$. In addition, the off-diagonal entries of $\Sigma_{x}$ are equal to the off-diagonal entries of $B B^{\top}$. As a consequence, even though both PPCA and PCA try to capture as much information as possible from $\Sigma_{x}$ into $\Sigma_{y}$, the information they attempt to capture is not the same. On the one hand, PPCA tries to find a matrix $B$ such that the covariances are preserved, i.e., the off-diagonal entries of $\Sigma_{x}$. On the other hand, PCA tries to preserve the variances, i.e., the diagonal entries of $\Sigma_{x}$.

统计代写|主成分分析代写Principal Component Analysis代考|PPCA by Maximum Likelihood

In practice, we may not know the population mean and covariance, $\mu_{x}$ and $\Sigma_{x}$. Instead, we are given $N$ i.i.d. samples, $\left{x_{j}\right}_{j=1}^{N}$, from which we wish to estimate the PPCA model parameters $\mu, B$, and $\sigma$. In this section, we show that the ML estimates (see Appendix B.1.4) of these parameters can be computed in closed form from the ML estimates of the mean and covariance.

To that end, assume that $y$ and $\varepsilon$ are zero-mean Gaussian random variables with covariances $I_{d}$ and $\sigma^{2} I_{D}$, respectively, i.e., $\boldsymbol{y} \sim \mathcal{N}(\mathbf{0}, I)$ and $\varepsilon \sim \mathcal{N}\left(\mathbf{0}, \sigma^{2} I\right)$. Then $\boldsymbol{x} \sim \mathcal{N}\left(\mu_{x}, \Sigma_{x}\right)$, where $\mu_{x}=\mu$ and $\Sigma_{x}=B B^{\top}+\sigma^{2} I_{D}$. Therefore, the loglikelihood of $x$ is given by

\begin{aligned} \mathscr{L} &=\sum_{j=1}^{N} \log \left(\frac{1}{(2 \pi)^{D / 2} \operatorname{det}\left(\Sigma_{x}\right)^{1 / 2}} \exp \left(-\frac{\left(x_{j}-\mu_{x}\right)^{\top} \Sigma_{x}^{-1}\left(x_{j}-\mu_{x}\right)}{2}\right)\right) \ &=-\frac{N D}{2} \log (2 \pi)-\frac{N}{2} \log \operatorname{det}\left(\Sigma_{x}\right)-\frac{1}{2} \sum_{j=1}^{N}\left(x_{j}-\mu\right)^{\top} \Sigma_{x}^{-1}\left(x_{j}-\mu\right) \end{aligned}
We obtain the ML estimate for $\mu$ from the derivatives of $\mathscr{L}$ with respect to $\mu$ as
$$\frac{\partial \mathscr{L}}{\partial \mu}=-\sum_{j=1}^{N} \Sigma_{x}^{-1}\left(x_{j}-\mu\right)=\mathbf{0} \Longrightarrow \hat{\mu}=\hat{\mu}{N} \doteq \frac{1}{N} \sum{j=1}^{N} \boldsymbol{x}{j}$$ After replacing $\hat{\mu}$ in the log-likelihood, we obtain $$\mathscr{L}=-\frac{N D}{2} \log (2 \pi)-\frac{N}{2} \log \operatorname{det}\left(\Sigma{x}\right)-\frac{N}{2} \operatorname{trace}\left(\Sigma_{x}^{-1} \hat{\Sigma}{N}\right)$$ where $$\hat{\Sigma}{N} \doteq \frac{1}{N} \sum_{j=1}^{N}\left(x_{j}-\hat{\mu}{N}\right)\left(x{j}-\hat{\mu}_{N}\right)^{\top} .$$

统计代写|主成分分析代写Principal Component Analysis代考|PPCA from Population Mean and Covariance

$$\hat{\mu}=\mu_{x},$$

1. 低维表示具有单位协方差，即 $\Sigma_{y}=I_{d} \in \mathbb{R}^{d \times d}$.
2. 橾声协方差矩阵 $\Sigma_{\varepsilon} \in \mathbb{R}^{D \times D}$ 是各向同性的，即 $\Sigma_{\varepsilon}=\sigma^{2} I_{D}$.
在这些假设下，观测值的协方差必须为
$$\Sigma_{x}=B B^{\top}+\sigma^{2} I_{D} .$$
从这个关系可以得出特征值 $\Sigma_{x}$ 必须等于的特征值 $B B^{\top}$ 加 $\sigma^{2}$. 自从 $B B^{\top}$ 有等级 $d$ 并且是半正定的， $D-d$ 的特征值 $B B^{\top}$ 必须等于零。因此，最小的 $D-d$ 的 特征值 $\Sigma_{x}$ 必须彼此相等且等于 $\sigma^{2}$. 此外，非对角线条目 $\Sigma_{x}$ 等于的非对角项 $B B^{\top}$. 因此，即使 PPCA 和 PCA 都试图从 $\Sigma_{x}$ 进入 $\Sigma_{y}$ ，他们试图捕获的信息是不 一样的。一方面，PPCA试图找到一个矩阵 $B$ 使得协方差被保留，即，非对角线条目 $\Sigma_{x}$. 另一方面，PCA 试图保留方差，即 $\Sigma_{x}$.

统计代写|主成分分析代写Principal Component Analysis代考|PPCA by Maximum Likelihood

$$\mathscr{L}=\sum_{j=1}^{N} \log \left(\frac{1}{(2 \pi)^{D / 2} \operatorname{det}\left(\Sigma_{x}\right)^{1 / 2}} \exp \left(-\frac{\left(x_{j}-\mu_{x}\right)^{\top} \Sigma_{x}^{-1}\left(x_{j}-\mu_{x}\right)}{2}\right)\right) \quad=-\frac{N D}{2} \log (2 \pi)-\frac{N}{2} \log \operatorname{det}\left(\Sigma_{x}\right)-\frac{1}{2} \sum_{j=1}^{N}\left(x_{j}-\mu\right)^{\top} \Sigma_{x}^{-1}\left(x_{j}-\mu\right)$$

$$\frac{\partial \mathscr{L}}{\partial \mu}=-\sum_{j=1}^{N} \Sigma_{x}^{-1}\left(x_{j}-\mu\right)=\mathbf{0} \Longrightarrow \hat{\mu}=\hat{\mu} N \doteq \frac{1}{N} \sum j=1^{N} \boldsymbol{x} j$$

$$\mathscr{L}=-\frac{N D}{2} \log (2 \pi)-\frac{N}{2} \log \operatorname{det}(\Sigma x)-\frac{N}{2} \operatorname{trace}\left(\Sigma_{x}^{-1} \hat{\Sigma} N\right)$$

$$\hat{\Sigma} N \doteq \frac{1}{N} \sum_{j=1}^{N}\left(x_{j}-\hat{\mu} N\right)\left(x j-\hat{\mu}_{N}\right)^{\top} .$$

