## 数学代写|基础数据分析代写Elementary data Analysis代考|Some General Theory for Linear Smoothers

Some key parts of the theory you are familiar with for linear regression models carries over more generally to linear smoothers. They are not quite so important any more, but they do have their uses, and they can serve as security objects during the transition to non-parametric regression.

Throughout this sub-section, we will temporarily assume that $Y=\mu(X)+\epsilon$, with the noise terms $\epsilon$ have constant variance $\sigma^2$, and being uncorrelated with each other at different observations. Also, we will define the smoothing, influence or hat matrix $\hat{\mathbf{w}}$ by $\hat{w}_{i j}=\hat{w}\left(x_i, x_j\right)$. This records how much influence observation $y_j$ had on the smoother’s fitted value for $\mu\left(x_i\right)$, which (remember) is $\widehat{\mu}\left(x_i\right)$ or $\widehat{\mu}_i$ for short ${ }^{14}$, hence the name “hat matrix” for $\hat{w}$.

It is easy to get the standard error of any predicted mean value $\widehat{\mu}(x)$, by first working out its variance:
\begin{aligned} \mathbb{V}[\widehat{\mu}(x)] &=\mathbb{V}\left[\sum_{j=1}^n w\left(x_j, x\right) Y_j\right] \ &=\sum_{j=1}^n \mathbb{V}\left[w\left(x_j, x\right) Y_j\right] \ &=\sum_{j=1}^n w^2\left(x_j, x\right) \mathbb{V}\left[Y_j\right] \ &=\sigma^2 \sum_{j=1}^n w^2\left(x_j, x\right) \end{aligned}
The second line uses the assumption that the noise is uncorrelated, and the last the assumption that the noise variance is constant. In particular, for a point $x_i$ which appeared in the training data, $\mathbb{V}\left[\widehat{\mu}\left(x_i\right)\right]=\sigma^2 \sum_j w_{i j}^2$.

Notice that this is the variance in the predicted mean value, $\widehat{\mu}(x)$. It is not an estimate of $\mathbb{V}[Y \mid X=x]$, though we will see how conditional variances can be estimated using nonparametric regression in Chapter 7.

Notice also that we have not had to assume that the noise is Gaussian. If we did add that assumption, this formula would also give us a confidence interval for the fitted value (though we would still have to worry about estimating $\sigma$ ).

## 数学代写|基础数据分析代写Elementary data Analysis代考|(Effective) Degrees of Freedom

For linear regression models, you will recall that the number of “degrees of freedom” was just defined as the number of coefficients (including the intercept). While degrees of freedom are less important for other sorts of regression than for linear models, they’re still worth knowing about, so I’ll explain here how they are calculated.
The first thing to realize is that we can’t use the number of parameters to define degrees of freedom in general, since most linear smoothers don’t have parameters. Instead, we have to go back to the reasons why the number of parameters matters in ordinary linear models ${ }^{15}$. We’ll start with an $n \times p$ data matrix of predictor variables $\mathbf{x}$ (possibly including an all- 1 column for an intercept), and an $n \times 1$ column matrix of response values $\mathbf{y}$. The ordinary least squares estimate of the $p$-dimensional coefficient vector $\beta$ is
$$\hat{\beta}=\left(\mathbf{x}^T \mathbf{x}\right)^{-1} \mathbf{x}^T \mathbf{y}$$
This lets us write the fitted values in terms of $\mathbf{x}$ and $\mathbf{y}$ :
\begin{aligned} \widehat{\mu} &=\mathbf{x} \hat{\beta} \ &=\left(\mathbf{x}\left(\mathbf{x}^T \mathbf{x}\right)^{-1} \mathbf{x}^T\right) \mathbf{y} \ &=\mathbf{w y} \end{aligned}
where w is the $n \times n$ matrix, with $w_{i j}$ saying how much of each observed $y_j$ contributes to each fitted $\hat{\mu}_i$. This is what, a little while ago, I called the influence or hat matrix, in the special case of ordinary least squares.

Notice that $\mathbf{w}$ depends only on the predictor variables in $\mathbf{x}$; the observed response values in $\mathbf{y}$ don’t matter. If we change around $\mathbf{y}$, the fitted values $\widehat{\mu}$ will also change, but only within the limits allowed by $\mathbf{w}$. There are $n$ independent coordinates along which y can change, so we say the data have $n$ degrees of freedom. Once $\mathbf{x}$ (and thus $\mathbf{w})$ are fixed, however, $\widehat{\mu}$ has to lie in a $p$-dimensional linear subspace in this $n$-dimensional space, and the residuals have to lie in the $(n-p)$-dimensional space orthogonal to it.

