## 经济代写|计量经济学代写Econometrics代考|The Method of Maximum Likelihood

The estimation techniques we have discussed so far – least squares and instrumental variables – are applicable only to regression models. But not every model can be written so that the dependent variable is equal to a regression function plus an additive error term or so that a set of dependent variables, arranged as a vector, is equal to a vector of regression functions plus a vector of errors (see Chapter 9). If not, then least squares and instrumental variables are simply not appropriate. In this chapter, we therefore introduce a third estimation method, which is much more widely applicable than the techniques we have discussed so far, but also requires fairly strong assumptions. This is the method of maximum likelihood, or ML, estimation.

As an extreme example of how inappropriate least squares can be, consider the model
$$y_t^\gamma=\beta_0+\beta_1 x_t+u_t, \quad u_t \sim \operatorname{IID}\left(0, \sigma^2\right),$$
which looks almost like a regression model. This model makes sense so long as the right-hand side of $(8.01)$ is always positive, and it may even be an attractive model in certain cases. ${ }^1$ For example, suppose that the observations on $y_t$ are skewed to the right but those on $x_t$ are not. Then a conventional regression model could reconcile these two facts only if the error terms $u_t$ were right-skewed, which one would probably not want to assume and which would make the use of least squares dubious. On the other hand, the model (8.01) with $\gamma<1$ might well be able to reconcile these facts while allowing the error terms to be symmetrically distributed.

If $\gamma$ were known, (8.01) would be a regression model. But if $\gamma$ is to be estimated, (8.01) is not a regression model. As a result, it cannot sensibly be estimated by least squares. The sum-of-squares function is
$$\operatorname{SSR}(\boldsymbol{\beta}, \gamma)=\sum_{t=1}^n\left(y_t^\gamma-\beta_0-\beta_1 x_t\right)^2,$$
1 Strictly speaking, of course, it is impossible to guarantee that the right-hand side of (8.01) will always be positive, but this model may be regarded as a very good approximation if $\beta_0+\beta_1 x_{\ell}$ is always much larger than $\sigma$.

## 经济代写|计量经济学代写Econometrics代考|Fundamental Concepts and Notation

Maximum likelihood estimation depends on the notion of the likelihood of a given set of observations relative to a model, or set of DGPs. A DGP, being a stochastic process, can be characterized in a number of ways. We now develop notation in which we can readily express one such characterization that is particularly useful for present purposes. We assume that each observation in any sample of size $n$ is a realization of a random variable $y_t, t=1, \ldots, n$, taking values in $\mathbb{R}^m$. Although the notation $y_t$ ignores the possibility that the observation is in general a vector, it is more convenient to let the vector notation $\boldsymbol{y}$ (or $\boldsymbol{y}^n$ if we wish to make the sample size explicit) denote the entire sample. Thus
$$\boldsymbol{y}^n=\left[\begin{array}{l:l:l:l} y_1 & y_2 & \cdots & y_n \end{array}\right]$$
If each observation is a scalar, $\boldsymbol{y}$ is an $n$-vector, while if each observation is an $m$-vector, $\boldsymbol{y}$ is an $n \times m$ matrix. The vector or matrix $\boldsymbol{y}$ may possess a probability density, namely, the joint density of its elements under the DGP. This density, if it exists, is a map to the real line from the set of possible realizations of $\boldsymbol{y}$, a set that we will denote by $\mathrm{y}^n$ and that is in general an arbitrary subset of $\mathbb{R}^{n m}$. It will be necessary to exercise some care over the definition of the density in certain cases, but for the present it is enough to suppose that it is the ordinary density with respect to Lebesgue measure on $\mathbb{R}^{n m} \cdot 3$ When other possibilities exist, it will turn out that the choice among them is irrelevant for our purposes.

We may now define formally the likelihood function associated with a given model for a given sample $\boldsymbol{y}$. This function is a function of both the parameters of the model and the given data set $\boldsymbol{y}$; its value is just the density associated with the DGP characterized by the parameter vector $\boldsymbol{\theta} \in \Theta$, evaluated at the sample point $\boldsymbol{y}$. Here $\Theta$ denotes the parameter space in which the parameter vector $\theta$ lies; we will assume that it is a subset of $\mathbb{R}^k$. We will denote the likelihood function by $L: y^n \times \Theta \rightarrow \mathbb{R}$ and its value for $\boldsymbol{\theta}$ and $\boldsymbol{y}$ by $L(\boldsymbol{y}, \boldsymbol{\theta})$. In many practical cases, such as the one examined in the preceding section, the $y_t$ ‘s are independent and each $y_t$ has probability density $L_t\left(y_t, \boldsymbol{\theta}\right)$. The likelihood function for this special case is then
$$L(\boldsymbol{y}, \boldsymbol{\theta})=\prod_{t=1}^n L_t\left(y_t, \boldsymbol{\theta}\right) .$$
The likelihood function (8.03) of the preceding section is evidently a special case of this special case. When each of the $y_t$ ‘s is identically distributed with density $f\left(y_t, \boldsymbol{\theta}\right)$, as in that example, $L_t\left(y_t, \boldsymbol{\theta}\right)$ is equal to $f\left(y_t, \boldsymbol{\theta}\right)$ for all $t$.

