## 商科代写|计量经济学代写Econometrics代考|Application to Missing Data in Surveys

When making inference with survey data, the researcher has available data on a vector of characteristics for units belonging to a random subset $\mathcal{S}$ of a larger finite population $\mathcal{U}$. The law used to draw $\mathcal{S}$ can depend on variables available for the whole population, for example, from a census. We assume that the researcher is interested in a parameter $g$ which could be computed if we had the values of a variable $y_{i}$ for all units of index $i \in \mathcal{U}$. This can be an inequality index, for example, the Gini index, and $y_{i}$ the wealth of household $i$. In the absence of missing data, the statistician can produce a confidence interval for $g$, making use of the data for the units $i \in \mathcal{S}$ and his available knowledge on the law $\mathcal{S}$. We assume that the cardinality of $\mathcal{S}$ is fixed and equal to $n$. When $g$ is a total, it is usual to rely on an unbiased estimator, an estimator of its variance, and a Gaussian approximation. For more complex parameters, linearization is often used to approximate moments. The estimator usually rely on the survey weights $\pi_{i}=1 / \mathbb{P}(i \in \mathcal{S})$. For example an estimator of the Gini index is
$$\widehat{g}\left(\left(y_{i}\right){i \in \mathcal{S}}\right)=\frac{\sum{i=1}^{n}(2 \hat{r}(i)-1) \pi_{i} y_{i}}{\sum_{i=1}^{n} \pi_{i} \sum_{i=1}^{n} \pi_{i} y_{i}}-1,$$
where $\hat{r}(i)=\sum_{j=1}^{n} w_{j} \mathbb{1}\left{y_{j} \leq y_{i}\right}$. The estimators of the variance of the estimators are more complex to obtain and we assume there is a numerical procedure to obtain them. Inference is based on the approximation
$$\widehat{g}\left(\left(y_{i}\right){i \in \mathcal{S}}\right) \approx g+\sqrt{\widehat{\operatorname{var}}(\widehat{g})\left(\left(y{i}\right){i \in \mathcal{S}}\right)} \epsilon$$ where $\epsilon$ is a standard normal random variable and $\widehat{\operatorname{var}}(\widehat{g})\left(\left(y{i}\right){i \in \mathcal{S}}\right)$ is an estimator of the variance of $\widehat{g}\left(\left(y{i}\right)_{i \in \mathcal{S}}\right)$.

In practice, this is not possible when some of the $y_{i}$ s are missing. There is a distinction between total nonresponse, where the researcher discards the data for some units $i \in \mathcal{S}$ or it is not available, and partial nonresponse. Let us ignore total nonresponse which is usually dealt with using reweighting and calibration and focus on partial nonresponse. We consider a case where $y_{i}$ can be missing for some units $i \in S$, while all other variables are availahle for all units $i \in S$. We rely on a classical formalism where the vector of surveyed variables and of those used to draw $\mathcal{S} \subsetneq \mathcal{U}$, for each unit $i \in \mathcal{U}$, are random draws from a superpopulation. In this formalism, the parameter $y_{i}$ for all indices $i$ of households in the population and $g$ are random and we shall now use capital letters for them. Let $S_{i}$ and $R_{i}$ be random variables,where $S_{i}=1$ if $i \in \mathcal{S}$ and $R_{i}=1$ if unit $i$ reveals the value of $Y_{i}$ given $S_{i}=1$, and $\boldsymbol{X}{i}$ and $\boldsymbol{Z}{i}$ be random vectors which will play a different role.

## 商科代写|计量经济学代写Econometrics代考|B-Spline Model-Assisted Estimator for Finite Population Totals

Consider the superpopulation model given in (2) with $f$ an unknown function and a univariate $x$-variable. Without loss of generality, we suppose that $x_{k} \in[0,1]$. We suppose also that $x_{k}$ is known for all $k \in U$.

To estimate the unknown regression function $f$, we use spline approximation. For a fixed $m>1$, the set $S_{K, m}$ of spline functions of order $m$ with $K$ equidistant interiors knots $0=\xi_{0}<\xi_{1}<\ldots<\xi_{K}<\xi_{K+1}=1$ is the set of piecewise polynomials of degree $m-1$ that are smoothly connected at the knots
$S_{K, m}=\left{t \in C^{m-2}[0,1]: t(z)\right.$ is a polynomial of degree (m-1) on each interval[ $\left.\left.\xi_{i}, \xi_{i+1}\right]\right} .$
For $m=1, S_{K, m}$ is the set of step functions with jumps at knots. For each fixed set of knots, $S_{K, m}$ is a linear space of functions of dimension $q=K+m$. A basis for this linear space is provided by the B-spline functions $\left{B_{j}(\cdot)\right}_{j=1}^{q}$ defined by $B_{j}(x)=$ $\left(\xi_{j}-\xi_{j-m}\right) \sum_{l=0}^{m}\left(\xi_{j-l}-x\right){+}^{m-1} / \Pi{r=0, r \neq l}^{m}\left(\xi_{j-l}-\xi_{j-r}\right) \quad$ with $\quad\left(\xi_{j-l}-x\right){+}^{m-1}=$ $\left(\xi{j-l}-x\right)^{m-1}$ if $\xi_{j-l} \geq x$ and zero, otherwise (Schumaker 1981; Dierckx 1993). Each function $B_{j}(\cdot)$ has the knots $\xi_{j-m}, \ldots, \xi_{j}$ with $\xi_{r}=\xi_{\min (\max (r, 0), K+1)}$ for $r=j-m, \ldots, j$ (Zhou et al. 1998) which means that its support consists of a small, fixed, finite number of intervals between knots. Figure 1 exhibits the six $B$ -spline basis functions for $K=3$ interior knots and $m=3$. Other important properties of $B$-splines are:
$$B_{j}(x) \geq 0 \text { for all } x \in[0,1]$$
and
$$\sum_{j=1}^{q} R_{j}(x)=1, \quad x \in[0,1] .$$

## 商科代写|计量经济学代写Econometrics代考|B-Spline Model-Assisted Estimation

In a survey sampling framework, the $y_{k}$ ‘s values are available only for the sampled individuals, so $\tilde{f}\left(x_{k}\right)$ given in (6) cannot be used in practice. We estimate it by
$$\hat{f}\left(x_{k}\right)=\mathbf{b}^{T}\left(x_{k}\right) \hat{\boldsymbol{\theta}}, \quad k \in U$$

where $\hat{\theta}$ is the minimizer of the weighted least square sum
\begin{aligned} \hat{\boldsymbol{\theta}} &=\arg \min {\boldsymbol{\theta} \in \mathbf{R}^{q}} \sum{k \in s} d_{k}\left(y_{k}-\mathbf{b}^{T}\left(x_{k}\right) \boldsymbol{\theta}\right)^{2} \ &=\left(\mathbf{B}{s}^{T} \boldsymbol{\Pi}{s}^{-1} \mathbf{B}{s}\right)^{-1} \mathbf{B}{s}^{T} \boldsymbol{\Pi}{s}^{-1} \mathbf{y}{s}=\left(\sum_{k \in s} d_{k} \mathbf{b}\left(x_{k}\right) \mathbf{b}^{T}\left(x_{k}\right)\right)^{-1} \sum_{k \in s} d_{k} \mathbf{b}\left(x_{k}\right) y_{k} \end{aligned}
where $\mathbf{B}{s}^{T}-\left(\mathbf{b}^{T}\left(x{k}\right)\right){k \in s}, \mathbf{y}{s}-\left(y_{k}\right){k \in s}$ and $\Pi{s}-$ diag $\left(\pi_{k}\right){k \in s}$ and provided that the matrix $\mathbf{B}{s}^{T} \boldsymbol{\Pi}{s}^{-1} \mathbf{B}{s}$ is invertible. The sample-based estimator $\hat{\boldsymbol{\theta}}$ can be viewed as a substitution estimator of $\tilde{\boldsymbol{\theta}}$ given in (7) since every finite population total from the expression of $\tilde{\theta}$ is substituted by its HT estimator.

The $B$-spline model-assisted estimator for estimating the total $t_{y}$ has been suggested by Goga (2005) and obtained by plugging $\hat{f}\left(x_{k}\right)$ in (3)
\begin{aligned} \hat{t}{b s} &=\sum{k \in s} d_{k}\left(y_{k}-\hat{f}\left(x_{k}\right)\right)+\sum_{k \in U} \hat{f}\left(x_{k}\right) \ &=\sum_{k \in s} d_{k} y_{k}-\left(\sum_{k \in s} d_{k} \mathbf{b}\left(x_{k}\right)-\sum_{k \in U} \mathbf{b}\left(x_{k}\right)\right)^{T} \hat{\boldsymbol{\theta}} \end{aligned}

