## 计算机代写|机器学习代写machine learning代考|Binary and Categorical Features: One-Hot Encodings

So far we have dealt with regression problems where we have both real-valued inputs (features $X$ ) and real-valued outputs (labels $y$ ). What can we do in cases where features are binary or categorical?

As an example, let us consider whether the length of a user’s review can be predicted by (or more simply, is related to) their gender. To do so, we will look at a different dataset (of a few hundred beer reviews from McAuley et al. (2012)) that includes the gender of its users.
That is, we would like a model of the form:
$$\text { length }=\theta_0+\theta_1 \times \text { gender. }$$
Obviously, gender (represented in this dataset as a string) is not a numerical quantity, so we need some appropriate encoding of the gender variable.

For the moment, let us treat gender as a binary variable. We will relax this assumption in a moment to allow for a non-binary gender variable (and allow for the possibility that the gender is missing, as it can be in this dataset), but for the moment let us encode the gender variable as:
$$\text { Male }=0 ; \quad \text { Female }=1 .$$
Alternately, this is just a binary indicator specifying whether this user is female. This encoding, although only one of a few we might have used, allows us to fit a linear model and estimate the values of $\theta_0$ and $\theta_1$. The model we fit (after removing users who did not specify a gender) is
$$\text { length (in words) }=127.07+8.76 \times(\text { user is female). }$$
With a little thought, we can interpret the model parameters as indicating that, on average, females write slightly longer reviews (by $8.76$ words) compared to males. Note that $127.07$ is not the population average, but rather the average for males (whose gender feature is zero).

A scatter plot of the data (i.e., the encoded gender attribute and the review lengths), as well as the line of best fit above is depicted in Figure 2.9. Note that although we have fit the data with a line (Fig. 2.9, left), the actual feature values only occupy two points ( 0 and 1 ); thus the fit is perhaps better represented with a bar plot (Fig. 2.9, right).

## 计算机代写|机器学习代写machine learning代考|Missing Features

Often datasets will have features that are missing, for example, the underlying data used for the example in Section 2.3.2 consisted of a gender attribute that many users may leave unspecified.

When dealing with binary or categorical features we dealt with these missing values quite straightforwardly-we simply treated ‘missing’ as an additional category.

But if a continuous feature, such as a user’s age or income, were missing, we must think harder about how to handle it. Trivially, we might simply discard instances with missing features, though this strategy will harm model performance if it means discarding a substantial fraction of our data.

Alternately we might replace the missing entries by the average (or mode) value for that feature; this strategy is known as feature imputation. This may be more effective than discarding the feature, but may also introduce some bias, as (e.g.) users who choose to leave a feature unspecified may be quite different from the average or mode.

To avoid these issues, we would like a strategy that uses features when they are available, but makes separate predictions for those users when they are not. This can be achieved via the following strategy: for any feature $x$ which is sometimes missing, replace it by two features $x^{\prime}$ and $x^{\prime \prime}$ as follows: $x^{\prime}=\left{\begin{array}{ll}1 & \text { if feature is missing } \ 0 & \text { otherwise }\end{array}, \quad x^{\prime \prime}=\left{\begin{array}{ll}0 & \text { if feature is missing } \ x & \text { otherwise }\end{array}\right.\right.$.
Following this parameters can be fit within a model as usual:
$$y=\theta_0+\theta_1 x^{\prime}+\theta_2 x^{\prime \prime} .$$
This representation may seem somewhat arbitrary, but makes sense once we expand the expression for missing and non-missing features. For example, when a feature is available predictions are made according to
$$y=\theta_0+\theta_2 x,$$
whereas when a feature is missing predictions are made according to
$$y=\theta_0+\theta_1 .$$

