## 统计代写|数据科学、大数据和数据多样性代写Data Science, Big Data and Data Variety代考|Question Wording

Although applications demonstrating the potential of machine learning for questionnaire design (or developing question wording in the more traditional sense) have been developed successfully, users may not know that many of these tools rely on MLMs.

The free online application Survey Quality Predictor (SQP) $2.0$ (http://sqp.upf .edu/) is one such example. SQP allows researchers to assess the quality of their questions, including suggestions for improvements, either prior to data collection or if this assessment is done after data collection, to allow researchers to quantify (and correct for) potentially biasing impact of measurement error (Oberski and DeCastellarnau 2019). For this assessment to work, researchers first have to code their question along several different dimensions including topic, wording, response scale, and administration mode (using the coding system developed by Saris and Gallhofer 2007, 2014). The second step is that the system then predicts the quality of said question, or more specifically, two indicators of measurement error: the reliability and the validity of the survey question. ${ }^1$ This prediction task is where machine learning comes in: while SQP $1.0$ relied on linear regression modeling, SQP $2.0$ uses random forests consisting of 1500 individual regression trees to account for possible multicollinearity of question characteristics, nonlinear relationships, and interaction effects (Saris and Gallhofer 2007, 2014; Oberski, Gruner, and Saris 2011; Oberski and DeCastellarnau 2019). ${ }^2$ These random forests explain a much higher portion of the variance, namely $65 \%$ of the variance in reliability and $84 \%$ of validity across the questions in the test sample, compared to the linear regression used in the first version of SQP (reliability: 47\%; validity: 61\%) (Saris and Gallhofer 2014; Oberski 2016). SQP also provides researchers with an importance measure, that is prioritization, as to which question characteristic is the most influential and how changing a particular question feature, say a 5-point response scale to an 11-point response scale would alter the predicted reliability and validity. Another tool with a slightly different approach is QUAID (question-understanding-aid) (Graesser et al. 2000; Graesser et al. 2006).

## 统计代写|数据科学、大数据和数据多样性代写Data Science, Big Data and Data Variety代考|Evaluation and Testing

In addition to the tools and applications targeted to improve question wording, researchers have used machine learning in question evaluation and testing. The insights from these studies, e.g. which questions contribute to interviewer misreading and which operations in reporting cause respondent reporting errors, can be used to redesign and improve existing questionnaires, serve as a tool to assess interviewer performance and to target interviewers for retraining, and potentially help identify the type of respondent to include for testing.

For example, Timbrook and Eck $(2018$, 2019) investigated the respondentinterviewer interaction or more specifically interviewer reading behaviors – that is, whether a question was read with or without changes and whether those changes were minor or major. The traditional approach to measure interviewer reading behaviors is behavior coding, which can be very time-consuming and expensive. Timbrook and Eck (2018, 2019) demonstrated that it is possible to partially automate the measurement of these interviewer question reading behaviors using machine learning. The authors compared string comparisons, RNNs, and human coding. Unlike string comparisons that only differentiate between reading with and without changes, RNNs like human coders should be able to reliably differRNNs to string matching (read with change vs. without), the authors showed that exact string comparisons based on preprocessed text were comparable to the other methods. Preprocessing text, however, is resource intensive. If the text was left unprocessed, they showed that the exact string comparisons were less reliable than any other method. In contrast, RNNs can differentiate between different degrees of deviations in question reading. The authors showed that RNNs trained on unprocessed text are comparable to manual, human coding if there is a high prevalence of deviations from exact reading and that the RNNs perform much worse when there is a low prevalence of deviations. Regardless of this prevalence, RNNs performed slightly worse identifying minor changes in question wording. McCarthy and Earp (2009) retrospectively analyzed data from the 2002 Census of Agriculture using classification trees to identify respondents with higher rates of reporting errors in surveys (more specifically, when reporting land size land use).

