# «Tianyou Wang ACT, Inc. Address correspondence to Tianyou Wang, ACT, P.O. Box 168, Iowa City, IA 52243, e-mail: wang Essentially Unbiased EAP ...»

Essentially Unbiased EAP Estimates in Computerized Adaptive Testing

Tianyou Wang

ACT, Inc.

Address correspondence to Tianyou Wang, ACT, P.O. Box 168, Iowa City, IA 52243,

e-mail: wang@act.org

Essentially Unbiased EAP Estimates in Computerized Adaptive Testing

Abstract

In computerized adaptive testing (CAT), the scoring procedure is usually based on

IRT-based ability ( θ ) estimates instead of number-correct scores because different

examinees typically receive different sets of items. It is well-known that the maximum likelihood estimation (MLE) produces relatively unbiased estimates with relatively high standard error (SE) in CAT. The Bayesian estimation methods, on the other hand, produce estimates with relatively small SE but with large bias if a standard normal prior is imposed.

The purpose of this paper was to propose a new expected a posteriori (EAP) estimation method with a flatter prior distribution than the standard normal distribution to reduce the bias of the Bayesian methods. The simulation results of the paper demonstrated that the EAP with a beta prior distribution can produce estimates with similar or even smaller bias than the MLE and yet does not sacrifice much of the smaller SE and root mean square error (RMSE) of the standard EAP estimation with a normal prior, and that the presence of practical constraints such as content balancing and item exposure rate control does not affect the relative unbiasedness of the new EAP method.

Key Words: Computerized adaptive testing, Bayesian estimation, expected a posteriori, prior distribution, bias.

Essentially Unbiased EAP Estimates in Computerized Adaptive Testing In computerized adaptive testing (CAT), it is common that different examinees receive different sets of items from a given item pool. Because those sets of items are of different difficulty levels, it is inconvenient to derive the reported scores based on the number-correct raw scores as is often done in paper-pencil conventional testing. Therefore, IRT-based ability ( θ ) estimates are often used as a basis in deriving the reported scores. So far, four ability estimation methods primarily have been used in CAT: (1) maximum likelihood estimation (MLE) (Birnbaum, 1968), (2) Owen's Bayesian estimation (OWEN) (Owen, 1969, 1975), (3) expected a posteriori estimation (EAP) (Bock & Aitken, 1981; Bock & Mislevy, 1982), and (4) maximum a posteriori estimation (MAP) (Samejima, 1969). A few studies (Bock & Mislevy, 1982; Weiss & McBride, 1984; de la Torre, 1991; Wang,

1995) have been done to examine and compare these ability estimation methods under CAT settings. The general conclusions are that MLE is relatively unbiased with a well-designed item pool but has relatively large standard error (SE), and that the Bayesian methods are relatively biased toward the prior mean, and that among the Bayesian methods, EAP has relatively small bias, and SE. Bias in this context is defined as the mean θ estimates for an examinee taking the same CAT many times without practice effect minus his/her true θ.

EAP has the advantage of being computationally simpler than MLE and MAP. Wang (1995) also found that if an item pool lacks items of extreme difficulty levels, which is usually the case with real-world item pools, MLE could also be biased, but in the opposite direction of the Bayesian methods.

In many standardized testing programs (e.g., GRE, see Eignor & Schaeffer, 1995), Bayesian methods are not used despite their small standard error only because they are seriously biased. Bias can be problematic when the estimates are used to make inferences in relation to some absolute criterion. For instance, in computerized mastery testing, the estimates may be used to compare with certain cut-scores and make decisions about examinees’ pass/fail status. Bias in the estimation can cause serious false decisions. For some testing programs, the CAT form will co-exist with its paper-pencil conventional form for a period of time and the score scale will remain the same as for the conventional form. In these situations, there is a need to transform the θ estimates into the equivalent numbercorrect score on some base conventional form (e.g., Eignor & Schaeffer, 1995). Any bias in the θ estimates will necessarily affect the transformed reported score in a negative way. To solve this bias problem, some CAT developers resorted to traditional equating methods to eliminate the effect of the bias. For example, Segall (1995) and Segall & Carter (1995) used a random groups design to eliminate the inequivalency of the θ based CAT scores and conventional form test scores. The equating process is usually expensive and may introduce additional errors in the process of data collection and analysis.

Conceptually, the Bayesian methods are intrinsically biased because of the incorporation of the prior information into the estimation process. Like the regression methods in predication problems which regress the predicted values toward the mean, the Bayesian methods also regress estimates toward their prior mean. The Bayesian methods use both the data and the prior for estimation whereas MLE uses only the data. The Bayesian methods can be thought, in some loose sense, as a combination of MLE and the prior distribution which is usually the standard normal distribution. (For convenience, Bayesian methods with a standard normal prior will be called standard Bayesian methods in the remainder of this paper.) The large bias of the standard Bayesian methods in CAT is caused by the steep shape of the standard normal prior. But because MLE also has a relatively small bias in the opposite direction of bias of the Bayesian methods, it was hypothesized that if a flatter prior distribution is specified, the Bayesian estimates can also be relatively unbiased. The purposes of this paper are to study the effect of different specifications of the prior distribution on the bias of the EAP estimates in CAT, and to search for an optimal prior for a given item pool so that the EAP estimates would be basically unbiased in a relatively wide range of the θ scale. The relationship between the characteristics of item pool and the shape of the optimal prior will be investigated. Another purpose of the study is to examine the possible effects of implementing practical constraints such as content balancing and item exposure rate control on the bias of the new EAP method.

EAP was chosen among the Bayesian methods because of its relatively small error and computational simplicity over other Bayesian methods even though similar idea can be applied to MAP; that is, a flatter prior distribution can be applied to MAP to reduce its bias.

Because OWEN was specifically designed to have a standard normal prior, however, this idea does not apply to the OWEN method.

ability estimation methods, in particular, of MLE and the standard EAP methods are described below.

Maximum likelihood estimation: MLE is a widely used for parameter estimation in many statistical applications. In the context of item response theory (IRT) ability estimation, given a response vector u to a set of items with known parameters, the likelihood function is

Iterative numerical methods such as the Newton-Raphson method can be used to solve the likelihood function. Asymptotically, the variance of the MLE estimates can be approximated by the inverse of the test information function.

In the context of CAT, the approximation may not be sufficiently accurate because the test length of a CAT test is supposed to be relatively short. Warm (1989) and Wang (1995) found with simulation

targeted at an examinee’s true ability level, the bias will be close to zero because the term in the parentheses will be close to zero. If the ability level is higher than the average item difficulty level, the bias will be positive; likewise, if the ability level is lower than the average item difficulty level, the bias will be negative.

The expected a posteriori estimation: In the context of ability estimation in IRT, we have

where X k is one of q quadrature points, W (X k ) is a weight associated with the quadrature point, and L(Xk ) is the likelihood function conditioned at that quadrature point. Using this procedure, it can be seen that the EAP estimates become summations and do not require iterative processing. Unlike the OWEN method, the EAP method evaluates the actual posterior distribution directly. So at least logically, the EAP method is superior to the OWEN method. Bock & Mislevy (1982) pointed out that among all possible estimators, EAP has the smallest mean square error over (RMSE) the population for which the distribution of the ability is specified by the prior. The bias of the Bayesian methods all point toward the middle point of the θ scale if a standard normal prior is used. The shape of the prior distribution affects the magnitude of the bias for the Bayesian estimates. In largescale standardized testing, it is often not realistic to use any actual prior information about the individual examinee to form the prior. Therefore it is common practice to use the standard normal distribution as the prior for every examinee. The bias of the Bayesian estimates represents a regression effect toward the group mean which is undesirable in most standardized testing settings.

prior distribution is used instead of the standard normal prior. The goal is to make the new EAP method as good as MLE in terms of bias and still have RMSE similar to the standard EAP. With this new EAP method, the prior distributions no longer aims at reflecting any prior information about the examinees' ability but only at serving as a tool to achieve technical quality such as less bias. For this reason, they can be referred to as uninformative priors.

In choosing such flatter priors, there may be many different options. One such option may be the normal distribution with variance greater than one. But because the magnitude of bias of EAP or other Bayesian methods were found to be generally asymmetric around the middle point of the θ scale (cf. Wang, 1995), the normal distribution is considered not desirable due to its symmetry. The family of beta distributions was considered to be the best option for this situation because of their flexibility in shape. Let this beta distribution be denoted as g(θ|α,β,l, u), where α,β,l, and u are four parameters that characterize the distribution, with the first two parameters characterizing the shape and the last two parameters characterizing the lower and upper bounds of the distribution. The probability density function of this distribution can be expressed as (Johnson & Kotz, 1970; Hanson, 1991)

The shape of this distribution is symmetric when α equals β and is asymmetric otherwise.

When α is greater than β, it is negatively skewed; otherwise it is positively skewed. The smaller the α and β are, the flatter the shape is. Hanson (1991) presented formulas for computing the mean, variance, skewness and kurtosis of the distribution based on the values of these four parameters. The main task of this study is to search for a way for finding the four parameters so that the resulting EAP estimates will be essentially unbiased along a wide range of the θ scale.

In CAT the bias of the Bayesian estimates is not only affected by shape of the prior distribution, but are also affected by the characteristics of the item pool such as the number of items and the discrimination values within different strata of difficulty levels (Wang, 1995). Therefore a universally applicable prior distribution to produce the least biased estimates for all types of item pools can not be found. The search for such prior distributions is then specific for a particular item pool. Because many different aspects of the characteristics of an item pool are expected to influence the bias of the EAP estimates (Wang, 1995), it is not expected that parameters for the beta prior can be determined quantitatively in relationship with some indexes of the item pool characteristics. The different aspects of item pool characteristics may include the pool size, the mean discrimination parameter values, the distribution of the difficulty parameters, and the number of items and the mean discrimination values for items within each strata of the difficulty levels, etc. For this reason, a trial-and-error approach with simulations will be used to find the parameter values of the beta prior for a particular pool that yields estimates with the smallest bias. The parameter values thus found, however, will be examined in relationship to the characteristics of the item pool. This process will be repeated across several item pools with different characteristics with the goal of finding general relationships between the parameters of the beta prior distribution with the characteristics of the item pools.