20090103

Education research: analysis of multiple-choice questions

Definitions (from Aubrecht & Aubrecht, 1983):
  • Difficulty level (i.e., correct response rate), where 0.0 = no correct responses; while 1.0 = all responses correct. According to Aubrecht and Aubrecht, a four-part multiple-choice question (used in both Astronomy 210 and Physics 205A courses at Cuesta College) ideally should have a difficulty of 0.625, which is halfway between a chance score and a perfect score.
  • Index of discrimination, measuring how likely it is that the best students answer correctly more than the worst students, using the following algorithm from Aubrecht and Aubrecht:
    Using the scores on the test as a whole, the testmaker first orders the students' papers from highest to lowest. The commonly agree-upon [sic] practice is then to define the top 27% of the papers as the "high" group and the bottom 27% as the "low" group. The index of discrimination is then calculated by subtracting the number correct in the low (C_L) from the number correct in the high group (C_H) and then dividing by the number of students in either group (N).
Note that due to tied scores, there can be different numbers of students in the "low" and high" groups, N_L and N_H, respectively. In this case, the nearest equivalent way to determine the index of discrimination used here is then (C_H/N_H) - (C_L/N_L).

Aubrecht and Aubrecht cite a minimum index of discrimination of +0.3 for a multiple-choice question to be considered effective enough for future ("file") use. (Zero discrimination indices correspond to "low" and "high" groups equally likely to answer correctly; and negative discrimination indices correspond to "low" groups answering more correctly than "high" groups!)

The difficulty level for Astronomy 210 questions reflects scoring with partial credit for multiple-choice, where students recover one-eight of the full credit for at least identifying an incorrect response, but only if they were unable to successfully identify the correct response. Thus the difficulty levels of these questions are expected to be no higher than 0.125 more than if there were no partial credit. The scoring of Physics 205A questions was more conventional, and do not have this partial credit scheme.

Hypotheses:
  1. There is a correlation between difficulty level and index of discrimination.
  2. Most multiple-choice questions have appropriate difficulty levels and effective discrimination indices.
  3. Questions with difficulty levels that are too low or too high do not effectively discriminate between "low" and "high" groups.
Data sets:
Multiple-choice questions from quizzes and exams, Astronomy 210 (SLO and NCC campuses) and Physics 205A, Cuesta College, San Luis Obispo, CA.

A total of 100 multiple-choice questions. Linear and quadratic fits to the data yield R^2 factors of 0.0121 and 0.2551, respectively.


A total of 100 multiple-choice questions (two questions with negative discrimination indices lie off of this graph, but were still included in the linear and the quadratic fit statistics). Linear and quadratic fits to the data yield R^2 factors of 0.069 and 0.3025, respectively.


A total of 79 multiple-choice questions. Linear and quadratic fits to the data yield R^2 factors of 0.0936 and 0.2583, respectively.


Discussion:
Graphs of all three data sets show similar peaks for discrimination indices at mid-level difficulty levels. No significant linear correlation exists, and only weak quadratic function correlations for all three data sets. Possibly a gaussian distribution could be fit to the data, but this would probably also be a weak correlation, as the data points would be more bounded by a gaussian distribution rather than strictly lying along a gaussian distribution.

The green boxes represent the ideal difficulty levels and discrimination indices for four-part multiple-choice questions; only a minority of questions lie above the 0.625 difficulty level, but most questions lie above the minimum 0.3 index of discrimination.

Conclusions:
  1. There is only a rough peaked correlation between difficulty level and index of discrimination.
  2. Only some multiple-choice questions have appropriate difficulty levels, but most questions have effective discrimination indices.
  3. Questions with mid-level difficulties most effectively discriminate between "low" and "high" groups.
Future goals:
  • Minimize the number of multiple-choice questions that have difficulty levels that are too low or too high (i.e., narrowing the horizontal distribution of data points) while maintaining a mean difficulty level of approximately 0.625 (midway between random guessing and perfect correct response rates).
  • Increase the discrimination indices of questions, especially those near the mean difficulty level (i.e., causing the spread of data points to lie along a gaussian distribution, rather than be bounded by a gaussian distribution).
  • Measure validity of selected multiple-choice questions in subsequent semesters, or between different sections/campuses.
  • Compare difficulty levels and discrimination indices of this instructor with those of other instructors at Cuesta College and other institutions for comparable courses.
  • Fit a gaussian distribution to data, to compare correlation statistics with quadratic function fits. (This should be done after the data begins to lie along a gaussian distribution rather than being bounded by a gaussian distribution.)
  • Identify how partial credit for multiple-choice affects the difficulty level of Astronomy 210 questions, and how this may be correlated with the index of discrimination.
Reference:
Gordon J. Aubrecht II and Judith D. Aubrecht, "Constructing objective tests," American Journal of Physics Volume 51, Issue 7 (July 1983), pp. 613-620.

No comments: