Can Generative Artificial Intelligence Reliably Score Open-Ended Question Assessments in Undergraduate Medical Education?

We report on the agreement between faculty and GPT-4 scores and the error pattern analysis for the questions scored with the analytic rubric first, followed by the same analysis for questions scored with the holistic rubric. Examples of student responses and GPT-4 feedback for all four questions are presented in Table 1. All weighted kappa, percent agreement and additional IRR metrics for all four questions and three iterations are provided in Supplemental Appendix 3.

Table 1 Most common reasons for discrepancies in scoring using the analytic rubric and representative examples (in italics)Analytic Rubric ResultsAgreement between Human and AI

The IRR between the human content expert scores and GPT-4’s scores is illustrated in Fig. 2. In the first iteration, IRR was substantial for questions 1A (kw=0.65, 95% CI: 0.49–0.81, 55% agreement) and 2A (kw=0.75, 95% CI: 0.63–0.87, 65% agreement). After refining scoring rubrics, the IRR increased to almost perfect agreement for both questions on iteration 2 (question 1A: kw=0.88, 95% CI: 0.81–0.96, 83% agreement; question 2A: kw=0.85, 95% CI: 0.78–0.92, 68% agreement) and iteration 3 (question 1A: kw=0.94, 95% CI: 0.91–0.98, 86% agreement; question 2A: kw=0.88, 95% CI: 0.81–0.94, 74% agreement).

Fig. 2

Weighted kappa (black bar) and 95% confidence interval (grey box) depicts inter-rater reliability (IRR) between the content expert and GPT-4 for each of three iterations. IRR for the analytic rubric is presented on the left and the holistic rubric on the right

To establish the degree of discrepancy between faculty and GPT-4 scoring, we quantified the difference in point value between the two graders (Fig. 3). Most were only one-point differences and there were smaller discrepancies after each iteration. After the 3rd iteration, all discrepancies for question 1A were only 1 point and for question 2A only two student responses had a 2-point discrepancy. Paired score data matrices and the distribution of scores assigned by faculty and GPT-4 are provided in Supplemental Appendices 4 and 5, respectively. Discrepancies occurred most often when faculty gave a score of 3 and GPT-4 gave a score of 2. For question 2A, discrepancies were also found in both directions when one grader gave a score of 3 and the other gave a score of 4. The matrices also visualize the evolution of faculty scores when they agreed with GPT-4. For example, for question 1A, GPT-4 gave a score of zero to four students, whereas faculty only gave a score of zero twice. However, after reviewing these discrepancies, faculty agreed that GPT-4 had correctly scored those responses and changed their scores to zero (as seen in iterations 2 and 3).

Fig. 3

The cumulative percent of agreement and discrepancies by size identified during each of three iterations comparing faculty and GPT-4. Discrepancies for the analytic rubric questions are presented on the top and the holistic rubric questions on the bottom

Question 1A Discrepancies

Determinations of who was incorrect (i.e. GPT-4 or the faculty) and who assigned a higher vs. lower score are shown in Fig. 4. Question 1A presented a case of a male patient (Stephen) who presents to his physician with swelling in his neck and reporting a family history of neoplasms. The question asks students to identify the syndrome and genetic transmission, as well as the probability of his sister and his sister’s son (Alex) having the syndrome. The most common reasons for discrepancies are presented in Table 2. GPT-4 was unable to score accurately when students provided multiple possibilities. Also, GPT-4 often didn’t recognize when a student provided the incorrect transmission probability for Stephen’s sister or Alex.

Fig. 4

Figure showing how many discrepancies were due to errors in scoring by GPT-4 or faculty, as well as which direction the discrepancies occurred. Discrepancies for the analytic rubric are presented on the left and the holistic rubric on the right

Table 2 Most common reasons for discrepancies in scoring using the analytic rubric and representative examples (in italics)

We found that faculty were overly generous graders and awarded credit for responses that were true but did not answer the question. Students sometimes used confusing language, which also made it difficult for faculty to score.

There were a few occasions when both faculty and GPT-4 scored incorrectly (iteration 1: N = 5, iteration 2: N = 1, iteration 3: N = 0) and usually the errors reflected different types of oversights. For example, faculty didn’t give credit because the student did not quantify the sister’s probability (only writing that “it is likely that his sister has the mutation”) but failed to notice that the student also gave the incorrect probability for her child. GPT-4, in contrast, overlooked both issues and assigned the student a score of 4.

The evolution of the scoring rubric for question 1A is shown in Supplemental Appendix 1. When the student presented multiple scenarios specific to the sister’s genotype in determining the probability of the child, GPT-4 was able to correctly grade this by iteration 3. In the rubric for iteration 3, we added an instruction to give students one point only if one of the multiple possibilities present were correct, which reduced the occurrence of this type of discrepancy. All cases where GPT-4 missed the incorrect probability for the sister, the child, or both were resolved. In the rubric we added “Stephen’s sister” and “Alex, the son of the sister” to clarify how students may refer to the sister and the nephew, respectively, which appears to have addressed this issue. While cases with confusing language in iteration 1 were typically resolved, new discrepancies in subsequent iterations were flagged for confusing language.

Question 2A Discrepancies

Question 2A identified two pathogens (Mycobacterium tuberculosis [Mtb] and Rickettsia rickettsii [Rr]) and asked students to compare their mechanisms for avoiding destruction and to relate their intracellular growth to disease manifestations. The most common reasons for discrepancies are presented in Table 1. GPT-4 overestimated scores when it didn’t recognize that a response was missing a correct answer or had a wrong answer, with the most common error related to the Rr mechanism of avoiding destruction. GPT-4 also underestimated scores when vocabulary used by the student was not in the rubric but was accurate.

Faculty most often scored question 2A incorrectly if a student only provided one of the two critical answers (endothelial damage or vasculitis/rash), mistakenly giving full credit. Faculty also missed minor errors.

When both faculty and GPT-4 errors occurred (iteration 1: N = 6, iteration 2: N = 1, iteration 3: N = 0), scorers missed incorrect information but not in a consistent way. In one example, a student swapped the mechanism by which Mtb and Rr avoid destruction but faculty didn’t recognize this and GPT-4 missed that the student gave the incorrect mechanism for Rr avoiding destruction.

The evolution of the scoring rubric for question 2A is shown in Supplemental Appendix 1. There was some fluctuation between iterations in the number of times that GPT-4 simply overlooked an incorrect or missing answer, but clearly this type of error was difficult to address because the source of the error was not always evident. For example, GPT-4 continued to report that a student was describing the correct mechanism of avoiding destruction even when they were providing an incorrect response or missing a response. Discrepancies due to vocabulary were reduced in iteration 2, likely due to additions to the rubric providing alternative phrases, however when this was scaled back in iteration 3, these types of errors returned. In iteration 2, a model answer and alternative language section were added to the rubric provided GPT-4. However, GPT-4 then took off 1-point if students did not explicitly state that “RMSF uses actin-mediated motility to move from one endothelial cell to another,” which was language used in the model answer but not associated with a point value. This issue was not present in iteration 3 because this language was removed from the rubric. During iteration 3 there were six new discrepancies identified, of which five were errors in faculty scoring for various incorrect or correct responses that were missed.

Holistic Rubric ResultsAgreement between Human and AI

Figure 2 illustrates the IRR between faculty and GPT-4 scoring. In iteration 1, IRR was slight for question 1H (kw=0.19, 95% CI: 0.05–0.33, 43% agreement) and substantial for question 2H (kw=0.78, 95% CI: 0.71–0.85, 65% agreement). After refining scoring rubrics and prompts, the IRR increased for question 1H to substantial in iteration 2 (kw=0.64, 95% CI: 0.54–0.73, 63% agreement) but decreased to moderate in iteration 3 (kw=0.54, 95% CI: 0.43–0.66, 61% agreement). Question 2H’s IRR increased to almost perfect agreement in iteration 2 (question 2: kw=0.88, 95% CI: 0.83–0.93, 80% agreement) and remained stable in iteration 3 (kw=0.89, 95% CI: 0.85–0.93, 73% agreement).

Figure 3 depicts the discrepancies between faculty and GPT-4 scoring. Most were only one-point differences. For question 1H, there were few 2-point differences and no 3-point differences in iteration 2, but these increased in iteration 3. Both 1- and 2-point differences decreased for question 2H in iteration 2; 2-point differences further decreased in iteration 3 but 1-point differences increased. Paired score data matrices and the distribution of scores assigned by faculty and GPT-4 are provided in Supplemental Figs. 4 and 6, respectively. Discrepancies occurred most often between scores of 4 and 5 or 5 and 6 points and faculty were generally more likely to score a response one point higher than GPT-4.

Question 1H Discrepancies

Question 1H asked students to justify the diagnosis of drug poisoning by connecting the molecular actions of the drug on autonomic nervous system receptors to the observed findings. The most common reasons for discrepancies are presented in Table 1. GPT-4 overestimated scores when it did not recognize that responses were missing important information specified in the rubric, such as specific receptor locations or types or a description of dominant tone. GPT-4 underestimated scores when it took off additional points for acceptable errors. Some of these errors had already been specified in the rubric as acceptable, particularly those regarding ANS ganglia nomenclature that did not interfere with indication of overall understanding. GPT-4 also underestimated scores when it was looking for specific language that was in the rubric but not in the student’s response. GPT-4 was never able to identify a response that should have been scored a 2 (N = 3) and was only somewhat accurate at scoring a 3 (Iteration 2: N = 4/6, Iteration 3: N = 3/6).

Table 3 Most common reasons for discrepancies in scoring using the holistic rubric and representative examples (in italics)

When faculty scores were incorrect in iteration 1, the errors were often similar to GPT-4’s. Faculty overestimated scores when they did not recognize that student answers were missing important information or had incorrect information.

When both faculty and GPT-4 were incorrect (iteration 1: N = 10, iteration 2: N = 0, iteration 3: N = 0), the faculty missed an incorrect response at the same time GPT-4 was too strict or vice versa. On two occasions, both faculty and GPT-4 missed an incorrect response, giving higher scores than deserved.

The evolution of the scoring rubric for each iteration of question 1H is shown in Supplemental Appendix 2. Given the errors noted after iteration 1, it was apparent that more guidance was needed in the rubric to differentiate between a score of 5 and 6, providing examples that were acceptable to achieve a 6. Distinctions between 5 and 4 were also emphasized, with modifications to words in the rubric to prevent GPT-4 from assigning a 4 because of specific vocabulary (i.e. removed “prolonged” from “activation of Nn receptors” and “all” from “SANS and PANS ganglia” for a score of 4). Adding a model answer in iteration 3 proved to be counterproductive, resulting in a lower reliability. Many of the persistent discrepancies included feedback from GPT-4 about clarity of writing, which were difficult to address due to the vagueness of the feedback (e.g., “minor issues with clarity and terminology.”

Question 2H Discrepancies

Question 2H consisted of 3 sub parts: (A) calculating an IV maintenance dose (B) changing the IV to an oral dose every 6 h and (C) an application where students had to identify that reduced renal function led to reduced clearance requiring dose adjustment. The most common reasons for discrepancies are presented in Table 1. GPT-4 had difficulty assigning the correct score after properly identifying an incorrect response described in the feedback. GPT-4 was occasionally too strict, taking off one-point if extraneous information was included, or if the student used language to describe lowering the dose but did not explicated state lowering the dose.

The most frequent error faculty made was not taking enough points off for an incorrect answer to part C if students didn’t mention renal dysfunction, which warranted a 4. In contrast, faculty were too strict when students got multiple parts of the question wrong, including a calculation error (part B) and clinical reasoning error (part C). The calculation error was the same in most cases: students did not multiply the oral dose by 6 h. By contrast faculty were sometimes overly generous by allowing minor calculation errors.

In all seven cases where both faculty and GPT-4 scored incorrectly during iteration 1, faculty gave a score of 5 and GPT-4 gave a score of 3. In each case, the only error identified was the failure to explicitly point out renal disfunction, which meant that the highest score they could achieve was a 4. Errors by faculty and GPT-4 diminished over subsequent iterations (iteration 2: N = 2, iteration 3: N = 0).

The evolution of the scoring rubric for question 2H is shown in Supplemental Appendix 2. Language was added to the rubric to clearly define the criteria for each score. In addition, an important considerations section was included after iteration 1 and three model answers were included in iterations 2 and 3. After iteration 1, GPT-4 more accurately matched the feedback to the score, however GPT-4 struggled with this mismatch on low scores (i.e., 1–3). Discrepancies due to strict scoring were resolved by iteration 2, but these errors expanded in iteration 3. More than half of these discrepancies were originally mis-scored by faculty or both GPT and faculty in previous iterations and were concentrated on part C errors (e.g., giving a score of 3 when a student didn’t mention renal dysfunction). In iterations 2 and 3, GPT-4 missed minor incorrect responses, such as miscalculations in part A or not calculating the oral dose every 6 h for part B. Student responses that did not mention renal dysfunction that were incorrectly scored by faculty in iteration 1 were correctly scored by GPT-4 in iterations 2 or 3.

View original article

MEDICAL SCIENCE EDUCATOR

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Can Generative Artificial Intelligence Reliably Score Open-Ended Question Assessments in Undergraduate Medical Education?

Comments (0)