A new University of Georgia study finds that artificial intelligence tools like large language models could speed up grading but aren’t yet reliable enough to replace human teachers. Researchers tested Mixtral, an advanced LLM, on middle school science responses requiring students to model particle behavior during heat transfer.
The AI generated its own rubric and graded quickly, but often relied on shortcuts—flagging keywords rather than evaluating logic. Without detailed human-made rubrics, Mixtral scored accurately just 33.5% of the time. That jumped to 50% with clear grading guidelines.
Lead author Xiaoming Zhai, director of UGA’s AI4STEM Education Center, said teachers could benefit from AI if its accuracy improves. “Many teachers told me they had more time for meaningful work,” Zhai noted. The study, published in Technology, Knowledge and Learning, highlights the potential of AI in education—but cautions against removing humans from the process.