Exploring the Potential: Can AI Effectively Mark Students' Work?

The advent of Artificial Intelligence (AI) has sparked a transformation across various sectors, and education is no exception. With the development of Large Language Models (LLMs) like OpenAI's GPT-4, there's growing interest in leveraging AI to enhance educational processes. One area that has garnered significant attention is the potential for AI to mark students' work. But can AI truly match or surpass human educators in assessing student performance? This blog explores this question, delving into recent research, practical examples, and the ethical considerations surrounding AI-assisted marking.

Automated grading isn't new - systems like ETS’s e-rater® have been used to score essays on tests like the GRE and TOEFL1. Now, EdTech providers like sAInaptic and Graide are using AI to mark beyond multiple-choice questions. If you haven’t checked them out yet - you should!

How AI Marking Works

AI marking systems typically employ Natural Language Processing (NLP) techniques to evaluate written responses. They analyse textual features such as syntax, semantics, and discourse structure. Machine learning algorithms are trained on large datasets of human-graded essays to learn scoring patterns.

Advantages

Speed and Efficiency: AI can grade a vast number of responses in a fraction of the time it would take a human.
Consistency: Machines are not subject to fatigue or bias, potentially offering more consistent evaluations.
Immediate Feedback: Students can receive instant feedback, which is crucial for learning.

Limitations

Understanding Nuance: AI may struggle with creativity, humour, or cultural references in student responses.
Bias in Algorithms: If trained on biased data, AI can perpetuate those biases in marking.
Lack of Transparency: AI decision-making processes can be opaque, making it hard to understand marking rationales.

Recent Research and Findings

A notable study titled "Can Large Language Models Make the Grade?" explored the effectiveness of LLMs in marking student work2. The research involved:

Methodology:
- Selection of Questions: Twelve questions from history and science, covering key stages 2, 3, and 4, were chosen, with varying difficulty levels.
- Data Collection: 1,710 student responses that were not verbatim correct were collected to introduce ambiguity.
- Human Grading: Nearly 40 teachers participated in blind grading sessions, with two teachers marking each response.
- AI Grading: The same responses were fed into several LLMs, including GPT-4, using minimal prompting and a temperature setting of zero to reduce randomness.
Findings:
- Accuracy: GPT-4 showed the strongest performance among the models tested.
- Agreement with Teachers: Human teachers agreed on grades 87% of the time, while GPT-4 agreed with teachers 85%.
- Efficiency: Teachers spent approximately 11 hours grading, whereas GPT-4 completed the task in about 2 hours.
- Consistency Across Questions: GPT-4's performance was relatively stable across different subjects and difficulty levels.
Limitations:
- Scope of Questions: Only short-answer questions were used; the AI's performance on longer, more complex responses remains uncertain.
- Edge Cases: Discrepancies between AI and human grading often occurred in nuanced situations where multiple interpretations were valid.

Ethical Considerations

Bias and Fairness: AI systems can inherit biases in their training data. For example, they might unfairly disadvantage non-native English speakers or students from different cultural backgrounds.
Transparency: There's a need for explainable AI that allows educators and students to understand how marking decisions are made.
Data Privacy: Utilising student data to train AI models raises concerns about privacy and data protection regulations like GDPR.

The Role of Teachers

While AI shows promise in marking efficiency and consistency, it cannot replace the nuanced understanding that human teachers bring. Educators consider factors like a student's effort, creativity, and personal circumstances—elements that AI might overlook.

AI as an Assistant, Not a Replacement

AI can alleviate administrative burdens, allowing teachers to focus more on instruction and student engagement. For instance, AI can handle initial grading and flagging responses that require human attention.

Future Outlook

Enhanced AI Capabilities: Ongoing advancements in AI could lead to more sophisticated marking systems that better understand context and nuance.
Personalised Learning: AI marking can provide immediate, personalised feedback, helping students learn more effectively.
Hybrid Models: Combining AI efficiency with human oversight could offer the best of both worlds, ensuring accuracy and fairness.

Conclusion

AI has the potential to improve both efficiency and consistency in marking. Studies like Can Large Language Models Make the Grade? show that models like GPT-4 are nearing human-level performance in specific tasks. However, there are limitations and ethical concerns that require a thoughtful, informed approach. As with other areas of AI in education, it should be seen as a tool to support—not replace—the essential role of teachers.

Reference Links

Educational Testing Service. (n.d.). Automated Scoring and Natural Language Processing. Retrieved from About the e-rater Scoring Engine
Henkel, O., Hills, L., & Boxer, A. (2023). Can Large Language Models Make the Grade? Carousel Learning. Retrieved from Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability To Mark Short Answer Questions in K-12 Education | Proceedings of the Eleventh ACM Conference on Learning @ Scale

‍