Using Machine Learning for Early Identification of At-Risk Students in Undergraduate Biology Courses

Authors

  • Danielle Graham Department of Biological and Forensic Sciences, Fayetteville State University, Fayetteville, NC 28301, USA
  • Justin Graham Department of Biological and Forensic Sciences, Fayetteville State University, Fayetteville, NC 28301, USA
  • Willietta Gibson Department of Biological and Forensic Sciences, Fayetteville State University, Fayetteville, NC 28301, USA
  • Khalid Lodhi Department of Biological and Forensic Sciences, Fayetteville State University, Fayetteville, NC 28301, USA
  • Jiazheng Yuan Department of Biological and Forensic Sciences, Fayetteville State University, Fayetteville, NC 28301, USA
  • Lieceng Zhu Department of Biological and Forensic Sciences, Fayetteville State University, Fayetteville, NC 28301, USA
  • My Abdelmajid Kassem Plant Genomics and Bioinformatics Lab, Department of Biological and Forensic Sciences, Fayetteville State University, Fayetteville, NC 28301, USA

DOI:

https://doi.org/10.5147/ajse.263

Keywords:

Machine learning in education, Educational data mining, Biology performance prediction, STEM analytics , Student outcome forecasting, SHAP interpretability, Acadmic early earning systems, Predictive modeling in higher education, Interpretable machine learning, Course-level analytics , STEM retention

Abstract

The increasing availability of educational data has created new opportunities to apply machine learning (ML) for predicting student outcomes, particularly in STEM disciplines where early identification of academic risk is essential for improving retention and performance. This study investigates the use of supervised ML algorithms to predict final exam performance in undergraduate biology courses, leveraging earlier assessment scores—Exam 1, Midterm, and Exam 3—as predictive features. The dataset comprises 500 student records drawn from five biology courses (BIOL150, BIOL210, BIOL310, BIOL330, and BIOL499), representing a spectrum of instructional levels from introductory to advanced capstone experiences. Four ML models were implemented and compared: Linear Regression, Random Forest, Support Vector Regressor (SVR), and XGBoost. These models were evaluated using standard regression metrics, including Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R². Among baseline models, Linear Regression demonstrated the highest explanatory power (R² = 0.39), while tree-based models showed competitive performance and further improvement after hyperparameter tuning. Feature importance analysis using both tree-based measures and SHAP (SHapley Additive exPlanations) values revealed that Midterm and Exam 3 scores were consistently the strongest predictors of final exam performance, whereas Exam 1 had lower predictive influence. The findings suggest that mid-course assessments provide a valuable window for identifying students at risk of underperformance, allowing for timely, targeted interventions. The use of interpretable ML models further enables actionable feedback for educators, aligning predictive outcomes with pedagogical decisions. By focusing specifically on biology—a domain underrepresented in educational data mining—this study contributes a subject-specific framework for academic early warning systems. The results support broader adoption of data-driven approaches in higher education and provide a scalable model for integrating predictive analytics into biology instruction and curriculum planning.

Downloads

Published

2025-08-02 — Updated on 2025-08-02

Versions

Issue

Section

ARTICLES