
What is the difference between machine scoring systems and chatbots?
The term Artificial Intelligence (AI) has become a buzzword used in various businesses, organizations, and media in general. Many have warned about the dangers of using such a phrase as a blanket term to describe technologies that are not truly AI, as it tends to mislead the public about what AI is and what their expectations should be (The AI Buzzword Trap, n.d.). In the same vein, there has been a trend to equate AI to chatbots like ChatGPT. This is not uncommon even among academics (Jordan, 2019). In a recent Applied Linguistics academic conference, there were a total of 40 presentations related to the search term “Artificial Intelligence” out of which about 31 were about generative AI or chatbots, like ChatGPT.
AI is much bigger than chatbots. In fact, AI encompasses a variety of technologies that enable machines to perform tasks requiring human-like intelligence. Self-driving cars are a real-world example of AI technology that is beyond chatbots. Just like when training a human to drive safely, the machine is trained to recognize traffic signs, avoid obstacles, make decisions at intersections, and overall follow the traffic regulations. With the help of (1) sensors that gather millions of data points on what is ahead, beside, or behind, (2) software that processes all these data points collected through the sensors, and (3) machine learning that recognizes patterns in the data points collected to support the machine in improving their driving, a machine is able to perform the human-like task of driving a car in real traffic.
Likewise, ACTFL® and Language Testing International® (LTI) have leveraged state-of-the-art machine learning technologies to build a model that would provide scores to Spanish AAPPL PW (ACTFL Assessment of Performance toward Proficiency in Languages Presentational Writing) responses just like ACTFL certified raters would do. Like with self-driving cars, the research team at ACTFL and LTI trained the machine to perform the task of a certified human rater by (1) compiling thousands of data points of actual test responses and rater scores, (2) using software to process these data, and (3) applying machine learning techniques to find patterns to optimize the machine scoring performance.
How are ACTFL and LTI leading innovation with machine scoring for Spanish? Why?
Research on automated scoring systems in languages other than English and for non-adult language learners is limited at best. If we look at the automated scoring systems in use, most, if not all of them, focus on English as the target test language and on adult test-takers (e.g., Davis & Papageorgiou, 2021; Isbell, Crowther, & Nishizawa, 2023; Gao, Gales, & Xu, 2024). To address this gap in the research and in the field in general, LTI and ACTFL collaborated on this innovative research project that aimed at building an automated scoring system targeting the Spanish language for non-adult learners for the Spanish AAPPL PW. Not only did it make a much-needed contribution to the research field, but it also provided a solution to generate consistent scores, to double-rate all tests and thus improve the QA process and rapidly detect alarming comments, among other affordances.
How ethical is the Automated Machine Scoring System?
In an AAPPL customer satisfaction survey that LTI launched in 2024, many respondents expressed concerns about automated scoring systems, particularly regarding the absence of human judgment, potential biases in AI models, and fairness across diverse demographics. These are all concerns that the Machine Scoring research team at ACTFL and LTI have taken seriously since day one. Reflecting this commitment, their efforts to build this automated system have been grounded in the International Language Testing Association (ILTA) code of ethics (ILTA, revised, forthcoming). As such, the machine scoring system projects have been guided by the four essential principles in the code of ethics, particularly by the Principle of Technological responsibility. This principle refers to the ethical management of language testers to handle technological innovations “with diligence and foresight to uphold the integrity and fairness of language testing” (ILTA, forthcoming, p. 8).
More specifically, the team carefully compiled a dataset of responses representative of the full range Spanish AAPPL PW test-taker population along with ACTFL-certified rater scores to train the machine scoring system. Training the machine scoring system on this authentic and representative dataset ensures that automated scores align closely with human judgment, as well as their efforts to provide unbiased and fair scores across different demographic groups. Compiling such a representative dataset is just one aspect of the research team’s broader commitment to ethical responsibility. Additional details on our ethical practices will be shared at the upcoming East Coast Organization of Language Testers (ECOLT) Conference in September.
What is in store for the Machine Scoring System and other projects?
With the rapid advances in language assessment technology and AI in general, the machine scoring research team continues improving our models to provide more accurate, reliable, and interpretable scores for the Spanish AAPPL PW. At the same time, we continue disseminating our work through presentations at different conferences, the next of which are ECOLT 2025 in Washington DC and AIRiAL 2025 in New York City. We have also recently published our work in an academic journal for our stakeholders to get informed in more detail about the earlier stages of our work on this project.
In addition, the team continues innovating and is now working on a machine scoring system for the Spanish AAPPL ILS (Interpersonal Listening and Speaking). This project has its own challenges as it involves speech rather than text. As mentioned earlier, little research has been done on young learners in languages other than English, and this includes research on speech recognition models for this population and target language. As such, the team is working diligently to tackle this and many other challenges while still upholding the highest standards of ethics and quality.
References
Davis, L., & Papageorgiou, S. (2021). Complementary strengths? Evaluation of a hybrid human-machine scoring approach for a test of oral academic English. Assessment in Education: Principles, Policy & Practice, 28(4), 437-455.
Gao, S., Gales, M., & Xu, J. (2024). Detecting Aberrant Responses in Automated L2 Spoken English Assessment. Exploring artificial intelligence in applied linguistics, 96-117
International Language Testing Association. (Forthcoming). ILTA Code of Ethics in English.
Isbell, D. R., Crowther, D., & Nishizawa, H. Speaking performances, stakeholder perceptions, and test scores: Extrapolating from the Duolingo English test to the university. Language Testing, 0(0), 02655322231165984. https://doi.org/10.1177/02655322231165984
Jordan, M. I. (2019, July 1). Artificial intelligence—The revolution hasn’t happened yet. HDSR. https://hdsr.mitpress.mit.edu/pub/wot7mkc1/release/10
The AI buzzword trap: Why branding everything as AI is dangerous. (n.d.). SECO. https://www.seco.com/blog/details/the-ai-buzzword-trap-why-branding-everything-as-ai-is-dangerous
Voss, E., Sallee, K., Son, Y-A., Malone, M. E., Marshall, C., & Chomon-Zamora, C. (2025). An automated scoring system for the AAPPL Spanish presentational writing tasks. Foreign Language Annals, 1–19. https://onlinelibrary.wiley.com/doi/abs/10.1111/flan.70038




