10. Artificial Intelligence-enabled adaptive assessments with Intelligent Tutors

Xiangen Hu

University of Memphis

Keith Shubeck

University of Memphis

John Sabatini

University of Memphis

Abstract

This chapter presents an adaptive assessment framework inspired by Intelligent Tutoring Systems (ITS), i.e. computer-based learning environments providing personalised instruction and feedback to learners through digital tutors and/or “peer” avatars. Unlike typical assessment environments, ITS provide students with interactive scenarios that integrate dynamic tasks and iterative feedback, where learning takes place as learners progress through the activities. The chapter argues that next-generation assessments have a lot to learn from ongoing developments in ITS, especially from original applications of Artificial Intelligence (AI) that provide intelligent feedback to students, adapt the system content in response to their actions, and evaluate what they know and can do. Three examples of ITS that integrate AI- applications for such purposes are presented.

Introduction

The preceding chapters in this publication have argued that assessing what students know and can do requires a coherent chain linking: 1) a vision of what relevant competencies in a given domain are; with 2) the type of tasks where observing proficiency in such competencies is possible; and 3) the appropriate methods to interpret and summarise the data resulting from these tasks. Such evidence-based reasoning is essential to make valid inferences on complex competencies, like communication and creative thinking, which people need for life and work in the 21st Century. It has been argued that eliciting evidence of 21st Century competencies requires immersive, realistic assessment tasks where test takers engage these competencies and where evaluators pay as much attention to the process by which learners come to solve the given task as to the result of this process. Finally, it has also been argued that making sense of the complex data streaming from such interactive tasks is only possible at scale by leveraging modern digital technologies.

In this chapter, we elaborate on these points by introducing an adaptive assessment framework inspired by Intelligent Tutoring Systems, i.e. computer-based learning environments providing personalised instruction and feedback to learners through digital tutors and/or peers called avatars (Nwana, 1990[1]; Rus et al., 2013[2]). Intelligent Tutoring Systems (ITS) have been one of the most active areas of research and development in learning sciences and technologies in the past 35 years. Researchers in this area have always been at the forefront of technologies, particularly the assessment technologies (Sottilare et al., 2017[3]). Most ITS applications are a unique combination of learning theories, applications in the domain and available technologies.

Unlike typical assessment environments, ITS provide interactive scenarios filled with dynamic tasks and constant feedback, where learning takes place as students progress through the activities. We contend that next-generation assessments have a lot to learn from ongoing developments in ITS, and in particular, from original applications of Artificial Intelligence (AI) that provide intelligent feedback, adapt content in response to the actions of test takers, and evaluate what they know and can do. We provide a few examples of ITS for assessment along these lines before turning to a summary of the chapter’s key points and concluding thoughts.

Why are ITS relevant to next-generation assessments?

In traditional tests, the goal is to assess students’ acquired knowledge prior to the task. Usually no feedback is given, tasks are likely very distinct from one another, and the type of responses is mostly limited to categorical responses (i.e. correct or incorrect answers) to minimise the “testing effect”, i.e. learning from the test (Avvisati and Borgonovi, 2020[4]; Butler, 2010[5]; Dempster, 1996[6]; Roediger and Karpicke, 2006[7]; Rowland, 2014[8]; Wheeler and Roediger, 1992[9]). In contrast, in ITS – as in learning environments more broadly – the goal is to maximise the testing effect to support student learning. To this end, tasks are not only related but carefully constructed feedback is provided to students after each response. Additionally, attempting to emulate what human educators do, ITS adapt and make their judgements based on a wide range of student behaviours beyond the provision of categorical responses including response latency, signs of confidence, emotions, body movement, etc.

These differences (presence or absence of feedback, relations between tasks and observed student behaviours) between the two environments result in fundamentally different assessment approaches. The fact that tasks in typical assessment environments have to be efficient and independent reduces the need for sophisticated types of responses and consequently limits the type of analytical tools that are needed to make sense of these responses. In ITS, tasks have to be ecologically valid and mimic real learning environments (for example, being conversational), and consecutive tasks are likely related. The response to the tasks can be multi-modal so advanced technologies have to be used to collect and interpret the data. The ITS experience can thus help realise the vision of change for next-generation assessments described in previous chapters by illustrating ways to deliver intelligent feedback as well as providing models and tools to process complex, multi-modal data.

An ITS adaptive assessment framework

We propose a general framework for the use of ITS for assessment. The framework describes how ITS work, what types of data can be generated by the interactions between human learners and automated tutors, and how these data can be converted into evidence of students’ knowledge and mastery of complex skills. The framework is an abstraction of the general elements characterising ITS that target different domains (e.g. from electricity to socio-emotional learning) and automatically process data emanating from different response processes (e.g. from multiple choice to open conversation and emotion detection). In particular, the framework shows how these very different sources of evidence can be organised according to the same data structure.

Figure 10.1 illustrates the main elements of the framework:

1. the type and function of agents that interact in the ITS.
2. the way agents interact through sequences of stimuli and responses.
3. the data structure corresponding to the number of responses multiplied by the number of measures that are calculated on each response.

Figure 10.1. Elements of an ITS adaptive assessment framework

The agents of Intelligent Tutoring Systems: Learners and avatars

In ITS, one or more human learners interact with one or more ITS applications called avatars. The avatars take either the role of a tutor, who explains concepts and asks questions, or of a peer learner, who participates to the tutoring as the human learner does. Learners and avatars act similarly within ITS: both present questions for each other and offer answers with the best of their knowledge. The difference is that avatars are controlled by back-end AI-enabled engines that constantly assess learners during the interaction while the behaviours of human learners are controlled by back-end human intelligence (broadly defined) and are very likely doing similar assessments of the avatars.

The interactions between learners and avatars

The interactions between learners and avatars occur in a mixed-initiative style, where humans and avatars respond to questions of other humans and avatars. We characterise this iterative process as a series of sequences each involving a stimulus (S) and a response (R). Any action at the time t becomes the stimulus for another action at time t+1. For example, a typical interaction might start with the avatar tutor explaining a concept and then asking a question to learners to verify their understanding. This is the first stimulus at time t, that is followed by a response at time t+1. This response elicits feedback from another agent in the ITS at time t+2 that results in a new stimulus. If the human agent provides an incorrect response, the avatar tutor might provide a clarification and invite the learner to give another response at time t+3. In this conversational environment, the number of responses that are collected from each human learner is thus not fixed but depends on the how the interaction unfolds between the human and the avatars.

In this framework, we distinguish two different aspects of the responses taking place in ITS: 1) the response type, or format of the response (e.g. text or voice input); and 2) the response measures, or types of evidence that are generated about student learning – including correctness as well as timing and magnitude. We discuss these two aspects in detail next.

Response types

Responses from learners in ITS can take one or a mixture of the following response types:

Categorical behaviour that is easily identified, such as multiple choices.
Written (typed) text responses.
Natural language voice input.
Facial expressions and body gestures.
Bio-metrical response/behaviours captured through wearable devices, such as smart watches.

As shown in Table 10.1, collecting data from each response type requires different types of hardware in terms of assessment devices.

Table 10.1. Response types and input devices
Input (response type)	Hardware (device)	Software (processing)
Categorical response	Keyboard, mouse, touchpad	Classic statistics
Text input	Keyboard	Natural Language Processing (NLP)
Voice/facial/body	Voice recording, video camera	Speech-to-text, NLP, emotion detection, gesture detection
Physiological measures	Wearable devices	Big-Data analytical tools

Response measures

Evaluation of the correctness of responses

For each avatar action, the avatar has a set of “expected” correct responses and “speculated” wrong responses. Consider a simple and familiar case: if the action of the avatar is to present a multiple-choice question to the learner, the expected correct answer would be the correct choice and the speculated wrong responses would be the wrong choices. In this multiple-choice example, the avatar can always “score” the responses of any action into:

Hit (H): Correct choice is selected.
False Alarm (FA): Incorrect choice is selected.
Miss (M): Correct choice is not selected.
Correct Rejection (CR): Wrong choice is not selected.

Notice that when the response is correct, more than one score can be assigned: Hit (the correct selection) and Correct Rejection (the ones that are not selected). In the case in which more than one choice needs to be selected (such as “check all that apply” types), all four scores may exist in one response.

This simple categorical representation can be adapted to more complex responses from the human learner. No matter the type of response, within an ITS there is always a well-defined set of expected correct and incorrect responses. For each response type, the software processing the responses is responsible for classifying them into discrete responses:

For categorical responses, such as multiple-choice responses, correct/incorrect responses are explicitly defined according to the scheme presented above (H, FA, M, CR).
For non-categorical responses, such as text input, software is needed to classify the input text into letter grades or a more fine-grained evaluation against typical correct and incorrect answers.
For voice input, software (some type of AI application) is needed to process and transcribe the input into natural language and then analyse it like text input from categorical responses.
For facial or body gestures, software is needed to classify the input into discrete categories (such as emotions based on facial expressions).
For biometric responses (collected from wearable devices, for example) data are larger than the previous types; specially designed software is needed to interpret student input.

Time-based measures

Given that the responses in ITS are recorded with timestamps and other physical information, it is relatively easy to measure the time course of each category and have the following measures that were not available for classical/traditional assessment instruments:

Response latency: The time between the end of the previous stimulus and the starting of the response. This is one of the most studied dependent variables in cognitive psychology. It is easy to measure and very informative.
Duration: The time between the action of the response and the submission of response. For example, if a learner is required to provide a drag-and-drop response, the duration would be time from the beginning of a dragging action to the time an object is dropped. When classifying facial expression into emotional states, some of the emotions may last longer than others.
Inter-category interval and intra-category interval: When the processed result of a response includes multiple categories and the categories have different time courses, inter-category intervals are the difference between timestamps for different categories. Intra-category intervals consist of multiple categories. This measures the time between two observed categorical behaviours.

Measures of magnitude or intensity of the response

The technologies applied in modern ITS can also be used to classify responses according to other characteristics beyond their correctness and timing. One important additional measure is the Magnitude, which represents the intensity of the observed behaviour. A simple measure of magnitude can be obtained by asking the learner to rate their level of confidence in their response: if the learner states that he has high confidence in his response then the response is considered having high magnitude. Judgement of confidence can also be obtained unobtrusively with advanced software to process raw voice and obtain non-verbal information, such as hesitation, delay between sentences, intonations or sentiments. Measures of magnitude can have important applications in innovative assessments of social interaction skills and emotional regulation skills.

Figure 10.2 provides an illustration of the above measures for the simple scenario of a multiple-choice response type of task. In this example, the avatar asks a question and the human learner chooses among four responses options (A, B, C, D), where A, C, D are correct choices and B is the wrong choice. In the first response, the learner indicates option A and B as correct: there is thus one Hit (A), two Misses (C and D) and one False Alarm (B). The avatar reacts at t+1 providing feedback and then the learner changes his response, indicating A and C as correct. There is an improvement from 1 out of 4 correct categories to 3 out of 4 with respect to the first response (the only issue in the second response is the missing D). After receiving feedback from the avatar again the learner submits a third response, this time correctly selecting options A, C and D and rejecting B. In Figure 10.2, the height of the bar represents the magnitude while the width represents the length of time between the stimulus and the submission of the responses (response latency). In this representation, an inter-category interval is given by the time the student takes to shift from an incorrect to a correct response (or vice versa), while an intra-category interval is given by the time the student keeps a response the same.

Figure 10.2. The interaction between learners and avatars (tutors) in ITS

Data structure

If the number of independent behavioural categorical measurements for each response is N, the data for adaptive assessments in ITS will form a K x N data matrix, where K would be the total number of responses. Take, for example, the data for Figure 10.2. There are four behaviour categories: Hit (H), False Alarm (FA), Miss (M), and Correct Rejection (CR). If the tutor only provides two feedback interventions, then the data matrix would be a 3 by 4. If additional measures are collected (e.g. confidence and timelapse of the responses), there will be additional values in each cell (e.g. response latency, duration, and magnitude). The other two measures (inter-category interval and intra-category interval) and second-order numerical properties of the data matrix can be computed easily.

This general data structure is applicable to a variety of adaptive test environments. The next section covers a few examples demonstrating the utility of the framework.

Selected examples

In the previous section, we presented a framework for adaptive assessments in ITS, which was illustrated with clearly defined response categories (H, FA, M, CR). In the examples we present next, the assessment framework is the same, but response categories vary. The first example looks at conversation-based ITS by focusing on two applications of AutoTutor. The second example is an application of the framework in assessing competencies of team members in group interaction. In the third example, we present an application of the framework to an assessment of emotional responses.

Example 1: Conversation-based ITS

In this first example, we look at two applications of AutoTutor. The first one, ElectronixTutor (Graesser et al., 2018[10]; Morgan et al., 2018[11]), is an ITS including 30 modules each covering a key concept in one of five areas of electronics (semiconductor, PN junction, rectifiers, filters and power supplies). In ElectronixTutor, learners interact with the tutor by typing (or voice input) in plain English. As shown in Figure 10.3, the tutor starts with a seed question about the concept (e.g. “What are the disadvantages of a bridge rectifier?”). In turn, using a combination of latent semantic analysis (LSA) and RegEx, the ITS evaluates learner responses by analysing their semantic similarity to typical expectations or typical misconceptions – that is, matching them against a set of previously defined correct and incorrect answers (see Figure 10.4). If the learner input is incorrect, the tutor provides a hint for the learner to respond to; if the response to the hint is incorrect, the tutor provides a narrower hint targeting a specific word or phrase in the form of a “prompt”. Finally, the tutor provides an assertion or summary of the expectation and moves on to providing hints, prompts and assertions for the next expectation.

It must be noted that, while we considered four indicators of correctness for learner responses in the previous section (Hit, Miss, False Alarm, Correct Rejection), ElectronixTutor considers two: Hits (i.e. responses that meet the semantic overlap threshold to the ideal answer) and Misses (i.e. those that do not meet the semantic overlap threshold). The tutor provides just-in-time feedback for misses and positive feedback for hits. All interaction data is stored in a learning record store and can be drawn upon to display learners’ performance, both to instructors and learners themselves through a dashboard user interface. The dashboard can be easily integrated into most learning management systems meaning that the modules of the application can be used for learning and assessment in the classroom. ElectronixTutor is designed to offer two types of feedback (hints and prompts) for both typical expectations and misconceptions. The data structure in this example is hence a 2 by 2 matrix.

Figure 10.4. The conversation flow of an Expectation-Misconception Tailored dialogue

A different, simpler version of AutoTutor, LITE (Hu et al., 2009[12]), provides a model where students are presented with a deep reasoning question about a concept that requires explanation beyond a “yes” or “no” response (Hu et al., 2009[12]; Sullins, Craig and Hu, 2015[13]). In this variation, once the student has provided an answer, a language analyser semantically “decomposes” it (Hu et al., 2014[14]) and computes its relevance against the answer key (i.e. correct answer). The numerical value (between 0 and 1) of semantic similarity between the student’s answer and the answer key is used as the measure of relevance (R) and irrelevance (IR). The numerical value of irrelevance is 1 minus the numerical value of relevance. If the student does not give a satisfactory answer, the tutor will invite the student to provide further input. The analyser will then decompose the new response to determine, first, whether it overlaps semantically with the previous one or not – this is, to see if the student is providing new (N) or old (O) information – and second, whether the information is relevant or not. With this process, a sequence of vectors can be produced, each vector containing six values: Relevant-New (N+R), Relevant-Old (O+R), Irrelevant-New (N+IR), and Irrelevant-Old (O+IR), in addition to Total Coverage (accumulated pure relevant and new (N+R) from the first answer to the last answer) and Current Score (most recent (N+R) + (O+R)). All six scores, not necessarily independent, provide enough information for the ITS to give feedback.

Computation of these values can be achieved by most current semantic analytical tools (Hu et al., 2014[14]). Each element of the vector will serve as an assessment for the students’ understanding of the given concept. We call the sequences of values (O+R, O+IR, N+R and N+IR) a learner's characteristic curve (see Figure 10.5), which has a rather intuitive interpretation that can be easily used to derive feedback:

When the student does not know the answer or does not know what was asked, the values of (N+IR) are likely high and this indicates confusion.
When the student misunderstands the question or has misconception, the values of (O+IR) are likely high.
High values of (O+R) indicate that the student may have answered with all the knowledge.
High values of (N+R) indicate that the student is likely to contribute more towards the answers.

Figure 10.5. Learner’s characteristic curve

Example 2: Assessing competencies of team members in group interaction

The next example demonstrates that the framework presented here can be used to assess learners' competencies when interacting with others. To illustrate this possibility, imagine that learners and avatars in an ITS contribute to a discussion by responding to the previous contributions from other learners or avatars. Assume the responses are finite and happen at discrete time points (t₁,...,t_n), where each action at time t_k is the response to some or all of actions prior to k. There are no assigned actions meaning that each learner/avatar decides to act and some of them may act more than others.

In such context, different types of evidence are produced as learners/avatars respond to previous actions at any time t. As summarised in Table 10.2 below, such evidence can be used to calculate six indices of a group communication analysis (GCA) vector for each learner/avatar (Dowell, Nixon and Graesser, 2018[16]; Hu et al., 2018[17]).

Table 10.2. Indices of group communication derived from interactions in an ITS generic scenario
Indices	Definition	Method of computation
Participation	Measures how much a learner participates in the group. At each occurrence, this value will increase for the learner but will decrease for all other learners and avatars, since they are not acting at this time.	Cumulative frequency of contributions.
Overall responsivity	The value will increase if the action of the learner is actively in response to previous actions.	Semantic similarity of a contribution in relation to the contributions of other participants.
Internal cohesion	This value will increase if the action is consistent to the previous actions of the learner.	Semantic similarity between current contribution and previous contributions of the same learner.
Social impact	This value is not updated for the current learner. However, values for all other learners/avatars may change depending on how much the contribution of the current learners is related to their perspective actions.	Semantic similarity between one agent’s past contributions and current contributions of others.
Newness	This value will increase if the action of the learner is different from its previous actions.	Semantic distinctiveness between current contribution and the same agent’s previous contributions.
Communication density	Measured by how meaningful the contribution is.	Semantic density of the contributions (i.e. it can be measured by domain-relevant key terms).
Note: Detailed computation for the indices can be found in Table 1 in Hu et al. (2018[17]).

To compute these indices, behaviour of the participants needs to be recorded – for example, the details of each contribution of the participants, such as the time, content (language) and the target (who he/she is addressing to). Some advanced linguistic analytical tools, such as semantic analysis (Hu et al., 2014[14]; Hu, Cai and Olney, 2019[18]), are needed for some of the indices.

With these indices computed after each round of contributions, a sequence of 6-dimensional GCA vectors can be obtained. The elements of the 6-dimensional vector are measures of response categories. With this sequence of vectors, timestamp (latency, duration) can be associated with some of the values. There are also “expected” correct actions and “speculated” wrong moves associated with the actions, so similar classifications can be performed.

In this example, the data structure is K x 6, where K is the number of contributions from an agent. Notice that even if each of the contributions is only made by one individual, the GCA vectors are updated for all participants. Additionally, although the data structure is the same as in the previous example, the nature of the behavioural categories is different. The potential application for this example is to assess “teamness” of individuals taking part in artificially constructed discussion environments, where some or all other participants are avatars controlled by ITS.

Example 3: Assessment of emotional responses

Emotions and cognitive affective states affect learning in different ways. For example, meta-analyses of emotions during learning suggest that emotions that frequently occur include boredom, frustration, confusion, happiness, anxiety and flow/engagement (D’Mello and Graesser, 2012[19]). Previous ITS, to varying degrees of success, have implemented affect-detection measures meant to enhance feedback and interventions during the tutoring process (D’Mello, Picard and Graesser, 2007[20]; D’Mello and Graesser, 2012[19]). But how can information about a learner’s affective state improve adaptive assessments?

One possibility is to use facial recognition software – for example, the sensory model of the Generalised Intelligent Framework for Tutoring (GIFT) (Sottilare et al., 2017[3]) – to capture learning-relevant affective states (Ahmed, Shubeck and Hu, 2021[21]). Updating the student model with information on their affective states can potentially help assess a learner’s knowledge state, which goes beyond measures of past performance or simple self-reported measures of confidence. For example, learners that appear confused when responding to a question may indicate that they guessed the answer. Similarly, learners that appear to be frustrated may indicate low confidence on a certain topic or knowledge component. Learners that appear to be happy or in a state of flow (i.e. a cognitive state where one is completely immersed in an activity) may suggest high confidence and understanding of the material. However, if a learner appears confident when providing an incorrect response this may suggest an existing misconception that should be addressed.

In a traditional non-adaptive assessment environment, it can be difficult to determine if an incorrect response was due to carelessness, genuine lack of understanding or simply not attending to the task. Within the context of the Hit, False Alarm, Miss and Correct Rejection framework, knowing a learner’s affective state at each interaction point can provide further insight into why a learner changed their response from a False Alarm to a Hit or from a Correct Rejection to a False Alarm. Consider, for example, a learner who appears confused when providing an incorrect response to a question and then later appears calm or happy when making a correct response for their second attempt. This could suggest that the confusion was resolved. Here, the affective data is used to enhance the system’s confidence in the learner’s current knowledge state. Mapping a learner’s affective state onto relevant knowledge components can strengthen the system’s student model. This improved student model can then be used to provide data-driven, “learner aware” (Ahmed et al., 2020[22]) and affect-sensitive interventions and feedback to potentially guide the user back into an affective state conducive to learning.

In a pilot implementation, a simple Amazon Rekognition API (AWS, 2023[23]) was integrated in an existing ITS (see Figure 10.6). Facial expressions of learners were processed every 500 milliseconds and mapped each time onto eight emotional states (Angry, Sad, Calm, Confused, Disgusted, Fear, Surprised and Happy) with varying degrees of confidence – for example, Rekognition API classified the face as Angry with 45.52% confidence. It must be noted that the AI behind the Rekognition _API is based on a large number of human faces, but it may not be accurate when classifying emotions for individuals in certain contexts, such as learning. It may be, for instance, that the facial expression of the individual in Figure 10.6 can be generally classified as Angry (with 45% confidence from API) but, for this particular individual, it is a typical expression of confusion when working with difficult tasks (a z-score of 2.77 for the emotion type of confusion). Indeed, the outcome of the eight emotion types from Rekognition API may not be the perfect measure of emotion types during learning. However, having a numerical vector that are sensitive to learners’ emotional change as function of tasks during learning made it possible for potential applications for assessment.

Figure 10.6. Measuring emotional responses during tutoring

Conclusion and final thoughts

In this chapter, we introduced an adaptive assessment framework inspired by Intelligent Tutoring Systems. In this framework, regardless of the chosen domain and type of task, the data structure is in the form of K x N, where K is the number of responses and N is the number of assessment categories. The adaptive nature of assessments in this framework, however, requires models that analyse sequences of actions in interconnected tasks. The examples presented in this chapter have shown that the current state of AI makes it possible to interpret data and develop indicators for sequences of actions with multiple data types – albeit several limitations remain, both theoretical and technological. Theoretically, we still need to understand what exactly the fine distinctions between learning environments and assessment environments are. What we have presented is based on a typical assumption in a learning framework where any exposure of learning resources (even in the form of assessment items) will have an explicit or implicit impact on future performance. Yet, because this process is adaptive and dynamic, it is necessarily different for each learner. In other words, the assessment of a learner’s ability will likely be process dependent. It would be a challenging task to design an adaptive assessment that measures the same competence for all the learners.

There are a few technological issues as well. The technology that we use today may be outdated tomorrow. For example, there are new and improved semantic processing technologies today that are much better than those used five years ago. Different technology will necessarily produce different adaptive processes and there will be issues if longitudinal comparisons are made using different technologies at different times. In addition, there are large differences in terms of the availability of technologies across different countries and regions. For example, semantic analytic tools are more mature for some languages, such as English, than for others. Using technologically dependent process data to assess students’ ability will need to be validated when the availability of technologies changes. The processing of natural language input, such as through syntactic parsing or semantic encoding, is more computationally expensive (both as the user terminal and server) than processing categorical inputs. Differences in the availability of technology and computational power might result in differences in access to innovative systems across countries.

Furthermore, while AI-powered learning environments offer new avenues to analyse the process and not only the product of learning (e.g. through process data), and to emulate real-life, open and dynamic contexts (e.g. by introducing complex conversations), their ability to enhance the authenticity of assessments remains limited in some respects (Swiecki et al., 2022[24]). For instance, despite the progress of natural language processing, developing AI systems capable of processing the full spectrum of human expression including elements such as humour or double meanings remains elusive. Limited “reasoning” abilities hold back AI’s capacity to categorise learner behaviour well in truly open environments where qualitative information is needed, compromising in turn the offering of “intelligent” tutoring feedback. Further limitations include increasingly recognised issues of discrimination stemming from the fact that AI systems build on models, both expert-based and data-driven, grounded on restricted ideas of learning and behaviour pertaining to particular groups of learners (e.g. white, male, neurotypical students). AI technologies must become more mature if we are to rely on non-human intelligences for the provision of the authentic and adaptive tutoring and assessment experiences that all learners deserve.

References

[22] Ahmed, F. et al. (2020), “Enable 3A in AIS”, in HCI International 2020 – Late Breaking Papers: Cognition, Learning and Games, Lecture Notes in Computer Science, Springer, Cham, https://doi.org/10.1007/978-3-030-60128-7_38.

[21] Ahmed, F., K. Shubeck and X. Hu (2021), “Enhancement of GIFT Enabled 3A Learning: New additions”, Proceedings of the 9th Annual Generalized Intelligent Framework for Tutoring (GIFT) Users Symposium (GIFTSym9).

[4] Avvisati, F. and F. Borgonovi (2020), “Learning mathematics problem solving through test practice: A randomized field experiment on a global scale”, Educational Psychology Review, Vol. 32/3, pp. 791-814, https://doi.org/10.1007/s10648-020-09520-6.

[23] AWS (2023), Amazon Rekognition: Developer Guide, Amazon Web Services.

[5] Butler, A. (2010), “Repeated testing produces superior transfer of learning relative to repeated studying”, Journal of Experimental Psychology: Learning, Memory, and Cognition, Vol. 36/5, pp. 1118-1133, https://doi.org/10.1037/a0019902.

[6] Dempster, F. (1996), “Distributing and managing the conditions of encoding and practice”, in Ligon Bjork, E. and R. Bjork (eds.), Memory, Elsevier, https://doi.org/10.1016/b978-012102570-0/50011-2.

[19] D’Mello, S. and A. Graesser (2012), “AutoTutor and affective autotutor”, ACM Transactions on Interactive Intelligent Systems, Vol. 2/4, pp. 1-39, https://doi.org/10.1145/2395123.2395128.

[20] D’Mello, S., R. Picard and A. Graesser (2007), “Toward an affect-sensitive AutoTutor”, IEEE Intelligent Systems, Vol. 22/4, pp. 53-61, https://doi.org/10.1109/MIS.2007.79.

[16] Dowell, N., T. Nixon and A. Graesser (2018), “Group communication analysis: A computational linguistics approach for detecting sociocognitive roles in multiparty interactions”, Behavior Research Methods, Vol. 51/3, pp. 1007-1041, https://doi.org/10.3758/s13428-018-1102-z.

[10] Graesser, A. et al. (2018), “ElectronixTutor: An intelligent tutoring system with multiple learning resources for electronics”, International Journal of STEM Education, Vol. 5/1, https://doi.org/10.1186/s40594-018-0110-y.

[12] Hu, X. et al. (2009), “AutoTutor lite”, Artificial Intelligence in Education.

[18] Hu, X., Z. Cai and A. Olney (2019), “Semantic representation and analysis (SRA) and its application in conversation-based Intelligent Tutoring Systems (CbITS)”, in Feldman, R. (ed.), Learning Science: Theory, Research, and Practice, McGraw-Hill Education.

[17] Hu, X. et al. (2018), “Constructing Individual Conversation Characteristics Curves (ICCC) for Interactive Intelligent Tutoring Environments (IITE)”, in Sottilare, R. et al. (eds.), Design Recommendations for Intelligent Tutoring Systems, US Army Research Laboratory, Orlando.

[15] Hu, X., D. Morrison and Z. Cai (2013), “Conversation-based intelligent tutoring system”, in Sottilare, R. et al. (eds.), Design Recommendations for Intelligent Tutoring Systems: Learner Modeling, U.S. Army Research Laboratory, Orlando.

[14] Hu, X. et al. (2014), “Semantic representation analysis: A general framework for individualized, domain-specific and context-sensitive semantic processing”, in Schmorrow, D. and C. Fidopiastis (eds.), Foundations of Augmented Cognition. Advancing Human Performance and Decision-Making through Adaptive Systems, Lecture Notes in Computer Science, Springer, Cham, https://doi.org/10.1007/978-3-319-07527-3_4.

[11] Morgan, B. et al. (2018), “ElectronixTutor integrates multiple learning resources to teach electronics on the web”, Proceedings of the Fifth Annual ACM Conference on Learning at Scale, https://doi.org/10.1145/3231644.3231691.

[1] Nwana, H. (1990), “Intelligent tutoring systems: An overview”, Artificial Intelligence Review, Vol. 4/4, pp. 251-277, https://doi.org/10.1007/bf00168958.

[7] Roediger, H. and J. Karpicke (2006), “The power of testing memory: Basic research and implications for educational practice”, Perspectives on Psychological Science, Vol. 1/3, pp. 181-210, https://doi.org/10.1111/j.1745-6916.2006.00012.x.

[8] Rowland, C. (2014), “The effect of testing versus restudy on retention: A meta-analytic review of the testing effect”, Psychological Bulletin, Vol. 140/6, pp. 1432-1463, https://doi.org/10.1037/a0037559.

[2] Rus, V. et al. (2013), “Recent advances in conversational Intelligent Tutoring Systems”, AI Magazine, Vol. 34/3, pp. 42-54, https://doi.org/10.1609/aimag.v34i3.2485.

[3] Sottilare, R. et al. (2017), “The Generalized Intelligent Framework for Tutoring (GIFT)”, in Galanis, G., C. Best and R. Sottilare (eds.), Fundamental Issues in Defense Training and Simulation, CRC Press, London, https://doi.org/10.1201/9781315583655-20.

[13] Sullins, J., S. Craig and X. Hu (2015), “Exploring the effectiveness of a novel feedback mechanism within an intelligent tutoring system”, International Journal of Learning Technology, Vol. 10/3, p. 220, https://doi.org/10.1504/ijlt.2015.072358.

[24] Swiecki, Z. et al. (2022), “Assessment in the age of artificial intelligence”, Computers and Education: Artificial Intelligence, Vol. 3, pp. 1-10, https://doi.org/10.1016/j.caeai.2022.100075.

[9] Wheeler, M. and H. Roediger (1992), “Disparate effects of repeated testing: Reconciling Ballard’s (1913) and Bartlett’s (1932) Results”, Psychological Science, Vol. 3/4, pp. 240-246, https://doi.org/10.1111/j.1467-9280.1992.tb00036.x.

╳

Metadata, Legal and Rights

This document, as well as any data and map included herein, are without prejudice to the status of or sovereignty over any territory, to the delimitation of international frontiers and boundaries and to the name of any territory, city or area. Extracts from publications may be subject to additional disclaimers, which are set out in the complete version of the publication, available at the link provided.

https://doi.org/10.1787/e5f3e341-en

The use of this work, whether digital or print, is governed by the Terms and Conditions to be found at https://www.oecd.org/termsandconditions.