It sounds so futuristic… artificial intelligence and assessment! Maybe it’s just that much of current assessment practice involves theory developed in the early 1900’s (classical test theory) or the 1960’s (item response theory), the idea that assessment might be influenced and updated by AI is attractive, even exciting. Will AI solve all our assessment problems? Of course not, particularly since it’s hard to agree of what is and is NOT AI. In fact, the definition of AI is highly variable, and honestly, not really the most important part of this blog. It’s really the innovations that are happening in assessment that are of most interest to me, AI or not, and hopefully to you as well.
So, after a brief discussion about AI, I’d like to discuss the topics in assessment that I think at least in orbit around the AI concept. Following the Wikipedia article on AI, I would suggest that the following applications of AI are most relevant for assessment:
OK, but what is AI anyways? Artificial Intelligence is generally thought of as computer-based technology that mimics some unique capacity of human intelligence. Applications of this technology include speech/language recognition and production, pattern recognition, perception, and complex learning. Some perspectives on AI insist that not only should the product of computation be human-like, the mechanisms that produce intelligent action should be biologically plausible. Artificial Neural Networks are the prime example, but few applications of ANNs to assessment exist. 🙂 Even from the perspective of the expanded definition, I would suggest that there are few applications of AI to assessment.
Undaunted, I want to discuss at a high level the 3 areas above that apply some pretty sophisticated computation to assessment problems. I should admit, I’m no mathematician so I’m going to keep this discussion conceptual.
Let’s start with Optimization. Optimization (often, via mixed integer programming) has been around in assessment for perhaps 20 years. Optimization as a methodology is used to identify optimal values for variables to maximize or minimize some quantity in the presence of certain constraints. Scheduling (for airlines for example) is a classic optimization task in which the maximum number of people should be moved subject to constraints on the number of airplanes, their capacity, the amount of distance each plane has to travel and the number of stops each has to make, etc. In assessment, optimization has been employed to create examination forms from an item bank, ones that satisfy multiple constraints to the greatest degree possible. An example form assembly problem could be to create 4 exam forms from the ‘best’ items that all adhere to blueprint and statistical specifications while sharing a certain amount of content for equating purposes.
More recently, optimization has been used successfully as the item selection algorithm for computer adaptive testing. The truly exciting innovation about optimization in this context is that it will select the next operational item not only on the basis of examinee ability, but it can also be used to estimate and update item parameters in real time, detect cheating, and even select possible experimental items such that the fewest number of candidates would need to see them in order to get good stats. Though to my knowledge no operational version of this technology yet exists, it’s no doubt on the horizon very soon!
Probabilistic reasoning either sounds fancy, or pretty generic, depending on your role and interest in the assessment industry. The truth is, all assessment is based on PR, formally or informally. If you think about it, each test question that a candidate answers is evidence of what they know, or how much. In a generic testing situation, we start by assuming little about a test taker’s ability, thus the need for the test. Let’s say a test taker answers the first, relatively easy question correctly. From that, you could correctly infer that the candidate probably has some knowledge, but you’re not very confident about that. Let’s say the candidate keeps rolling off correct answers to those easy, and even medium difficulty questions, but has mixed success on the harder ones. The evidence you are collecting is zeroing in on above average but perhaps not exceptional ability for the candidate. At some point, you will have asked enough questions that candidate ability is known to a particular level of confidence, and you can conclude whatever the test is designed to conclude, readiness for licensure, mastery of grade 9 algebra, whatever.
What I’ve described above is conceptually how IRT works, powered by PR. You start with a probability distribution of examinee ability, perhaps ‘flat’, perhaps normally distributed based on the cohort to which the examinee belongs. Then, based on candidate responses, you update the probability distribution of their ability. As you include data from more questions, the probability distribution comes to be in a smaller and smaller range of the ability scale, and eventually the probability is high enough that their score is in a particular small range, or that they’re ‘minimally competent’, and you’re done. Where PR gets even more exciting (and maybe gets more ‘AI-like’) is when you’re trying to make more complex or nuanced inferences about the test taker, like do they understand the distributive property of multiplication, or when the tasks are complex and different aspects of performance selectively update different candidate ability variables.
Ok, so I’m going out on a limb a bit talking about AI, classification, and assessment. I know of no working application for AI-like tools in this context, but it seems like there could eventually be some, and I thought I might as well have a little fun by talking about some possibilities. Classification is a classic AI problem, and it has been used to do things like convert spoken words into text, determine whether a lesion on radiograph shows cancer or not, or what kind of online ads to send you. AI, and particularly some artificial neural networks, are good at taking complex data and figuring out what category that set of data belongs to, either because data naturally clusters into those categories, or because the network given feedback about which category the data belongs to and it derives optimal rules for determining the category.
What kind of assessment application might this AI-based technology be good for? One possibility concerns assessment tasks in which complex and varied data are collected about task performance. Let’s say you’re interested in testing would-be pilots in a flight simulator. You want to collect lots of data about what pilots do when they encounter a particular situation including the actions they take and when they take them. You will also have data on whether or not the pilot successfully managed the challenging situation. From there, the ANN can take over, determining what actions, or sequences and timing of actions predict a successful outcome. This information could then be used to score performance in this environment and even to help designing training so that pilots could learn the most effect strategies to use in challenging situations. The information derived from the ANN could also be used to design a probabilistic model for task performance like the ones described above, not quite so exploratory as described here.
The bottom line is that, whether or not you call them AI, there are some pretty amazing, innovative, envelope-pushing things going on in assessment these days.