Chapter 4. Assessment of computer capabilities to answer questions in the Survey of Adult Skills (PIAAC)

This chapter describes the results of the exploratory assessment of current computer capabilities to answer questions from the OECD Survey of Adult Skills (PIAAC). The expert ratings of computer performance are discussed separately for the three cognitive skill areas assessed by PIAAC: literacy, numeracy and problem solving with computers. The analysis explores several different ways of aggregating the ratings to take into account the perspectives of the different experts. A comparison is then made between human and computer capabilities to answer the PIAAC questions. The expert discussion of some individual test questions is summarised to illustrate the aspects of human performance that are difficult for computers to reproduce. Finally, ratings of projected computer capabilities in 2026 are analysed from three of the computer scientists.

This chapter describes the results of the exploratory assessment of computer capabilities on the Survey of Adult Skills (PIAAC). The assessment was carried out by a group of computer scientists using the approach described in Chapter 3. Most of the attention in the assessment focused on the ability of current computer techniques to answer test questions in literacy and numeracy. In these two skill areas, all 11 participating computer scientists provided ratings for each question, using a similar approach. Each expert provided a rating of Yes, No or Maybe for the ability of current computer techniques to answer each test question after a one-year development period costing no more than USD 1 million, and using the same visual materials that are used by adults who take the test. In addition, six of the participants provided ratings for the third skill area of problem solving using computers1 , and three of the participants provided ratings for possible computer capabilities in 2026 for all three skill areas. The chapter discusses the results for the different skill areas in turn: literacy, numeracy and problem solving using computers.

In general, the experts projected a pattern of performance for computer capabilities in the middle of the adult proficiency distribution on PIAAC. In literacy, these preliminary results suggest that current computer techniques could perform roughly like adults at Level 2 and that Level 3 performance is close to being possible. In numeracy, the preliminary results suggest that computer performance is roughly at Level 2 and that Level 3 or 4 is close to being possible. In problem solving with computers, the preliminary results suggest that computer performance is roughly at Level 2 and that Level 3 is close to being possible.

Ratings of computer capabilities to answer the literacy questions

Literacy skill in PIAAC is defined as the “ability to understand, evaluate, use and engage with written texts to participate in society, to achieve one’s goals, and to develop one’s knowledge and potential” (OECD, 2012). The test includes the decoding of written words and sentences, as well as the comprehension, interpretation, and evaluation of complex texts; it does not include writing. The test includes questions using different types of texts, including both print-based and digital texts, as well as both continuous prose and non-continuous document texts, and questions that mix several types of text or include multiple texts. The questions are drawn from several contexts that will be familiar to most adults in developed countries, including work, personal life, society and community, and education and training.

Literacy proficiency is described in terms of six proficiency levels, ranging from Below Level 1 to Level 5. The easier test items involve short texts on familiar topics and questions that can be matched directly to a passage of text. The harder test items involve longer and sometimes multiple texts on less familiar topics, questions that require some inference from the text, and distracting information in the text that can lead to a wrong answer. For example, one Below Level 1 item includes several brief paragraphs about a union election. It includes a simple table showing the votes for three candidates and asks which candidate received the fewest votes. An example Level 2 item shows a simple website about a sporting event. It asks for the phone number for the event organisers, which is not shown directly but can be found by following a link marked “Contact Us.” An example Level 4 item provides the result of a library search for books related to genetically modified foods. It asks which book argues that the claims made both for and against genetically modified foods are unreliable. This item requires the test-taker to interpret the information in the title and brief description for each book and to avoid many books that are superficially related to the question but not a correct response (OECD, 2013a).2

Computer literacy ratings by question difficulty

Figure 4.1 shows the average assessment ratings of computer capabilities on the questions at each literacy proficiency level.3 For each question, the answers of the different experts are averaged together, counting a Yes as 100%, a Maybe as 50%, and a No as 0%. These average expert ratings by question are then averaged together for all questions in each proficiency level. The average expected performance ranges from a high of 90% on the questions that are easiest for adults (Level 1 and below) to 41% on the questions that are most difficult for adults (Levels 4 and 5).

Although the average expected performance for computers by proficiency level decreases as the questions become more difficult for adults, there are big differences across the different questions within each proficiency level. Overall, the correlation coefficient across the individual questions between the average expected rating for computers and the question difficulty score for adults is -0.61.

The participants discussed alternative meanings for their Maybe ratings, with some saying they used a Maybe rating to reflect genuine uncertainty about whether computers could answer a question and others saying they used Maybe when they believed computers could probably answer a question but were not completely sure. To reflect these two possible interpretations, Figure 4.2 provides two alternative averages, one that omits the Maybe ratings (to reflect genuine uncertainty) and one that groups them with the Yes ratings. The version with the Maybe ratings omitted from the averages produces little change in the overall results. The version with Maybe counted as 100%, like the Yes ratings, increases the expected performance on Levels 2-5 by about 10 percentage points each. It is not surprising that alternative codings produce relatively small differences since the Maybe rating was used in only 19% of the judgments.

Accounting for differences in areas of expertise in the literacy ratings

The expected performance ratings shown in Figure 4.1 count the assessments of each of the computer scientists with equal weighting. As a result, a high score is possible on a particular question only if most of the computer scientists in the group know about a technique that could be used to successfully answer the question. This way of aggregating the results may be overly conservative in some cases, since it effectively prevents new techniques that only a few of the experts know about from leading to an aggregate Yes rating. Although most of the experts will know about well-established techniques, each one probably has specific knowledge or access to different results when it comes to newer techniques. For test questions that could potentially be answered by newer techniques but not older techniques, only a few of the experts in the group may know about relevant research and be in a position to offer a viewpoint on it.

An alternative way of aggregating the ratings across the group would be to assign an aggregate Yes rating for computers if some minimum number of experts rates that question as Yes.4 This approach takes into account the differences in techniques that the different experts in the group know about. If several experts know about a technique that could be used to answer a particular question, then it would be reasonable to count that as a question that computers are likely to be able to answer – even if the other experts do not know about that technique and believe that computers could not answer the question successfully.

Figure 4.3 shows the results of an analysis using a 3-expert minimum where each question is counted as Yes if at least three of the 11 computer scientists rated it as a Yes. With this approach to aggregating the results, the proportion of questions expected by the experts to be answered successfully by computers ranges from 100% of the easiest questions (Level 1 and below) to 58% of the most difficult questions (Levels 4 and 5). These results suggest a substantially higher level of computer success on the questions than when the ratings are simply averaged across the group.

Computer literacy ratings by expert

In addition to different types of expertise related to different computer techniques, the computer scientists in the group had different overall levels of optimism about the general ability of computers to answer the literacy test questions. To compare the level of optimism, Figure 4.4 shows the average rating across all literacy questions for each expert, counting a Yes as 100%, a Maybe as 50%, and a No as 0%.5 The scores range 56 percentage points across the experts, from 28% for Hobbs to 84% for Forbus. The average for the group is 56%. Changing the scoring for Maybe – to omit the rating from the average or to count it as 100% - does not make an appreciable impact on the range in the average rating across the experts.

The fact that some of the experts were much more “optimistic” than the others raises a question about the 3-expert minimum analysis in Figure 4.3 that counts a question as Yes if at least three experts give it a Yes. Rather than different types of expertise, this aggregation approach may simply reflect the judgments of the most optimistic experts in the group. This is because it would be possible for a question to receive a Yes with only the results of the three most optimistic experts (Forbus, Burstein and Saraswat). To account for this, one might add the additional requirement that at least one of the experts saying computers can answer the question is not in the group of the top three Optimists. Adding this extra requirement does not substantially change the results. It only modestly decreases the computer rating for Level 3 questions from 75% to 67%, and the rating for Level 4 and 5 questions from 58% to 50%.

The different level of optimism across the group also raises the possibility of excluding the experts who are more extreme, focusing on those in the middle as representing a view that might be more representative of a consensus of the field. However, averaging the ratings across the five experts in the middle (Vardi, Steedman, Passonneau, Rus and Spohrer) produces results that are very close to the simple average for the full group.

Comparing the computer literacy ratings to human scores

The scoring process for the Survey of Adult Skills uses item response theory6 to calculate difficulty scores for each question as well as proficiency scores for each adult, with the scores for both questions and people placed on the same 500-point scale (OECD, 2013c). Each adult who takes the test is placed at the level where they answer two-thirds of the questions successfully. As a result, an adult with a literacy proficiency of Level 2 can successfully answer Level 2 questions about two-thirds of the time. Generally, people will be more successful in answering questions easier than their level and less successful answering questions harder than their level. For example, an average adult at the mid-point of Level 2 can answer 92% of Level 1 questions and only 26% of Level 3 questions (OECD, 2013b, Table 4.6).

Figure 4.5 compares the expected computer ratings for literacy with the performance of adults at three different levels of literacy proficiency, using the average of the expert ratings and coding Maybe as 50%.7 Compared to Level 2 and 3 adults, the computer ratings show less change across the different levels of question difficulty, with lower expected performance on the easier questions than people show and relatively higher expected performance on the harder questions. The computer ratings are worse than Level 2 adults on the Level 1 questions, match Level 2 adults on the Level 2 questions, and are substantially better than Level 2 adults on the Level 3 and 4 questions. On the Level 4 questions, the computer ratings are also above the Level 3 adults.

Figure 4.6 compares the expected computer ratings and adult performance using the 3-expert minimum analysis. With this alternative, the computer ratings are better than Level 2 adults for questions at all levels of difficulty. The computer ratings are better than Level 3 adults for questions at all levels of difficulty except for the questions at Level 2, where the computers are roughly comparable.

While there are differences across the levels of question difficulty and possible ways of aggregating the ratings from the individual experts, the comparison suggests that the literacy capabilities of computers correspond roughly to the pattern of human performance seen in Level 2 or Level 3 adults.

Disagreement on the computer literacy ratings

To examine the range of disagreement across the different questions, a simple measure of disagreement was calculated by comparing the number of Yes and No ratings. A question was identified as showing disagreement if there were at least two Yes ratings and also at least two No ratings. The Maybe ratings were ignored. Overall, 60% of the questions showed disagreement by this measure. To gauge the overall effect of disagreements on the aggregate ratings, Figure 4.7 compares the average ratings from Figure 4.1 with averages based only on the 40% of the questions where the experts showed “high agreement” - which was simply the set of questions where they did not show disagreement as defined above. The overall results using only the questions where the experts agree are quite similar to the results using all questions.

Discussion of the literacy questions

Throughout the meeting, there was extensive discussion of different literacy questions and the challenges they pose for computers. It is worth describing some of the notable exchanges to outline the types of concerns and analysis that were the focus of the discussion.

The only literacy question in Level 1 and below that the computer scientists did not agree could be answered by computers (Literacy #5) was a question with a figure that was difficult to process visually. For this question, the group divided evenly between those who believed current techniques could answer the question and those who believed they could not. The figure shows a line of people, each holding a sign showing a number. The people each represent a specific country, indicated by a country name positioned underneath each person. The text indicates that the numbers on the signs represented the percentage of teachers in the country who are female. The information in the figure could have been shown in a simple table giving the statistic for each country. In that case the computer scientists agreed that the question could have easily been answered using current computer techniques. The difficulty of the question for computers is entirely related to the problems computers would have in connecting the pieces of information in the picture.

The easiest literacy question in Level 2 (Literacy #8) raised a different kind of challenge. In this question, the test-taker sees an Internet poll related to using the Internet in cars. Instructions are given to vote in the poll on behalf of another person who believes that Internet use in cars is unsafe. When assessing the ability of computers to answer the question, the group of experts divided evenly between Yes, Maybe and No responses. The difficulty of the question in this case relates to understanding the common sense implications of the instructions – that voting on someone else’s behalf means voting according to their opinion, and that voting in this context means pressing the buttons on the Internet poll website.

Another question that received extensive discussion was one of the more difficult questions in Level 3 (Literacy #44), which asks about the distance between different cities and provides a triangular distance table to use in determining the answer. Such tables are commonly used on printed maps to provide distances between pairs of cities. However, with the increasing use of computers and GPS to provide directions, many people today would never use a triangular distance table when planning a trip and some people may never have seen this kind of table format. Again the group divided evenly between Yes, Maybe and No responses on the ability of computers to answer this question. However, the discussion showed a wide range of approaches to thinking about the problem. The group did not believe that it would be hard to understand the lines and numbers of the table from the picture. Instead, the issue was the ability of computers to interpret the meaning of this unusual table format. One of the computer scientists approached the question as a visual problem solving task, suggesting that the unusual format could be understood by applying standard rules for labelling more conventional tables. A number of the experts assumed that the ground rules for the test would need to specify the use of this type of table in advance. This would then make it possible to apply standard techniques during the development process to allow a computer to interpret tables of this type. One expert assumed that the information could be made available in a more standard table format. Several suggested that the easiest way of answering the question would be to ignore the table provided and instead use Google to provide information for the appropriate distances.

The discussions about these three different literacy questions illustrate the wide range of factors that the computer scientists considered in determining whether current computer techniques could answer the questions. Notably, in these three questions, the difficulties that potentially prevent computers from successfully providing an answer seem to relate largely to factors other than their literacy capabilities: interpreting a difficult picture, understanding common sense implications related to voting and having advance warning about an unusual table format. The different factors noted in these three examples are typical of much of the discussion that occurred around the literacy questions at the meeting. However, it is possible that the non-literacy factors were discussed not because they were so important, but because they were unusual and therefore worth noting.

Another view of the factors being considered by the computer scientists was provided by a discussion of ten questions showing high levels of disagreement. To identify these questions, the computer scientists were divided into three groups according to their overall average ratings in Figure 4.4, distinguishing the top three “Optimists” (Forbus, Burstein, and Saraswat), the bottom three “Pessimists” (Hobbs, Davis, and Graesser), and the five “Realists” in the middle (Vardi, Steedman, Passonneau, Rus, and Spohrer). The ten questions identified were those where the Optimists voted Yes (with at most one Maybe in the group) and the Pessimists voted No (with at most one Maybe in the group).8 The Realists generally leaned towards the Optimists on the easier questions and towards the Pessimists on the harder questions.

The targeted discussion on ten questions revealing high disagreement contrasted to the discussion on questions that came up spontaneously. This time, the disagreement between the Optimists and the Pessimists in the group centred round issues relating to language interpretation. In particular, much of the discussion concerned whether “shallow” language processing would be adequate to answer each question or whether “deep” language processing would be necessary. Shallow processing involves pattern matching of various types, as carried out in search routines. By contrast, deep processing involves full interpretation of the meaning of the language. In two cases, the discussion convinced the Pessimists that the question was easier than they had originally thought and could be answered successfully with pattern matching techniques.

Computer literacy ratings for 2026 by three experts

Three of the computer scientists also provided ratings for all of the individual questions for 2026. Although a complete analysis across all 11 experts is not possible, the partial analysis for these three provides an interesting additional perspective regarding the literacy test questions.

The three computer scientists who provided ratings for 2026 are Davis, Forbus and Graesser. As indicated in Figure 4.4, Davis and Graesser are overall less optimistic about current computer capabilities on the PIAAC literacy test, whereas Forbus is more optimistic. The average literacy rating for 2016 projected by these experts is 53%. This is only slightly below the average rating of 56% across all 11 computer scientists.

Figure 4.8 compares the average rating by proficiency level for 2016 and 2026 for these three experts, showing predicted ratings for 2026 that are substantially greater than their ratings for 2016.9 The predicted pattern for computer performance in 2026 is somewhat better than the progression of humans rated at Level 3 in literacy.

Summary of computer ratings on the literacy questions

Overall, the group expects computers to be more successful in literacy questions that are easier for people, and less successful in the questions that are harder for people. This pattern roughly corresponds to the increasing difficulty of the language processing required as the questions become more difficult for people. However, the change in expected performance for computers across the different levels of question difficulty is weaker than it is for humans. At the same time, certain questions at each proficiency level are expected by the group to be far more difficult for computers than for humans. In these cases, the extra difficulty for computers often relates to additional capabilities required for the questions, such as understanding visual information or using common sense reasoning.

Across the group of 11 computer scientists, the average rating of current computer capabilities in literacy roughly corresponds to the range of performance for adults who are rated at Level 2 or 3. Such adults can answer about two-thirds of the questions at Level 2 or 3 and almost all of the easier questions. When the Maybe responses are coded as 50%, the expected pattern of aggregate performance across the different levels looks more like that of Level 2 adults. However, for the 3-expert minimum, the overall assessment of current computer capabilities resembles more closely the range of performance for adults who are rated at Level 3. Three computer scientists who also projected the capabilities of computers for 2026 estimated that the performance would be somewhat better than adults who perform at Level 3 in literacy.

Ratings of computer capabilities to answer the numeracy questions

Numeracy in the Survey of Adult Skills is defined as the “ability to access, use, interpret and communicate mathematical information and ideas, in order to engage in and manage the mathematical demands of a range of situations in adult life” (OECD, 2012). The skill includes four areas of content: quantity and number; dimension and shape; pattern, relations and change; and data and chance. The mathematical information in the test can be represented in a variety of formats, including objects and pictures; numbers and symbols; visual displays, such as diagrams, maps, graphs or tables; texts; and technology-based displays. The questions are drawn from the same familiar contexts used for the literacy test: work, personal life, society and community, and education and training.

Numeracy proficiency is described in terms of six levels, ranging from Below Level 1 to Level 5. The easier test items involve single-step processes, such as using basic arithmetic in familiar contexts. The harder test items involve complex or abstract contexts and questions requiring multiple problem-solving steps related to quantitative or spatial data. For example, a Below Level 1 item has four supermarket price tags that include the packing date and asks which product was packed first. An example Level 2 item shows a logbook used by a salesman to record work-related miles of driving. It asks for the reimbursement the salesman will receive for one trip noted in the logbook, using a stated reimbursement rate per mile. An example Level 4 item provides two stacked-column bar graphs showing the distribution of the Mexican population by years of schooling in different years for men and women separately. It asks for one of the values shown on one of the bar graphs for one of the years and one of the categories of years of schooling. (OECD, 2013a).10

Computer numeracy ratings by question difficulty

The average assessment ratings of computer capabilities for the numeracy questions are illustrated in Figure 4.9.11 As with the literacy analysis, the answers of the different experts are combined to produce an average rating for each question. The average ratings of all questions for each proficiency level are then averaged together.

The results indicate a much weaker relationship between the expected performance of computers and the difficulty score for adults than that shown with the literacy questions. For numeracy, average expected performance of current computer techniques ranges from 69% for Level 2 questions to 52% for Level 4 and 5 questions. Unlike the results for literacy, the expected performance of computers in numeracy for the easiest questions for adults (Level 1 and below) is not close to 100%.

It is notable that the particularly low rating for computers for the questions at Level 1 and below is almost entirely due to two questions (#1 and #8, discussed below). These include images that would be difficult for a computer to interpret. The correlation coefficient across the individual questions between the average expected rating for computers and the question difficulty score for adults is only -0.22, much smaller than the corresponding correlation for literacy.

As Figure 4.10 illustrates, expert ratings of computer capabilities to answer PIAAC numeracy questions do not produce a substantial change in results when averaged with different coding for Maybe ratings. As with the analysis for literacy, the alternative that omits the Maybe ratings from the averages is almost indistinguishable from the version that counts Maybe as 50%. The version that counts Maybe ratings as 100% increases expected computer performance by about 10 percentage points at each numeracy proficiency level. Here as with literacy, Maybe ratings account for a relatively small portion (22%) of the ratings.

Accounting for differences in areas of expertise in the numeracy ratings

Figure 4.11 offers the results of the 3-expert minimum analysis, in order to account for differences in areas of expertise of the computer scientists. This comparison also allows the ratings to reflect computer capabilities from newer techniques that some experts may not yet know about. With this approach to aggregating the results, the proportion of questions expected to be answered successfully by computers ranges from 95% for the Level 2 questions, to 83% for the Level 4 and 5 questions. As with literacy, the results from this approach suggest a substantially higher level of computer success on the numeracy questions than when the ratings are simply averaged across the group.

Computer numeracy ratings by expert

Figure 4.12 illustrates the average rating across all numeracy questions for each expert, counting a Yes as 100%, a Maybe as 50%, and a No as 0%. The range is 69 percentage points, from 21% for Davis to 90% for Hobbs. This range is wider than that for literacy (56 percentage points). The average for the group is 64%. Changing the scoring for Maybe, by omitting the rating from the average or to counting it as 100%, does not make an appreciable difference to the overall ratings across the group.

Although most of the experts appear in roughly the same position in the literacy and numeracy orderings, there is a striking change for two of the experts, Hobbs and Forbus. In literacy, Hobbs is the most pessimistic whereas in numeracy he becomes the most optimistic. By contrast, in literacy Forbus is the most optimistic while in numeracy he is the third most pessimistic.

As with literacy, an average for numeracy that focuses on the five experts in the middle – excluding the three most and least optimistic experts – produces results that are roughly similar to the simple average across the full group. Yet in the case of numeracy, the average across the five experts in the middle tends to be somewhat higher than the average across the full group, particularly for the questions that are more difficult for people.

Comparing the computer numeracy ratings to human scores

Because of the lower expected performance of computers on the easiest questions in numeracy and the flatter shape of performance at the different proficiency levels, the overall pattern of expected performance looks less like the shape of typical adult performance than is the case for literacy. In general, the expected performance for computers is about 20 percentage points lower for numeracy than for literacy on the Level 1 questions, but about 10 percentage points higher on the Level 2-4 questions (Figures 4.2 and 4.10).

Figures 4.13 and 4.14 compare the computer numeracy ratings with the performance of adults at three different levels of numeracy proficiency. Figure 4.13 uses the average ratings with Maybe coded as 50%.12 With this coding, the computer ratings are lower than Level 2 adults for the Level 1 questions, equal to Level 2 adults for the Level 2 questions, and higher than Level 2 adults on the Level 3 and 4 questions. Figure 4.14 uses the 3-expert minimum approach that requires a minimum of three Yes ratings. With this alternative coding, the computer ratings are still lower than Level 2 adults for the Level 1 questions, but they are almost as high as the Level 4 adults for the Level 2 and Level 3 questions, and they are higher than the Level 4 adults for the Level 4 questions.

Except for the low performance on the Level 1 questions, the comparison with human performance suggests that the numeracy capabilities of current computers correspond roughly to the pattern of performance seen in Level 2 or 4 adults, depending on the method used to aggregate the individual responses from the experts.

Disagreement on the computer numeracy ratings

The group showed a somewhat higher level of disagreement on the numeracy questions than the literacy questions: 66% of the questions provoked disagreement, compared to 60% for literacy. This calculation is based on the measure where a question is identified as showing disagreement when at least two Yes ratings and at least two No ratings occur. Most of the numeracy questions (88%) in Levels 3-5 provoked disagreement. The experts only agreed upon a small number of questions in Level 3 and Levels 4-5. It is therefore not meaningful to compare the results by numeracy proficiency level using these questions alone.

Discussion of the numeracy questions

A number of the experts noted that successfully applying computer techniques to answer the numeracy questions would require the development of a large number of specialised systems. These systems would be needed to address particular types of questions and process particular types of figures, tables or pictures. In most cases, the development of any one of these systems would not necessarily be a problem. However, it was unclear to the group how many systems would be needed to answer the full set of possible questions on the test. Without a well-defined specification of the types of material that might be presented, the number of potential specialised systems could be quite large.

The necessity to develop a number of specialised systems for numeracy contrasts with the situation for literacy, where the experts believe that many of the questions could be addressed with a relatively small number of general language techniques.

Much of the discussion of individual numeracy questions ended up focusing on issues related to understanding the visual input for different types of questions. Several of the experts asserted they were less confident about their judgments for the numeracy questions because they felt they did not have sufficient expertise to evaluate the visual processing requirements for the different questions.

The numeracy question that is the easiest for adults (Numeracy #1) was mentioned repeatedly during the discussion. This was due to the striking contrast between the expected low performance for computers and the high performance for people. As noted in Chapter 3, this question received the lowest rating for computer capability across the group. It was the only numeracy question that did not receive any Yes votes. The experts uniformly judged this problem to be difficult because of the difficulty of interpreting a photograph of two packages of bottled water. This is because the packaging material makes it hard for a computer to identify many of the bottles. The difficulty of the mathematical reasoning behind determining how many bottles are in the packages was not the feature that would make the question hard for computers.

Another numeracy question in Level 1 (Numeracy #8) received very low ratings for similar reasons. This question uses a photograph of a box of candles and asks how many layers of candles are in the box. As with the photograph of the packaged water bottles, the photograph of the packaged candles is hard to interpret because many of the candles are not directly visible and must be inferred. The difficulty of the question for computers therefore relates to the task of interpreting the photograph, not the mathematical reasoning required to determine how many layers of candles are in the box.

As with the literacy discussion, the group discussed a set of numeracy questions that showed disagreements between the three top Optimists (Hobbs, Burstein, Spohrer) and the three top Pessimists (Davis, Graesser, Forbus) in the group. The group discussed eight of the 16 questions identified where the Optimists voted Yes (with at most one Maybe in the group) and the Pessimists voted No (with at most one Maybe in the group).13 The five experts in the middle leaned towards the Optimists on most of these questions.

Half of the questions discussed raised issues related to visual materials that the experts believed would be difficult for computers to interpret. In addition, another question asked the test-taker to use a ruler to measure a line and the group believed they lacked the necessary robotics expertise to evaluate the relevant computer capabilities. Unlike the corresponding discussion on literacy questions, there were no cases where the Optimists and Pessimists finally agreed on a question rating after discussion.

Computer numeracy ratings for 2026 by three experts

As with literacy, the three computer scientists who offered ratings for the numeracy questions for 2026 are Davis, Forbus and Graesser.

Since Forbus moved from being optimistic about computer capabilities for literacy to being more pessimistic about their capabilities for numeracy, all three of the experts who provided ratings for 2026 were at the more pessimistic end of the ratings. The average numeracy rating for 2016 for these experts is 33%. This is substantially below the average rating of 64% across all 11 computer scientists.

Figure 4.15 compares the average rating by numeracy proficiency level for 2016 and 2026. It indicates substantial expected increase in computer capabilities over the ten-year period.14 The projected increase is much larger for numeracy than it is for literacy. With respect to human skills, the predicted pattern of performance in 2026 is close to the pattern that people show who perform at Level 3 in numeracy proficiency, expecting success on about two-thirds of the questions at Level 3 and almost all of the easier questions at Level 2 and below.

Summary of computer ratings on the numeracy questions

Unlike the ratings for the literacy questions, the computer scientists expect that computer performance will show only a small difference between the numeracy questions that are easier for people and those that are more difficult.

In general, the nature of the mathematical reasoning required for different questions was seldom raised as a difficulty in the discussion. The group focused primarily on the difficulties presented by the different visual materials, and by particular problem types. In a few cases, the experts also mentioned challenges related to the use of language in understanding the question or the text.

The average rating of current computer capabilities in numeracy is somewhat difficult to compare to the performance for adults, because the predicted computer performance is relatively similar at the different levels. The group projects that current computers could be successful on about two-thirds of the numeracy questions at Levels 2, 3 or even 4, depending on the aggregation method used. However, they do not expect computers to be successful in most of the easiest questions at Level 1 and below. When it comes to the easiest questions for adults, the primary problem posed for computers is the interpretation of visual material.

Finally, three computer scientists who projected the capabilities of computers for 2026 estimated that the performance on numeracy would be similar to adults who are rated at Level 3. These three experts were the ones who allotted the lowest overall ratings for computer performance in numeracy for 2016.

Ratings of computer capabilities to answer the problem solving questions

Skill in problem solving with computers15 in PIAAC is defined as “using digital technology, communication tools and networks to acquire and evaluate information, communicate with others and perform practical tasks” (OECD, 2012). The domain involves the ability to solve problems for personal, work or civic purposes by setting up goals and plans, and accessing and making use of information through computers. Although the skill area is intended to address a full range of digital devices, the current version of the test is limited to work on a laptop computer using generic versions of email, browser and spreadsheet software.

Problem solving proficiency is described in terms of four proficiency levels, ranging from Below Level 1 to Level 3. The easier test items involve well-defined problems using only a single function of one of the generic programs and without any inference required. The harder test items involve combining multiple steps across multiple programmes to solve a problem where the goal may not be fully defined, and where unexpected outcomes may occur. For example, a Level 1 item asks the test-taker to sort email responses to a party invitation into two existing folders for those who can and cannot attend. An example Level 2 item asks the test-taker to respond to an email asking about club members who meet two conditions, using a spreadsheet containing 200 entries describing each of the members. An example Level 3 item involves multiple email requests to reserve meeting rooms using a web-based reservation system and resolving a conflict related to two of the requests (OECD, 2013a).16

The six experts who provided ratings for computers in the problem solving domain are Davis, Forbus, Graesser, Passonneau, Spohrer and Steedman. In literacy, these six experts gave an average rating of 56%, the same as the average for all 11 experts. In numeracy, these six experts gave an average rating of 55%, somewhat below the average of 64% for all 11 experts. The results for the other two skill areas suggest that these six experts are likely to give a set of average ratings for the problem solving domain that are roughly comparable to the average that would have resulted from the full group of 11 computer scientists.

Computer problem solving ratings by question difficulty and by expert

Figure 4.16 provides the average expert ratings of computer capabilities to answer the questions in problem solving with computers for each proficiency level.17 As for the other skill areas, the answers of the different experts are averaged together to produce an expected result for each question and then the average expert ratings for all the questions in each proficiency level are averaged. The results reveal a relatively strong relationship between the expected performance of computers and the level of difficulty of the questions for adults in this domain. The correlation coefficient across the individual questions between the average expected rating for computers and the question difficulty score for adults is -0.74. The results in Figure 4.16 average the individual ratings coding Maybe as 50%. The versions with Maybe omitted, with Maybe coded as 100%, or with requiring a minimum of three Yes ratings all produce similar results.

Figure 4.17 compares the expected computer ratings with the performance of adults at two different levels of proficiency in problem solving using computers.18 The shape of the experts’ expectations of computer capabilities across the different proficiency levels relatively closely matches adults with a proficiency of Level 2 in problem solving with computers.

The ratings of the six experts across all problem-solving questions range 93 percentage points, from 0% for Graesser to 93% for Passonneau. The average rating across all six experts and all questions is 53%, substantially lower than numeracy and slightly lower than literacy. The range of disagreement for the problem-solving domain is wider than either of the other two domains. However, given the smaller number of experts, further analyses about the level of disagreement were not conducted.

Discussion of the problem solving questions

The group did not have time at the meeting to discuss the questions in the domain of problem solving with computers. However, notes prepared by the participants in advance contain several points related to this domain. Some of the experts expected that the context of the different questions would be difficult to interpret. They believed this would cause the problem solving questions to be more difficult for computers than the literacy and numeracy questions. However, this belief was not reflected in the actual ratings. Many of the specific points raised in the advance notes related to issues of language understanding, rather than expected difficulties related to problem solving or the use of software applications.

Computer problem solving ratings for 2026 by three experts

As for the other two domains, three of the computer scientists also provided ratings for 2026 for the questions for problem solving with computers. Davis and Forbus were in the middle of the expert distribution for 2016, whereas Graesser provided the lowest rating. Overall, these three experts had an average score for 2016 of 36%, below the average rating of 53% across all six computer scientists who provided ratings for problem solving. Figure 4.18 compares the average rating by the proficiency level for problem solving for 2016 and 2026 for the three experts who provided both.19 As with the ratings for literacy and numeracy, the predicted capability ratings for 2026 for problem solving are substantially greater than the corresponding ratings for 2016. The predicted pattern of performance is better than the pattern that people show who perform at Level 2 in problem solving with computers, and is almost as good as Level 3, which is the highest performance level on the test.

Summary of computer ratings on the problem solving questions

Only half of the experts (six in total) provided ratings for the problem solving questions, and the group did not have a chance to discuss the domain at the meeting. However, the available ratings provide an initial sense of the capabilities of computers in this area. Overall, the experts predict that computer performance will be stronger on the questions that are easy for people, and weaker on the questions that are harder for people. These results are like the ratings for literacy and unlike those for numeracy.

Overall, the projected average rating of current computer capabilities in problem solving with computers roughly corresponds to the range of performance for adults at Level 2 in this skill area.

Three computer scientists who also projected the capabilities of computers for 2026 estimated that the performance in the problem solving domain at that time would be almost as good as the top adult performance level on the test.

References

National Research Council (2005), Measuring Literacy: Performance Levels for Adults. Committee on Performance Levels for Adult Literacy, R.M. Hauser, C.F. Edley, Jr., J.A. Koenig, and S.W. Elliott, editors. The National Academies Press, Washington, DC.

OECD (2016), Survey of Adult Skills (PIAAC) (Database 2012, 2015), www.oecd.org/site/piaac/publicdataandanalysis.htm.

OECD (2013a), OECD Skills Outlook 2013: First Results from the Survey of Adult Skills, OECD Publishing, Paris, https://doi.org/10.1787/9789264204256-en.

OECD (2013b), The Survey of Adult Skills: Reader’s Companion, OECD Publishing, Paris, https://doi.org/10.1787/9789264204027-en.

OECD (2013c), Technical Report of the Survey of Adult Skills (PIAAC), OECD, Paris.

OECD (2012) Literacy, Numeracy and Problem Solving in Technology-Rich Environments: Framework for the OECD Survey of Adult Skills, OECD Publishing, Paris, https://doi.org/10.1787/9789264128859-en.

Notes

← 1. The formal name used for the problem solving skill area in PIAAC is “problem solving in technology-rich environments.”

← 2. More information about the Survey of Adult Skills and examples of the literacy questions are provided in OECD (2013a, 2013b). Full descriptions of the literacy proficiency levels are provided in Annex Table B4.1.

← 3. Complete assessment ratings for current computer capabilities by literacy question and expert are provided in Annex Table B4.2.

← 4. Another approach to reflecting different levels of expertise with respect to the different questions would have been to allow the experts to give a rating on their confidence in their judgments for each of the questions. This approach was not discussed or used at the meeting, but was suggested by one of the reviewers and could be explored in future work.

← 5. For full names, affiliations and areas of expertise for each expert, see Chapter 3, Table 3.1.

← 6. Item response theory is an approach to analysing test results that uses separate parameters to describe a respondent’s ability level and to describe a test question’s difficulty level (National Research Council, 2005, pp. 76-83).

← 7. The results are somewhat different than shown in Figure 4.1 for the computer ratings at the top and bottom because the Below Level 1 and Level 5 questions are excluded.

← 8. The ten questions that meet this criterion that were identified during the meeting were 21, 23, 28, 29, 32, 39, 46, 50, 52 and 56. An additional question that meets the criterion – 35 – was identified after the meeting and so was not discussed.

← 9. Complete assessment ratings for computer capabilities in 2026 by literacy question and expert are provided in Annex Table B4.3.

← 10. More information about the Survey of Adult Skills and examples of the numeracy questions are provided in OECD (2013a, 2013b). Full descriptions of the numeracy proficiency levels are provided in Annex Table B4.4.

← 11. Complete assessment ratings for current computer capabilities by numeracy question and expert are provided in Annex Table B4.5.

← 12. The results are somewhat different than shown in Figure 4.9 for the computer ratings at the top and bottom, because the Below Level 1 and Level 5 questions are excluded.

← 13. The 16 questions that meet this criterion that were identified during the meeting were 5, 16, 17, 25, 28, 31, 34, 36, 37, 41, 42, 43, 49, 51, 52 and 54. During the discussion of disagreements, the group discussed questions 16, 17, 25, 28, 31, 34, 36 and 41.

← 14. Complete assessment ratings for computer capabilities in 2026 by numeracy question and expert are provided in Annex Table B4.6.

← 15. The formal term used for this domain in PIAAC is “problem solving in technology-rich environments.”

← 16. More information about the Survey of Adult Skills and examples of the problem solving questions are provided in OECD (2013a, 2013b). Full descriptions of the problem solving proficiency levels are provided in Annex Table B4.7.

← 17. Complete assessment ratings for current computer capabilities by problem solving question and expert are provided in Annex Table B4.8.

← 18. The problem solving skill area is scored using only three levels of difficulty because of the small number of test questions, rather than the five levels used in literacy and numeracy.

← 19. Complete assessment ratings for computer capabilities in 2026 by problem solving question and expert are provided in Annex Table B4.9.