Executive summary

As artificial intelligence (AI) and robotics technologies continue to expand their scope of applications across the economy, understanding their impact becomes increasingly critical.

The AI and the Future of Skills (AIFS) project at OECD's Centre for Education Research and Innovation (CERI) is developing a comprehensive framework for regularly measuring AI capabilities and comparing them to human skills. The capability measures will encompass a wide range of skills crucial in the workplace and cultivated within education systems. They will establish a common foundation for policy discussions about AI's potential effects on education and work.

The AIFS project has undergone two phases of developing the methodology of the assessment framework. The first phase focused on identifying relevant AI capabilities and existing tests to evaluate them. It drew from a wealth of skill taxonomies and assessments across various disciplines, including computer science, psychology and education.

The second phase, the focus of this report, delves deeper into methodological development. It comprises three distinct exploratory efforts:

Education tests offer a valuable means of comparing AI to human capabilities in domains relevant to education and work. The project carried out two studies to explore the use of education tests for collecting expert judgements on AI capabilities. The first study, conducted in 2021/22, followed up an earlier pilot study, asking experts to evaluate AI’s performance on the literacy and numeracy tests of the OECD's Survey of Adult Skills (PIAAC). The second study collected expert judgements of whether AI can solve science questions from the OECD's Programme for International Student Assessment (PISA).

The studies aimed to refine the assessment framework for eliciting expert knowledge on AI using education tests. They explored different test tasks, response formats and rating instructions, along with two distinct assessment approaches: a "behavioural approach" used in the PIAAC studies, drawing on smaller expert groups engaging in discussions, and a "mathematical approach" adopted in the PISA study, relying more heavily on quantitative data from a larger expert pool.

This work showed that there are limits to obtaining robust measures of AI capabilities by surveying experts. Especially in domains that are not the centre of current research, consensus evaluations are hard to reach. In addition, recruiting and engaging experienced experts is costly.

Two exploratory studies extended the rating of AI capabilities to tests used to certify workers for occupations. These tests present complex practical tasks typical in occupations, such as a nurse moving a paralysed patient, or a product designer creating a design for a new container lid. Such tasks are potentially useful as a way of providing insight into the application of AI techniques in the workplace.

The inherent complexity of occupational tasks makes them different from the questions contained in education tests. Occupational tasks require various capabilities, take place in real-world unstructured environments and are often unfamiliar to computer scientists. Consequently, the project had to develop different methods for collecting expert ratings of AI with such tasks. The two studies explored the use of different survey instruments and instructions for collecting reliable and valid expert evaluations on these tasks.

Rating AI performance on occupational tasks proved challenging. The rating difficulty related to predicting contextual factors that can potentially affect AI performance and to specifying the underlying capability requirements for each task. On the other hand, the studies suggested the possible use of occupational tasks for better anticipating how occupations might evolve as new AI capabilities emerge. As a result, occupational tasks will be used to understand the implications of AI for work and education rather than for gathering expert judgements on AI capabilities.

Recognising the limitations of expert judgement, the project initiated an exploration of measures derived from direct evaluations of AI systems. These benchmark tests offer a diverse range of evaluations but vary in quality, complexity, and target capabilities. To navigate this landscape, the project commissioned experts to explore their uses for the project.

The project needed to find ways to select good-quality benchmarks, categorise them according to AI capabilities and systematise them into single measures. It commissioned experts to work on each of these tasks. Anthony Cohn and José Hernández-Orallo developed a method for describing the characteristics of benchmark tests to guide the selection of existing measures for the project. Guillaume Avrin, Swen Ribeiro and Elena Messina presented evaluation campaigns of AI and robotics and proposed an approach for systematising them according to AI capabilities. Yvette Graham reviewed major benchmark tests in the domain of natural language processing and developed an integrated measure based on the reviewed tests.

Using direct measures to develop valid indicators of AI capabilities is a challenging but promising direction because of the large number and variety of direct measures available. The measures evolve rapidly as the field itself, which requires an approach to synthesising them conceptually. In addition, the measures often omit a comparison to human capabilities, which requires additional steps to add this reference. The preliminary work on direct measures suggests ways of addressing these two challenges.

Both expert judgements and direct AI measures are necessary to develop indicators of AI capabilities that are understandable, comprehensive, repeatable and policy relevant. The project’s third phase is working on a concrete approach for developing such indicators in different domains. This approach draws on experts to review, select and synthesise direct AI measures into a set of integrated AI indicators. These will be complemented with measures obtained from expert evaluations in areas where direct AI assessments are lacking. The resulting AI indicators will then be linked to measures of human competences and examples of occupational tasks to derive implications for education and work. They should aid decision-makers in determining necessary policy interventions as AI continues to advance.

Disclaimers

This work is published under the responsibility of the Secretary-General of the OECD. The opinions expressed and arguments employed herein do not necessarily reflect the official views of the Member countries of the OECD.

This document, as well as any data and map included herein, are without prejudice to the status of or sovereignty over any territory, to the delimitation of international frontiers and boundaries and to the name of any territory, city or area.

The statistical data for Israel are supplied by and under the responsibility of the relevant Israeli authorities. The use of such data by the OECD is without prejudice to the status of the Golan Heights, East Jerusalem and Israeli settlements in the West Bank under the terms of international law.

Note by the Republic of Türkiye
The information in this document with reference to “Cyprus” relates to the southern part of the Island. There is no single authority representing both Turkish and Greek Cypriot people on the Island. Türkiye recognises the Turkish Republic of Northern Cyprus (TRNC). Until a lasting and equitable solution is found within the context of the United Nations, Türkiye shall preserve its position concerning the “Cyprus issue”.

Note by all the European Union Member States of the OECD and the European Union
The Republic of Cyprus is recognised by all members of the United Nations with the exception of Türkiye. The information in this document relates to the area under the effective control of the Government of the Republic of Cyprus.

Photo credits: credits: © Shutterstock/LightField Studios; © Shutterstock/metamorworks; © Shutterstock/Rido; © Shutterstock/KlingSup.

Corrigenda to OECD publications may be found on line at: www.oecd.org/about/publishing/corrigenda.htm.

© OECD 2023

The use of this work, whether digital or print, is governed by the Terms and Conditions to be found at https://www.oecd.org/termsandconditions.