Annex C. Using machine learning to analyse the information contained in online job postings

The information contained in online job postings is extremely rich and large. The database used in this paper spans several gigabytes of data and sums up to millions of keywords collected from job postings in different countries and over time. In addition to its size, the information contained in online job postings differs from most traditional labour market statistics (such as, for instance, labour force surveys) in that it contains information in the form of text rather than numbers and figures. Differently from standard quantitative data, text bears “semantic meaning” which can be multifaceted and ambiguous but it can also convey a far greater amount of information than just numbers and figures.

Recent advances in machine learning techniques led to the development of so-called language models which have the objective of understanding the complex relationships between words (their semantics) by deriving and interpreting the context those words appear in. Language models (and in particular Natural Language Processing- NLP- models) interpret text information by feeding it to machine learning algorithms that derive the logical rules to interpret the semantic context in which words appear. NLP and language models, used in the remainder of this paper, are therefore better suited for the analysis of text information than traditional statistics and, as such, they are used for the analysis of online job postings in the remainder of this report.

In particular, the approach taken in this report leverages “Word2Vec”, an NLP algorithm developed by researchers in Google. This algorithm functions by creating a mapping between the meaning (i.e. the semantics) of words contained in text and mathematical vectors, so-called “word vectors”. Put it differently, word vectors are the mathematical representation of the meaning of the words used in online job postings. Those vectors are plotted in a high-dimensional vector space (called “graph”) where words with similar meanings occupy close spatial positions in the vector space.

Since word vectors 1 occupy a specific place in the vector space, this makes it possible to calculate the distance (i.e. the cosine similarity) between those vectors and to rank the relationships between skills from the closest to the farthest from any given occupation. In other words, by estimating their semantic closeness, this approach allows to rank the similarities between every skill (word) vector relative to any given occupation vector.2

Skills that are more similar to a certain occupation are interpreted in this report as being more “relevant” to the occupation. Using this approach is, therefore, possible to assess whether the skill “Excel” is more relevant to the occupation “Economist” or to “Painter”, based on the semantic closeness of these words’ meanings extrapolated from millions of job postings.

In this report, the matrix of skills-to-occupations relevance scores (the Semantic Skill Bundle Matrix, SSBM) is used to identify the occupations for which digital skills are particularly relevant as well as to assess the relationship between digital skills and occupations and the speed of diffusion of the demand for digital technologies and skills across labour markets.

Notes

← 1. One n-dimensional vector per skill.

← 2. Occupation vectors are also calculated using a slight modification of Word2Vec called Doc2Vec.

╳

Metadata, Legal and Rights

This document, as well as any data and map included herein, are without prejudice to the status of or sovereignty over any territory, to the delimitation of international frontiers and boundaries and to the name of any territory, city or area. Extracts from publications may be subject to additional disclaimers, which are set out in the complete version of the publication, available at the link provided.

https://doi.org/10.1787/659ce346-en

The use of this work, whether digital or print, is governed by the Terms and Conditions to be found at https://www.oecd.org/termsandconditions.