5.8. Roadmap: The Internet as a source for statistical data

Why use the Internet as a source for statistical data?

The Internet has become an indispensable infrastructure for economies and societies. An ever growing share of economic transactions, communication and information supply takes place online. Many of these online actions leave digital “footprints” that can be observed using tools that scan, gather, interpret, filter and organise information from across the Internet, providing a foundation for the use of the Internet as a statistical data source (IaSD). Online data may be of use in combination with, or as a substitute for, data collected by traditional instruments such as statistical surveys or off-line administrative sources. For example, online retailers’ websites can be a useful source of information about prices while social media may provide information related to employment, population or societal wellbeing.

The relatively short history of Internet based social and behavioural research (Hewson et al, 2016) shows that online data can support different elements of statistical activity within national statistical organisations (NSOs) at different steps of the statistical value chain:

  • Identifying and sampling the population of interest. Internet data can enable efficient updating of registers of statistical units based on Internet presence (e.g. businesses with their own websites or active in online marketplaces), thereby supporting the design of data collection processes.

  • Data collection. In many instances, web-reading techniques may enable the search for and retrieval of information online that may not otherwise be available with comparable levels of timeliness, detail and exhaustiveness (Bean, 2016). Such data can be timely, especially compared to data collected through traditional survey approaches; Internet search patterns can provide early warning signs about upcoming economic downturns or of health issues emerging in the population, for example. Use of the Internet has the potential to free up NSO resources and reduce response burdens so that surveys can be implemented where they are most effective.

  • Verification / imputation. Information from the Internet can be used to verify data from other sources, such as surveys. In addition, the use of online information to identify commonalities between respondents and non-respondents may be of use in making imputations to ensure statistics are representative of the target population.

  • Dissemination. By releasing their statistics online, NSOs also contribute to the enhancement of IaSD for use by expert and interested users, including other NSOs and international organisations.

The use of IaSD is already a reality in many NSOs or is progressively being tested for production environments (e.g. Statistics Canada, US Census Bureau). This opens up avenues to implement subject, object, relationship and network-based measurements (CBS, 2012) that make the most of a vast array of data, including text, images, sound and video files. Of particular interest are data generated in transaction and social media platforms across users through content and service mediation. One example is the “Billion Prices project”, an academic initiative aimed at comparing official and alternative, Internet retailer-based measures of inflation, drawing on transaction data. Official data can in some cases be challenged or confirmed, hinting at possible leading indicators.

Comparing official CPI and Internet-based consumer price inflation estimates, 2008-15
Annual Consumer Price Index inflation rates, Argentina and United States, 2008-15
picture

Source: OECD calculations based on Cavallo and Rigobon (2016).

 StatLink https://doi.org/10.1787/888933930497

Website metadata, hyperlinks to other sites, logs, cookies and website/subscriber analytics also represent key sources for understanding data flows and network effects. Behavioural data from devices such as smartphones or wearable technology carried by individuals, that record data such as location, physical activity and health status, offer additional opportunities to develop new statistics addressing previously unmeasurable phenomena, and the capacity to measure actual behaviour as an alternative to reported behaviour. IaSD therefore has the potential to help address potential response and reporting bias, especially around sensitive phenomena.

What are the challenges?

Internet data acquisition modalities can range from the use of robots/crawlers to delivery of data though Application Programming Interfaces (APIs). In addition to technical issues, including software and infrastructure requirements, IaSD requires that the data used are legally cleared for the intended statistical use. NSOs may lack the legal rights to make use of privately owned data available online, but a legislative basis for this can be put in place.

By virtue of its nature, the Internet presents all the features of Big data (i.e. vast volume, update frequency, coherence, complexity, representativeness of the population of interest). Such data requires non-conventional tools which NSO staff may not be fully trained or equipped to use. In addition, borders do not apply to the Internet, whereas the activities of NSOs are mostly confined to their own jurisdictions. Linking Internet information to real-world entities can thus be especially challenging. Most importantly, it may be difficult to assess the integrity and provenance of data retrieved from online sources.

Each use case needs to be assessed on its own merits. Government transparency requirements of administrative procedures may enable NSOs to reliably source governmental administrative data online (e.g. procurement or grant data; patent filings). Information disclosure (or suppression) online can be influenced by organisational objectives. For example, online job listings may not signal a willingness to hire for advertised posts, but rather provide a job market scanning mechanism or a company may advertise its activity in some areas to boost its image while keeping other operations secret. In order to secure integrity, the IaSD agenda requires that primary information providers are able to trust that the information provided in online environments will not be used against them, while information users need to feel reassured that the information provider has nothing to gain by reporting or withholding false information. To enable this, IaSD often relies on ensuring privacy and confidentiality (e.g. between platform owners, their users and NSOs).

Options for international action

The OECD Recommendation on Good Statistical Practice, advocates that NSOs, as a collective, explore Internet-based sources, and the combination of these with existing sources for official statistics. The United Nations Statistics Division (UNSD) provides an inventory of Internet-based Big data projects in NSOs (https://unstats.un.org/bigdata/inventory.cshtml). In order to ensure the quality of official statistics when such sources are used, the formulation of explicit policy towards the use of Big data (including the Internet and private data) has to consider access, legal, technical and methodological implications.

International action is particularly pertinent for the purposes of demonstration and mutual learning, especially around quality-assurance. International action is also relevant for addressing the measurement of phenomena across jurisdictional boundaries, such as those relating to globalisation or cross-country analysis (Schreyer, 2015). Collective action can drive a move towards the development and adoption of standards that favour disambiguation and interoperability of the Internet footprint under conditions that are suitable for good statistical practice. NSOs may increasingly leverage and contribute to the development of global Internet information commons that could in the future be vital statistical infrastructure for examining cross boundary phenomena. Examples include the work led by private non-for-profit international consortia to consolidate online registers of organisations, curating administrative data sources published in isolation by governments and public bodies, and rendering them accessible and usable online.

As a collective group, NSOs and International Organisations including the OECD should work to develop a fruitful dialogue with the owners of Internet-based platforms that are facilitating growing shares of online activity - and have access to the associated digital footprint.

References

Bean, C. (2016), Independent Review of UK Economic Statistics: Final Report, UK Government, London, www.gov.uk/government/publications/independent-review-of-uk-economic-statistics-final-report.

Cavallo, A. and R. Rigobon (2016), “The Billion Prices project: Using online data for measurement and research”, Journal of Economic Perspectives, Vol. 30, No. 2, pp. 151-178.

CBS (2012), ICT, Knowledge and the Economy, Central Bureau of Statistics, the Netherlands, www.cbs.nl/en-gb/publication/2012/48/ict-knowledge-and-the-economy-2012.

Hewson, C., C. Vogel and D. Laurent (2016). Internet Research Methods, 2nd edition, Sage Publications Ltd., London, https://uk.sagepub.com/en-gb/eur/internet-research-methods/book237314.

OECD (2015), Recommendation of the OECD Council on Good Statistical Practice, OECD, Paris, www.oecd.org/statistics/good-practice-toolkit/Brochure-Good-Stat-Practices.pdf.

Schreyer, P. (2015), “Use of geospatial and web data for OECD statistics”, presentation for the CCSA Special session on Showcasing Big Data, 1st October 2015, Bangkok, https://unstats.un.org/unsd/accsub/2015docs-26th/Presentation-OECD.pdf.

End of the section – Back to iLibrary publication page