3. Looking ahead: A roadmap of datasets to enhance the fraud risk model of Spain’s Comptroller General

Abstract

This chapter explores additional datasets that the General Comptroller of the State Administration (Intervención General de la Administración del Estado, IGAE) of Spain can use to enhance the risk model described in Chapter 2. The chapter provides a road map and indicates which databases are most promising for improving the assessment of grant fraud risks using the model, based on the accessibility, relevance and quality of the datasets. The datasets are grouped into three categories: 1) organisational data on the parties of the granting process; 2) data on personal connections and conflicts of interest; and 3) data on organisational reliability and violation of rules.

Introduction

This chapter offers a roadmap for complementing existing grants data of the General Comptroller of the State Administration (Intervención General de la Administración del Estado, IGAE) in order to improve risk assessment models. By implication, it outlines priority datasets which can be linked to existing IGAE grants data, enhancing analytical sophistication and improving the precision of risk assessment. As discussed in Chapter 2, machine learning models are limited by the scope and type of data included in the training sample. The model cannot precisely estimate risk probabilities based on incomplete information, because key drivers and mechanisms determining risks remain unaccounted for. Hence, the more comprehensive the initial dataset is, the more precise and accurate risk calculations become.

As the universe of potentially relevant datasets is vast, it is imperative to narrow down the list of datasets to the most relevant ones before investing considerable resources into data mapping, processing, linking and eventually incorporating into the predictive models. Three factors should be considered when selecting suitable datasets: accessibility, relevance, and quality. Accessibility in this context encompasses the ease with which the dataset can be gathered from its original source, which can include questions such as whether the dataset is publicly downloadable or it has to be requested. The format in which the data are available is also crucial, such as a single downloadable dataset or a series of HTML pages. Relevance refers to the potential of the data fields to improve analytical sophistication and precision. This has to be assessed before actually collecting the data. The ultimate test of this initial assessment is whether the data would improve the predictive accuracy of the model. When too many redundant variables are included, the final model may suffer from overfitting. Data quality in this context captures the rate of non-missing values and the reliability of information. Low quality data with many missing values or inaccurately collected data are likely to bias the results. This chapter will only cover the datasets that are considered to be readily available to the IGAE, relevant for the said risk model and of sufficiently high quality.

Roadmap for complementing IGAE grants data

The two previous chapters outlined the process by which machine learning can be deployed to enhance the IGAE’s approach to identifying risks in grants and subsidies provision. The process of drawing on external datasets in addition to the existing internal data follows the same logic. First, background and risk indicators should be defined for each dataset to identify factors that potentially influence fraud risks. The next step is to link datasets to the existing internal dataset. In order to do so, a few things should be taken into consideration: the unit of analysis in each dataset, variable relevance, the missing rate and the variance. As discussed in Chapter 2, the missing rate should be lower than 50%, with variance of at least 35%. Moreover, to merge the new data it should be aligned to the same unit of analysis with unique IDs to avoid duplicative rows after matching. Variables that do not contain useful information (i.e. cannot be used as indicators) should be dropped.

For example, to add external datasets to the existing National Subsidies Database (Base de Datos Nacional de Subvenciones, BDNS), they should have identifiers matching with the ones in the BDNS data. Such IDs include identifiers of grants, Tax Identification Number (NIF) of beneficiaries and grantor names, such as municipality names. This implies some limitations, for example, it is currently impossible to match third parties by their names, and instead they can be matched only by NIFs. Additionally, matching by municipality will lead to a significant data loss, because aligning data to the same unit of analysis with unique IDs means that risk scores should be aggregated by municipality. Similar logic applies to matching by grantors’ names and beneficiaries’ NIF, as there are many identical values throughout the BDNS data (i.e. the same beneficiary might receive multiple grants or subsidies).

There are a few sources—some more reliable than others—that can be potentially used for adding data to the existing BDNS dataset. First, there are official sources such as the National Register of Associations (el Registro Nacional de Asociaciones) of the Ministry of the Interior, which lists accredited non-governmental organisations (NGOs), the tax database of the State Tax Administration Agency (Agencia Estatal de Administración Tributaria, AEAT) and the Spanish Association of Foundations (La Asociación Española de Fundaciones), which lists accredited foundations. Some of the data are publicly accessible, whereas others are restricted only to authorised agencies.

Beneficial ownership (BO) registries and public procurement data can also be considered as trusted official sources. The advantage of working with official data directly obtained from data holders is that there is no need to verify the information provided, beyond the standard data quality checks used as part of the outlined data pipeline. Official aid data from the European Union is another example of trustworthy data.

The next group of sources are independent NGOs and associations. This information is less reliable, since the process of data collection and verification is unclear. While official sources most likely include primary data and information, secondary sources are either parsed from different sources or collected manually, often without transparency concerning how the dataset is constructed. Therefore, these datasets should be used with greater care and their validity checked more thoroughly. In Spain, examples of such sources include independent NGO evaluators as well as FICESA, a database of Spanish senior positions and secretariats.

Overview of the most relevant dataset groups

There are four major groups of data that are relevant for matching with the main BDNS database in order to enhance the IGAE’s fraud risk assessments. Each group can provide insights on distinctly different dimensions and determinants of fraud risks. Some data creates opportunities for alternative methods of analysis, such as network analysis, revealing connections between private companies and politically exposed persons, as well as beneficial owners and associated companies. Bringing all of these datasets together offers the possibility of the most comprehensive risk assessment; however, matching only some, or even just one additional dataset, can be very useful for enhancing the IGAE’s risk model, including the following groups of data:

i. Organisational data on the parties of the granting process. This group covers data on grantors and grantees, as well as third parties (i.e. project implementers). Potential sources of information for this group are:

Company registry and financial information: provides information on the organisational structure and history of the company (e.g. when it was founded) and also uncovers the financial situation such as profitability of the organisation.
Organisational data on accredited NGOs, foundations, associations: provides information on the registry features, reliability of the organisation, and financial records.

ii. Data on personal connections and conflicts of interest. This group can be helpful in identifying connections between officials in private organisations applying for grants and political officeholders overseeing grant giving. Connecting public and private office holders can be useful for further investigating possible conflicts of interest. Potential sources of information for this group are:

The BO registry: can help with identifying beneficial owners, associated companies and their records.
Politically exposed persons: helps in revealing people who are entrusted with power and are more susceptible to being involved in bribery or other corrupt practices.
Data on senior positions and secretariats: provides names of people potentially connected to private companies through legal or beneficial ownership.

iii. Data on organisational reliability and violation of rules. This group can aid in predicting fraud risks by offering insights on relevant, but only indirectly related violations, such as tax payment irregularities. This group can also provide information on softer measures of reliability, such as civil society accreditation. Potential sources of information are:

Data on bankruptcy or tax payments: shows the reliability of an organisation based on past financial records:
Accreditations of NGOs: identifies accredited NGOs or other associations as more reliable ones.

iv. Data on other funds and contracts. Information on other funding sources and public contracts can reveal additional factors that influence the likelihood of fraud, such as double funding for the same activity. Moreover, corruption risks in public procurement or other funding processes can point to systematic, organisation-level weaknesses and the propensity to commit fraud. The relevant datasets in this group include:

EU Funds: list of beneficiaries of EU aid can show if the organisation received double funding from different sources for the same project.
Public procurement: corruption risks in public contracts received by organisations or provided by the same grantor can influence the possibility of wrongdoing in grants and subsidies.

Table 3.1 presents the most promising datasets in Spain which are either publicly accessible or their content and specifications are in the public domain. For each dataset belonging to one of the 4 dataset groups, the table contains information on the unit of measurement (i.e. what does a single row refer to), number of observations where available, ID for matching to the BDNS,1 and the priority for the IGAE’s follow-up work. The table highlights the top priority datasets on the top, considering the three main dimensions of data assessment discussed above: accessibility, relevance, and quality. Only datasets that scored high on all 3 dimensions—readily accessible bulk data download, highly relevant data scope and content, and adequate quality—were considered as high priorities for the IGAE.

Conversely, some datasets that scored high on only one or two dimensions were rated as medium or low priority. For instance, when data accessibility was limited, the priority was set to medium even for data that were otherwise seen to be highly relevant or of adequate quality. Ranking datasets in terms of overall priority sets the detailed roadmap for extending and enriching the current IGAE dataset and the risk model described in Chapter 2. The next sections discuss each of these datasets in detail, along with some fraud risk indicators, which can be calculated when data are matched.

Table 3.1. Short description of additional datasets
Dataset name	Dataset group	Unit of measurement	Number of observations	ID to match on to IGAE main dataset	Priority for the IGAE’s follow-up work
National Company Register (Registradores de Espana)	i, ii	Organisation	>5 000 000	NIF of beneficiaries, names of organisations	high
Beneficial ownership registry (LibreBOR)	i, ii	Organisation	>5 000 000	NIF of beneficiaries	high
Database of Spanish senior positions and secretariats(FICESA)	ii	Institutions and State Bodies	~100 000	Name of organisations	high
CINCO.net	iii	Organisations	should be accessed by official body	NIF of organisations	high
Public procurement data	iv	Tender	1 391 558	NIF of organisations	high
Public Bankruptcy Registry (El Registro Público Concursal)	iii	Organisations	website does not allow to search	NIF of organisations	medium
Spanish Association of Foundations (La Asociación Española de Fundaciones, AEF)	iv	Foundation	15 840	Location and type of beneficiary	medium
State Tax Administration Agency (Agencia Estatal de Administración Tributaria, AEAT)	iii	Organisations	not in public access	NIF of organisations	medium
European Union Aid	iv	Grant or contract	40 567	Name of beneficiary, vat number	medium
National Register of Associations of the Ministry of Interior (el Registro Nacional de Asociaciones)	i, iii	Accredited NGO	44	CIF of organisation	low
Loyalty Foundation (Fundación Lealtad)	i, ii, iii	Accredited NGO	191	Name of organisation	low
Source: Author

Matching organisational data: More precise organisational profiles and anomaly detection

Organisational data for the parties involved in grant making include the grantors, grantees and third parties (i.e. project implementers). Matching data on organisations allows for gaining a more complete and detailed picture of organisational controls of fraud risks. It helps to identify additional organisational characteristics that might influence the probability of sanctions. For example, accounting information, size of the company and associated companies can all be useful characteristics for identifying fraud risks and improving the IGAE’s risk model in the future. This group includes the following databases: the National Company Registry (Registradores de Espana), data from the Spanish Association of Foundations (la Asociación Española de Fundaciones, AEF), and the National Register of Associations (el Registro Nacional de Asociaciones) of the Ministry of Interior.

Company registry and financial data

One of the most relevant datasets for the IGAE’s purpose and for enhancing the risk model is the National Company Register. It contains data on companies' details, capital, representatives (e.g. directors and attorneys), registered acts and filing of annual accounts (i.e. financial performance). The list of variables are presented in Table 3.2.2

Table 3.2. List of variables (National Company Register)
Variables	Description	Type of the variable
Name	The name of the company	Text
NIF	The NIF number of the company	Text
Date of incorporation	The date the company was incorporated	Date
Company address	The address where the company is registered	Text
Sector of economic activity	In which economic sector the company operates (NACE)	Categorical
Legal form	Official legal form of the company (national forms)	Categorical
Company status	If the company is active and operational	Categorical
Company’s assets	Total value of items benefiting the company economically	Numeric
Company’s liabilities	Total value of the company's obligations	Numeric
Company's income	Total amount of income generated annually	Numeric
Company's expenditures	Total amount of expenditures per year	Numeric
Changes in equity	If there were any changes in equity for the past year	Binary + text
Cash flows	Increase or decrease in the amount of money	List
Members	Includes the name of all members of the current company representation	Text
Beneficial owners	List of names of final owners of the company	Text
Source: https://sede.registradores.org/site/home

The National Company Register can be matched to the main BDNS dataset by the company’s NIF number, or if that is erroneous, by the name of the organisation. Almost all data fields contained in the company dataset are relevant for the IGAE in terms of enhancing its risk model. These fields range from essential registry information, such as date of incorporation or location of headquarters, to balance sheets and income statements. Similarly, recent changes in equity and the full list of members of the company can provide additional insights on potential conflicts of interest when matched with other datasets.

With regards to essential registry information, some red flags have proven to be useful for predicting corruption and fraud risks. For example, companies which have been set up, or whose registration data has been modified shortly before applying for a grant, are higher risks. Similarly, companies registered in so-called “company graveyard” addresses can be high risk, where a very large number of companies are registered with high degrees of fluctuation (e.g. thousands of companies created and closed on the same address each month). Similarly, as discussed in Chapter 2, the type of organisation (i.e. the company legal status), as well as its overall income and size, can influence the level of fraud risks. For example, due to legislation, certain types of organisations can be less transparent or more loosely regulated (e.g. trusts or company ownership presented by bearer shares).

Regarding company financial data, the IGAE could consider a number of relevant indicators for risk prediction. First, the ratio between a company's expenditures and income can provide information as to whether the company is profitable. Companies that are not profitable are riskier beneficiaries of grants and subsidies, since they may use funds to repay their debts as opposed to financing their projects. Similarly, a negative ratio between a company's liabilities and assets suggests greater risk in terms of the appropriate use of grants. Frequent changes in equity might be a signal of internal conflicts and instability within the company, increasing the level of risks associated with grants and subsidies for such organisations. Systematic decrease in cash flows reflects stagnation or reduction in the company's activities, which also brings its reliability into question. Combining the grants data with company financial data also can reveal the relative size of the grant compared to the company, with small companies receiving large grants potentially being risky.

Register of Associations

Another organisational dataset that the IGAE could consider for its risk model, although a low priority, is the National Register of Associations (el Registro Nacional de Asociaciones), held by the Ministry of Interior. This is a list of organisations that have passed a review made by the Spanish Agency for International Development Cooperation (Agencia Española de Cooperación Internacional para el Desarrollo, AECID) in which more than 70 qualitative and quantitative criteria were used, mostly related to experience, financial solvency, transparency and human resources. The main limitation of this dataset is the small number of accredited NGOs it provides, as it only has 44 observations. They are stored in HTML format and can be easily transformed to excel or any other data formats. The list of the variables are described in Table 3.3.

Table 3.3. List of variables (National Register of Associations of the Ministry of Interior)
Variables	Description	Type of the variable
Name	What is the name of the NGO	Text
Sectors	Which sectors it’s qualified for	Categorical
CIF	What is the CIF number of the NGO	Text
Source: https://www.aecid.es/EN/aecid/our-partners/ngdo/accreditation

The dataset provides two potential IDs for matching—the name of the organisation and its Customer Identification Number (CIF). Both can be used to link the data to the IGAE’s grant data. The data consists of three variables, two of which are IDs and one specifies the exact sectors in which the NGO is qualified to operate. Based on this information, two binary variables can be created: 1) whether the NGO has been reviewed, and 2) whether the NGO is acting in the same area as it was qualified for (e.g. the NGO was qualified for the health sector, but receives grants for the education sector). Due to a low number of observations, significant changes in predicted risk scores are unlikely. However, if the main BDNS dataset is filtered for NGOs only, this information might influence the outcomes for this sector.

NGO evaluations

The third dataset worth considering is that of the Loyalty Foundation (Fundación Lealtad). This is an independent NGO evaluator, which analyses the management, governance, use of funds, economic situation, volunteering and transparency of NGOs. On the foundation’s website, there is a downloadable PDF file with the list of all positively evaluated NGOs. However, this list has limited information beyond name of organisations. Therefore, a more effective approach would be to access the HTML pages of each organisation and parse data manually. There is a possibility to parse information from standardised PDFs called “full reports” for each NGO. The list of variables are described in Table 3.4.

Table 3.4. List of variables (Loyalty Foundation)
Variables	Description	Type of variable
Name	The name of the NGO	Text
Sectors	Sectors of its operation	Categorical
CIF	The CIF number of the NGO	Text
Income	The annual income of the organisation + sources	Numeric + categorical
Expenses	The annual expenses of the organisation + types of expenses	Numeric + categorical
Year	Year of origin of organisation	Date
Beneficiaries	The overall number and type of beneficiaries of this NGO	Numeric
Partners	Number of partners the NGO has	Numeric
Employees	Number of employees the NGO has	Numeric
Volunteers	Number of volunteers the NGO has	Numeric
NIF	The NIF number of the organisation	Text
Management positions	Individual(s) who represent the management of this NGO	Text
Contacts	Email, telephone, address of organisation	Text
Geographic area	Where the NGO operates	Text
Source: https://www.fundacionlealtad.org/ong/a-toda-vela/

The main IDs by which organisations can be linked to the IGAE’s datasets are name of organisation and NIF. While name is available in both HTML and PDF files, NIF is stored in the full report PDF. Data on income, expenses, sector of activities, year of origin, as well as number of beneficiaries, partners and employees can add to the background information for the analysis. As before, a binary variable can be created reflecting whether the given organisation is verified by the Fundación Lealtad. Besides the general background information, some additional indicators can be extracted from this dataset. For instance, the ratio of expenses should be taken into consideration to assess how much is spent on administration of the NGO in comparison to its mission. High spending on administration might be a signal for higher risk scores, although on its own would not be an indicator of fraud or wrongdoing. Administrative bodies when linked to other datasets (e.g. politically exposed persons) can provide information on potential conflicts of interest.

Matching personal data for tracking connections and conflict of interest

The second group of datasets that could enhance the IGAE’s risk model, described in Chapter 2, is data on personal connections and conflicts of interest. Matching data on personal connections between the public and private sectors opens up the possibility for tracking conflicts of interest. Such data can be analysed with the use of network analysis to identify if there are connections between politically exposed persons and owners of the companies receiving grants and subsidies. Some potential sources were already discussed in the previous group. The next sections will focus on the Beneficial Ownership Registry and FICESA, the database of Spanish senior positions and secretariats.

Beneficial Ownership (BO) Registry

The BO registry provides information for over 5 000 000 organisations registered since 2009. The short list of variables is provided in Table 3.2. There is no complete dataset in the public domain, but the source—an online platform for consulting and analysing the Official Gazette of the Mercantile Registry (Boletín Oficial del Registro Mercantil) called LibreBOR—provides API and Python script to parse the data.3 It is possible to select those organisations that appear in the IGAE datasets, without parsing the whole dataset, which will make for a more efficient processing time.

Table 3.5. List of variables in the BO registry
Variables	Description	Type of the variable
Current and previous denomination	The name of the company, what are the previous names	Text
Registered office	The official office is registered	Text
Legal form	The legal form of the company	Categorical
Province	Province where the company operates	Text
Management positions	Names of the individual(s) in management positions	Text
Date of dissolution and reason	If the company dismissed or disintegrated - when and why it happened	Date + text
Registry data	Additional information on company registry	Text
Links to the official sources	Official source from which the data comes	Text
Beneficial owners¹	List of names of final owners of the company	Text
Source: https://docs.librebor.me/

There are two ways for the IGAE to match the BDNS datasets to the BO registry: 1) by name of the organisation, or 2) by NIF of the beneficiary. Alternatively, it is possible to aggregate data per province and match aggregate numbers (e.g. average company size) by particular location. The BO dataset contains a lot of background information for organisations, but the most relevant one is management positions, associated organisations, and the final beneficial owners. The ownership data is best used when matched against other datasets, in particular, lists of political office holders (see next section).

In addition, the IGAE can use some of the background information as risk predictors on their own. When the names of beneficial owners of grant recipients is matched against public office holders, it is possible to identify either direct conflicts of interest (i.e. when the official works for the granting body itself) or indirect forms of potential conflict (i.e. when the related political office holder works in a higher level or supervisory body to the granting organisation). When looking at the ownership data on its own, the information on companies associated with the grantee can reveal risks if further matched to other datasets (e.g. complex forms of conflicts of interest and related risk factors).4

Senior bureaucrats’ database

The next source is a database of Spanish senior positions and secretariats called FICESA. This source contains data related to senior public officials in a wide range of public organisations: state secretariats, undersecretaries, general directorates and sub-directorates, budget offices, official offices, as well as different judicial bodies for state, regional and local levels. There is no data in the public domain, and data must be requested from the data holder by filling out a form. Therefore, the format of the data and the variables the dataset contains is unclear. There was no response to attempts to contact the source. It is assumed that the IGAE would be able to gain access to the full database as a bulk download.

The only ID by which this dataset can be linked is names and, if available, additional personal features, such as date of birth. If the BDNS dataset contains data on beneficial owners, as described above, the data on official positions can be linked by persons’ names. Linking the IGAE’s datasets to the information on senior office holders creates the possibility to conduct network analysis and see if there are conflicts of interests between private organisations receiving grants and public bodies giving grants. It is particularly useful to use the BO registry in order to find all the associated organisations, and analyse if they are connected to politically exposed persons. For instance, the organisation receiving the grant is not connected to anyone from official bodies, but one of its related organisations could be.

Matching data on organisational reliability and violations to collate risks across different domains

Datasets with information about organisational reliability and violations of rules or laws is the third group of data that could support the IGAE to strengthen its risk model for assessing grant fraud risks. This group was covered partially above in the section about data on accredited NGOs. In addition, in this group, there are datasets on bankruptcy and taxation. Matching data on organisational reliability and violation of rules illuminates new dimensions of fraud risks relating to other domains. These datasets can help predict fraud risks in grants by exploiting correlations between accredited organisations’ trustworthiness, rule following behaviours (tax debts, bankruptcy, etc.) and fraud in grants. Building on previous discussions, the next section focus on the Public Bankruptcy Registry, AEAT’s tax data and accounting data from CINCOnet.

Bankruptcy Registry

The first dataset in this group, identified previously as a medium priority for the IGAE, is the Public Bankruptcy Registry (El Registro Público Concursal). The source includes information on procedural resolutions, bankruptcy and out-of-court settlements. The data can be parsed from HTML after filtering by province or court. Unfortunately, for unknown reasons, filtering does not work on the site properly, leading to page errors. Yet, the approximate list of variables is presented in Table 3.6.

Table 3.6. List of variables (Public Bankruptcy Registry)
Variables	Description	Type of variable
Name	The name of the company	Text
Identifying document	The ID of the bankruptcy document	Text
Debtor	If the company is a debt or not	Binary
Disabled	If the company is disabled or not	Binary
Administrator	If the company is an administrator of the bankruptcy or not	Binary
Source: https://www.publicidadconcursal.es/concursal-web/afectado/buscar

This dataset can be matched to the IGAE’s grants data by either name of the organisation, or NIF/CIF number. The source does not provide an opportunity to look through all the cases, requiring filtering beforehand, so the easiest way to set a filter is to use province. The most relevant information for fraud risk assessments are the details on bankruptcy. The source provides location, name of organisation, court, judge and NIF/CIF or other identifiers of organisations. Unfortunately, there is no information on the date of bankruptcy proceedings, which would be especially important to analyse past grants and subsidies. After matching, the most relevant risk indicator for the IGAE would be the binary variable (‘flag’) reflecting if the grantee was or is currently in the state of bankruptcy. Such bankruptcy information on an organisation might signal that the awarded grant or subsidy will be misused by the beneficiary, or at the very least, inadequately administered due to other organisational pressures.

Tax data

The second dataset on rule violations is data from the State Tax Administration Agency (Agencia Estatal de Administración Tributaria, AEAT). This is a dataset with restricted access and only aggregated statistics are available in the public domain. Once again, for the discussion below, an assumption was made that the IGAE can obtain full access to the database in order to incorporate such data into its risk model. According to the notes the AEAT published, it has data in a disaggregated format which can be provided upon request. Aggregated data covers filing of tax returns, payment of taxes, debts and fees, tax certificates, consult tax return, etc.

Due to restricted access to the datasets it is uncertain whether the IDs are the same as in the BDNS dataset, but most likely organisations can be matched either by name or by NIF of the beneficiary. Information on timely payment of taxes, debts and fees are the most relevant for enriching predictive models on fraud risks. Late payment of taxes, as well as presence of debt in a given organisation (or associated ones) could be a signal of higher risks.

Accounting information

The third dataset belonging to this group is accounting and budgeting data from CINCO.net, deemed a high priority for the IGAE and improvements to risk model. The data includes expense operations and total expenditure amount in the current year, revenue amount in the current year, cash flows, non-budgetary operations, third-party expenses, general data of third parties, etc. Like the AEAT’s data, this data is not in the public domain; however, the Ministry of Finance and Civil Service (Ministerio de Hacienda y Función Pública) manages CINCO.net and the IGAE has direct access to it.

The organisations in this database can be matched by names or NIF of the beneficiary to the BDNS. Yet, due to restricted access of the data, it is difficult to assess the quality and content of matching variables. Besides general background information on revenues and expenditures, CINCO.net provides data on reimbursement of other grants provided by different organisations in Spain. This can be particularly useful in assessments of potential risks in future subsidies and grants provision, such as double-funding of operations or the large value of grants received compared the revenue.

Matching data on public contracts and other grants enables tracing double funding and related risks

The final group of datasets encompasses a diverse group of data on public contracts and other grants and funding. Matching data on other funds and contracts would allow the IGAE to cross-reference spending as well as develop additional risk dimensions. For example, it can help identify cross-subsidisation for the same activities, which should be considered a risk factor. Public procurement contracts received by a company can be scored using corruption risk indicators and then related to grants risks. For example, a company or agency (third party, grantor, grantee) participating in high-risk tenders might also be risky when it comes to grants. This group includes datasets from the Spanish Association of Foundations (la Asociación Española de Fundaciones, AEF), European Union Funds, and public procurement data.

Data for foundations

AEF’s data provides information on foundations giving grants, including their types of activity, geographical areas, type of beneficiaries, date of constitution and origin of their administrative bodies. The list of the variables is presented in Table 3.7. The data is open access and can be easily downloaded in excel or PDF format. In total there are 15 840 foundations covered by the directory.

Table 3.7. List of variables of the Spanish Association of Foundations (AEF)
Variables	Description	Type of variable
Name	What is the name of the foundation	Text
Protectorate	Under which ministry/agency protectorate this foundation is	Text
Year	Year of constitution	Date
Contacts	What are the contact details of the foundation (email, phone)	Text
Address	Where the foundation operates	Text
Source: http://www.fundaciones.es/es/buscador-fundaciones

Matching this dataset to the BDNS requires several steps. First, all the observations should be filtered by type of beneficiary, using the online filtering, since the type of beneficiary is not a data field in the downloadable file. Second, the particular location should be matched to the locations of grantors or grantees. This will not provide the exact information as to whether the beneficiary received another grant from a certain foundation, but it indicates the presence of the foundation in the same location with the same types of beneficiaries.

The most relevant information for the IGAE to assess risks would be whether any of the beneficiaries were double granted for the same activities. To precisely track such risks requires checking the exact beneficiaries by their IDs, yet this source does not provide such detailed information. Hence, only aggregate information, which is much more imprecise, can be used from this source. The presence of a foundation supporting similar activities in the same locality (province) as grantor or grantee increases the probability of being double funded.

European Union (EU) Funds data

The next relevant dataset for the IGAE to consider matching to the BDNS data, as a medium priority, is data for EU Funds. The Spanish government and the European Commission provide the data, and they cover records from 2007 to 2020. The data are easily accessible and can be downloaded in Excel format. The list of relevant variables is presented in Table 3.8.

Table 3.8. List of variables (European Union aid)
Variables	Description	Type of variable
Budget references	The budget reference ID for this grant	Text
Subject of grant or contract	The purpose/subject of this grant	Text
Name of beneficiary	The name of beneficiary	Text
VAT number	The VAT number of beneficiary	Text
Contracted amount	The amount of money was contracted to beneficiary	Numeric
Number of budgetary commitments	The number of budgetary commitments the beneficiary has	Numeric
Programme name	The name of the programme under which the grant was allocated	Text
Responsible department	The department responsible for grant allocation	Text
Project start and end date	The start and end date of the project	Date
Source: https://ec.europa.eu/budget/financial-transparency-system/analysis.html

The data provides a VAT number as an ID for organisations, which can be transformed into a NIF number by removing the first two letters. Alternatively, names of organisations can be used for matching. Number of budgetary commitments, subject of grants or contracts, as well as project start and end dates are particularly relevant to identify whether the grantee received funding from the EU for the same project as its Spanish grant. Double funding is a fraudulent practice when the same project is funded more than one time by different donors, without providing information on contributions made. Therefore the project might be implemented, yet the extra public money disbursed is not used as intended.

Public procurement data

The last data source the IGAE could consider matching with its datasets is national public procurement data. The opentender.eu portal contains this data collected from two official government sources (Ministerio de Hacienda y Función Pública and Plataforma de Contratación), as well as Tenders Electronic Daily (TED), a European online public procurement portal. The data contains all the publicly available information on tenders, contracts, bidders, buyers and suppliers necessary for calculating the Corruption Risk Indicator (see Box 3.1). The list of relevant variables is presented in Table 3.9.

Table 3.9. List of variables (Public procurement data)
Variables	Description	Type of variable
Supplier ID	Unique ID of supplier	Text
Buyer ID	Unique ID of buyer	Text
Name of supplier	Name of supplier winning the contract	Text
Name of buyer	Name of buyer providing tender call	Text
Number of bids	How many bids were made per tender	Numeric
Procedure type	Is the procedure type open or restricted	Categorical
Public call	Was the call for tender available to public	Categorical
Length of bid submission	The length between start and end date of bid submission	Numeric
Length of decision period	The length between end date of bid submission and decision	Numeric
Connections	Are there recorded connections between supplier and procurement authority	Categorical
Source: Platforma de Contratacion https://contrataciondelestado.es/; Portal Institucional Del Ministerio De Hacienda y Funcion Pública: https://www.hacienda.gob.es; Tenders electronic daily: http://ted.europa.eu.

Suppliers IDs are the same as the grantees’ NIFs, therefore this ID can be used for linking data. Alternatively, names of organisations as well as grantors names can be matched to the buyers or suppliers from procurement dataset. To assess if the procurement contracts won by bidding firms or tenders run by public sector grantors are prone to corruption, information on corruption proxies can be used. For example, single bidding on competitive markets, procedure type used, publication of the call for tenders, length of bid advertisement and decision period, as well as connections between supplier and procurement authority. Collating public procurement corruption risks in the procurement activities of grantees or grantors can shed additional light on grants fraud risks as it is expected that organisations that are risky in one domain will also be risky in a related domain. This logic of analysis is empirically demonstrated in Box 3.1.

Box 3.1. Matching IGAE Grants data with Public procurement data (opentender.eu dataset)

The Corruption Risk Indicator (CRI) proxies for the deliberate restriction of competition in public procurement tenders for the benefit of a connected bidding firm. The CRI methodology utilises administrative data to calculate corruption risk scores for each contract. Based on the methodology developed by (Fazekas and Kocsis, 2017[1]), the criterion for the selection of procurement risk indicators is the degree of association with unjustified restriction of competition, that is single bidding on competitive markets. It includes several corruption proxies in addition to single bidding such as procurement closed procedure type risk, lack of publicity of call for tenders, supplier tax haven registration, procurement authority dependence on supplier (i.e. agency capture), and the length of bid advertisement and decision periods.

The suppliers tax identification ID (NIF) was used to match the grants dataset to the cleaned public procurement dataset . After cleaning the NIF identification number from nonsensical entries, the grant fraud risk scores were aggregated for each supplier and matched directly to the procurement dataset. There were 103 872 contracts located by 6 408 suppliers that have received a grant. Figure 3.1 shows the aggregated CRI distribution for granted suppliers excluding suppliers with less than 3 contracts. There is an average CRI score of 0.55, considerably higher than the national average.

Figure 3.1. CRI distribution (Suppliers)

Matching the grant dataset to the public procurement dataset allows for deeper insights into the relationships between the risk scores. It is possible to run linear and non-linear regression analyses, including controls for buyer location, buyer type, type of market (CPV sectors), contract type and tender year. Both models in Table 3.10 show a positive correlation between the procurement corruption risk scores and the grant fraud risks. However, model 2 seems to fit better in capturing the non-linearity of this relationship. In Figure 3.2 we show the predictive margins from modelling the CRI in a quadratic relationship with the Grant Fraud Risk. These simple regression results assure us of the validity of both risk scores as they are aligned and convey a similar message, that higher corruption risk scores positively correlate with higher grant fraud risks. Moreover, the association is especially strong when public procurement corruption risks are above the sample average.

Figure 3.2. Correlation between CRI and Grant Fraud Risks (Predictive Margins)

Table 3.10. Correlation between CRI and Grant Fraud Risk
Dependent variable	Grant Fraud Risk
Model	(1)	(2)
Sample	Granted	Granted
CRI	0.036*** (0.002)	-0.014 (0.021)
CRI^2		0.054** (0.024)
Controls	✔	✔
Observations R²	103 151 0.1719	103 151 0.1721
Notes: Regression includes controls for contract values, contract type, buyer type,
buyer location, market, contract type and tender year.
Robust standard errors in parentheses * p<0.01, p<0.05, * p<0.1

Source: Fazekas, M. and G. Kocsis (2017[1]), “Uncovering High-Level Corruption: Cross-National Objective Corruption Risk Indicators Using Public Procurement Data”, British Journal of Political Science, Vol. 50/1, pp. 155-164, https://doi.org/10.1017/s0007123417000461.

Benefits of drawing on multiple datasets

This chapter offered a detailed account of how and why different datasets can be linked to existing IGAE datasets with particular attention to promising fraud risk indicators enabled by the new data. These new indicators principally capture actor behaviour rather than simple background characteristics allowing for a far more precise risk assessment. However, data linking not only allows for calculating new indicators in one database and linking them to each other, but also for creating new indicators by drawing on multiple datasets. Such complex indicators offer additional insights on relevant risk dimensions. They also represent a more robust measure of actor behaviour, because multiple sources pointing at the same behaviour carry greater validity than a single dataset.

Drawing on multiple datasets is crucial for comprehensively mapping complex fraud behaviours, as well as for reducing the rate of false positives that are common in simple models (Fazekas, M., Ugale, G, & Zhao, A., 2019[2]). Combining multiple indicators stemming from different datasets is considered as good practice in risk measurement as it allows for measurement triangulation. In other words, it allows for increasing convergent validity. False positives are pervasive in simple risk assessments, as many indicators merely point at potential wrongdoing rather than actual bad deeds. Moreover, widely used indicators of conflicts of interest typically indicate the presence of a potential conflict rather than an actual conflict that represents abuse of a situation for undue personal gain. However, when conflicts of interest information is combined with data on outcomes, such as double-counting grants or anomalous financial performance, the combination of indicators provide greater validity to the measurement approach.

Matching datasets representing multiple dimensions of relationships can also power the use of advanced, multi-layer network analytics. Such multi-layered relationships can encompass connections between private companies and public grant making organisations through a range of contractual relationships, or links between companies’ beneficial owners and politically exposed persons working in public sector bodies. Multiple network connections established through the use of large-scale, linked administrative datasets also allow for tracking temporal changes in connections across potentially risky entities and individuals, thereby increasing the analytical sophistication of risk modelling.

Conclusion

This section has reviewed a wide variety of potential useful additional datasets to the existing IGAE dataset. By doing so it set out a roadmap of data capture and matching maximizing analytical value for IGAE. Of the reviewed datasets, company information on registration, ownership and financials represents the highest potential for further refining the fraud risk assessment model. These datasets can be readily matched to IGAE’s internal data using company registry IDs. Moreover, matching public procurement data to grants data, also demonstrated by analysing readily available datasets, can add great value as 2 sets of risk factors can be triangulated against each other producing more reliable risk assessment. Once these high priority datasets are brought into the IGAE data pipeline, further datasets can also be considered such as the bankruptcy register.

References

[2] Fazekas, M., Ugale, G, & Zhao, A. (2019), Analytics or Integrity: Data-Driven Decisions for Enhancing Corruption and Fraud Risk Assessments, OECD Publishing, Paris, https://www.oecd.org/gov/ethics/analytics-for-integrity.pdf.

[1] Fazekas, M. and G. Kocsis (2017), “Uncovering High-Level Corruption: Cross-National Objective Corruption Risk Indicators Using Public Procurement Data”, British Journal of Political Science, Vol. 50/1, pp. 155-164, https://doi.org/10.1017/s0007123417000461.

Notes

← 1. In some cases, certain information is presumed to be present in the IGAE’s datasets; however, confirmation of this was not possible because of anonymisation of most of the databases.

← 2. The access to the dataset is restricted and requires paying a fee for each organisation and receiving a digital certificate. Free access is only allowed to the aggregated data per sector, year or business sector. The only company-level information available without additional restrictions is company status (i.e. operational or not). For the IGAE to use this data, it would need to gain full access to the complete and current dataset, either through paying the bulk access fee or setting up a special arrangement with the government data provider. Easy access, public alternatives also exist, for example, opencorporates.com, which is a private, social enterprise aiming to make all company data easily accessible around the world.

← 3. See https://docs.librebor.me/python/.

← 4. Due to a restricted access to the source, it is not clear if the information on beneficial owners is there. Yet, it is present in the company register; therefore, it is reasonable to expect that it also contains a variable in LibreBOR. In case it is not, the information can be obtained from the company register after receiving an electronic certificate.

╳

Metadata, Legal and Rights

This document, as well as any data and map included herein, are without prejudice to the status of or sovereignty over any territory, to the delimitation of international frontiers and boundaries and to the name of any territory, city or area. Extracts from publications may be subject to additional disclaimers, which are set out in the complete version of the publication, available at the link provided.

https://doi.org/10.1787/0ea22484-en

The use of this work, whether digital or print, is governed by the Terms and Conditions to be found at http://www.oecd.org/termsandconditions.