Português
Português

Blended Integrated Open Data

Finding information over Open Data sources and combining them is a complex task, because it is necessary to find reliable information, to sanitize the data and to integrate it, prior to executing any query for data analysis.

Once the integration is done, it is necessary to know in details how the data relates to correclty combine the available data. National governments are typical Open Data producers that make available large amounts of unrelated open data. For instance, the Brazilian government publishes many data in two web portals, such as the Brazilian Open Data, or National Institute of Educational Researches (INEP)>. In addition, the integrated data demands maintenance efforts to keep data reliable and consistent. There are solutions that enable querying directly over the original data sources, but the data cleaning and combination problem remains, thus it continues to be a difficult task for data analysts.

The Blended Integrated Open Data (BIOD) is an integrated repository that can be accessed through an API to execute analytical queries. It is built using the framework called BlenDb. It has been developed by the Center of Computer Science and Free Software (C3SL) [Direne et al. 2016] located at the Federal University of Parana.

The repository central objective is to provide an analytical API, in which it is not necessary to know the relations between data, only the metrics (calculations over some measures), dimensions (degree of aggregation) and filters (selection of a subset of the data). The data combinations are found automatically by the BlenDb framework, when it is possible, based on a configuration file previously defined, but that is transparent from the API user.

The current version of BIOD is composed by the data sources and tables described below. So far, it contains more than 2 billions of records and more than 800 attributes. The degree of normalization depends on the original data source, thus some tables have more than hundreds of attributes. The original data is in Portuguese, but we provide English translation to ease the understanding.

Educational Open Data Laboratory - general (LDE)

Educational Open Data Laboratory - Under Graduation Educational Open Data Laboratory - primary studies

A query in BlenDb has the following format:

http://biod.c3sl.ufpr.br/v1/data?METRIC_1,METRIC_N &dimensions=DIMENSION_1,DIMENSION_N &filters=CLAUSE_1_FILTER_1,CLAUSE_1_FILTER_N; CLAUSE_N_FILTER_1,CLAUSE_N_FILTER_N

The framework builds the query, by finding the tables, its attributes and the relationships between them, transparently to the user. We call this query format as RFQ - relation-free query, because it is not allowed to specify relations, only metrics, dimensions and filters. As already stated, the goal is avoid explicitly knowing and specifying JOINs over the integrated data.

Conventions of the attributes names:
We have defined a set of conventions to name the parameters of the API, to ease its understanding, following the example below:

Name Aggregation Type of data Description
met:count:cidade:id count integer Number of cities
met:avg:docente:idade avg float Average Professor's age

Note that the name met: represents the metric, count: is the type of aggregation, followed by the kind of data and attribute name, respectively cidade:id. On the second line, it uses the avg aggregation function. BIOD uses this convention for the following functions: SUM, AVG, MAX, MIN, COUNT. When specifying the dimensions or the filters, the prefix is dim:.

Calling the BIOD API
Consider it is necessary to do the following analytic query: we need to return, by Brazilian region, the number of internet acess that are accessible in a set of cities, the GDB average, the population count, the number of higher education institutions and schools, filtered by active internet points, for the years 2014 to 2017, ordered by the GDP of each region.

This question can be answered with the API call below:

https://biod.c3sl.ufpr.br/api/v1/data?metrics=met:count:ponto:id,met:avg:ibge:pib,met:sum:ibge:populacao,met:count:es:instituicao:id,met:count:escola:id&dimensions=dim:regiao:nome&filters=dim:ponto:ativo==t;dim:ibge:censo:ano==2014;dim:es:instituicao:censo:ano==2017;dim:escola:censo:ano==2017&sort=met:avg:ibge:pib

The result is a JSON object, as shown in the Figure below. The result can also be obtained in a CSV file, by adding an additional parameter at the end of the URL: '&format=csv'
query response

Integration with Google Sheets
The BIOD data can be directly integrated into a Google Sheet. The integration is explained in this link.

References