Finding information over Open Data sources and combining them is a complex task, because it is necessary to find reliable information, to sanitize the data and to integrate it, prior to executing any query for data analysis.Once the integration is done, it is necessary to know in details how the data relates to correclty combine the available data. National governments are typical Open Data producers that make available large amounts of unrelated open data. For instance, the Brazilian government publishes many data in two web portals, such as the Brazilian Open Data, or National Institute of Educational Researches (INEP)>. In addition, the integrated data demands maintenance efforts to keep data reliable and consistent. There are solutions that enable querying directly over the original data sources, but the data cleaning and combination problem remains, thus it continues to be a difficult task for data analysts.
The Blended Integrated Open Data (BIOD) is an integrated repository that can be accessed through an API to execute analytical queries. It is built using the framework called BlenDb. It has been developed by the Center of Computer Science and Free Software (C3SL) [Direne et al. 2016] located at the Federal University of Parana.The repository central objective is to provide an analytical API, in which it is not necessary to know the relations between data, only the metrics (calculations over some measures), dimensions (degree of aggregation) and filters (selection of a subset of the data). The data combinations are found automatically by the BlenDb framework, when it is possible, based on a configuration file previously defined, but that is transparent from the API user.
The current version of BIOD is composed by the data sources and tables described below. So far, it contains more than 2 billions of records and more than 800 attributes. The degree of normalization depends on the original data source, thus some tables have more than hundreds of attributes. The original data is in Portuguese, but we provide English translation to ease the understanding.Educational Open Data Laboratory - general (LDE)
Conventions of the attributes names: We have defined a set of conventions to name the parameters of the API, to ease its understanding, following the example below:
|Name||Aggregation||Type of data||Description|
|met:count:cidade:id||count||integer||Number of cities|
|met:avg:docente:idade||avg||float||Average Professor's age|
Calling the BIOD API Consider it is necessary to do the following analytic query: we need to return, by Brazilian region, the number of internet acess that are accessible in a set of cities, the GDB average, the population count, the number of higher education institutions and schools, filtered by active internet points, for the years 2014 to 2017, ordered by the GDP of each region.This question can be answered with the API call below: