Journal:Data and metadata brokering – Theory and practice from the BCube Project

From LIMSWiki
Revision as of 00:15, 14 February 2017 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title Data and metadata brokering – Theory and practice from the BCube Project
Journal Data Science Journal
Author(s) Khalsa, Siri Jodha Singh
Author affiliation(s) University of Colorado
Primary contact Email: sjsk at nsidc dor org
Year published 2017
Volume and issue 16(1)
Page(s) 1
DOI 10.5334/dsj-2017-001
ISSN 1683-1470
Distribution license Creative Commons Attribution 4.0 International
Website http://datascience.codata.org/articles/10.5334/dsj-2017-001/
Download http://datascience.codata.org/articles/10.5334/dsj-2017-001/galley/620/download/ (PDF)

Abstract

EarthCube is a U.S. National Science Foundation initiative that aims to create a cyberinfrastructure (CI) for all the geosciences. An initial set of "building blocks" was funded to develop potential components of that CI. The Brokering Building Block (BCube) created a brokering framework to demonstrate cross-disciplinary data access based on a set of use cases developed by scientists from the domains of hydrology, oceanography, polar science and climate/weather. While some successes were achieved, considerable challenges were encountered. We present a synopsis of the processes and outcomes of the BCube experiment.

Keywords: interoperability, brokering, middleware, EarthCube, cross-domain, socio-technical

Genesis and objectives of EarthCube

In 2011 the U.S. National Science Foundation initiated EarthCube, a joint effort of NSF’s Office of Cyberinfrastructure (OCI), whose interest was in computational and data-rich science and engineering, and the Geosciences Directorate (GEO), whose interest was in understanding and forecasting the behavior of a complex and evolving Earth system. The goal in creating EarthCube was to create a sustainable, community-based and open cyberinfrastructure for all researchers and educators across the geosciences.

The NSF recognized there was no infrastructure that could manage and provide access to all geosciences data in an open, transparent and inclusive manner, and that progress in geosciences would be increasingly reliant on interdisciplinary activities. Therefore, a system that enabled the sharing, interoperability and re-use of data needed to be created.

Similar efforts to provide the infrastructure needed to support scientific research and innovation is underway in other countries, most notably in the European Union, guided by the European Strategy Forum on Research Infrastructures (ESFRI) and in Australia under the National Collaborative Research Infrastructure Strategy (NCRIS). The goal of these efforts is to provide scientists, policy makers and the public with computing resources, analytic tools and educational material, all within an open, interconnected and collaborative environment.

The nature of infrastructure development

The building of infrastructure is as much a social endeavor as a technical one. Bowker et al.[1] emphasized that information infrastructures are more than the data, tools and networks comprising the technical elements, but also involve the people, practices, and institutions that lead to the creation, adoption and evolution of the underlying technology. The NSF realized that a cyberinfrastructure, to be successful, must have substantial involvement of the target community through all phases of its development, from inception to deployment. In fact, studies have shown infrastructure evolves from independent and isolated efforts and there is not a clear point where "deployment" is complete.[2] The fundamental challenge was the heterogeneity of scientific disciplines and technologies that needed to cooperate to accomplish this goal, and the necessity of getting all stakeholders to cooperate in its development. A compounding factor is that while technology evolves rapidly, people’s habits, work practices, cultural attitudes towards data sharing, and willingness to use others' data, all evolve more slowly. How the relationship of people to the infrastructure evolves determines whether it succeeds or fails.

A significant element of NSF’s strategy for building EarthCube was to make it a collective effort of geoscientists and technologists from the start, in hopes of ensuring that what was developed did indeed serve the needs of geoscientists and would in fact find widespread uptake. A series of community events and end-user workshops spanning the geoscience disciplines were undertaken with the dual goals of gathering requirements for EarthCube and building a community of geoscientists willing to engage with and take ownership of the EarthCube process.

NSF began issuing small awards to explore concepts for EarthCube. These were followed by the funding of an initial set of "building blocks" meant to demonstrate potential components of EarthCube. The Brokering Building Block (BCube) was one of these awards. BCube sought both to solve real problems of interoperability that geoscientist face in carrying out research, while also studying the social aspects of technology adoption.

The challenge of cross-disciplinary interoperability

Interoperability has many facets and can be viewed from either the perspective of systems or people. Systems are interoperable when they can exchange information without having to know the details of each other's internal workings. Likewise, people view systems or data as interoperable when they don’t have to learn the intricacies of each in order to use them. When systems are interoperable, users of those systems should have uniform access and receive harmonized services and data from them. This is the vision of EarthCube. Delivering on that vision can be considered the "grand challenge" of information technology as applied to the geosciences.

The reason that achieving interoperability across the geosciences is so challenging is because the many scientific fields that comprise the geosciences all have their own methods, standards and conventions for managing and sharing data. The sophistication of the information technologies that have been adopted in each community, the degree of standardization on data exchange formats and vocabularies, the amount of centralization in data cataloguing, and the openness to sharing data all vary greatly.

The methods of achieving interoperability across distributed systems can be categorized as shown in Table 1.

Table 1. Methods for achieving interoperability
Method Requirements Benefits
Adherence to common standards Uniformity in system configuration De facto interoperability
Gateways and translators Installation and maintenance of custom or third-party software Can adapt to new or changing protocols and standards
Brokers as infrastructure, third-party mediation Creation and maintenance of brokering framework with custom adapters Provides two-way translations between disparate systems and removes burdens of interoperability from data provider

Since disciplines will always use different standards for encoding, accessing and describing data, the first option is not a realistic one for the geosciences. The second method is currently in wide use within the geosciences, such as GBIF[3], which harvests metadata from multiple external systems and then maps the metadata — which are served through different protocols and use different schema — to a common standard. Systems such as ERDAAP[4][5] act as servers accessing disparate datasets and serving them through a common interface. What BCube explored was the possibility that a broker, mediating the interactions between many systems serving data and many systems requesting data, could be established as a shared service, i.e., as infrastructure, without being tied to any particular repository or user portal.

Edwards et al.[6] show that technical infrastructures such as electrical grids and railroads evolve in stages, and the final stage is "a process of consolidation characterized by gateways that allow dissimilar systems to be linked into networks." Brokering is such a gateway, applied in the context of information systems. While brokering technologies such as CORBA1 have been in existence since the 1990s, their application typically requires participants in a network to install software packages that enable interfacing through a common protocol. Conformance to uniform standards is clearly a barrier in cross-disciplinary contexts since each community tends to develop its own conventions for storing, describing and accessing data.

The BCube brokering framework

The BCube project advanced a brokering framework by addressing the social, technical and organizational aspects of cyberinfrastructure development. It sought to identify best practices in both technical and cultural contexts by means of engaging scientist with the evolving cyberinfrastructure to achieve effective cross-disciplinary collaborations. The engagement included a number of different communities in guiding and testing the development, with the aim of involving geoscientists at a deep level in the entire process.

BCube adapted a brokering framework that had been developed for the EuroGEOSS project[7] and subsequently deployed in the Global Earth Observation System of Systems (GEOSS). Called the Discovery and Access Broker, or DAB[8], it has successfully brokered millions of data records from dozens of data sources. Guided by the recommendations laid out in the Brokering Roadmap[9], BCube sought to demonstrate how brokering could enhance cross-disciplinary data discovery and access by having scientists from different fields create real-world science scenarios that required the use of data from diverse sources.

The approach that BCube promoted was one in which the broker was taught to interact with each community’s conventions, allowing the participating systems to interact without adopting a common set of standards. BCube developers then set about configuring a cloud-based version of the DAB to access these sources. This required developing software components, called "accessors," that interacted with each data source. At the start of the project we believed the suite of accessors that had already been developed for GEOSS could in many cases be reused for brokering the datasets identified in BCube's science scenarios.

The brokering framework is depicted in Figure 1.


Fig1 Khalsa DataScienceJ2017 16-1.png

Figure 1. The BCube Broker, based on GI-cat and related software from CNR2, mediates two-way requests and responses between clients, depicted on the left and data repositories, depicted on the right, for data query, access, and transform services

Science scenarios

The project was guided by science scenarios developed by the geoscientists on the BCube team. These scenarios were used to define requirements for the broker development while engaging the geoscience community with EarthCube. They also provided the basis for evaluating the added value of brokering.

The term "science scenario" was used in place of what is more commonly known in software development as "use case." This was in response to a concern in that EarthCube should be solving real, rather than hypothetical problems.

The science scenarios, coming from the fields of hydrology, oceanography, polar science and climate/weather, focused on the specific research needs of each scientist. For each scenario, a team composed of domain scientists and computer scientists was convened to investigate the ability of the BCube Brokering Framework to meet the identified needs of the scientists. These needs determined what new or modified mediation functions the broker needed to perform in order to fulfill the scenario.

Several different types of scenarios were defined. There were scenarios that described high-level science research or education goals without referencing specific data and services. The enactment of these scenarios involved both discovery and access as part of the scenario. The primary type of BCube scenario was the detailed science or education scenario in which the scientist identified specific data sources and services that they wished to have access to. Each scenario described the end-to-end activities required to achieve a science objective. By observing how the objective was accomplished first without brokering and then with brokering we were able to evaluate how the broker was saving time and effort. The flow for this type of scenario is depicted in Figure 2.


Fig2 Khalsa DataScienceJ2017 16-1.png

Figure 2. Flow in development and enactment of BCube science scenarios

The third type of scenario the project defined involved configuring the broker to access the resources of a major data repository, thereby making its resources discoverable and accessible, thereby supporting cross-discipline research.

The BCube Brokering Framework gives access to 17 different data repositories serving over five million datasets, as show in Table 2.

Table 2. Resources brokered by the BCube Brokering Framework, along with the access protocol and number of records for each
Repository/source Protocol Number of datasets
AVHRR SST THREDDS 62,777
BCO DMO SPARQL 10,702
Global Multi-Resolution Topography (GMRT) OGC WMS 13
IRIS Event Custom 4,213,828
IRIS Station Custom 544,991
Integrated Marine Observing systems OGC CSW 601
NASA ASTER OPeNDAP 22,684
NERRS SOAP 329
NSIDC OpenSearch 161
One Geology OGC CSW 438
PANGAEA OAI-PMH 356,943
RTOF Models GrADS 46
Rutgers ERDDAP service OPeNDAP 1,200
SRTM NASA OPeNDAP 14,282
UNAVCO GPS Custom 1,739
UNAVCO SSARA SOAP 2,000
US NODC OGC CSW 29,840

References

  1. Bowker, G.C.; Baker, K.; Millerand, F.; Ribes, D. (2010). "Toward Information Infrastructure Studies: Ways of Knowing in a Networked Environment". In Hunsinger, J.; Klastrup, L.; Allen, M.. International Handbook of Internet Research. Springer Netherlands. pp. 97–117. ISBN 9781402097898. 
  2. Star, S.L.; Ruhleder, K. (1996). "Steps toward an ecology of infrastructure: Design and access for large information spaces". Information Systems Research 7 (1): 111–134. ISBN 10.1287/isre.7.1.111. 
  3. Edwards, J.L.; Lane, M.A.; Nielsen, E.S. (2000). "Interoperability of biodiversity databases: Biodiversity information on every desktop". Science 289 (5488): 2312-2314. ISBN 10.1126/science.289.5488.2312. PMID 11009409. 
  4. Simons, R.A.; Mendelssohn, R. (2012). "ERDDAP - A Brokering Data Server for Gridded and Tabular Datasets". American Geophysical Union, Fall Meeting 2012 2012: IN21B-1473. http://adsabs.harvard.edu/abs/2012AGUFMIN21B1473S. 
  5. Delaney, C.; Alessandrini, A.; Greidanus, H. (2016). "Using message brokering and data mediation on earth science data to enhance global maritime situational awareness". IOP Conference Series: Earth and Environmental Science 34: 012005. doi:10.1088/1755-1315/34/1/012005. 
  6. Edwards, P.N.; Jackson, S.J.; Bowker, G.C. et al. (2012). "Understanding infrastructure: Dynamics, tensions, and design". Deep Blue. http://hdl.handle.net/2027.42/49353. 
  7. Vaccari, L.; Craglia, M.; Fugazza, C. et al. (2012). "Integrative Research: The EuroGEOSS Experience". IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 5 (6): 1603–1611. doi:10.1109/JSTARS.2012.2190382. 
  8. Nativi, S.; Craglia, M.; Pearlman, J. (2013). "Earth Science Infrastructures Interoperability: The Brokering Approach". IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 6 (3): 1118–1129. doi:10.1109/JSTARS.2013.2243113. 
  9. Khalsa, S.J.; Pearlman, J.; Nativi, S. et al. (2013). "Brokering for EarthCube Communities: A Road Map" (PDF). National Snow and Ice Data Center. doi:10.7265/N59C6VBC. https://www.earthcube.org/sites/default/files/doc-repository/EarthCube%2520Brokering%2520Roadmap.pdf. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article lists references alphabetically, but this version — by design — lists them in order of appearance. Footnotes have been changed from numbers to letters as citations are currently using numbers.