Difference between revisions of "Journal:Earth science data analytics: Definitions, techniques and skills"

From LIMSWiki
Jump to navigationJump to search
(Saving and adding more.)
(Saving and adding more.)
Line 88: Line 88:
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Multivariable calculus and linear algebra
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Multivariable calculus and linear algebra
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Thinking like a data scientist
   | style="background-color:white; padding-left:10px; padding-right:10px;"|Thinking like a data scientist
|-
|}
|}
The website ''Master's in Data Science''<ref name="MastersCompleteDir">{{cite web |url=http://www.mastersindatascience.org/schools/ |title=Complete Directory of Data Science Graduate Degrees |work=MastersInDataScience.org}}</ref> summarizes the technical skills and tools needed of a data scientist to perform data analytics as follows in Table 2.
{|
| STYLE="vertical-align:top;"|
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="80%"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|'''Table 2.''' Technical skills and tools of a data scientist<ref name="MastersCompleteDir" />
|-
  ! style="padding-left:10px; padding-right:10px;"|Technical skills and tools of a data scientist
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Math (e.g., linear algebra, calculus, and probability)
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Statistics (e.g., hypothesis testing and summary statistics)
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Machine learning tools and techniques (e.g., k-nearest neighbors, random forests, ensemble methods, etc.)
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Software engineering skills (e.g., distributed computing, algorithms, and data structures)
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Data mining
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Data cleaning and munging
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Data visualization (e.g. ggplot and d3.js) and reporting techniques
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Unstructured data techniques
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|R and/or SAS languages
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|SQL databases and database querying languages
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Python (most common), C/C++, Java, Perl
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Big data platforms like Hadoop, Hive, and Pig
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Cloud tools like Amazon S3
  |-
  |-
|}
|}

Revision as of 20:53, 10 July 2017

Full article title Earth science data analytics: Definitions, techniques and skills
Journal Data Science Journal
Author(s) Kempler, Steve; Mathews, Tiffany
Author affiliation(s) NASA Goddard Space Flight Center, NASA Langley Research Center
Primary contact Email: gulliver2100 at verizon dot net
Year published 2017
Volume and issue 16
Page(s) 6
DOI 10.5334/dsj-2017-006
ISSN 1683-1470
Distribution license Creative Commons Attribution 4.0 International
Website http://datascience.codata.org/articles/10.5334/dsj-2017-006/
Download http://datascience.codata.org/articles/10.5334/dsj-2017-006/galley/632/download/ (PDF)

Abstract

The continuous evolution of data management systems affords great opportunities for the enhancement of knowledge and advancement of science research. To capitalize on these opportunities, it is essential to understand and develop methods that enable data relationships to be examined and information to be manipulated. Earth science data analytics (ESDA) comprises the techniques and skills needed to holistically extract information and knowledge from all sources of available, often heterogeneous, data sets. This paper reports on the ground-breaking efforts of the Earth Science Information Partners' (ESIP) ESDA Cluster in defining ESDA and identifying ESDA methodologies. As a result of the void of earth science data analytics in the literature, the ESIP ESDA definition and goals serve as an initial framework for a common understanding of techniques and skills that are available, as well as those still needed to support ESDA. Through the acquisition of earth science research use cases and categorization of ESDA result oriented research goals, ESDA techniques/skills have been assembled. The resulting ESDA techniques/skills provide the community with a definition for ESDA that is useful in articulating data management and research needs, as well as a working list of techniques and skills relevant to the different types of ESDA.

Keywords: data science, analytics, techniques, skills, information, knowledge

Introduction

The continuous evolution of data management systems affords great opportunities to the enhancement of knowledge and advancement of earth science research. With the growing need and desire to leverage information from various sources to better understand our environment, it becomes evident — through community experience and foresight — that this can be maximized by accepting new ways to cross-examine this information. As excerpts from Hey, Tansley, and Tolle's The Fourth Paradigm explain[1]:

We have to do better at producing tools to support the whole research cycle—from data capture and data curation to data analysis and data visualization. (pg. xvii)

Clearly, data-intensive science … must move beyond data warehouses and closed systems, striving instead to allow access to data to those outside the main project teams, allow for greater integration of sources, and provide interfaces to those who are expert scientists but not experts in data administration and computation. (pg. 147)

We are already seeing some attempts to infer knowledge based on the world’s information. (pg. 167)

To capitalize on these opportunities, it is essential to develop a data analytics framework in which we scope the scientific, technical, and methodological components that contribute to advancing science research. Through this framework, we can categorize discussions among individuals of like component interests, instead of attempting to draw specific direction from a set of starting points that may greatly vary.

Data analytics is the process of examining large amounts of data of a variety of types to reveal hidden patterns, unknown correlations, and other useful information, key to facilitating Earth science research opportunities. Thus, the research presented here is motivated by the need to determine and categorize available data analytics techniques and skills and to identify the gaps for where they are still needed.

Today (and well into the foreseeable future) there is a rapid growth in the amount of earth science data and value-added heterogeneous information that many earth science researchers have not yet holistically leveraged. It is important to realize that this rate of data growth is new and challenging. It is new in that information technology is just beginning to provide the tools for advancing the analysis of heterogeneous datasets in a "big" way to provide opportunities to discover non-obvious scientific relationships, previously invisible to the science eye. The challenge is it takes individuals, or teams of individuals, with just the right combination of skills to understand the data and develop the methods to glean knowledge from data and information.

The ability to apply information technology, tools, and services necessary to facilitate the advancement of earth science research is becoming more obvious and necessary at a rate that is accelerating. That is, if data manipulation (subsetting, data transformation, format conversion, etc.) extract information from data, then data analytics techniques and skills glean knowledge from information.[2]

The objectives of the Earth Sciences Information Partners (ESIP)[3] Federation Earth Science Data Analytics (ESDA) Cluster is to understand, define, and facilitate the implementation of earth science data analytics. As a result of Cluster efforts, an ESIP-adopted definition of ESDA has been generated, along with 10 ESDA goals, to set the framework for a common understanding for advancing earth science research. In addition, the ESIP ESDA Cluster performed an exhaustive search identifying ESDA types, techniques, and skills used in performing data analytics.

In this paper, we present the ESIP-derived definition of ESDA and differentiate it from other publicized definitions of data analytics. We then describe different types of ESDA and their driving goals. This is followed by an exhaustive survey of current techniques and skills for performing ESDA, made available for the benefit of data scientists exploring new earth science data analytics methodologies, and their potential use.

Literature review

The advancement of information use resulting from evolving technologies, newly developed techniques, and refined skills has become the purview of the data scientist performing data analytics. Data analytics got its start in the business world, which is why most literature and developed tools reflect back on business as the primary application. In the literature, we find that data analytics is comprised of five types: descriptive, diagnostic, discoverative, predictive, and prescriptive. When the ESIP ESDA Cluster attempted to categorize earth science research use cases into these data analytics types, the use cases did not fit: categorizing was ambiguous and/or they fit in more than one type category. Where business data analytics types reflect looking for patterns, and predicting (and prescribing) actions, earth science data analytics also include assessing, validating, calibrating, and applying techniques required to prepare raw datasets for co-use. In addition, characteristics of earth science data introduces data analytics challenges such as dealing with differing formats, differing spatial and temporal data resolutions, inconsistent data acquisition techniques and units for the same measurement, noise, and biasing, to mention a few. This led to the need for a data analytics definition directed specifically at earth science research goals.

In addition, insights by Hey et al., like "[r]esearchers in science must work with colleagues in computer science and informatics to develop field-specific requirement"[1] (pg. 151) and "[t]he process of combining information from existing scientific knowledge … including the specific methodologies that were followed to produce conclusions, should be automatic and implicitly supported"[1] (pp. 170-171), are the genesis for melding the expanding manipulation of information and associated technologies with physical science.

A significant aspect of ESDA is information literacy, the ability to "recognize when information is needed and have the ability to locate, evaluate, and use information effectively."[4] Now that we are well into the information age, ensuring the responsible and appropriate use of data and information has become extremely important and a key skill for ESDA. The Framework for Information Literacy for Higher Education[5] provides groundwork for learners to understand and fully appreciate the value of information used when performing ESDA.

As seen in the literature, there is no shortage of data analytics definitions, and descriptions of individuals who performs data analytics, the data scientist. The Booz Allen Hamilton (BAH) report The Field Guide to Data Science reminds us that "the term Data Science appeared in the computer science literature throughout the 1960s-1980. It was not until the late 1990s however, that the field, began to emerge from the statistics and data mining communities."[6] The BAH report further states that "[p]erforming Data Science requires the extraction of timely, actionable information from diverse data sources to drive data products."[6]

The National Institute of Standards and Technology provides the following definitions[7]:

  • Data science is the extraction of actionable knowledge directly from data through a process of discovery, or hypothesis formulation and hypothesis testing.
  • A data scientist is a practitioner who has sufficient knowledge in the overlapping regimes of business needs, domain knowledge, analytical skills, and software and systems engineering to manage the end-to-end data processes in the data life cycle.
  • The analytics process is the synthesis of knowledge from information.

The article "8 Skills You Need to Be a Data Scientist"[8] defines the skills as follows in Table 1.

Table 1. Skills of a data scientist[8]
Skills of a data scientist
Basic tools Data munging
Basic statistics Data visualization and communication
Machine learning Software engineering
Multivariable calculus and linear algebra Thinking like a data scientist

The website Master's in Data Science[9] summarizes the technical skills and tools needed of a data scientist to perform data analytics as follows in Table 2.

Table 2. Technical skills and tools of a data scientist[9]
Technical skills and tools of a data scientist
Math (e.g., linear algebra, calculus, and probability)
Statistics (e.g., hypothesis testing and summary statistics)
Machine learning tools and techniques (e.g., k-nearest neighbors, random forests, ensemble methods, etc.)
Software engineering skills (e.g., distributed computing, algorithms, and data structures)
Data mining
Data cleaning and munging
Data visualization (e.g. ggplot and d3.js) and reporting techniques
Unstructured data techniques
R and/or SAS languages
SQL databases and database querying languages
Python (most common), C/C++, Java, Perl
Big data platforms like Hadoop, Hive, and Pig
Cloud tools like Amazon S3

References

  1. 1.0 1.1 1.2 Hey, T.; Tansley, S.; Tolle, K. (2009). The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research. pp. 284. ISBN 9780982544204. 
  2. Kempler, S. (12 March 2014). "How Do We Facilitate the Use of Large Amounts of Heterogeneous Data". Information Science and Technology Colloquium, Goddard Institutional Repository. NASA Goddard Space Flight Center. https://gsfcir.gsfc.nasa.gov/colloquia/5064/how-do-we-facilitate-the-use-of-large-amounts-of-heterogeneous-data-or-is-gleaning-knowledge-from-information-in-reach. 
  3. "Earth Science Data Analytics". Federation of Earth Science Information Partners. 14 December 2016. http://wiki.esipfed.org/index.php/Earth_Science_Data_Analytics. 
  4. American Library Association (10 January 1989). "Presidential Committee on Information Literacy: Final Report". Association of College & Research Libraries. http://www.ala.org/acrl/publications/whitepapers/presidential. 
  5. American Library Association (11 January 2016). "Framework for Information Literacy for Higher Education". Association of College & Research Libraries. http://www.ala.org/acrl/standards/ilframework. 
  6. 6.0 6.1 Blackburn, F.; Sullivan, J.; Guerra, P. et al. (2015). "The Field Guide to Data Science". Booz Allen Hamilton, Inc. pp. 124. https://www.boozallen.com/content/dam/boozallen_site/sig/pdf/publications/2015-field-guide-to-data-science.pdf. 
  7. NIST Big Data Public Working Group Definitions and Taxonomies Subgroup (September 2015). "NIST Big Data Interoperability Framework: Volume 1, Definitions" (PDF). National Institute of Standards and Technology. pp. 32. doi:10.6028/NIST.SP.1500-1. https://bigdatawg.nist.gov/_uploadfiles/NIST.SP.1500-1.pdf. 
  8. 8.0 8.1 Holtz, D. (7 November 2014). "8 Skills You Need to Be a Data Scientist". Udacity. http://blog.udacity.com/2014/11/data-science-job-skills.html. 
  9. 9.0 9.1 "Complete Directory of Data Science Graduate Degrees". MastersInDataScience.org. http://www.mastersindatascience.org/schools/. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation and grammar. In some cases important information was missing from the references, and that information was added. The original article had citations listed alphabetically; they are listed in the order they appear here due to the way the wiki works.