Journal:The development of data science: Implications for education, employment, research, and the data revolution for sustainable development

From LIMSWiki
Revision as of 21:54, 16 July 2018 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title The development of data science: Implications for education, employment, research, and the data revolution for sustainable development
Journal Big Data and Cognitive Computing
Author(s) Murtagh, Fionn; Devlin, Keith
Author affiliation(s) University of Huddersfield, Stanford University
Primary contact Email: fmurtagh at acm dot org
Year published 2018
Volume and issue 2(2)
Page(s) 14
DOI 10.3390/bdcc2020014
ISSN 2504-2289
Distribution license Creative Commons Attribution 4.0 International
Website http://www.mdpi.com/2504-2289/2/2/14/htm
Download http://www.mdpi.com/2504-2289/2/2/14/pdf (PDF)

Abstract

In data science, we are concerned with the integration of relevant sciences in observed and empirical contexts. This results in the unification of analytical methodologies, and of observed and empirical data contexts. Given the dynamic nature of convergence, the origins and many evolutions of the data science theme are described. The following are covered in this article: the rapidly growing post-graduate university course provisioning for data science; a preliminary study of employability requirements; and how past eminent work in the social sciences and other areas, certainly mathematics, can be of immediate and direct relevance and benefit for innovative methodology, and for facing and addressing the ethical aspect of big data analytics, relating to data aggregation and scale effects. Associated also with data science is how direct and indirect outcomes and consequences of data science include decision support and policy making, and both qualitative as well as quantitative outcomes. For such reasons, the importance is noted of how data science builds collaboratively on other domains, potentially with innovative methodologies and practice. Further sections point towards some of the major current research issues.

Keywords: big data training and learning, company and business requirements, ethics, impact, decision support, data engineering, open data, smart homes, smart cities, IoT

1. Introduction: Data science as the convergence and bridging of disciplines

The context of our problem solving and analytics will always be quite fundamental, very specific, and particularly oriented. (Section 4 of this article draws some interesting and relevant implications of this.) This article is oriented towards commonality and mutual influence of methodologies, and of analytical processes and procedures. A nice example of the parallel nature of such things is how "big data analytics" is often considered a synonym of "data science." In Section 2.2, it is mentioned how public transport may well use smartphone and mobile phone wireless connection data to observe locations of individuals. This close association or, perhaps even, identity of big data analytics and data science will have growing importance with the internet of things (IoT), and smart cities and smart homes, and so on (as noted in Section 8). The McKinsey Global Institute provided an outstanding perspective on this idea in their paper The age of analytics: Competing in a data-driven world.[1]

In Section 8 and Section 9 of this article, very important developments are at issue, encompassing newly oriented and pursued methodologies, and the integration of research domains. Section 7 notes how important all of the content here is to sustainable development. The phrase "data revolution" is based here on ongoing work by the United Nations, and by so many of us in this domain, and from national authorities in Africa and the Middle East discussing issues here at the most recent (2017) World Statistics Congress.

This converging and bridging of disciplines is increasingly important. For example, Mahabal et al.[2] discuss the parallels between astronomy and Earth science data, methodology transfer, and metadata and ontologies characterized as being crucial. They claim the convergence or bridging of disciplines must address “non-homogeneous observables, and varied spatial, temporal coverage at different resolutions.”[2] This quotation is very familiar to us in regard to how NoSQL databases are now widely used, as well as traditional relational databases. Another example is how text mining, social media, and many other domains have become so very important in many contexts. Then, given computational support, “it is the complexity more than the data volume that proves to be a bigger challenge.”[2] Further benefits of this data science convergence are termed here "tractability" and "reproducibility." Mahabal et al.[2] also discuss the complexity relating to resolution and distributions. In a separate work, Murtagh[3] characterized this in terms of data encoding. Plenty of work now emphasizes the importance of p-adic data encoding (binary or ternary when p = 2 or 3), compared with real-valued encoding (m-adic, especially when m = 10).

The convergence and bridging of disciplines is fully emphasized by Mahabal et al. as such[2]:

Methodology transfer can almost never be unidirectional. Diverse fields grow by learning tricks employed by other disciplines. The important thing is to abstract data—described by meaningful metadata—and the metadata in turn connected by a good ontology.

Further description is at issue in regard to collaboration in data science[2]:

We have described here a few techniques from astroinformatics that are finding use in geoinformatics. There would be many from earth science that space science would do well to emulate. Even other disciplines like bioinformatics provide ample opportunities for methodology transfer and collaboration. With growing data volumes, and more importantly the increasing complexity, data science is our only refuge. Collaboration in data science will be beneficial to all sciences.

2. Historical development of data science and some contemporary examples of cross-disciplinarity

A short historical perspective that follows is with reference to such disciplines as computer and information sciences, mathematics and statistics, physics, and, implicitly, social sciences. In concluding this description, a key point will be how data science encompasses and embraces all of the following: cross-disciplinarity, interdisciplinarity, and multidisciplinarity.

2.1 Historical prominence of data science in recent times

The origins of data science are largely due to Chikio Hayashi and others. Hayashi[4] says “I will present 'data science' as a new concept,” followed by a relevant introduction to the science of data: “Data Science consists of three phases: design for data, collection of data and analysis on data.”[4] In Ohsumi[5], the abstract has this: “In 1992, the author argued the urgency of the need to grasp the concept 'data science'. Despite the emergence of concepts such as data mining, this issue has not been addressed.”

Escoufier et al.[6] note how data science arises from the convergence of computer science and statistics, which "gives birth to a new science at its core." They conclude that "[t]o take data as a starting point provides a complementary vision of theory and practice, and avoids creating an unfortunate gap between two steps, both of which are essential in any scientific process."[6]

Cao provides a comprehensive overview of data science[7], noting how the “first conference to adopt 'data science' as a topic” was the International Federation of Classification Societies (IFCS) 1996 conference, in Kobe, Japan. This was fully consistent with our work as participants, then and now (IFCS 2017, in Tokyo, Japan, also had "data science" in its title). Ueno[8] makes a similar point about IFCS 1996 as the first conference with "data science" in its title, and he also claims that the journal Behaviormetrika is "the oldest journal addressing the topic of data science," when it started in 1974. He describes data science as "an interdisciplinary field that includes the use of statistical methods to extract meaningful knowledge from data in various forms: either structured or unstructured."[8]

Cao[7] provides additional historical perspectives, with the section heading "The Data Science journey," relating largely to work in the 1960s and 1970s. This includes "information discovery" as a continuing key objective in data science. Englmeier and Murtaugh[9] also make note of this objective, emphasizing the “semantic dimension of data science,” through the information discovery lifecyle, and the “discovery lifecycle in text mining.” While also emphasizing cooperation, and cross-disciplinarity, there is this: we see the data scientist’s responsibility...

  • in the design of an overarching semantic layer addressing data and analysis tools,
  • in identifying suitable data sources and data patterns that correspond to the appearance of structured and unstructured data, and
  • in the management of the information discovery lifecycle and discovery teams.

An ever-more important issue arises from the data sources that are employed. As a summary expression, data science is, firstly, the integration of data sources and analytical and related data processing methodologies, and, secondly and quite fundamentally, arising from the convergence of disciplines. Convergence of disciplines can be quite beneficial in practice, particularly in regard to addressing and solving problems, and also in regard to the cooperation yielded by cross-disciplinarity. See Section 5, below, for some current discussion on how the problems and challenges to be addressed can and should be, quite naturally, arising out of all aspects of data science.

The current era of data science can be considered as a culmination of previous epochs that gave rise to major digital technology advances, with implications in all social domains. Largely, the first epoch (in the 1980s) brought about laptop and desktop computers, and the second epoch (in the 1990s) gave rise to the internet and the World Wide Web.

2.2 Practical association of disciplines and sub-disciplines

Hayashi[4] also makes mention of data science being centered on the following disciplines: statistics, informatics, sociology, and management science. Clearly there is emphasis on “synergy of several research disciplines” and how “interdisciplinary initiatives are necessary to bridge the gaps between the respective disciplines.”[4] This is exciting and not least because of how there is convergence of disciplines or subdisciplines. We may consider, for example, how the digital humanities can incorporate relevant areas of a few disciplines, how computational psychoanalysis can come to the fore.[10] With a major focus on psychometrics, Coombs[11] has chapters that proceed from “Basic Concepts” to “On Methods of Collecting Data,” and “Preferential Choice Data.”

Now, data is so very central to all of our sciences, and to all aspects of our engineering and technology. Murtagh[3] defines just what data is, which includes the concept of data coding, or perhaps also, this should be termed data encoding. After all, data is measurement. This underscores the importance of the mathematical underpinnings in data science. Implications that follow include the relevance and importance for new, innovative directions to be followed, and from effective problem solving. The mathematical view of what measurement means is all important, as well as in the discipline of physics. Murtagh[3] cites eminent physicist Paul Dirac as to how mathematics underpins all of physics, and how the work of eminent psychoanalyst Ignacio Matte Blanco has mathematics being integral to psychoanalysis.

From a major study of big data and surveying by the American Association for Public Opinion Research comes the following[12]: “The classic statistical paradigm was one in which researchers formulated a hypothesis, identified a population frame, designed a survey and a sampling technique and then analyzed the results … The new paradigm means it is now possible to digitally capture, semantically reconcile, aggregate, and correlate data.”

References

  1. Henke, N.; Bughin, J.; Chui, M. et al. (December 2016). "The age of analytics: Competing in a data-driven world". McKinsey & Company. pp. 136. https://www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/the-age-of-analytics-competing-in-a-data-driven-world. Retrieved 18 June 2018. 
  2. 2.0 2.1 2.2 2.3 2.4 2.5 Mahabal, A.A.; Crichton, D.; Djorgovki, S.G. et al. (2017). "From Sky to Earth: Data Science Methodology Transfer". Proceedings of the International Astronomical Union: 1–10. doi:10.1017/S1743921317000060. 
  3. 3.0 3.1 3.2 Murtagh, F. (2017). Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics. CRC Press. pp. 206. ISBN 9781498763936. 
  4. 4.0 4.1 4.2 4.3 Hayashi, C. (1998). "What is Data Science? Fundamental concepts and a heuristic example". In Hayashi, C.; Yajima, K.; Bock H.H. et al.. Data Science, Classification, and Related Methods. Springer. pp. 40–51. ISBN 9784431702085. 
  5. Ohsumi, N. (2000). "From data analysis to data science". In Kiers, H.A.L.; Rasson, J.-P.; Groenen, P.J.F. et al.. Data Science, Classification, and Related Methods. Springer. pp. 329–34. ISBN 9783540675211. 
  6. 6.0 6.1 Escoufier, Y.; Fichet, B.; Lebart, L. et al., ed. (1995). Data Science and Its Applications. Academic Press. 
  7. 7.0 7.1 Cao, L. (2017). "Data Science: A Comprehensive Overview". ACM Computing Surveys 50 (3): 43. doi:10.1145/3076253. 
  8. 8.0 8.1 Ueno, M. (2017). "As the oldest journal of data science". Behaviormetrika 44 (1): 1–2. doi:10.1007/s41237-016-0011-7. 
  9. Englmeier, K.; Murtagh, F. (2017). "Data scientist - Manager of the discovery lifecycle". Proceedings of the 6th International Conference on Data Science, Technology and Applications: 133–140. doi:10.5220/0006393801330140. 
  10. Murtagh, F. (2017). "Chapter 8: Geometry and Topology of Matte Blanco's Bi-Logic in Psychoanalytics". Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics. CRC Press. pp. 147–62. ISBN 9781498763936. 
  11. Coombs, C.H. (1964). A Theory of Data. Wiley. 
  12. {Japec, L.; Kreuter, F.; Berg, M. et al. (12 February 2015). "AAPORT Report: Big Data". AAPOR. https://www.aapor.org/Education-Resources/Reports/Big-Data.aspx. Retrieved 18 June 2018. 

Notes

This presentation is faithful to the original, with only a few minor changes to grammar, spelling, and presentation, including the addition of PMCID and DOI when they were missing from the original reference. The original inline citation method was unorthodox; these inline citations have been made clearer with the addition of the author of the citation.