Journal:The development of data science: Implications for education, employment, research, and the data revolution for sustainable development

From LIMSWiki
Revision as of 16:36, 17 July 2018 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title The development of data science: Implications for education, employment, research, and the data revolution for sustainable development
Journal Big Data and Cognitive Computing
Author(s) Murtagh, Fionn; Devlin, Keith
Author affiliation(s) University of Huddersfield, Stanford University
Primary contact Email: fmurtagh at acm dot org
Year published 2018
Volume and issue 2(2)
Page(s) 14
DOI 10.3390/bdcc2020014
ISSN 2504-2289
Distribution license Creative Commons Attribution 4.0 International
Website http://www.mdpi.com/2504-2289/2/2/14/htm
Download http://www.mdpi.com/2504-2289/2/2/14/pdf (PDF)

Abstract

In data science, we are concerned with the integration of relevant sciences in observed and empirical contexts. This results in the unification of analytical methodologies, and of observed and empirical data contexts. Given the dynamic nature of convergence, the origins and many evolutions of the data science theme are described. The following are covered in this article: the rapidly growing post-graduate university course provisioning for data science; a preliminary study of employability requirements; and how past eminent work in the social sciences and other areas, certainly mathematics, can be of immediate and direct relevance and benefit for innovative methodology, and for facing and addressing the ethical aspect of big data analytics, relating to data aggregation and scale effects. Associated also with data science is how direct and indirect outcomes and consequences of data science include decision support and policy making, and both qualitative as well as quantitative outcomes. For such reasons, the importance is noted of how data science builds collaboratively on other domains, potentially with innovative methodologies and practice. Further sections point towards some of the major current research issues.

Keywords: big data training and learning, company and business requirements, ethics, impact, decision support, data engineering, open data, smart homes, smart cities, IoT

1. Introduction: Data science as the convergence and bridging of disciplines

The context of our problem solving and analytics will always be quite fundamental, very specific, and particularly oriented. (Section 4 of this article draws some interesting and relevant implications of this.) This article is oriented towards commonality and mutual influence of methodologies, and of analytical processes and procedures. A nice example of the parallel nature of such things is how "big data analytics" is often considered a synonym of "data science." In Section 2.2, it is mentioned how public transport may well use smartphone and mobile phone wireless connection data to observe locations of individuals. This close association or, perhaps even, identity of big data analytics and data science will have growing importance with the internet of things (IoT), and smart cities and smart homes, and so on (as noted in Section 8). The McKinsey Global Institute provided an outstanding perspective on this idea in their paper The age of analytics: Competing in a data-driven world.[1]

In Section 8 and Section 9 of this article, very important developments are at issue, encompassing newly oriented and pursued methodologies, and the integration of research domains. Section 7 notes how important all of the content here is to sustainable development. The phrase "data revolution" is based here on ongoing work by the United Nations, and by so many of us in this domain, and from national authorities in Africa and the Middle East discussing issues here at the most recent (2017) World Statistics Congress.

This converging and bridging of disciplines is increasingly important. For example, Mahabal et al.[2] discuss the parallels between astronomy and Earth science data, methodology transfer, and metadata and ontologies characterized as being crucial. They claim the convergence or bridging of disciplines must address “non-homogeneous observables, and varied spatial, temporal coverage at different resolutions.”[2] This quotation is very familiar to us in regard to how NoSQL databases are now widely used, as well as traditional relational databases. Another example is how text mining, social media, and many other domains have become so very important in many contexts. Then, given computational support, “it is the complexity more than the data volume that proves to be a bigger challenge.”[2] Further benefits of this data science convergence are termed here "tractability" and "reproducibility." Mahabal et al.[2] also discuss the complexity relating to resolution and distributions. In a separate work, Murtagh[3] characterized this in terms of data encoding. Plenty of work now emphasizes the importance of p-adic data encoding (binary or ternary when p = 2 or 3), compared with real-valued encoding (m-adic, especially when m = 10).

The convergence and bridging of disciplines is fully emphasized by Mahabal et al. as such[2]:

Methodology transfer can almost never be unidirectional. Diverse fields grow by learning tricks employed by other disciplines. The important thing is to abstract data—described by meaningful metadata—and the metadata in turn connected by a good ontology.

Further description is at issue in regard to collaboration in data science[2]:

We have described here a few techniques from astroinformatics that are finding use in geoinformatics. There would be many from earth science that space science would do well to emulate. Even other disciplines like bioinformatics provide ample opportunities for methodology transfer and collaboration. With growing data volumes, and more importantly the increasing complexity, data science is our only refuge. Collaboration in data science will be beneficial to all sciences.

2. Historical development of data science and some contemporary examples of cross-disciplinarity

A short historical perspective that follows is with reference to such disciplines as computer and information sciences, mathematics and statistics, physics, and, implicitly, social sciences. In concluding this description, a key point will be how data science encompasses and embraces all of the following: cross-disciplinarity, interdisciplinarity, and multidisciplinarity.

2.1 Historical prominence of data science in recent times

The origins of data science are largely due to Chikio Hayashi and others. Hayashi[4] says “I will present 'data science' as a new concept,” followed by a relevant introduction to the science of data: “Data Science consists of three phases: design for data, collection of data and analysis on data.”[4] In Ohsumi[5], the abstract has this: “In 1992, the author argued the urgency of the need to grasp the concept 'data science'. Despite the emergence of concepts such as data mining, this issue has not been addressed.”

Escoufier et al.[6] note how data science arises from the convergence of computer science and statistics, which "gives birth to a new science at its core." They conclude that "[t]o take data as a starting point provides a complementary vision of theory and practice, and avoids creating an unfortunate gap between two steps, both of which are essential in any scientific process."[6]

Cao provides a comprehensive overview of data science[7], noting how the “first conference to adopt 'data science' as a topic” was the International Federation of Classification Societies (IFCS) 1996 conference, in Kobe, Japan. This was fully consistent with our work as participants, then and now (IFCS 2017, in Tokyo, Japan, also had "data science" in its title). Ueno[8] makes a similar point about IFCS 1996 as the first conference with "data science" in its title, and he also claims that the journal Behaviormetrika is "the oldest journal addressing the topic of data science," when it started in 1974. He describes data science as "an interdisciplinary field that includes the use of statistical methods to extract meaningful knowledge from data in various forms: either structured or unstructured."[8]

Cao[7] provides additional historical perspectives, with the section heading "The Data Science journey," relating largely to work in the 1960s and 1970s. This includes "information discovery" as a continuing key objective in data science. Englmeier and Murtaugh[9] also make note of this objective, emphasizing the “semantic dimension of data science,” through the information discovery lifecyle, and the “discovery lifecycle in text mining.” While also emphasizing cooperation, and cross-disciplinarity, there is this: we see the data scientist’s responsibility...

  • in the design of an overarching semantic layer addressing data and analysis tools,
  • in identifying suitable data sources and data patterns that correspond to the appearance of structured and unstructured data, and
  • in the management of the information discovery lifecycle and discovery teams.

An ever-more important issue arises from the data sources that are employed. As a summary expression, data science is, firstly, the integration of data sources and analytical and related data processing methodologies, and, secondly and quite fundamentally, arising from the convergence of disciplines. Convergence of disciplines can be quite beneficial in practice, particularly in regard to addressing and solving problems, and also in regard to the cooperation yielded by cross-disciplinarity. See Section 5, below, for some current discussion on how the problems and challenges to be addressed can and should be, quite naturally, arising out of all aspects of data science.

The current era of data science can be considered as a culmination of previous epochs that gave rise to major digital technology advances, with implications in all social domains. Largely, the first epoch (in the 1980s) brought about laptop and desktop computers, and the second epoch (in the 1990s) gave rise to the internet and the World Wide Web.

2.2 Practical association of disciplines and sub-disciplines

Cao[7] also makes mention of data science being centered on the following disciplines: statistics, informatics, sociology, and management science. Clearly there is emphasis on “synergy of several research disciplines” and how “interdisciplinary initiatives are necessary to bridge the gaps between the respective disciplines.”[7] This is exciting and not least because of how there is convergence of disciplines or subdisciplines. We may consider, for example, how the digital humanities can incorporate relevant areas of a few disciplines, how computational psychoanalysis can come to the fore.[10] With a major focus on psychometrics, Coombs[11] has chapters that proceed from “Basic Concepts” to “On Methods of Collecting Data,” and “Preferential Choice Data.”

Now, data is so very central to all of our sciences, and to all aspects of our engineering and technology. Murtagh[3] defines just what data is, which includes the concept of data coding, or perhaps also, this should be termed data encoding. After all, data is measurement. This underscores the importance of the mathematical underpinnings in data science. Implications that follow include the relevance and importance for new, innovative directions to be followed, and from effective problem solving. The mathematical view of what measurement means is all important, as well as in the discipline of physics. Murtagh[3] cites eminent physicist Paul Dirac as to how mathematics underpins all of physics, and how the work of eminent psychoanalyst Ignacio Matte Blanco has mathematics being integral to psychoanalysis.

From a major study of big data and surveying by the American Association for Public Opinion Research comes the following[12]: “The classic statistical paradigm was one in which researchers formulated a hypothesis, identified a population frame, designed a survey and a sampling technique and then analyzed the results … The new paradigm means it is now possible to digitally capture, semantically reconcile, aggregate, and correlate data.”

Abbany[13] notes that wireless connection data is forming a basis for public transport management. Such big data sources can be associated with, or even integrated with, personal and social behavioral patterns and activities. “Better living through data?” asks Abbany, followed by a very critical statement: “The other thing I need to declare is that I’m no fan of our contemporary belief that life can only get better the more data we have at our disposal.”[13] A response to this would be that data science, as the science of data, is everything relating to the path and trajectory connecting data, information, knowledge, and wisdom.

Darabi[14] reports that “The UK’s next census will be its last,” with administrative, governmental authorities’ data replacing the national census. This is acknowledged: “Collecting the data itself is only half the work. A great deal of effort must go into combining it with other sources, in order to answer real questions.” That can be understood as undertaking scientific investigation of such data, and other potentially relevant data. The cross-disciplinarity inherent in that also can, and perhaps must, lead to new interdisciplinary linkages. Arising out of the ending of the national census is the recognition that how the "government counts its people is changing, and it could transform policy.”

One issue here has been how mathematics underpins so much, across disciplines, and also in the commercial and in most social domains. Many universities in the recent past shut down their mathematics departments and no longer provide teaching in mathematics. However, this is being reversed, with university courses again being provided in mathematics.

3. Open data, reproducibility, and the data curation challenge

While generally recognized as so important for innovation in both application outcomes and in regard to analytics and methodologies, open data plays a key role for data scientists. (Information and news about open data is well provided by the organization Open Data Institute: https://theodi.org).

One major aspect of how big data analytics are quite central to data science is the increasing availability of open data. Cao[7] associates this with methodology, through “the open model rather than a closed one.” This concept was central to a May 2017 London presentation by Dr. Robert Hanisch, Director, Office of Data and Informatics, National Institute of Standards and Technology. Dr. Hanisch worked for 30 years on the Hubble Space Telescope (HST) project. Due to open access to observed data, from our cosmos, Dr. Hanisch noted that three times the number of people directly engaged in HST work were working on HST data. As such, there were three times the benefits drawn from HST data.

Dr. Hanisch noted how important the national metrology institutes were to their efforts. Arising from this was, and is, the importance of reproducibility and interoperability of all of analytics comprising data science. Underpinning these very important themes in data science work is data curation. Data curation is still a major challenge to be addressed. Noted in Dr. Hanisch’s presentation is the contemporary “crisis” of reproducibility. At issue is to support data management from acquisition to publication, whether it occurs in business, medical, governmental or other sectors. The computing expert will recognize this crucial theme of data curation as associated with metadata and evolving ontologies.

For the latter, i.e., Murtagh et al.[15] discuss in a broad and general context the very important and central role of evolving ontology, research publishing, and research funding. While challenges remain to be pursued and addressed, it is important to note that astronomy and astrophysics offer interesting paradigms for open data, and, in many ways, for data curation. Certainly, further research will be carried out on data curation, as well as evolving and interacting ontologies, all of which are core issues for metrology, hence the very basis of all manufacturing technology, and, as described in the latter citation, for research publications and research funding.

Cao[7] also discusses “the open model and open data.” From that discussion results the concept of multidisciplinarity—expressed here as the convergence of disciplines—which can be aided and facilitated by the openness of analytics, data management, and all data science methodologies. Keeping methodologies open allows domain experts to both link up with and perhaps even, if feasible, to integrate with all that is at issue in other relevant domains. As such, a plea for openness of data science as a discipline continues to grow, particularly when viewed as a convergence of disciplines.

4. Integration of data and analytics: Context of applications

The integration of data and analytics in data science has resulted in a strong need to acknowledge and address challenges and other issues with data and the underpinning or contextual reality of the data. Informally expressed, our data represents reality or the context from which the measurements arose, i.e., the data numeric values or qualitative representations. This requires data scientists to focus on quality and standards of work.

Hand[16] contributes numerous important points relevant to the discussion here, describing the problems of data quality—in the big data context—relating to administrative data. He notes that data curation is relevant for reproducibility of analytics. The implications for analytics[16]:

... the fact that data are often not of the highest quality has led to the development of relevant statistical methods and tools, such as detection methods based on integrity checks and on statistical properties ... However, this emphasis has often not been matched within the realm of machine learning, which places more emphasis on the final modelling stage of data analysis. This can be unfortunate: feed data into an algorithm and a number will emerge, whether or not it makes sense. However, even within the statistical community, most teaching implicitly assumes perfect data ... Challenge 1. Statistics teaching should cover data quality issues.

Our analytics should not be a “black box,” a term that was informally used in regard to neural networks in earlier times. Rather, transparency should always be a key property of analytical methodologies.

The view offered by Anderson[17], and discussed by Murtagh[18], quoting Peter Norvig, Google’s research director[18]:

Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

However, this interesting view, inspired by contemporary search engine technology, is provocative. The author maintains that "[c]orrelation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all."

It cannot be accepted that correlation supersedes causation, i.e., that analytics can be automated fully, and thereby obfuscate, or make redundant, data science as well as health and well-being analytics. Englmeier and Murtagh[19] reflect similarly on the above, stating their case for comprehensive information governance, encompassing fully the contextualization of all the analytics that are being carried out. Murtagh and Farid[20] discuss quite a good deal of the contextualization of analytics of health and well-being data. In the discussion accompanying the seminal work by Allin and Hand[21] in statistical perspectives on health and well-being, the authors responded to our comments: “We agree with Murtagh that ‘big data’ may offer insights, provided that there are appropriate analytics.”

References

  1. Henke, N.; Bughin, J.; Chui, M. et al. (December 2016). "The age of analytics: Competing in a data-driven world". McKinsey & Company. pp. 136. https://www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/the-age-of-analytics-competing-in-a-data-driven-world. Retrieved 18 June 2018. 
  2. 2.0 2.1 2.2 2.3 2.4 2.5 Mahabal, A.A.; Crichton, D.; Djorgovki, S.G. et al. (2017). "From Sky to Earth: Data Science Methodology Transfer". Proceedings of the International Astronomical Union: 1–10. doi:10.1017/S1743921317000060. 
  3. 3.0 3.1 3.2 Murtagh, F. (2017). Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics. CRC Press. pp. 206. ISBN 9781498763936. 
  4. 4.0 4.1 Hayashi, C. (1998). "What is Data Science? Fundamental concepts and a heuristic example". In Hayashi, C.; Yajima, K.; Bock H.H. et al.. Data Science, Classification, and Related Methods. Springer. pp. 40–51. ISBN 9784431702085. 
  5. Ohsumi, N. (2000). "From data analysis to data science". In Kiers, H.A.L.; Rasson, J.-P.; Groenen, P.J.F. et al.. Data Science, Classification, and Related Methods. Springer. pp. 329–34. ISBN 9783540675211. 
  6. 6.0 6.1 Escoufier, Y.; Fichet, B.; Lebart, L. et al., ed. (1995). Data Science and Its Applications. Academic Press. 
  7. 7.0 7.1 7.2 7.3 7.4 7.5 Cao, L. (2017). "Data Science: A Comprehensive Overview". ACM Computing Surveys 50 (3): 43. doi:10.1145/3076253. 
  8. 8.0 8.1 Ueno, M. (2017). "As the oldest journal of data science". Behaviormetrika 44 (1): 1–2. doi:10.1007/s41237-016-0011-7. 
  9. Englmeier, K.; Murtagh, F. (2017). "Data scientist - Manager of the discovery lifecycle". Proceedings of the 6th International Conference on Data Science, Technology and Applications: 133–140. doi:10.5220/0006393801330140. 
  10. Murtagh, F. (2017). "Chapter 8: Geometry and Topology of Matte Blanco's Bi-Logic in Psychoanalytics". Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics. CRC Press. pp. 147–62. ISBN 9781498763936. 
  11. Coombs, C.H. (1964). A Theory of Data. Wiley. 
  12. Japec, L.; Kreuter, F.; Berg, M. et al. (12 February 2015). "AAPORT Report: Big Data". AAPOR. https://www.aapor.org/Education-Resources/Reports/Big-Data.aspx. Retrieved 18 June 2018. 
  13. 13.0 13.1 Abbany, Z. (27 November 2017). "A public transport model built on open data". DW. Deutsche Welle. https://www.dw.com/en/a-public-transport-model-built-on-open-data/a-41546053. Retrieved 27 November 2017. 
  14. Darabi, A. (5 December 2017). "The UK’s next census will be its last—here’s why". Apolitical. Apolitical Group Limited. https://apolitical.co/solution_article/uks-next-census-will-last-heres/. 
  15. Murtagh, F.; Orlov, M.; Mirkin, B. (2018). "Qualitative Judgement of Research Impact: Domain Taxonomy as a Fundamental Framework for Judgement of the Quality of Research". Journal of Classification 35 (1): 5–28. doi:10.1007/s00357-018-9247-0. 
  16. 16.0 16.1 Hand, D.J. (2018). "Statistical challenges of administrative and transaction data". Statistics in Society Series A 181 (3): 555–605. doi:10.1111/rssa.12315. 
  17. Anderson, C. (23 June 2008). "The End of Theory: The Data Deluge Makes The Scientific Method Obsolete". Wired. Condé Nast. https://www.wired.com/2008/06/pb-theory/. 
  18. 18.0 18.1 Murtagh, F. (2008). "Origins of Modern Data Analysis Linked to the Beginnings and Early Development of Computer Science and Information Engineering". Electronic Journal for History of Probability and Statistics 4 (2): 1–26. https://arxiv.org/abs/0811.2519. 
  19. Englmeier, K.; Murtagh, F. (2017). "Editorial: What Can We Expect from Data Scientists?". Journal of Theoretical and Applied Electronic Commerce Research 12 (1): 1–5. doi:10.4067/S0718-18762017000100001. 
  20. Murtagh, F.; Farid, M. (2017). "Contextualizing Geometric Data Analysis and Related Data Analytics: A Virtual Microscope for Big Data Analytics". Journal of Interdisciplinary Methodologies and Issues in Sciences 3 (Digital Contextualization): 1–19. doi:10.18713/JIMIS-010917-3-1. 
  21. Allin, P.; Hand, D.J. (2016). "New statistics for old?—Measuring the wellbeing of the UK". Statistics in Society Series A 180 (1): 3–43. doi:10.1111/rssa.12188. 

Notes

This presentation is faithful to the original, with only a few minor changes to grammar, spelling, and presentation, including the addition of PMCID and DOI when they were missing from the original reference. The original inline citation method was unorthodox; these inline citations have been made clearer with the addition of the author of the citation. This often required sentences containing inline citations to be reconstructed. Several vanity statements and irrelevant comments were removed for improved readability.