Journal:A view of programming scalable data analysis: From clouds to exascale

From LIMSWiki
Revision as of 23:22, 26 February 2019 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title A view of programming scalable data analysis: From clouds to exascale
Journal Journal of Cloud Computing
Author(s) Talia, Domenico
Author affiliation(s) DIMES at Università della Calabria
Primary contact Email: talia at dimes dot unical dot it
Year published 2019
Volume and issue 8
Page(s) 4
DOI 10.1186/s13677-019-0127-x
ISSN 2192-113X
Distribution license Creative Commons Attribution 4.0 International
Website https://link.springer.com/article/10.1186/s13677-019-0127-x
Download https://link.springer.com/content/pdf/10.1186%2Fs13677-019-0127-x.pdf (PDF)

Abstract

Scalability is a key feature for big data analysis and machine learning frameworks and for applications that need to analyze very large and real-time data available from data repositories, social media, sensor networks, smartphones, and the internet. Scalable big data analysis today can be achieved by parallel implementations that are able to exploit the computing and storage facilities of high-performance computing (HPC) systems and cloud computing systems, whereas in the near future exascale systems will be used to implement extreme-scale data analysis. Here is discussed how cloud computing currently supports the development of scalable data mining solutions and what the main challenges to be addressed and solved for implementing innovative data analysis applications on exascale systems currently are.

Keywords: big data analysis, cloud computing, exascale computing, data mining, parallel programming, scalability

Introduction

Solving problems in science and engineering was the first motivation for inventing computers. Much later, computer science remains the main area in which innovative solutions and technologies are being developed and applied. Also due to the extraordinary advancement of computer technology, nowadays data are generated as never before. In fact, the amount of structured and unstructured digital data is going to increase beyond any estimate. Databases, file systems, data streams, social media, and data repositories are increasingly pervasive and decentralized.

As the data scale increases, we must address new challenges and attack ever-larger problems. New discoveries will be achieved and more accurate investigations can be carried out due to the increasingly widespread availability of large amounts of data. Scientific sectors that fail to make full use of the volume of digital data available today risk losing out on the significant opportunities that big data can offer.

To benefit from big data availability, specialists and researchers need advanced data analysis tools and applications running on scalable architectures allowing for the extraction of useful knowledge from such huge data sources. High-performance computing (HPC) systems and cloud computing systems today are capable platforms for addressing both the computational and data storage needs of big data mining and parallel knowledge discovery applications. These computing architectures are needed to run data analysis because complex data mining tasks involve data- and compute-intensive algorithms that require large, reliable, and effective storage facilities together with high-performance processors to obtain results in a timely fashion.

Now that data sources have become pervasively huge, reliable and effective programming tools and applications for data analysis are needed to extract value and find useful insights in them. New ways to correctly and proficiently compose different distributed models and paradigms are required, and interaction between hardware resources and programming levels must be addressed. Users, professionals, and scientists working in the area of big data need advanced data analysis programming models and tools coupled with scalable architectures to support the extraction of useful information from such massive repositories. The scalability of a parallel computing system is a measure of its capacity to reduce program execution time in proportion to the number of its processing elements. (The appendix of this article introduces and discusses in detail scalability in parallel systems.) According to scalability definition, scalable data analysis refers to the ability of a hardware/software parallel system to exploit increasing computing resources effectively in the analysis of (very) large datasets.

Today, complex analysis of real-world massive data sources requires using high-performance computing systems such as massively parallel machines or clouds. However in the next years, as parallel technologies advance, exascale computing systems will be exploited for implementing scalable big data analysis in all areas of science and engineering.[1] To reach this goal, new design and programming challenges must be addressed and solved. As such, the focus of this paper is on discussing current cloud-based designing and programming solutions for data analysis and suggesting new programming requirements and approaches to be conceived for meeting big data analysis challenges on future exascale platforms.

Current cloud computing platforms and parallel computing systems represent two different technological solutions for addressing the computational and data storage needs of big data mining and parallel knowledge discovery applications. Indeed, parallel machines offer high-end processors with the main goal to support HPC applications, whereas cloud systems implement a computing model in which dynamically scalable virtualized resources are provided to users and developers as a service over the internet. In fact, clouds do not mainly target HPC applications; they represent scalable computing and storage delivery platforms that can be adapted to the needs of different classes of people and organizations by exploiting a service-oriented architecture (SOA) approach. Clouds offer large facilities to many users who were unable to own their parallel/distributed computing systems to run applications and services. In particular, big data analysis applications requiring access and manipulating very large datasets with complex mining algorithms will significantly benefit from the use of cloud platforms.

Although not many cloud-based data analysis frameworks are available today for end users, within a few years they will become common.[2] Some current solutions are based on open-source systems, such as Apache Hadoop and Mahout, Spark, and SciDB, while others are proprietary solutions provided by companies such as Google, Microsoft, EMC, Amazon, BigML, Splunk Hunk, and InsightsOne. As more such platforms emerge, researchers and professionals will port increasingly powerful data mining programming tools and frameworks to the cloud to exploit complex and flexible software models such as the distributed workflow paradigm. The growing utilization of the service-oriented computing model could accelerate this trend.

From the definition of the term "big data," which refers to datasets so large and complex that traditional hardware and software data processing solutions are inadequate to manage and analyze, we can infer that conventional computer systems are not so powerful to process and mine big data[3], and they are not able to scale with the size of problems to be solved. As mentioned before, to face with limits of sequential machines, advanced systems like HPC, cloud computing, and even more scalable architectures are used today to analyze big data. Starting from this scenario, exascale computing systems will represent the next computing step.[4][5] Exascale systems refers to high-performance computing systems capable of at least one exaFLOPS, so their implementation represents a significant research and technology challenge. Their design and development is currently under investigation with the goal of building by 2020 high-performance computers composed of a very large number of multi-core processors expected to deliver a performance of 1018 operations per second. Cloud computing systems used today are able to store very large amounts of data; however, they do not provide the high performance expected from massively parallel exascale systems. This is the main motivation for developing exascale systems. Exascale technology will represent the most advanced model of supercomputers. They have been conceived for single-site supercomputing centers, not for distributed infrastructures that could use multi-clouds or fog computing systems for decentralizing computing and pervasive data management, and later be interconnected with exascale systems that could be used as a backbone for very large scale data analysis.

The development of exascale systems spurs a need to address and solve issues and challenges at both the hardware and software level. Indeed, it requires the design and implementation of novel software tools and runtime systems able to manage a high degree of parallelism, reliability, and data locality in extreme scale computers.[6] Needed are new programming constructs and runtime mechanisms able to adapt to the most appropriate parallelism degree and communication decomposition for making scalable and reliable data analysis tasks. Their dependence on parallelism grain size and data analysis task decomposition must be deeply studied. This is needed because parallelism exploitation depends on several features like parallel operations, communication overhead, input data size, I/O speed, problem size, and hardware configuration. Moreover, reliability and reproducibility are two additional key challenges to be addressed. At the programming level, constructs for handling and recovering communication, data access, and computing failures must be designed. At the same time, reproducibility in scalable data analysis asks for rich information useful to assure similar results on environments that may dynamically change. All these factors must be taken into account in designing data analysis applications and tools that will be scalable on exascale systems.

Moreover, reliable and effective methods for storing, accessing, and communicating data; intelligent techniques for massive data analysis; and software architectures enabling the scalable extraction of knowledge from data are needed.[3] To reach this goal, models and technologies enabling cloud computing systems and HPC architectures must be extended/adapted or completely changed to be reliable and scalable on the very large number of processors/cores that compose extreme scale platforms and for supporting the implementation of clever data analysis algorithms that ought to be scalable and dynamic in resource usage. Exascale computing infrastructures will play the role of an extraordinary platform for addressing both the computational and data storage needs of big data analysis applications. However, as mentioned before, to have a complete scenario, efforts must be performed for implementing big data analytics algorithms, architectures, programming tools, and applications in exascale systems.[7]

Pursuing this objective within a few years, scalable data access and analysis systems will become the most used platforms for big data analytics on large-scale clouds. In the long term, new exascale computing infrastructures will appear as viable platforms for big data analytics in the next decades, and data mining algorithms, tools, and applications will be ported on such platforms for implementing extreme data discovery solutions.

In this paper we first discuss cloud-based scalable data mining and machine learning solutions, then we examine the main research issues that must be addressed for implementing massively parallel data mining applications on exascale computing systems. Data-related issues are discussed together with communication, multi-processing, and programming issues. We then introduce issues and systems for scalable data analysis on clouds and then discuss design and programming issues for big data analysis in exascale systems. We close by outlining some open design challenges.

Data analysis on cloud computing platforms

Cloud computing platforms implement elastic services, scalable performance, and scalable data storage used by a large and everyday increasing number of users and applications.[8][9] In fact, cloud platforms have enlarged the arena of distributed computing systems by providing advanced internet services that complement and complete functionalities of distributed computing provided by the internet, grid systems, and peer-to-peer networks. In particular, most cloud computing applications use big data repositories stored within the cloud itself, so in those cases large datasets are analyzed with low latency to effectively extract data analysis models.

"Big data" is a new and overused term that refers to massive, heterogeneous, and often unstructured digital content that is difficult to process using traditional data management tools and techniques. The term includes the complexity and variety of data and data types, real-time data collection and processing needs, and the value that can be obtained by smart analytics. However, we should recognize that data are not necessarily important per se but they become very important if we are able to extract value from them—that is if we can exploit them to make discoveries. The extraction of useful knowledge from big digital datasets requires smart and scalable analytics algorithms, services, programming tools, and applications. All these tools require insights into big data to make them more useful for people.

The growing use of service-oriented computing is accelerating the use of cloud-based systems for scalable big data analysis. Developers and researchers are adopting the three main cloud models—software as a service (SaaS), platform as a service (PaaS), and infrastructure as a service (IaaS)—to implement big data analytics solutions in the cloud.[10][11] According to a specialization of these three models, data analysis tasks and applications can be offered as services at the software, platform, or infrastructure level and made available every time from anywhere. A methodology for implementing them defines a new model stack to deliver data analysis solutions that are a specialization of the XaaS (everything as a service) stack and is called "data analysis as a service" (DAaaS). It adapts and specifies the three general service models (SaaS, PaaS, and IaaS) for supporting the structured development of big data analysis systems, tools, and applications according to a service-oriented approach. The DAaaS methodology is then based on the three basic models for delivering data analysis services at different levels as described here (see also Fig. 1):

  • Data analysis infrastructure as a service (DAIaaS): This model provides a set of hardware/software virtualized resources that developers can assemble and use as an integrated infrastructure where storing large datasets, running data mining applications, and/or implementing data analytics systems from scratch;
  • Data analysis platform as a service (DAPaaS): This model defines a supporting software platform that developers can use for programming and running their data analytics applications or extending existing ones without worrying about the underlying infrastructure or specific distributed architecture issues; and
  • Data analysis software as a service (DASaaS): This is a higher-level model that offers to end users data mining algorithms, data analysis suites, or ready-to-use knowledge discovery applications as internet services that can be accessed and used directly through a web browser. According to this approach, all data analysis software is provided as a service, leaving end users without having to worry about implementation and execution details.


Fig1 Talia JOfCloudComp2019 8.png

Figure 1. The three models of the DAaaS software methodology. The DAaaS software methodology is based on three basic models for delivering data analysis services at different levels (application, platform, and infrastructure). The DAaaS methodology defines a new model stack to deliver data analysis solutions that are a specialization of the XaaS (everything as a service) stack and is called "data analysis as a service" (DAaaS). It adapts and specifies the three general service models (SaaS, PaaS, and SaaS) for supporting the structured development of big data analysis systems, tools, and applications according to a service-oriented approach.

Cloud-based data analysis tools

Using the DASaaS methodology, we designed a cloud-based system, the Data Mining Cloud Framework (DMCF)[12] which supports three main classes of data analysis and knowledge discovery applications:

  • Single-task applications, in which a single data mining task such as classification, clustering, or association rules discovery is performed on a given dataset;
  • Parameter-sweeping applications, in which a dataset is analyzed by multiple instances of the same data mining algorithm with different parameters; and
  • Workflow-based applications, in which knowledge discovery applications are specified as graphs linking together data sources, data mining tools, and data mining models.

DMCF includes a large variety of processing patterns to express knowledge discovery workflows as graphs whose nodes denote resources (datasets, data analysis tools, mining models) and whose edges denote dependencies among resources. A web-based user interface allows users to compose their applications and submit them for execution to the cloud platform, following the data analysis software as a service approach. Visual workflows can be programmed in DMCF through a language called VL4Cloud (Visual Language for Cloud), whereas script-based workflows can be programmed by JS4Cloud (JavaScript for Cloud), a JavaScript-based language for data analysis programming.

Figure 2 shows a sample data mining workflow composed of several sequential and parallel steps. It is just an example for presenting the main features of the VL4Cloud programming interface.[12] The example workflow analyses a dataset by using n instances of a classification algorithm, which work on n portions of the training set and generate the same number of knowledge models. By using the n generated models and the test set, n classifiers produce in parallel n classified datasets (n classifications). In the final step of the workflow, a voter generates the final classification by assigning a class to each data item, by choosing the class predicted by the majority of the models.


Fig2 Talia JOfCloudComp2019 8.png

Figure 2. A parallel classification workflow designed by the VL4Cloud programming interface. The figure shows a workflow designed by the VL4Cloud programming interface during its execution. The workflow implements a parallel classification application. Tasks/services included in square bracket are executed in parallel. The results produced by classifiers are selected by a voter task that produces the final classification.

Although DMCF has been mainly designed to coordinate coarse grain data and task parallelism in big data analysis applications by exploiting the workflow paradigm, the DMCF script-based programming interface (JS4Cloud) allows also for parallelizing fine-grain operations in data mining algorithms, as it permits to program in a JavaScript style any data mining algorithm, such as classification, clustering, and others. This can be done because loops and data parallel methods are run in parallel on the virtual machines of a cloud.[13][14]

Like DMCF, other innovative cloud-based systems designed for programming data analysis applications include Apache Spark, Sphere, Swift, Mahout, and CloudFlows. Most of them are open-source. Apache Spark is an open-source framework developed at University of California, Berkeley for in-memory data analysis and machine learning.[5] Spark has been designed to run both batch processing and dynamic applications like streaming, interactive queries, and graph analysis. Spark provides developers with a programming interface centered on a data structure called the "resilient distributed dataset" (RDD) that represents a read-only multi-set of data items distributed over a cluster of machines maintained in a fault-tolerant way. Differently from other systems and from Hadoop, Spark stores data in memory and queries it repeatedly so as to obtain better performance. This feature can be useful for a future implementation of Spark on exascale systems.

Swift is a workflow-based framework for implementing functional data-driven task parallelism in data-intensive applications. The Swift language provides a functional programming paradigm where workflows are designed as a set of calls with associated command-line arguments and input and output files. Swift uses an implicit data-driven task parallelism.[15] In fact, it looks like a sequential language, but being a dataflow language, all variables are futures, thus execution is based on data availability. Parallelism can be also exploited through the use of the foreach statement. Swift/T is a new implementation of the Swift language for high-performance computing. In this implementation, a Swift program is translated into an MPI program that uses the Turbine and ADLB runtime libraries for scalable dataflow processing over MPI. Recently, a porting of Swift/T on large cloud systems for the execution of numerous tasks has been investigated.

DMCF, differently from the other frameworks discussed here, it is the only system that offers both a visual and a script-based programming interface. Visual programming is a very convenient design approach for high-level users, like domain-expert analysts having a limited understanding of programming. On the other hand, script-based workflows are a useful paradigm for expert programmers who can code complex applications rapidly, in a more concise way and with greater flexibility. Finally, the workflow-based model exploited in DMCF and Swift make these frameworks of more general use with respect to Spark, which offers a very restricted set of programming patterns (e.g., map, filter, and reduce), so limiting the variety of data analysis applications that can be implemented with it.

These and other related systems are currently used for the development of big data analysis applications on HPC and cloud platforms. However, additional research in this field must be done and the development of new models, solutions, and tools is needed.[7][16] Just to mention a few, active and promising research topics are listed here, ordered by importance:

1. Programming models for big data analytics: New abstract programming models and constructs hiding the system complexity are needed for big data analytics tools. The MapReduce model and workflow models are often used on HPC and cloud implementations, but more research effort is needed to develop other scalable, adaptive, general-purpose higher-level models and tools. Research in this area is even more important for exascale systems; in the next section we will discuss some of these topics in exascale computing.

2. Reliability in scalable data analysis: As the number of processing elements increases, reliability of systems and applications decreases, and therefore mechanisms for detecting and handling hardware and software faults are needed. Although Fekete et al.[17] have proven that no reliable communication protocol can tolerate crashes of processors on which the protocol runs, some ways in which systems cope with the impossibility result can be found. Among them, at the programming level it is necessary to design constructs for handling communication, data access, and computing failures and for recovering from them. Programming models, languages, and APIs must provide general and data-oriented mechanisms for failure detection and isolation, preventing an entire application from failing and assuring its completion. Reliability is a much more important issue in the exascale domain, where the number of processing elements is massive and fault occurrence increases, making detection and recovering vital.

3. Application reproducibility: Reproducibility is another open research issue for designers of complex applications running on parallel systems. Reproducibility in scalable data analysis must, for example, face with data communication, data parallel manipulation, and dynamic computing environments. Reproducibility demands that current data analysis frameworks (like those based on MapReduce and on workflows) and the future ones, especially those implemented on exascale systems, must provide additional information and knowledge on how data are managed, on algorithm characteristics, and on configuration of software and execution environments.

4. Data and tool integration and openness: Code coordination and data integration are main issues in large-scale applications that use data and computing resources. Standard formats, data exchange models, and common application programming interfaces (APIs) are needed to support interoperability and ease cooperation among design teams using different data formats and tools.

5. Interoperability of big data analytics frameworks: The service-oriented paradigm allows running large-scale distributed applications on cloud heterogeneous platforms along with software components developed using different programming languages or tools. Cloud service paradigms must be designed to allow worldwide integration of multiple data analytics frameworks.

Exascale and big data analysis

As we discussed in the previous sections, data analysis gained a primary role because of the very large availability of datasets and the continuous advancement of methods and algorithms for finding knowledge in them. Data analysis solutions advance by exploiting the power of data mining and machine learning techniques and are changing several scientific and industrial areas. For example, the amount of data that social media daily generate is impressive and continuous. Some hundreds of terabyte of data, including several hundreds of millions of photos, are uploaded daily to Facebook and Twitter.

Therefore it is central to design scalable solutions for processing and analyzing such massive datasets. As a general forecast, IDC experts estimate data generated to reach about 45 zettabytes worldwide by 2020.[18] This impressive amount of digital data asks for scalable high-performance data analysis solutions. However, today only one-quarter of digital data available would be a candidate for analysis, and about five percent of that is actually analyzed. By 2020, the useful percentage could grow to about 35 percent, thanks to data mining technologies.

Extreme data sources and scientific computing

Scalability and performance requirements are challenging conventional data storage, file systems, and database management systems. Architectures of such systems have reached limits in handling extremely large processing tasks involving petabytes of data because they have not been built for scaling after a given threshold. New architectures and analytics platform solutions that must process big data for extracting complex predictive and descriptive models have become necessary.[19] Exascale systems, both from the hardware and the software side, can play a key role in supporting solutions to these problems.[1]

An IBM study reports that we are generating around 2.5 exabytes of data per day.[20] Because of that continuous and explosive growth of data, many applications require the use of scalable data analysis platforms. A well-known example is the ATLAS detector from the Large Hadron Collider at CERN in Geneva. The ATLAS infrastructure has a capacity of 200 PB of disk space and 300,000 processor cores, with more than 100 computing centers connected via 10 Gbps links. The data collection rate is massive, and only a portion of the data produced by the collider is stored. Several teams of scientists run complex applications to analyze subsets of those huge volumes of data. This analysis would be impossible without a high-performance infrastructure that supports data storage, communication, and processing. Also computational astronomers are collecting and producing increasingly larger datasets each year that without scalable infrastructures cannot be stored and processed. Another significant case is represented by the Energy Sciences Network (ESnet), the U.S. Department of Energy’s high-performance network managed by Berkeley Lab that in late 2012 rolled out a 100 gigabits-per-second national network to accommodate the growing scale of scientific data.

If we go from science to society, social data and eHealth are good examples to discuss. Social networks, such as Facebook and Twitter, have become very popular and are receiving increasing attention from the research community because of the huge amount of user-generated data, which provide valuable information concerning human behavior, habits, and travel. When the volume of data to be analyzed is of the order of terabytes or petabytes (billions of tweets or posts), scalable storage and computing solutions must be used, but no clear solutions today exist for the analysis of exascale datasets. The same occurs in the eHealth domain, where huge amounts of patient data are available and can be used for improving therapies, for forecasting and tracking of health data, and for the management of hospitals and health centers. Very complex data analysis in this area will need novel hardware/software solutions; however, exascale computing is still promising in other scientific fields where scalable storage and databases are not used/required. Examples of scientific disciplines where future exascale computing will be extensively used are quantum chromodynamics, materials simulation, molecular dynamics, materials design, earthquake simulations, subsurface geophysics, climate forecasting, nuclear energy, and combustion. All those applications require the use of sophisticated models and algorithms to solve complex equation systems that will benefit from the exploitation of exascale systems.

Programming model features for exascale data analysis

Implementing scalable data analysis applications in exascale computing systems is a complex job requiring high-level fine-grain parallel models, appropriate programming constructs, and skills in parallel and distributed programming. In particular, mechanisms and expertise are needed for expressing task dependencies and inter-task parallelism, for designing synchronization and load balancing mechanisms, handling failures, and properly managing distributed memory and concurrent communication among a very large number of tasks. Moreover, when the target computing infrastructures are heterogeneous and require different libraries and tools to program applications on them, the programming issues are even more complex. To cope with some of these issues in data-intensive applications, different scalable programming models have been proposed.[21]

Scalable programming models may be categorized by:

i. Their level of abstraction, expressing high-level and low-level programming mechanisms, and
ii. How they allow programmers to develop applications, using visual or script-based formalisms.

Using high-level scalable models, a programmer defines only the high-level logic of an application while hiding the low-level details that are not essential for application design, including infrastructure-dependent execution details. A programmer is assisted in application definition, and application performance depends on the compiler that analyzes the application code and optimizes its execution on the underlying infrastructure. On the other hand, low-level scalable models allow programmers to interact directly with computing and storage elements composing the underlying infrastructure and thus define the application's parallelism directly.

Data analysis applications implemented by some frameworks can be programmed through a visual interface, which is a convenient design approach for high-level users, for instance domain-expert analysts having a limited understanding of programming. In addition, a visual representation of workflows or components intrinsically captures parallelism at the task level, without the need to make parallelism explicit through control structures.[6] Visual-based data analysis typically is implemented by providing workflow-based languages or component-based paradigms (Fig. 3). Dataflow-based approaches that share with workflows the same application structure are also used. However, in dataflow models, the grain of parallelism and the size of data items are generally smaller with respect to workflows. In general, visual programming tools are not very flexible because they often implement a limited set of visual patterns and provide restricted manners to configure them. For addressing this issue, some visual languages provide users with the possibility to customize the behavior of patterns by adding code that can specify operations that execute a specific pattern when an event occurs.


Fig3 Talia JOfCloudComp2019 8.png

Figure 3. Main visual and script-based programming models used today for data analysis programming

On the other hand, code-based (or script-based) formalism allows users to program complex applications more rapidly, in a more concise way, and with higher flexibility.[13] Script-based applications can be designed in different ways (see Fig. 3):

  • Use complete language or a language extension that allows to express parallelism in applications, according to a general purpose or a domain-specific approach. This approach requires the design and implementation of a new parallel programming language or a complete set of data types and parallel constructs to be fully inserted in an existing language.
  • Use annotations in the application code that allow the compiler to identify which instructions will be executed in parallel. According to this approach, parallel statements are separated from sequential constructs, and they are clearly identified in the program code because they are denoted by special symbols or keywords.
  • Use a library in the application code that adds parallelism to the data analysis application. Currently this is the most-used approach since it is orthogonal to host languages. MPI and MapReduce are two well-known examples of this approach.

Given the variety of data analysis applications and classes of users (from skilled programmers to end users) that can be envisioned for future exascale systems, there is a need for scalable programming models with different levels of abstractions (high-level and low-level) and different design formalisms (visual and script-based), according to the classification outlined above.

As we discussed, data-intensive applications are software programs that have a significant need to process large volumes of data.[22] Such applications devote most of their processing time to running I/O operations and exchanging and moving data among the processing elements of a parallel computing infrastructure. Parallel processing in data analysis applications typically involves accessing, pre-processing, partitioning, distributing, aggregating, querying, mining, and visualizing data that can be processed independently.

The main challenges for programming data analysis applications on exascale computing systems come from potential scalability; network latency and reliability; reproducibility of data analysis; and resilience of mechanisms and operations offered to developers for accessing, exchanging, and managing data. Indeed, processing extremely large data volumes requires operations and new algorithms able to scale in loading, storing, and processing massive amounts of data that generally must be partitioned in very small data grains, on which thousands to millions of simple parallel operations do analysis.

Exascale programming systems

Exascale systems force new requirements on programming systems to target platforms with hundreds of homogeneous and heterogeneous cores. Evolutionary models have been recently proposed for exascale programming that extend or adapt traditional parallel programming models like MPI (e.g., EPiGRAM[23] that uses a library-based approach, Open MPI for exascale in the ECP initiative), OpenMP (e.g., OmpSs[24] that exploits an annotation-based approach, the SOLLVE project), and MapReduce (e.g., Pig Latin[25] that implements a domain-specific complete language). These new frameworks limit the communication overhead in message passing paradigms or limit the synchronization control if a shared-memory model is used.[26]

As exascale systems are likely to be based on large distributed memory hardware, MPI is one of the most natural programming systems. MPI is currently used on over one million cores, and therefore it is reasonable to have MPI as one programming paradigm used on exascale systems. The same possibility occurs for MapReduce-based libraries that today are run on very large HPC and cloud systems. Both these paradigms are largely used for implementing big data analysis applications. As expected, general MPI all-to-all communication does not scale well in exascale environments; thus, to solve this issue new MPI releases introduced neighbor collectives to support sparse “all-to-some” communication patterns that limit the data exchange on limited regions of processors.[26]

Ensuring the reliability of exascale systems requires a holistic approach, including several hardware and software technologies for both predicting crashes and keeping systems stable despite failures. In the runtime of parallel APIs (like MPI and MapReduce-based libraries like Hadoop), a reliable communication layer must be provided if incorrect behavior in case of processor failure is to be mitigated. The lower unreliable layer is used by implementing a correct protocol that works safely with every implementation of the unreliable layer that cannot tolerate crashes of the processors on which it runs. Concerning MapReduce frameworks, reference[27] reports on an adaptive MapReduce framework, called P2P-MapReduce—which has been developed to manage node churn, master node failures, and job recovery in a decentralized way—provide a more reliable MapReduce middleware that can be effectively exploited in dynamic large-scale infrastructures.

On the other hand, new complete languages such as X10[28], ECL[29], UPC[30], Legion[31], and Chapel[32] have been defined by exploiting in them a data-centric approach. Furthermore, new APIs based on a revolutionary approach, such as GA[33] and SHMEM[34], have been implemented according to a library-based model. These novel parallel paradigms are devised to address the requirements of data processing using massive parallelism. In particular, languages such as X10, UPC, and Chapel and the GA library are based on a partitioned global address space (PGAS) memory model that is suited to implement data-intensive exascale applications because it uses private data structures and limits the amount of shared data among parallel threads.

Together with different approaches, such as Pig Latin and ECL, those programming models, languages, and APIs, must be further investigated, designed, and adapted, for providing data-centric scalable programming models useful in supporting the reliable and effective implementation of exascale data analysis applications composed of up to millions of computing units that process small data elements and exchange them with a very limited set of processing elements. PGAS-based models, data-flow and data-driven paradigms, and local-data approaches today represent promising solutions that could be used for exascale data analysis programming. The APGAS model is, for example, implemented in the X10 language, based on the notions of places and asynchrony. A place is an abstraction of shared, mutable data and worker threads operating on the data. A single APGAS computation can consist of hundreds or potentially tens of thousands of places. Asynchrony is implemented by a single block-structured control construct async. Given a statement ST, the construct async ST executes ST in a separate thread of control. Memory locations in one place can contain references to locations at other places. To compute upon data at another place, the following statement must be used:



This allows the task to change its place of execution to p, execute ST at p and return, leaving behind tasks that may have been spawned during the execution of ST.

Another interesting language based on the PGAS model is Chapel.[32] Its locality mechanisms can be effectively used for scalable data analysis where light data mining (sub-)tasks are run on local processing elements and partial results must be exchanged. Chapel's data locality provides control over where data values are stored and where tasks execute so that developers can ensure parallel data analysis computations execute near the variables they access, or vice-versa for minimizing the communication and synchronization costs. For example, Chapel programmers can specify how domains and arrays are distributed among the system nodes. Another appealing feature in Chapel is the expression of synchronization in a data-centric style. By associating synchronization constructs with data (variables), locality is enforced and data-driven parallelism can be easily expressed also at large scale. In Chapel, "locales" and "domains" are abstractions for referring to machine resources and map tasks and data to them. Locales are language abstractions for naming a portion of a target architecture (e.g., a GPU, a single core, or a multicore node) that has processing and storage capabilities. A locale specifies where (on which processing node) to execute tasks/statements/operations. For example, in a system composed of four locales:



we can use the following for executing the method Filter (D) on the first locale:




Abbreviations

APGAS: asynchronous partitioned global address space

BSP: bulk synchronous parallel

CAF: Co-Array Fortran

DAaaS: data analysis as a service

DAIaaS: data analysis infrastructure as a service

DAPaaS: data analysis platform as a service

DASaaS: data analysis software as a service

DMCF: Data Mining Cloud Framework

ECL: Enterprise Control Language

ESnet: Energy Sciences Network

GA: global array

HPC: high-performance computing

IaaS: infrastructure as a service

JS4Cloud: JavaScript for Cloud

PaaS: platform as a service

PGAS: partitioned global address space

RDD: resilient distributed dataset

SaaS: software as a service

SOA: service oriented computing

TBB: threading building blocks

VL4Cloud: Visual Language for Cloud

XaaS: everything as a service

References

  1. 1.0 1.1 Petcu, D.; Iuhasz, G.; Pop, D. et al. (2015). "On Processing Extreme Data". Scalable Computing: Practice and Experience 16 (4). doi:10.12694/scpe.v16i4.1134. 
  2. Tardieu, O.; Herta, B.; Cunningham, D. et al. (2016). "X10 and APGAS at Petascale". ACM Transactions on Parallel Computing (TOPC) 2 (4): 25. doi:10.1145/2894746. 
  3. 3.0 3.1 Talia, D. (2015). "Making knowledge discovery services scalable on clouds for big data mining". Proceedings from the Second IEEE International Conference on Spatial Data Mining and Geographical Knowledge Services (ICSDM): 1–4. doi:10.1109/ICSDM.2015.7298015. 
  4. Amarasinghe, S.; Campbell, D.; Carlson, W. et al. (14 September 2009). "ExaScale Software Study: Software Challenges in Extreme Scale Systems". DARPA IPTO. pp. 153. doi:10.1.1.205.3944. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.205.3944. 
  5. 5.0 5.1 Zaharia. M.; Xin, R.S.; Wendell, P. et al. (2016). "Apache Spark: A unified engine for big data processing". Communications of the ACM 59 (11): 56–65. doi:10.1145/2934664. 
  6. 6.0 6.1 Maheshwari, K.; Montagnat, J. (2010). "Scientific Workflow Development Using Both Visual and Script-Based Representation". 6th World Congress on Services: 328–35. doi:10.1109/SERVICES.2010.14. 
  7. 7.0 7.1 Reed, D.A.; Dongarra, J. (2015). "Exascale computing and big data". Communications of the ACM 58 (7): 56–68. doi:10.1145/2699414. 
  8. Armbrust, M.; Fox, A.; Griffith, R. et al. (2010). "A view of cloud computing". Communications of the ACM 53 (4): 50–58. doi:10.1145/1721654.1721672. 
  9. Gu, Y.; Grossman, R.L. (2009). "Sector and Sphere: The design and implementation of a high-performance data cloud". Philosophical Transactions, Series A: Mathematical, Physical, and Engineering Sciences 367 (1897): 2429–45. doi:10.1098/rsta.2009.0053. PMC PMC3391065. PMID 19451100. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3391065. 
  10. Talia, D.; Trunfio, P.; Marozzo, F. (2015). Data Analysis in the Cloud. Elsevier. pp. 150. ISBN 9780128029145. 
  11. Hwang, K. (2017). Cloud Computing for Machine Learning and Cognitive Applications. MIT Press. pp. 624. ISBN 9780262036412. 
  12. 12.0 12.1 Marozzo, F.; Talia, D.; Trunfio, P. (2013). "A Cloud Framework for Big Data Analytics Workflows on Azure". In Catlett, C., Gentzsch, W., Grandinetti, L. et al.. Cloud Computing and Big Data. Advances in Parallel Computing. 23. pp. 182–91. doi:10.3233/978-1-61499-322-3-182. ISBN 9781614993223. 
  13. 13.0 13.1 Marozzo, F.; Talia, D.; Trunfio, P. (2015). "JS4Cloud: script‐based workflow programming for scalable data analysis on cloud platforms". Concurrency and Computation: Practice and Experience 27 (17): 5214–37. doi:10.1002/cpe.3563. 
  14. Talia, D. (2013). "Clouds for Scalable Big Data Analytics". Computer 46 (5): 98–101. doi:10.1109/MC.2013.162. 
  15. Wozniak, J.M.; Wilde, M.; Foster, I.T. (2014). "Language Features for Scalable Distributed-Memory Dataflow Computing". Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing: 50–53. doi:10.1109/DFM.2014.17. 
  16. Lucas, R.; Ang, J.; Bergman, K. et al. (10 February 2014). "Top Ten Exascale Research Challenges" (PDF). U.S. Department of Energy. pp. 80. https://science.energy.gov/~/media/ascr/ascac/pdf/meetings/20140210/Top10reportFEB14.pdf. 
  17. Fekete, A.; Lynch, N.; Mansour, Y.; Spinelli, J. (1993). "The impossibility of implementing reliable communication in the face of crashes". Journal of the ACM 40 (5): 1087–1107. doi:10.1145/174147.169676. 
  18. IDC (April 2014). "The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things". Dell EMC. https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm. 
  19. Chen, J.; Choudhary, A.; Feldman, S. et al. (March 2013). "Synergistic Challenges in Data-Intensive Science and Exascale Computing: DOE ASCAC Data Subcommittee Report". Department of Energy, Office of Science. https://www.scholars.northwestern.edu/en/publications/synergistic-challenges-in-data-intensive-science-and-exascale-com. 
  20. "What will we make of this moment?" (PDF). IBM. 2013. pp. 151. https://www.ibm.com/annualreport/2013/bin/assets/2013_ibm_annual.pdf. 
  21. Diaz, J.; Muñoz-Caro, C.; Niño, A. (2012). "A Survey of Parallel Programming Models and Tools in the Multi and Many-Core Era". IEEE Transactions on Parallel and Distributed Systems 23 (8): 1369–86. doi:10.1109/TPDS.2011.308. 
  22. Gorton, I.; Greenfield, P.; Szalay, A.; Willimas, R. (2008). "Data-Intensive Computing in the 21st Century". Computer 41 (4): 30–32. doi:10.1109/MC.2008.122. 
  23. Markidis, S.; Peng, I.B.; Larsson, J. et al. (2016). "The EPiGRAM Project: Preparing Parallel Programming Models for Exascale". High Performance Computing - ISC High Performance 2016: 56–68. doi:10.1007/978-3-319-46079-6_5. 
  24. Fernández, A.; Beltran, V.; Martorell, X. et al. (2014). "Task-Based Programming with OmpSs and Its Application". Euro-Par 2014: Parallel Processing Workshops: 601–12. doi:10.1007/978-3-319-14313-2_51. 
  25. Olston, C.; Reed, B.; Srivastava, U. et al. (2008). "Pig Latin: A not-so-foreign language for data processing". Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data: 1099–1110. doi:10.1145/1376616.1376726. 
  26. 26.0 26.1 Gropp, W.; Snir, M. (2013). "Programming for Exascale Computers". Computing in Science & Engineering 15 (6): 27–35. doi:10.1109/MCSE.2013.96. 
  27. Marozzo, F.; Talia, D.; Trunfio, P. (2012). "P2P-MapReduce: Parallel data processing in dynamic cloud environments". Journal of Computer and System Sciences 78 (5): 1382–1402. doi:10.1016/j.jcss.2011.12.021. 
  28. Tardieu, O.; Herta, B.; Cunningham, D. et al. (2014). "X10 and APGAS at Petascale". Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming: 53–66. doi:10.1145/2555243.2555245. 
  29. Yoo, A.; Kaplan, I. (2009). "Evaluating use of data flow systems for large graph analysis". Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers: 5. doi:10.1145/1646468.1646473. 
  30. Nishtala, R.; Zheng, Y.; Hargrove, P.H. et al. (2011). "Tuning collective communication for Partitioned Global Address Space programming models". Parallel Computing 37 (9): 576–91. doi:10.1016/j.parco.2011.05.006. 
  31. Bauer, M.; Treichler, S.; Slaughter, E.; Aiken, A. (2012). "Legion: Expressing locality and independence with logical regions". Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis: 66. https://dl.acm.org/citation.cfm?id=2389086. 
  32. 32.0 32.1 Chamberlain, B.L.; Callahan, D.; Zima, H.P. (2007). "Parallel Programmability and the Chapel Language". The International Journal of High Performance Computing Applications 21 (3): 291–312. doi:10.1177/1094342007078442. 
  33. Nieplocha, J.; Palmer, B.; Tipparaju, V. et al. (2006). "Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit". The International Journal of High Performance Computing Applications 20 (2): 203–31. doi:10.1177/1094342006064503. 
  34. Meswani, M.R.; Carrington, L.; Snavely, A.; Poole, S. (2012). "Tools for Benchmarking, Tracing, and Simulating SHMEM Applications". CUG2012 Final Proceedings: 1–6. https://cug.org/proceedings/attendee_program_cug2012/by_auth.html. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. Some grammar and punctuation was cleaned up to improve readability. In some cases important information was missing from the references, and that information was added. The original article lists references alphabetically, but this version—by design—lists them in order of appearance. The lone footnote was turned into an inline reference.