The Spark Big Data Analytics Platform. Amir H. Payberah [email protected] Amirkabir University of Technology. (Tehran Polytechnic). /10/ Amir H. Payberah. What is Big Data? ○. Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process . Tutorial: Big data analytics using Apache Spark. HPC Saudi. - Sugimiyanto Suma et al ▷ Cost reduction. Big data technologies such as Hadoop.
|Language:||English, Spanish, Portuguese|
|ePub File Size:||23.70 MB|
|PDF File Size:||13.20 MB|
|Distribution:||Free* [*Sign up for free]|
Contribute to vaquarkhan/vaquarkhan development by creating an account on GitHub. PDF | The past years have seen more and more companies applying “big data” analytics on their rich variety of voluminous data sources (click. Highlights the role of the Spark in the Big Data Landscape and proves the worth of having strong Spark skills in order to increase subject knowledge and boost.
About this book Introduction Big Data Analytics with Spark is a step-by-step guide for learning Spark, which is an open-source fast and general-purpose cluster computing framework for large-scale data analysis. You will learn how to use Spark for different types of big data analytics projects, including batch, interactive, graph, and stream data analysis as well as machine learning. In addition, this book will help you become a much sought-after Spark expert. Spark is one of the hottest Big Data technologies. The amount of data generated today by devices, applications and users is exploding.
The Spark core engine itself has changed little since it was first released, but the libraries have grown to provide more and more types of functionality, turning it into a multifunctional data analytics tool.
Beyond these libraries, there are hundreds of open source external libraries ranging from connectors for various storage systems to machine learning algorithms.
The short answer is — it depends on the particular needs of your business, but based on my research, it seems like 7 out of 10 times the answer will be — Spark. Linear processing of huge datasets is the advantage of Hadoop MapReduce, while Spark delivers fast performance, iterative processing, real-time analytics, graph processing, machine learning and more.
So, when the size of the data is too big for Spark to handle in memory, Hadoop can help overcome that hurdle via its HDFS functionality. Spark runs applications up to x faster in memory and 10x faster on disk than Hadoop by reducing the number of read-write cycles to disk and storing intermediate data in-memory. Hadoop MapReduce — MapReduce reads and writes from disk, which slows down the processing speed and overall efficiency. Hadoop — In MapReduce, developers need to hand-code every operation, which can make it more difficult to use for complex projects at scale.
Hadoop — Hadoop MapReduce allows parallel processing of huge amounts of data. It breaks a large chunk into smaller ones to be processed separately on different data nodes.
Functionality Apache Spark is the uncontested winner in this category. If the task is to process data again and again — Spark defeats Hadoop MapReduce. He has been an invited speaker in several national and international conferences such as OReillys Strata Big-Data conference series. His recent publications have appeared in the Big Data journal of Liebertpub. He lives in Bangalore with his wife, son, and daughter, and enjoys researching ancient Indian, Egyptian, Babylonian, and Greek culture and philosophy.
Perhaps you are a video service provider and would like to optimize the end user experience by choosing the appropriate content distribution network based on dynamic network conditions. Or you are a government regulatory body that needs to classify Internet pages into porn or non-porn in order to filter porn pageswhich has to be achieved at high throughput and in real-time.
How you wish you had known that the last customer who was on the phone with your call center had tweeted with negative sentiments about you a day before.
Or you are a retail storeowner and you would love to have predictions about the customers downloading patterns after they enter the store so that you can run promotions on your products and expect an increase in sales. Or you are a healthcare insurance provider for whom it is imperative to compute the probability that a customer is likely to be hospitalized in the next year so that you can fix appropriate premiums.
Or you work for an electronic manufacturing company and you would like to predict failures and identify root causes during test runs so that the subsequent real-runs are effective. Welcome to the world of possibilities, thanks to big data analytics. The only difference between the terms analysis and analytics is that analytics is about analyzing data and converting it into actionable insights. The term Business Intelligence BI is also used often to refer to analysis in a business environment, possibly originating in a article by Peter Luhn Luhn Lots of BI applications were run over data warehouses, even quite recently.
The evolution of big data in contrast to the analytics term has been quite recent, as explained next. The term big data seems to have been used first by John R. Mashey, then chief scientist of Silicon Graphics Inc. The term was also used in a paper Bryson et al. The report Laney from the META group now Gartner was the first to identify the 3 Vs volume, variety, and velocity perspective of big data.
Googles seminal 14 paper on Map-Reduce MR; Dean and Ghemawat was the trigger that led to lots of developments in the big data space. Though the MR paradigm was known in the functional programming literature, the paper provided scalable implementations of the paradigm on a cluster of nodes. The paper, along with Apache Hadoop, the open source implementation of the MR paradigm, enabled end users to process large data sets on a cluster of nodesa usability paradigm shift.
Hadoop Suitability Hadoop is good for a number of use cases, including those in which the data can be partitioned into independent chunksthe embarrassingly parallel applications, as is widely known. Hadoops lack of suitability for all types of applications: If data splits are interrelated or computation needs to access data across splits, this might involve joins and might not run efficiently over Hadoop.
For example, imagine that you have a set of stocks and the set of values of those stocks at various time points. It is required to compute correlations across stockscan you check when Apple falls? What is the probability of Samsung too falling the next day? The computation cannot be split into independent chunksyou may have to compute correlation between stocks in different chunks, if the chunks carry different stocks.
If the data is split along the time line, you would still need to compute correlation between stock prices at different points of time, which may be in different chunks. For iterative computations, Hadoop MR is not well-suited for two reasons.
One is the overhead of fetching data from HDFS for each iteration which can be amortized by a distributed caching layer , and the other is the lack of long-lived MR jobs in Hadoop. Typically, there is a termination condition check that must be executed outside of the MR job, so as to determine whether the computation is complete. This implies that new MR jobs need to be initialized for each iteration in Hadoopthe overhead of initialization could overwhelm computation for the iteration and could cause significant performance hits.
The other perspective of Hadoop suitability can be understood by looking at the characterization of the computation paradigms required for analytics on massive data sets, from the National Academies Press NRC They term the seven categories as seven giants in contrast with 15 the dwarf terminology that was used to characterize fundamental computational tasks in the super-computing literature Asanovic et al.
These are the seven giants: 1. Basic statistics: This category involves basic statistical operations such as computing the mean, median, and variance, as well as things like order statistics and counting. The operations are typically O N for N points and are typically embarrassingly parallel, so perfect for Hadoop. Linear algebraic computations: These computations involve linear systems, eigenvalue problems, inverses from problems such as linear regression, and Principal Component Analysis PCA.
Moreover, a formulation of multivariate statistics in matrix form is difficult to realize over Hadoop. Examples of this type include kernel PCA and kernel regression. Generalized N-body problems: These are problems that involve distances, kernels, or other kinds of similarity between points or sets of points tuples. Computational complexity is typically O N2 or even O N3.
The typical problems include range searches, nearest neighbor search problems, and nonlinear dimension reduction methods. The simpler solutions of N-body problems such as k-means clustering are solvable over Hadoop, but not the complex ones such as kernel PCA, kernel Support Vector Machines SVM , and kernel discriminant analysis. Graph theoretic computations: Problems that involve graph as the data or that can be modeled graphically fall into this category.
The computations on graph data include centrality, commute distances, and ranking. When the statistical model is a graph, graph search is important, as are computing probabilities which are operations known as inference. Some graph theoretic computations that can be posed as linear algebra problems can be solved over Hadoop, within the limitations specified under giant 2. Euclidean graph problems are hard to realize over Hadoop as they become generalized N-body problems.
Moreover, major computational challenges arise when you are dealing with large sparse graphs; partitioning them across a cluster is hard. Optimizations: Optimization problems involve minimizing convex or maximizing concave a function that can be referred to as an objective, a loss, a cost, or an energy function.
These problems can be solved in various ways. Stochastic approaches are amenable to be implemented in Hadoop. Mahout has an implementation of stochastic gradient descent. Linear or quadratic programming approaches are harder to realize over Hadoop, because they involve complex iterations and operations on large matrices, especially at high dimensions.
One approach to solve optimization problems has been shown to be solvable on Hadoop, but by realizing a construct known as All-Reduce Agarwal et al. However, this approach might not be fault-tolerant and might not be generalizable. Conjugate gradient descent CGD , due to its iterative nature, is also hard to realize over Hadoop.
The work of Stephen Boyd and his colleagues from Stanford has precisely addressed this giant. Their paper Boyd et al. Integrations: The mathematical operation of integration of functions is important in big data analytics. They arise in Bayesian inference as well as in random effects models. Quadrature approaches that are sufficient for low-dimensional integrals might be realizable on Hadoop, but not those for high-dimensional integration which arise in Bayesian inference approach for big data analytical problems.
Most recent applications of big data deal with high-dimensional datathis is corroborated among others by Boyd et al. MCMC is iterative in nature because the chain must converge to a stationary distribution, which might happen after several iterations only.
Alignment problems: The alignment problems are those that involve matching between data objects or sets of objects. They occur in various domainsimage de-duplication, matching catalogs from different instruments in astronomy, multiple sequence alignments used in computational biology, and so on. The simpler approaches in which the alignment problem can be posed as a linear algebra problem can be realized over Hadoop.
The catalog cross-matching problem can be posed as a generalized N-body problem, and the discussion outlined earlier in point 3 applies. The limitations of Hadoop and its lack of suitability for certain classes of applications have motivated some researchers to come up with alternatives.
Researchers at the University of Berkeley have proposed Spark as one such alternativein other words, Spark could be seen as the next-generation data processing alternative to Hadoop in the big data space.
In the previous seven giants categorization, Spark would be efficient for Complex linear algebraic problems giant 2 Generalized N-body problems giant 3 , such as kernel SVMs and kernel PCA Certain optimization problems giant 4 , for example, approaches involving CGD An effort has been made to apply Spark for another giant, namely, graph theoretic computations in GraphX Xin et al.
It would be an interesting area of further research to estimate the 17 efficiency of Spark for other classes of problems or other giants such as integrations and alignment problems. Initial performance studies have shown that Spark can be times faster than Hadoop for certain applications.
This book explores Spark as well as the other components of the Berkeley Data Analytics Stack BDAS , a data processing alternative to Hadoop, especially in the realm of big data analytics that involves realizing machine learning ML algorithms. When using the term big data analytics, I refer to the capability to ask questions on large data sets and answer them appropriately, possibly by using ML techniques as the foundation.
I will also discuss the alternatives to Spark in this spacesystems such as HaLoop and Twister. The other dimension for which the beyond-Hadoop thinking is required is for real-time analytics.
It can be inferred that Hadoop is basically a batch processing system and is not well suited for real-time computations.
Consequently, if analytical algorithms are required to be run in real time or near real time, Storm from Twitter has emerged as an interesting alternative in this space, although there are other promising contenders, including S4 from Yahoo and Akka from Typesafe. Storm has matured faster and has more production use cases than the others. Thus, I will discuss Storm in more detail in the later chapters of this bookthough I will also attempt a comparison with the other alternatives for real-time analytics.
The third dimension where beyond-Hadoop thinking is required is when there are specific complex data structures that need specialized processinga graph is one such example. Twitter, Facebook, and LinkedIn, as well as a host of other social networking sites, have such graphs. They need to perform operations on the graphs, for example, searching for people you might know on LinkedIn or a graph search in Facebook Perry There have been some efforts to use Hadoop for graph processing, such as Intels GraphBuilder.
However, as outlined in the GraphBuilder paper Jain et al. GraphLab Low et al. By processing, I mean running page ranking or other ML algorithms on the graph. GraphBuilder can be used for constructing the graph, which can then be fed into GraphLab for processing. GraphLab is focused on giant 4, graph theoretic computations. The use of GraphLab for any of the other giants is an interesting topic of further research.
The emerging focus of big data analytics is to make traditional techniques, such as market basket analysis, scale, and work on large data sets. This is reflected in the approach of SAS and other traditional vendors to build Hadoop connectors. The other emerging approach for analytics focuses on new algorithms or techniques from ML and data mining for solving complex analytical problems, including those in video and real-time analytics.
My perspective is that Hadoop is just one such paradigm, with a whole new set of others that are emerging, 18 including Bulk Synchronous Parallel BSP -based paradigms and graph processing paradigms, which are more suited to realize iterative ML algorithms. The following discussion should help clarify the big data analytics spectrum, especially from an ML realization perspective.
This should help put in perspective some of the key aspects of the book and establish the beyond-Hadoop thinking along the three dimensions of real-time analytics, graph computations, and batch analytics that involve complex problems giants 2 through 7.
Big Data Analytics: Evolution of Machine Learning Realizations I will explain the different paradigms available for implementing ML algorithms, both from the literature and from the open source community. First of all, heres a view of the three generations of ML tools available today: 1.
These allow deep analysis on smaller data setsdata sets that can fit the memory of the node on which the tool runs. These allow what I call a shallow analysis of big data.
These facilitate deeper analysis of big data. Recent efforts by traditional vendors such as SAS in-memory analytics also fall into this category. However, not all of them can work on large data setslike terabytes or petabytes of datadue to scalability limitations limited by the nondistributed nature of the tool. In other words, they are vertically scalable you can increase the processing power of the node on which the tool runs , but not horizontally scalable not all of them can run on a cluster.
The first-generation tool vendors are addressing those limitations by building Hadoop connectors as well as providing clustering optionsmeaning that the vendors have made efforts to reengineer the tools such as R and SAS to scale horizontally. These tools are maturing fast and are open source especially Mahout. Mahout has a set of algorithms for clustering and classification, as well as a very good recommendation algorithm Konstan and Riedl Mahout can thus be said to work on big data, with a number of production use cases, mainly for the recommendation system.
I have also used Mahout in a production system for realizing recommendation algorithms in financial domain and found it to be scalable, though not without issues. I had to tweak the source significantly. One observation about Mahout is that it implements only a smaller subset of ML algorithms over Hadooponly 25 algorithms are of production quality, with only 8 or 9 usable over Hadoop, meaning scalable over large data sets.
These include the linear regression, linear SVM, the K-means clustering, and so forth. It does provide a fast sequential implementation of the logistic regression, with parallelized training. However, as several others have also noted see Quora.
Overall, this book is not intended for Mahout bashing. However, my point is that it is quite hard to implement certain ML algorithms including the kernel SVM and CGD note that Mahout has an implementation of stochastic gradient descent over Hadoop. This has been pointed out by several others as wellfor instance, see the paper by Professor Srirama Srirama et al.
Power system state estimation is used to ensure the stableness of the grid and prevent blackouts. Likewise Chen et al. When some part of the network are detached from the grid Islanding will occurs and such events can leads to stability issues in grid. In  Naive-Bayes classifier is used to for islanding detection.
A real time power quality assessment is much needed for power systems in which power disturbances like harmonics, swell and sag can affect the performance. In power systems grid fault identification, failure cause identification is other major issues.
Mori made a survey over the smart power grid oriented papers, which deals with various applications over data mining on the power systems, like failureclassification; examine the transient faults, etc. Power Consumption data The distributed electricity will be consumed by consumers from various zones like Residential Individual houses and Apartments , Commercial e. Smart meters are equipped at customer end points, which sense and broadcast utilization data to the service providers at regular interval of period.
These data are of two types, either in disaggregated data break up data for every single component or group of numerous components in the single oriented electrical circuits or in aggregated data collective data of all appliances.
These data are first aggregated at data concentrator. It is then transmitted to central servers. Advanced metering techniques and IP-based smart meters and appliances enable the data flow in smart grid in more fast and efficient way.
Load forecasting for huge economic and industrial consumers performs a vital part in optimizing electricity consumption for their future development.
Edwards et al. Customer profiling is helpful in their R. A list of consumer based applications which perform the analysis of utilization data has been described byZeyarAung .
Some of them are Real-time pricing, Load control, Metering information and energy analysis via website, outage detection and notification, metering aggregation for multiple sites and facilities, unification customer-owned generation, theft control .
Various kinds of individual applications listed in , which could be performed by utilizing the data generated from power grid, by applying various Machine Learning techniques, were discussed above. This demands a unified frame-work, which can handle both data and the processing capabilities required for smart-grid data analysis 3. Big Data Analytics on Smart Grid Performance of Big Data analytics and machine learning upon sensor and operational data from power grid generation and utilization systems should be precise.
Then these datasets will grow in size of hundreds of gigabytes per day. The main types of processing techniques employed in Big Data analysis are batch, stream and iterative processing. Thus we require such a platform to store this vast distributed data and to perform all these types of analysis. Hadoop Map-Reduce is a batch processing programming model which is primarily used for the analysis of large pool of static and empirical data.
In this model, a very large dataset is divided into numerous small sets for processing. Computation is done parallel on all these tiny units of data.
For example, A Map-Reduce job can be used to obtain the minimal readings amongst all smart meters. Map-Reduce can be applied to compute customer usage analysis, energy savings measure modeling and other such analytics which are done over static data.
It cannot be used for real-time sensor data and streaming data processing. Hence it is not suited for many smart grid real-time analytic processes such as demand response, short-term load forecasting, AMI Advanced Metering Infrastructure operations, real time usage and pricing analysis, real-time customer analytics, on-line grid control and monitoring, etc. Apart from the traditional batch processing technique Map-Reduce , inability to perform on-line and streaming data analysis is a major drawback for Apache Hadoop .
The set of machine learning algorithms provided by Apache Hadoop is also not enough to meet the requirements of Smart grid data analysis due to which, Apache Hadoop is not an apt choice for Big Data analytics on smart grid systems.
Other than the usual batch processing, Apache spark has the ability to perform iterative and streaming process. Apache Spark framework 3. Stream Processing Stream processing involves calling dependent logic on each new data instance, rather than waiting for the next batchof data and then reprocessing everything.
In Apache Spark, more analytics are carried using stream processing thanbatch processing.
This avoids unnecessary repetition of reprocessing the data. Stream processing provides timely andmore accurate results when compared to batch processing. This model enables Apache Spark to perform analyticson the data with dynamic behavior. For example, in order to monitor and predict the stability of the entire power grid, using the PMU data, we needto calculate and keep track of stability index of grid.
This can protect the grid frompotential threats like Islanding, blackouts and so on. Iterative Processing Apart from batch and stream processing methods there are many Big Data analytic problems which will not come under the scope of these techniques.
Iterative processing is another category of processing methodologies used to solve such kind of problems. Main characteristic of Iterative processing is processing of all variety of data types frequently.
Apache Spark is the current leading framework used for iterative processing. It has the power to process and hold data in memory across the cluster. Once the data is loaded into the framework, the data is written back, only after completion of all iterative process. This ensures Apache Spark 10x to x faster than Map Reduce framework, which involves reading and writing data from the disk during each iteration.