Spark: a data engineer’s best friend
Data engineering, as a separate category of expertise in the world of data science, did not happen in a vacuum. The role of the Data Engineer was born and evolved as the number of data sources and data products increased over the years. Therefore, the role and function of a data engineer is closely associated with various data processing platforms such as Apache Hadoop, Apache Spark and a large number of specialized tools. In this article, you’ll find out why Spark should be considered a data engineer’s best friend.
Spark is the ultimate toolkit
Data engineers often work in multiple and complex environments and perform the complex, difficult, and sometimes tedious work required to get data systems up and running. Their job is to put the data in a form where others in the data pipeline, like data scientists, can extract value from the data.
Spark has become the ultimate toolkit for data engineers as it simplifies the working environment by providing both a platform for organizing and running complex data pipelines, and a powerful toolset for storing, recover and transform data.
Spark isn’t everything, and there are plenty of important tools outside of Spark that data engineers love. But what Spark does is perhaps the most important thing: It provides a unified environment that accepts data in many different forms and allows all tools to work together on the same data, passing a set of data from one stage to another. Getting it right means you can build data pipelines at scale.
With Spark, data engineers can:
- Connect to different data sources in different locations, including cloud sources such as Amazon S3, databases, Hadoop file systems, data feeds, web services, and flat files.
- Convert different types of data to a standard format. The Spark data processing API allows the use of several different types of input data. Spark then uses Resilient Distributed Datasets (RDDs) and Data Frames for simplified yet advanced data processing.
- Write programs that access, transform, and store data. Many popular programming languages have APIs for directly integrating Spark code, and Spark offers many powerful functions to perform complex ETL-style data cleansing and transformation functions. Spark also includes a high-level API that allows users to transparently write queries in SQL.
- Integration with almost all of the important tools for data management, data profiling, data discovery, and data graphing.
What makes Spark unique
In order to understand why Spark is so special, it’s important to compare it to Hadoop infrastructure, which was previously crucial in the rise of big data and big data analytics.
It’s modular: Spark is essentially a modular toolkit originally designed to work with Hadoop through the YARN Cluster Manager interface. This coupling with Hadoop made sense because Hadoop provided both the compute and storage resources. Spark offered many tools for data processing and Hadoop handled large volumes of affordable persistent storage and scaling of compute storage nodes. However, it quickly became clear that combining both compute and storage was not cost effective. Since then, many efficiencies have been introduced to support cloud-scale architectures, all with the goal of decoupling storage and compute. Spark is a valuable tool that can be used outside of Hadoop and allows either resource to scale independently. Ultimately, this means that regardless of their organization’s preferred storage and compute infrastructure, Spark allows users to interface with that infrastructure.
Accepts data of any size and shape: Spark emerged 10 years after the inception of Hadoop and focused more on how data of any size could be combined to support the development of applications and analytical workloads. While Hadoop offered a variety of low-level functionality, Spark provides a much larger, tailored environment that takes raw materials, turns them into reusable forms, and delivers them in analytical workloads. While Spark can work in batches, it can also work interactively. As a result, Spark has become the go-to platform for most data applications and is uniquely suited for solving data engineering problems. Essentially, Spark has overtaken Hadoop.
Supports multiple approaches and users: Hadoop infrastructure ushered in the era of big data by creating a platform that can creatively and affordably store and process data in quantities never before imagined, and then make that data usable through MapReduce. But Spark was also created to support processes higher up the stack. While Spark can access raw forms of data and interact with Hadoop file systems, Spark is not a single paradigm for achieving these goals, but rather built from the ground up to provide multiple approaches in dealing with any architectures. using the same underlying data format.
Designed for the data engineer
Spark gives data engineers great elasticity and flexibility as they approach their work. For example, a TensorFlow job in Spark can access data in HDFS or in multiple formats. Because Spark uses an API-based approach, Spark engineers have a wide variety of tools to use with Spark as an analysis engine. This modularity allows the use of open source tools and avoids the locking of the supplier with programming tools specific to tailor-made applications.
Since Spark now works with Kubernetes and containerization, engineers can launch and slow down Spark clusters and effectively manage them as Kubernetes pods rather than relying on physical, stand-alone, or bare-metal clusters. Deploying a Spark cluster on top of a Kubernetes cluster leverages the hardware abstraction layer managed by Kubernetes. This further frees up a data engineer to do data engineering and avoids the complex and often time-consuming work of IT administration and cluster management.
Spark is a tool that was created not only to solve the problem of data engineering, but also to be accessible and useful to people further down the data pipeline. So while Spark was designed for data engineers, it actually increases the number of people who can take advantage of data. By providing scalable compute with scalable tool sets, Spark empowers engineers to empower others to get the most from data. Maybe, then, not only is Spark a data engineer’s best friend, but everyone’s best friend?
To see how Spark enables a variety of users to deliver workloads, machine learning, analytics, and data engineering, check out my blog: Ready to Become a Superhero? Build an ML model with Spark on HPE Ezmeral now.
About Don Wake
Don has spent the past 20 years building, testing, marketing and selling enterprise storage, networking and compute solutions in the rapidly evolving information technology industry. Today, he focuses on HPE Ezmeral: the ultimate toolkit for managing, deploying, running, and monitoring data-centric applications across software and hardware architectures in the cloud, on-premises, and at the edge.
Copyright © 2021 IDG Communications, Inc.