Spark on Kubernetes(k8s)

Sandip Roy
6 min readMar 22, 2022

Co Authored by Joydeep Das

What is Spark?

Apache Spark is a framework used in cluster computing environments for analyzing big data. This platform became widely popular due to its ease of use and the improved data processing speeds over Hadoop.

Apart from earlier cluster managers (Standalone cluster manager, Hadoop Yarn, Apache Mesos), Spark has added support for a new cluster manager for Kubernetes from version 2.3 onwards and so Spark can now run on clusters managed by Kubernetes, using the native Kubernetes scheduler.

What is Kubernetes?

Kubernetes, often abbreviated as “K8s”, is an open-source container orchestration platform designed to automate the deployment, scaling, and management of containerized applications.

It distributes application workloads across a Kubernetes cluster and automates dynamic container networking needs. Kubernetes also allocates storage and persistent volumes to running containers, provides automatic scaling, and works continuously to maintain the desired state of applications, providing resiliency.

Kubernetes and containers allow for much better resource utilization than hypervisors and VMs do. Because containers are so lightweight, they require less CPU and memory resources to run, thus making the process more cost efficient.

Why Spark on Kubernetes?

Amid a slowdown of the Hadoop big data market and added the fact that people seem way more interested in Kubernetes than in older Hadoop specific technologies like YARN for resource management and orchestration, the fast adoption of cloud native technologies and containerization of applications brewing a perfect storm for the ageing Hadoop stack. Nonetheless projects like Apache Spark are fast adopting by introducing Kubernetes as an alternative to YARN.

spark-submit can be directly used to submit a Spark application to a Kubernetes cluster. The submission mechanism works as follows:

i) Spark creates a Spark driver running within a Kubernetes pod.

ii) The driver creates executors which are also running within Kubernetes pods and connects to them and executes application code.

iii) When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.

Spark on Kubernetes — Execution Flow

Note that in the completed state, the driver pod does not use any computational or memory resources. The driver and executor pod scheduling and communication are handled by Kubernetes.

In current scope, we will focus on how to run a sample Spark program in an Azure Kubernetes Service (AKS) cluster. Similarly we can run on EKS (in AWS), GKE (in GCP), RedHat OpenShift or any other standard Kubernetes engine.

Prerequisites:

i) Need Docker installed in local development machine. Download the Docker installer for platform of choice from here. You also need Git, Maven and Scala IDE in your local development machine to build the Spark project.

ii) Need Azure CLI installed in local development machine. You can download and install Azure cli from here

iii) Existing setup for Azure Kubernetes Service (AKS) cluster with Azure Container Registry (ACR) attached to it. Refer here for setup of ACR, AKS and directive on how to attach both together.

iv) Existing Azure storage account.

v) To interact with AKS cluster we will be using kubectl cli. You can download and set up kubectl from here

Detailed Steps

i) Clone Apache Spark project from GitHub and build Docker image in local

You can clone the official Spark repository from here and build it like following:

git clone -b branch-3.0 https://github.com/apache/spark
cd spark
dev/make-distribution.sh -Pkubernetes
cd dist
docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .

Or else you can pull ready-made Apache Spark image from DockerHub.

ii) Rename Docker image in local and push it to Azure Container Registry (ACR)

# az acr login — name <ACR container registry name>
az acr login — name depcontreg
docker tag spark:latest depcontreg.azurecr.io/spark-on-aks:v3.0.1
docker push depcontreg.azurecr.io/spark-on-aks:v3.0.1

Spark image installed on ACR

** You can skip step #1 if you already have your Spark docker image available to you — via docker or any other public registry as you can literally import the same image to ACR.

iii) Build the Spark application Jar and upload to Azure storage container

Next, prepare a Spark job. A jar file is used to hold the Spark job and is needed when running the spark-submit command. The jar can be made accessible through a public URL or pre-packaged within a container image.

In this example, a sample jar is created to calculate the value of Pi. This jar is then uploaded to Azure blob storage. If you have an existing jar, feel free to substitute.

ACCOUNT_NAME=depbigdatapractice
CONTAINER_NAME=spark-aks-poc
BLOB_NAME=spark-examples_2.11–2.4.8.jar
FILE_TO_UPLOAD=target/scala-2.11/ spark-examples_2.11–2.4.8.jar
echo “Uploading the file…”
az storage blob upload — account-name $ACCOUNT_NAME — container-name $CONTAINER_NAME — file $FILE_TO_UPLOAD — name $BLOB_NAME
jarUrl=$( az storage blob url — account-name $ACCOUNT_NAME — container-name $CONTAINER_NAME — name $BLOB_NAME | tr -d ‘“‘)

iv) Submit a Spark job into AKS

Before we submit jobs to K8s cluster, we need to get the Kubernetes master URL, that to be used in spark-submit command or else same can be obtained from Azure portal AKS cluster main page.

#az aks get-credentials — resource-group <resource group name> — name <AKS cluster name>
az aks get-credentials — resource-group deppoc — name DEPAKSCluster
kubectl cluster-info

Now we have to create a service account and namespace in AKS cluster. Depending on the version and setup of Kubernetes deployed, this default service account may or may not have the role that allows driver pods to create pods and services under the default Kubernetes RBAC policies. Sometimes users may need to specify a custom service account that has the right role granted. Spark on Kubernetes supports specifying a custom service account to be used by the driver pod through the configuration property spark.kubernetes.authenticate.driver.serviceAccountName=<service account name>

kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role — clusterrole=edit — serviceaccount=default:spark — namespace=default

and then finally submit the Spark job in AKS cluster as below:

#spark-submit — master k8s://< kubernetes control plane server url > — deploy-mode cluster — name <job name> — class <class name> — conf spark.executor.instances=2 — conf spark.kubernetes.authenticate.driver.serviceAccountName=spark — conf spark.kubernetes.container.image=<ACR image path and name> <jarUrl>

spark-submit — master k8s://https://mydepakscluster-dns-7e8ac9ba.hcp.eastus.azmk8s.io:443 — deploy-mode cluster — name spark-pi — class org.apache.spark.examples.SparkPi — conf spark.executor.instances=2 — conf spark.kubernetes.authenticate.driver.serviceAccountName=spark — conf spark.kubernetes.container.image=depcontreg.azurecr.io/spark-on-aks:v3.0.1 https://depbigdatapractice.blob.core.windows.net/spark-aks-poc/spark-examples_2.11-2.4.8.jar

You can see that the Spark job successfully completed below:

v) Get application trace from Spark driver logs

Get the Spark driver log from console

kubectl logs spark-pi-175c287f897a537e-driver

or from Azure Log Analytics through KQL (Kusto Query Language)

Thanks for reading. In case you want to share your case studies or want to connect, please ping me via LinkedIn

--

--

Sandip Roy

Bigdata and Databricks Practice Lead at Wipro Ltd