Getting started with Apache Spark + GPU + RAPIDS (part-I)

Prerequisites: This article assumes that the reader is aware about architecture and setup of Spark 3.0 and already worked on Spark 2.0.

Spark 3.0 release added a new feature to run Spark workloads on GPU using RAPIDS library. This feature will help Data Scientists/Data Engineers to create single pipeline which starts from ETL (Extract Transform Load)/ELT (Extract Load Transform) to model training using single tool.The major advantage of this feature is that Spark seamlessly switches processing between CPU and GPU. The proof of concept (PoC) details that follow in the next paragraph is aimed to help you kick-start your journey of Apache Spark + GPU + RAPIDS, in my case I have performed the PoC on GCP (Google Cloud Platform) based Spark 3.0 Dataproc cluster running Dataproc 2.0.7, CUDA 10.2, RAPIDS 0.18 and Spark RAPIDS 0.4.1. Same PoC can be tried in any other cloud platform. One callout about this article- it is purely based on Data Engineering perspective with respect to Spark 3.0.

First step is to enable RAPIDS library. After setting spark.rapids.sql.enabled to true, we are instructing Spark to utilize underlying GPU for processing whenever there is an option to run certain instructions on GPU. The instructions which are not supported on GPU will automatically fallback to CPU for processing. Setting spark.rapids.sql.enabled to false will run your instructions only on CPU.
Later in the article I will explain with an example on how to check what part is being executed on GPU and what on CPU.

Now that we are aware of how to enable Spark on GPU, let’s have a look at some important configuration parameters one can play with to optimize processing.

spark.task.resource.gpu.amount : Sets amount of GPU resources per task

spark.rapids.sql.concurrentGpuTasks : Sets number of concurrent GPU tasks

spark.executor.resource.gpu.amount : Sets number of GPU per executor

Ideally, parallelism on CPU should be equal to parallelism in GPU so that resources are not wasted -

GPU parallelism = spark.executor.resource.gpu.amount / spark.task.resource.gpu.amount
CPU parallelism = spark.executor.cores / spark.task.cpus

Let’s understand this with an example,

Assume, we have 4 core CPU attached to one GPU. There are three possible cases, lets start with the first case, and run each of them using spark-shell with runtime configuration, same can be achieved using configuration file setting as well:

GPU parallelism = CPU parallelism

spark-shell
— — conf spark.executor.cores=4
— — conf spark.task.resource.gpu.amount=.25
— — conf spark.task.cpus=1
— — conf spark.executor.instances=4
— — conf spark.executor.resource.gpu.amount=1

This is the ideal case and Spark starts without any warning message as 4/1 = 1/.25

GPU parallelism < CPU parallelism

spark-shell
— — conf spark.executor.cores=4
— — conf spark.task.resource.gpu.amount=.5
— — conf spark.task.cpus=1
— — conf spark.executor.instances=4
— — conf spark.executor.resource.gpu.amount=1

1 / .5 (GPU parallelism)< 4 / 1 (CPU parallelism)

spark-shell
— — conf spark.executor.cores=4
— — conf spark.task.resource.gpu.amount=.25
— — conf spark.task.cpus=2
— — conf spark.executor.instances=4
— — conf spark.executor.resource.gpu.amount=1
1 / .25 (GPU parallelism)> 4 / 2 (CPU parallelism)

Please note, setting spark.rapids.sql.concurrentGpuTasks to values higher can lead to out of memory errors or poor performance, generally the value should be between 2 to 4.

Now that we are aware about basic configurations required to run an application on Spark, let’s see how to check which part of SQL is blocking instructions to be executed on GPU.

spark.rapids.sql.explain takes input as one of ALL/NONE/NOT_ON_GPU

Setting spark.rapids.sql.explain = NOT_ON_GPU will result in printing all such instructions which cannot be performed on GPU.

There are various operations/datatypes which does not have support on GPU as of now, one such example is support of Decimal data type, as of writing this article, Spark on CPU supports 128 bit precision while on GPU it supports upto 64 bit (upto Decimal(18,x)). So even if your operation is performed between attributes of Decimal(9,2) and Decimal(12,6), Spark inserts PromotePrecision to cast values to compute without being overflow. Because of which result might be of decimal (19,8) and since it’s not supported on GPU, the computation will be performed on CPU.

Let’s take an example where sales_yrly table contain yearly sales with attribute sales_amt defined as decimal(12, 2)

scala> spark.conf.set(“spark.rapids.sql.explain”, “NOT_ON_GPU”);
scala> spark.sql(“explain select sum(sales_amt) from sales_yrly”)

You can see in the below screenshot that the operations which were not able to run on GPU have been highlighted in explain plan when we enabled “NOT_ON_GPU”

It would be beneficial to check beforehand which operators are supported and which are not, before starting journey with Spark on RAPIDS. All such operations are well described on official documentation page.

This marks the end part I of this article, in the next article we will dig deeper into DAGs to understand how operations are performed in Spark 3.0.

References:

https://nvidia.github.io/spark-rapids/

Getting started with Apache Spark + GPU + RAPIDS (part-I) was originally published in Walmart Global Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Article Link: Getting started with Apache Spark + GPU + RAPIDS (part-I) | by Kunal Mulay | Walmart Global Tech Blog | Nov, 2021 | Medium