This Blog is Co-authored by Sivakumar S
Apache Spark is one of the most preferred open source compute engine for large-scale data processing and transformation. Effective utilisation of Compute Resources(Driver + Executor),mitigating the Task/Data Skew and addressing the High Garbage Collection issues are vital for the Spark Jobs to run at optimal performance for Time & Cost Savings.
There are various Data Profiling tools available which assist in collecting and analysing the Spark Job Metrics and add value by providing the detailed analysis for decision making & performance tuning.
What is a Profiler?
In software engineering, profiling is often referred to “program profiling”, “software profiling” , “Application Profiling” etc. It is a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration of function calls.
Most commonly, profiling information helps the development teams to identify opportunities for program optimisation, Cost Savings and Effective Utilisation of the available resources, and more specifically, performance engineering.
What are the Different Spark Profilers?
We have shared some of the Spark Profilers available.
• Sparklens (https://github.com/qubole/sparklens )
• sparkMeasure ( https://github.com/LucaCanali/sparkMeasure)
• Sparklint (https://github.com/groupon/sparklint )
• Dr. Elephant (https://github.com/linkedin/dr-elephant )
• SparkOscope (https://github.com/ibm-research-ireland/sparkoscope)
In this blog, we will be focusing on Sparklens.
What is Sparklens?
Sparklens is an Open Source Profiling tool with a built-in Spark scheduler simulator written in Scala. It can be used with “ANY” Spark Application. It has been developed and maintained at Qubole.
Sparklens primarily assists in deciphering the Spark Job metrics to understand the scalability limits of Spark Jobs and to identify the various opportunities to effectively tune our Spark Jobs.
It also assists in understanding how efficiently a given Spark application is using the compute resources provided to it. Maybe your application will run faster with more executors or with more Driver memory and may be it won’t. Sparklens can assist in answering this question by looking at a single run of your application.
How to Use Sparklens?
To enable Sparklens for your Spark Jobs/Applications we need to add the following additional Configuration parameters to spark-submit or spark-shell:
— packages qubole:sparklens:0.1.2-s_2.11
— conf spark.extraListeners=com.qubole.sparklens.QuboleJobListener
Each Spark Job submitted with the Sparklens configuration will have the Sparklens report enabled by default. We can access them using the following methods.
How to view/access the Sparklens report?
There are 3 different ways to access the Sparklens Report in a Cloud environment.
• Yarn Resource Manager
Navigate to the Resource Manager Link and access the Spark application logs to view the Sparklens report which is printed with the application logs.
• PHS Server Logs
The PHS (Persistent History Server) logs stores the Sparklens report details in the application History logs. Ideally we will be using PHS logs to analyse the Sparklens reports.
Note: Persistent History Server is a Persistent Cluster which is available 24*7 in a GCP Data Proc environment. It is used to store the application logs of all the applications/jobs triggered in your Data Proc environment.
• GCS Bucket Location
The Sparklens Report for each application ID is stored as a Json file in a GCS bucket configured for Dev / Prod. It can be downloaded and analysed. The GCS bucket location for the Sparklens data is highlighted in Figure 1.
• Sparklens UI
Sparklens reports can be uploaded to the Sparklens report portal to view the reports as charts.
Sparklens Metrics :
Sparklens generates multiple metrics based on the Spark application log. We will now go through some of the Key Metrics of the Sparklens report.
Critical Path :
Critical Path is the Least amount of time our application will take to finish , irrespective of the number of executors/infinite number of executors added to our job.
Critical Path = All the time that is spent in the driver + the time spent by the largest task in each Stage.Figure 3: Critical Path Representation . Source: Qubole BlogFigure 4: Critical Path data from Sparklens Actual Report from a Spark Job
From the above metrics, we could notice that the Total WallClock(execution) Time for an actual Spark Job is 16 m 09 s. However the Critical Path for the Job is 10m 46s. So we have an opportunity to tune the Job to achieve 5 + minutes of time savings for this job.
Ideal Path is the amount of time the job takes to execute if we had perfect parallelism with Zero Skew. In a real world scenario, it would be difficult to meet the ideal path requirements based on the business and data scenarios.
Key Takeaways from Critical Path Metrics:
• In an ideal application, The Driver WallClock Time should be relatively less compared to Executor WallClock Time, If the Driver WallClock Time is higher then analyse and identify options to reduce the Driver computation tasks.
• We can’t make Driver execute faster with more number of executors.
• We can’t make the slowest task of any Stage run faster with more number of executors unless we change the data handled by task by re-partitioning the data.
• Spark application cannot run faster than its critical path, irrespective of the number of executors.
• Spark Application can be made efficient by
• Reducing Driver side computations
• Having enough tasks for given cores
• Mitigating Task Skews
• Reducing the number of executors wherever needed.
Model Simulator Metrics:Figure 5:Model Simulation Metrics
The Model simulator estimates the run time for the Spark application/job and the Model error is the deviation between Model Estimated Run Time and Actual Run time.
Model simulator also estimates the run time of the Job based on the varying Executor counts and Cluster utilisation. We can use this metrics if we plan to have a trade-off between reduced Executor count and a marginal increase in run time based on use case scenarios.
For Ex: In the above Sparklens report in Figure 4, We could notice that with reduced executor capacity (80 % : 160 Executors) the projected run time increases a little higher but if it suits our use case, then we can run the Spark Job with reduced executor capacity.
Per Stage Execution Metrics:Figure 6: Metrics Aggregated for each Stage in Spark Job
Based on the Stage Level aggregated metrics from Sparklens, we could notice that Stage ID: 54 contributed to 27 % of total execution time and it had highest task count 111110. We can evaluate the Stage for any opportunities for tuning.Figure 7: Skew Metrics & Parallelism Ratio aggregated at Stage level for a Spark Job
Per Stage Key Metrics Definitions:Per Stage Key Metrics Definitions
Sparklens provides detailed metrics for the development teams to analyse the application logs and identify opportunities for cost & time optimisations. We have highlighted only the Key metrics as part of this Blog. Please go through the reference sources for detailed information on other metrics generated by Sparklens reports.
Sparklens doesn’t highlight any Code issues which needs to be mitigated to achieve better performance. Instead it assists the development teams to avoid the multiple trial and error approaches that they follow in optimising the Spark Jobs by summarising and highlighting the Key Metrics for efficient analysis.
Sparklens generates the report using the single execution of the Spark Job and it also have a UI interface with Visualisations to analyse the metrics. Data Engineering Leaders & teams which process large scale data using Apache Spark can plan to utilise any Data Profiling tools like Sparklens for effective Performance Engineering. Happy Learning !!
• Qubole Spark Meetup: https://www.youtube.com/watch?v=0a2U4_6zsCc
• The Fifth Elephant Conference(Qubole Sparklens: understanding the scalability limits of Spark applications — Rohit Karlupia) https://www.youtube.com/watch?v=SOFztF-3GGk
Apache Spark Performance Engineering using Sparklens was originally published in Walmart Global Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.