Posted onApr 1, 2023 • Edited onApr 7, 2023

Optimize spark on kubernetes

This is my second post about Spark on Kubernetes. I wanted to share my experience with reducing the costs of Spark computation in clouds, which can be expensive, but can be decreased by 60-70%. I am using Spark version 3.3.1.

'1. If you are running your research in client mode from iPython notebook, it is recommended touse dynamic allocation. This configuration allows you to create an executor pod only during compute time, after which the executor stops.

spark.dynamicAllocation.enabled                     truespark.dynamicAllocation.shuffleTracking.enabled     truespark.dynamicAllocation.shuffleTracking.timeout     120spark.dynamicAllocation.minExecutors                0spark.dynamicAllocation.maxExecutors                10

'2.Using spot nodes for executors significantly reduce costs (60-90% cheaper than on-demand nodes). To create a spot node group, you need to label it, for example, spark: spot. However, for driver still on-demand nodes should be used.

If you are running in client mode, set the following configuration

spark.kubernetes.executor.node.selector.spark      spot  # here you label k,v in my case k=spark, v=node

If you are using Spark Operator, use the following configuration settings:

spec:  driver:    nodeSelector:      - key1: value1      - key2: value2  executor:    nodeSelector:      - key1: value1      - key2: value2

P.S use volume mount from next point to keep executors temp results is case of spot node interruption

'3.Use SSD volume mount to executors. As mentioned above to keep executor temp results in case of spot node interruption. For this purpose, it is best to use an SSD volume mount, which accelerates the write and read of temp files that Spark saves on disk. You can use the following configuration settings:

spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName    OnDemandspark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass    gp # your cloud ssd storage classspark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit    100Gispark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path    /dataspark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly    false

'4. These are the recommended default values from "Learning Spark":

spark.shuffle.file.buffer                           1mspark.file.transferTo                               falsespark.shuffle.unsafe.file.output.buffer             1mspark.io.compression.lz4.blockSize                  512k

In conclusion, by following the above steps, you can significantly reduce the cost of running Spark computations in the cloud. Dynamic allocation, using spot nodes for executors, and SSD volume mounts can reduce costs by up to 60-90%. Additionally, using default values as recommended in "Learning Spark" can help optimize performance. Remember to always prioritize the needs and satisfaction of the user when making any changes and to thoroughly test any configurations before implementing them. By doing so, you can provide a useful and enjoyable experience for your users while also being cost-effective.

Recources:
https://spot.io/blog/how-to-run-spark-on-kubernetes-reliably-on-spot-instances/
https://aws.amazon.com/blogs/compute/running-cost-optimized-spark-workloads-on-kubernetes-using-ec2-spot-instances/
https://spark.apache.org/docs/latest/running-on-kubernetes.html
https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/

P.S. My first post about spark on k8s
How to run Spark on kubernetes in jupyterhub
https://dev.to/akoshel/spark-on-k8s-in-jupyterhub-1da2

Top comments(0)

For further actions, you may consider blocking this person and/orreporting abuse

Movatterモバイル変換

DEV Community

Optimize spark on kubernetes

Top comments(0)

More fromakoshel