
This is my second post about Spark on Kubernetes. I wanted to share my experience with reducing the costs of Spark computation in clouds, which can be expensive, but can be decreased by 60-70%. I am using Spark version 3.3.1.
'1. If you are running your research in client mode from iPython notebook, it is recommended touse dynamic allocation. This configuration allows you to create an executor pod only during compute time, after which the executor stops.
spark.dynamicAllocation.enabled truespark.dynamicAllocation.shuffleTracking.enabled truespark.dynamicAllocation.shuffleTracking.timeout 120spark.dynamicAllocation.minExecutors 0spark.dynamicAllocation.maxExecutors 10
'2.Using spot nodes for executors significantly reduce costs (60-90% cheaper than on-demand nodes). To create a spot node group, you need to label it, for example, spark: spot. However, for driver still on-demand nodes should be used.
If you are running in client mode, set the following configuration
spark.kubernetes.executor.node.selector.spark spot # here you label k,v in my case k=spark, v=node
If you are using Spark Operator, use the following configuration settings:
spec: driver: nodeSelector: - key1: value1 - key2: value2 executor: nodeSelector: - key1: value1 - key2: value2
P.S use volume mount from next point to keep executors temp results is case of spot node interruption
'3.Use SSD volume mount to executors. As mentioned above to keep executor temp results in case of spot node interruption. For this purpose, it is best to use an SSD volume mount, which accelerates the write and read of temp files that Spark saves on disk. You can use the following configuration settings:
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName OnDemandspark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass gp # your cloud ssd storage classspark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit 100Gispark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path /dataspark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly false
'4. These are the recommended default values from "Learning Spark":
spark.shuffle.file.buffer 1mspark.file.transferTo falsespark.shuffle.unsafe.file.output.buffer 1mspark.io.compression.lz4.blockSize 512k
In conclusion, by following the above steps, you can significantly reduce the cost of running Spark computations in the cloud. Dynamic allocation, using spot nodes for executors, and SSD volume mounts can reduce costs by up to 60-90%. Additionally, using default values as recommended in "Learning Spark" can help optimize performance. Remember to always prioritize the needs and satisfaction of the user when making any changes and to thoroughly test any configurations before implementing them. By doing so, you can provide a useful and enjoyable experience for your users while also being cost-effective.
Recources:
https://spot.io/blog/how-to-run-spark-on-kubernetes-reliably-on-spot-instances/
https://aws.amazon.com/blogs/compute/running-cost-optimized-spark-workloads-on-kubernetes-using-ec2-spot-instances/
https://spark.apache.org/docs/latest/running-on-kubernetes.html
https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/
P.S. My first post about spark on k8s
How to run Spark on kubernetes in jupyterhub
https://dev.to/akoshel/spark-on-k8s-in-jupyterhub-1da2
Top comments(0)
For further actions, you may consider blocking this person and/orreporting abuse