- Notifications
You must be signed in to change notification settings - Fork5
Examples and custom spark images for working with the spark-on-k8s operator on AWS
License
bbenzikry/spark-eks
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation

Examples and custom spark images for working with the spark-on-k8s operator on AWS.
Allows using Spark 2 with IRSA and Spark 3 with IRSA and AWS Glue as a metastore.
Note: Spark 3 images also include relevant jars for working with theS3A commiters
If you're looking for the Spark 3 custom distributions, you can find themhere
Note: Spark 2 images will not be updated, please see theFAQ
- Deployspark-on-k8s operator using thehelm chart and thepatched operator image
bbenzikry/spark-eks-operator:latest
Suggested values for the helm chart can be found in theflux example.
Note: Do not create the spark service account automatically as part of chart use.
- Create an AWS role for driver
- Create an AWS role for executors
- Add default service account EKS role for executors in your spark job namespace ( optional )
# NOTE: Only required when not building spark from source or using a version of spark < 3.1. In 3.1, executor roles will rely on the driver definition. At the moment they execute with the default service account.apiVersion:v1kind:ServiceAccountmetadata:name:defaultnamespace:SPARK_JOB_NAMESPACEannotations:# can also be the driver roleeks.amazonaws.com/role-arn:"arn:aws:iam::ACCOUNT_ID:role/executor-role"
- Make sure spark service account ( used by driver pods ) is configured to an EKS role as well
apiVersion:v1kind:ServiceAccountmetadata:name:sparknamespace:SPARK_JOB_NAMESPACEannotations:eks.amazonaws.com/role-arn:"arn:aws:iam::ACCOUNT_ID:role/driver-role"
For spark < 3.0.0, seespark2.Dockerfile
For spark 3.0.0+, seespark3.Dockerfile
For pyspark, seepyspark.Dockerfile
Below are examples for latest versions.
If you want to use pinned versions, all images are tagged by the commit SHA.
You can find a full list of tagshere
# spark2FROM bbenzikry/spark-eks:spark2-latest# spark3FROM bbenzikry/spark-eks:spark3-latest# pyspark2FROM bbenzikry/spark-eks:pyspark2-latest# pyspark3FROM bbenzikry/spark-eks:pyspark3-latest
hadoopConf:# IRSA configuration"fs.s3a.aws.credentials.provider":"com.amazonaws.auth.WebIdentityTokenCredentialsProvider"driver:.....labels:.....serviceAccount:SERVICE_ACCOUNT_NAME# See: https://github.com/kubernetes/kubernetes/issues/82573# Note: securityContext has changed in recent versions of the operator to podSecurityContextpodSecurityContext:fsGroup:65534
- Make sure your driver and executor roles have the relevant glue permissions
{/* Example below depicts the IAM policy for accessing db1/table1. Modify this as you deem worthy for spark application access. */Effect:"Allow",Action:["glue:*Database*","glue:*Table*","glue:*Partition*"],Resource:["arn:aws:glue:us-west-2:123456789012:catalog","arn:aws:glue:us-west-2:123456789012:database/db1","arn:aws:glue:us-west-2:123456789012:table/db1/table1","arn:aws:glue:eu-west-1:123456789012:database/default","arn:aws:glue:eu-west-1:123456789012:database/global_temp","arn:aws:glue:eu-west-1:123456789012:database/parquet",],}
- Make sure you are using the patched operator image
- Add a config map to your spark job namespace as definedhere
apiVersion:v1data:hive-site.xml:|- <configuration> <property> <name>hive.imetastoreclient.factory.class</name> <value>com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory</value> </property> </configuration>kind:ConfigMapmetadata:namespace:SPARK_JOB_NAMESPACEname:spark-custom-config-map
In order to submit an application with glue support, you need to add a reference to the configmap in yourSparkApplication
spec.
kind:SparkApplicationmetadata:name:"my-spark-app"namespace:SPARK_JOB_NAMESPACEspec:sparkConfigMap:spark-custom-config-map
Where can I find a Spark 2 build with Glue support?
As spark 2 becomes less and less relevant, I opted against the need to add glue support.You can take a lookhere for a reference build script which you can use to build a Spark 2 distribution to use with the Spark 2dockerfile
Why a patched operator image?
The patched image is a simple implementation for properly working with custom configuration files with the spark operator.It may be added as a PR in the future or another implementation will take its place. For more information, see the related issuekubeflow/spark-operator#216