- Notifications
You must be signed in to change notification settings - Fork308
Description
Hi folks.
I'm trying to connect to a remote Kubernetes-based Spark cluster (AWS EMR on EKS) from R usingsparklyr, but keep running into connection errors. The equivalent PySpark code works as expected. I'm not sure if this is a limitation ofsparklyr with Kubernetes deployments or if I'm missing something in my configuration.
What I'm trying to do
Python Code (Works)
This PySpark code successfully connects to my remote Kubernetes cluster:
frompyspark.sqlimportSparkSessionspark=SparkSession.builder \ .master("k8s://https://<myurl>:443") \ .appName("TestConnection") \ .config("spark.kubernetes.container.image","<my-spark-image>") \ .config("spark.executor.instances","2") \ .config("spark.executor.memory","4g") \ .config("spark.executor.cores","2") \ .getOrCreate()sc=spark.sparkContext
This works perfectly. It connects to the Kubernetes API server, spawns executor pods, and I can run queries.
R Code (Fails)
Unfortunately, I come unstuck with the following R equivalent:
library(sparklyr)sc= spark_connect(master="k8s://https://<myurl>:443",spark_home="/usr/lib/spark",config=list(spark.kubernetes.container.image="<my-spark-image>",spark.executor.instances=2,spark.executor.memory="4g",spark.executor.cores=2 ))
Error:
Error in spark_connect_gateway(gatewayAddress, gatewayPort, sessionId, ...) : Gateway in localhost:8880 did not respond.Try running `options(sparklyr.log.console = TRUE)` followed by `sc <- spark_connect(...)` for more debugging info.Side note: Runningoptions(sparklyr.log.console = TRUE) didn't actually give me any more info.
My understanding (with help from Claude)
I must admit that I'm a bit out of my comfort zone here, so I tried to troubleshoot further with Claude 4.5 Sonnet. See the summary below, although I can't speak to it's full accuracy.
Claude's Technical Summary:
The issue appears to be an architectural difference betweensparklyr and PySpark:
sparklyr's Gateway Architecture:
sparklyrtries to start a local gateway process onlocalhost:8880that acts as a bridge between R and Spark. This architecture works for local, YARN, and Mesos deployments.Kubernetes Client Mode: When connecting to a Kubernetes cluster with a
k8s://master URL, the driver needs to run locally and communicate directly with the Kubernetes API server to spawn executor pods. There's no intermediate gateway in this model.The Mismatch:
sparklyrattempts to start its gateway and wait for a response, but the gateway can't establish a connection to a remote Kubernetes cluster. The connection fails before any Spark application is created.
PySpark works because the Python process becomes the Spark driver directly and communicates with Kubernetes natively, without needing a gateway.
Questions
- Is this a known limitation? Does
sparklyrcurrently support connecting to remote Kubernetes Spark clusters? - Am I doing something wrong? Is there a different way I should be configuring the connection?
- Are there workarounds?
- Should I rather be using
pysparklyr/reticulate?
- Should I rather be using
Any guidance would be much appreciated. Happy to provide more info RE my setup as needed. Thanks in advance!