Hi Evan - thank you for opening this discussion!
The test code snippet is great - would you be willing to contribute that to the project? Also, I need to make sure that when you used#580, you also built notebook and jupyter_client with the corresponding PRs (#4479 and#428, respectively). I assume that’s the case.
I also need to make sure that the kernelspecspython3 andspark_python are indeed using theKubernetesProcessProxy and not getting launched as kernels local to EG. I assume each gets launched into their respective pods, etc.
Your post strikes me as being more about concurrency than scalability - although the two are closely related. Once kernels have started, EG really doesn’t need to do much from a scalability perspective - since kernel management is really just a hash map of kernel manager instances that just transfer the kernel inputs/outputs between the front end and kernel and the web handlers tend to be async already - it’s a relatively small footprint. However, as you have found, concurrent starts can be an issue - thus the need to introduce more EG instances - once we’ve optimized for the single instance case of course.
Yes, the current “HA/DR” story is Active/Passive and that shouldn’t be conflated with scalability. I totally agree we need to get to an Active/Active configuration that will alleviate both HA and scale issues.
Process Starvation:
I like the idea of introducing asubmitter orlauncher pod, although I’m not sure if it just moves the problem into the pod. How did you conclude that this approach might be beneficial? Is k8s imposing process on a given pod? Are there areas that we can address in lieu of this - i.e., there may still be async-related changes since those PRs haven’t been fully exercised. For example, I’m not sure if shutdown has been fully “asynced” in the PRs. As a result, it might be helpful to insert a delay prior to shutdown such that all kernels are running simultaneously before the first is shutdown.
I’m assuming this pod (deployment or service?) would be a singleton and not created/deleted on each kernel start request. The overall kernel start time probably won’t improve since the same sequence must occur, whether from EG or not. What’s nice is that inserting this kind of mechanism doesn’t affect kernel discovery. Keep in mind that in Spark 3, the plan is to provide the ability for users to bring their own pod templates - although launch would still go through spark-submit, a pre-processing step to substitute parameters into the template would need to occur (see#559) - as a result, I think the same pod could handle both styles of launch.
I’m also wondering if this is a matter of using another entity for the launch with a queue in between, if this can be implemented generically via threading? That would allow other platforms (on-prem YARN, docker, etc.) to leverage this feature. Perhaps most of the initial effort could be done in this manner.
Session Persistence:
Yes - the ultimate target for this is a lightweight, NOSQL DB like Redis, MongoDB (see#562) or Cloudant (CouchDB), etc. and is the driving force behind the recent refactoring of the KernelSessionManager. I think the big effort here is removing EG’s dependency on Kernel Gateway and subsuming the websocket personality handlers so we can add404 handling that then hits the persisted session storage (whether that be Redis or NFS, etc.). Some of the issues are discussed in#562 and#594. Regardless, we’ll want to continue using sticky sessions so as to not thrash within a given kernel’s session.
I’m not sure we need to worry about references throughout the server. Since sessions are sticky, its really a matter of establishing the in-memory states from the persisted storage within the handlers when the handlers detect the kernel_id can’t be found (i.e.,404 handling). Once established, those sessions should be usable. EG really just uses the kernel manager state, sessions are not sent from the client via NB2KG, although users issuing requests directly against the REST API can start a kernel via a POST to /api/sessions.
One area that gets side-affected by this is idle kernel culling since the kernel’s last activity is stored in the EG process. This is why sticky sessions are critical since a kernel’s management should only switch servers if the original server has gone down. If idle culling is enabled, then this implies the idle periods would reset when the kernel switches servers - which is probably correct since the new server doesn’t load the kernel’s session state unless there’s activity against it. Where idle culling would fail to behave is if the idle timeout were to occur after the server has died but before the kernel got loaded by another instance. In these cases, the kernel would not be culled.
Thank you for looking into this! It’s an extremely important topic - one that larger installations can definitely benefit from.