Currently we restrict thescale-up Lambda to only handle a single event at a time. In very busy environments this can prove to be a bottleneck: there are calls to GitHub and AWS APIs that happen each time, and they can end up taking long enough that we can't process job queued events faster than they arrive.

In our environment we are also using a pool, and typically we have responded to the alerts generated by this (SQS queue length growing) by expanding the size of the pool. This helps because we will more frequently find that we don't need to scale up, which allows the lambdas to exit a bit earlier, so we can get through the queue faster. But it makes the environment much less responsive to changes in usage patterns.

At its core, this Lambda's task is to construct an EC2CreateFleet call to create instances, after working out how many are needed. This is a job that can be batched. We can take any number of events, calculate the diff between our current state and the number of jobs we have, capping at the maximum, and then issue a single call.

The thing to be careful about is how to handle partial failures, if EC2 creates some of the instances we wanted but not all of them. Lambda has a configurable function response type which can be set toReportBatchItemFailures. In this mode, we return a list of failed messages from our handler and those are retried. We can make use of this to give back as many events as we failed to process.

Now we're potentially processing multiple events in a single Lambda, one thing we should optimise for is not recreating GitHub API clients. We need one client for the app itself, which we use to find out installation IDs, and then one client for each installation which is relevant to the batch of events we are processing. This is done by creating a new client the first time we see an event for a given installation.

We also remove the samebatch_size = 1 constraint from thejob-retry Lambda. This Lambda is used to retry events that previously failed. However, instead of reporting failures to be retried, here we maintain the pre-existing fault-tolerant behaviour where errors are logged but explicitly do not cause message retries, avoiding infinite loops from persistent GitHub API issues or malformed events.

Tests are added for all of this.

Tests in a private repo (sorry) look good. This was running ephemeral runners with no pool, SSM high throughput enabled, the job queued check _dis_abled, batch size of 200, wait time of 10 seconds. The workflow runs are each a matrix with 250 jobs.

iainlane force-pushed theiainlane/many-events branch 3 times, most recently froma7720aa to9056debCompare

May 29, 2025 12:45

iainlane commented

May 29, 2025

View reviewed changes

modules/runners/scale-up.tfShow resolvedHide resolved

npalm self-requested a review

May 30, 2025 07:48

iainlane force-pushed theiainlane/many-events branch 8 times, most recently fromd7d1b7c to54e4a8aCompare

June 12, 2025 14:55

iainlane force-pushed theiainlane/many-events branch from54e4a8a to79eb6a5Compare

June 13, 2025 18:50

iainlane commented

Jun 13, 2025

View reviewed changes

lambdas/libs/aws-powertools-util/src/logger/index.ts


		// Add the context to all child loggers
		childLoggers.forEach((childLogger)=>{
		childLogger.addPersistentLogAttributes({

Copy link

ContributorAuthor

iainlaneJun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I was getting warnings about mixing this withpersistentKeys. The function says

    /**     * @deprecated This method is deprecated and will be removed in the future major versions, please use {@link appendPersistentKeys() `appendPersistentKeys()`} instead.     */    addPersistentLogAttributes(attributes: LogKeys): void;

so I've switched it!

iainlane marked this pull request as ready for review

June 13, 2025 18:58

iainlane requested review froma team ascode owners

June 13, 2025 18:58

npalm requested a review fromCopilot

June 13, 2025 21:40

This comment was marked as outdated.

iainlane force-pushed theiainlane/many-events branch from69ff9b8 to9443df8Compare

June 16, 2025 10:17

npalm reviewed

Jun 22, 2025

View reviewed changes

lambdas/functions/control-plane/src/lambda.ts OutdatedShow resolvedHide resolved

iainlaneand others added7 commits

September 1, 2025 15:31

feat(control-plane): add support for handling multiple events in a si…

68720ab

…ngle invocationCurrently we restrict the `scale-up` Lambda to only handle a singleevent at a time. In very busy environments this can prove to be abottleneck: there are calls to GitHub and AWS APIs that happen eachtime, and they can end up taking long enough that we can't processjob queued events faster than they arrive.In our environment we are also using a pool, and typically we haveresponded to the alerts generated by this (SQS queue length growing) byexpanding the size of the pool. This helps because we will morefrequently find that we don't need to scale up, which allows the lambdasto exit a bit earlier, so we can get through the queue faster. But itmakes the environment much less responsive to changes in usage patterns.At its core, this Lambda's task is to construct an EC2 `CreateFleet`call to create instances, after working out how many are needed. This isa job that can be batched. We can take any number of events, calculatethe diff between our current state and the number of jobs we have,capping at the maximum, and then issue a single call.The thing to be careful about is how to handle partial failures, if EC2creates some of the instances we wanted but not all of them. Lambda hasa configurable function response type which can be set to`ReportBatchItemFailures`. In this mode, we return a list of failedmessages from our handler and those are retried. We can make use of thisto give back as many events as we failed to process.Now we're potentially processing multiple events in a single Lambda, onething we should optimise for is not recreating GitHub API clients. Weneed one client for the app itself, which we use to find outinstallation IDs, and then one client for each installation which isrelevant to the batch of events we are processing. This is done bycreating a new client the first time we see an event for a giveninstallation.We also remove the same `batch_size = 1` constraint from the `job-retry`Lambda and make it configurable instead, using AWS's default of 10 forSQS if not configured. This Lambda is used to retry events thatpreviously failed. However, instead of reporting failures to be retried,here we maintain the pre-existing fault-tolerant behaviour where errorsare logged but explicitly do not cause message retries, avoidinginfinite loops from persistent GitHub API issues or malformed events.Tests are added for all of this.

handle ScaleError

704b572

fix: correct typo in test comment

8a94963

style: minor tweaks toScaleError

7a1302d

Some small tweaks to the style of `ScaleError`, while we're working onit.`message` doesn't need to be a public property. `Error` already has it.Likewise with `stack` - we can use the parent's implementation.Instead of `getDetailedMessage` we can have a property getter which isvery slightly nicer to read at callsites (`e.detailedMessage` vs.`e.getDetailedMessage()`).

style: use guard clauses inprocessFleetResult

95eda18

This function is getting a little complex to read now. Refactoring touse the guard clause pattern makes the flow clearer. Now it's more likea sequence of checks followed by a final `throw`.

refactor: letScaleError convert itself to batch item failures, add…

022e40f

… testsThis lets us move some logic from `processFleetResult`, simplifying it.We're additionally adding some bounds checking here, and testing it all.

refactor: hardcode ScaleError message

83e644b

We only construct this once in the production code - there's no need tohave it be variable.

iainlane force-pushed theiainlane/many-events branch from9443df8 to83e644bCompare

September 1, 2025 14:31

iainlane requested review fromCopilot andnpalm

September 2, 2025 12:03

This comment was marked as outdated.

fix: avoid grouping all events

cb5ce97

We were grouping all events we were called for, and only looking at thenon-`aws:sqs` events to warn about them. This can be done as we go, andthen we only need to keep the SQS events in memory.

iainlane requested a review fromCopilot

September 2, 2025 12:19

CopilotAI reviewed

Sep 2, 2025

View reviewed changes

Copy link

Contributor

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

This PR adds support for handling multiple events in a single Lambda invocation to improve throughput in busy environments. Previously, the scale-up Lambda was restricted to processing only one event at a time, which could create bottlenecks when GitHub and AWS API calls took too long.

Event batching: Enables processing multiple SQS messages in a single Lambda invocation with configurable batch sizes
Partial failure handling: ImplementsReportBatchItemFailures to retry only failed messages rather than entire batches
GitHub client optimization: Reuses GitHub API clients across events for the same installation to reduce API overhead

Reviewed Changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
variables.tf	Adds new variables for configuring Lambda event source mapping batch size and batching window
modules/runners/scale-up.tf	Updates Lambda event source mapping to support batch processing with ReportBatchItemFailures
lambdas/functions/control-plane/src/scale-runners/scale-up.ts	Major refactor to handle batched events, optimize GitHub client usage, and return failed message IDs
lambdas/functions/control-plane/src/lambda.ts	Updates Lambda handler to process SQS batches and return batch item failures
lambdas/functions/control-plane/src/scale-runners/ScaleError.ts	Enhanced to support batch failure reporting with failed instance counts

_{Tip: Customize your code reviews with copilot-instructions.md.Create the file orlearn how to get started.}

lambdas/functions/control-plane/src/scale-runners/scale-up.tsShow resolvedHide resolved

lambdas/functions/control-plane/src/aws/runners.tsShow resolvedHide resolved

lambdas/functions/control-plane/src/scale-runners/scale-up.tsShow resolvedHide resolved

lambdas/functions/control-plane/src/scale-runners/ScaleError.tsShow resolvedHide resolved

chore(scale-runners): log failures with same key name as variable name

b5ddb89

Copilot pointed out correctly that it could be confusing to log`failedInstanceCount` using the key `missingInstanceCount` when there isa variable with that name.

Copy link

Member

npalm commentedSep 3, 2025

@ScottGuymer @stuartp44 wondering do you see also in use case for your deployments?

npalm added the enhancementNew feature or request label

Sep 3, 2025

Copy link

Member

npalm commentedSep 3, 2025

@iainlane for the updates, is the PR ready from your point of view?

Merge branch 'main' into iainlane/many-events

12a8f3d

Copy link

ContributorAuthor

iainlane commentedSep 3, 2025

@iainlane for the updates, is the PR ready from your point of view?

As far as I’m concerned yeah 👍

Merge branch 'main' into iainlane/many-events

8be2df6

npalm reviewed

Sep 8, 2025

View reviewed changes

lambdas/functions/control-plane/src/scale-runners/scale-up.ts


		// Don't call the EC2 API if we can create an unlimited number of runners.
		constcurrentRunners=
		maximumRunners===-1 ?0 :(awaitlistEC2Runners({ environment, runnerType,runnerOwner:group})).length;

Copy link

Member

npalmSep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice this will partly address issue#4710

Copy link

Member

npalm commentedSep 8, 2025

@ScottGuymer @stuartp44 wondering do you see also in use case for your deployments?

@stuartp44 @ScottGuymer kind reminder would this PR also help you? Any time to run a test on your deployments?

Copy link

Member

npalm commentedSep 8, 2025

@iainlane I ran a quick test, in general I sso no issues. But want to go over some longs before approvvaing the PR. Although it should not impact any of the current usages, the change is significant.

Merge branch 'main' into iainlane/many-events

c777ace

Copy link

Member

npalm commentedOct 3, 2025

note to my self, requires mor testing. Since last test had some failures

Copy link

guicaulada commentedOct 16, 2025

Hi@npalm,

Could you elaborate on what seems to be the failures? Our environment keeps increasing in size, and the ability to process events quicker is becoming increasingly important to maintain responsiveness.

Appreciate any details!

Merge branch 'main' into iainlane/many-events

d6af260

Copy link

ContributorAuthor

iainlane commentedNov 2, 2025

@iainlane @guicaulada what you think. Since the SQS message event is changed the PR can create a tiny deisruption during an update. For that reason I think we should mark it as a breaking change. What do you think?

That sounds like a good idea - I think people should definitely be aware that this ia happening 👍.

Might even merge it and make a beta / prerelease and big orgs like our two can try it in production to see if there’s any remaining issues that only reveal at big scale?

Copy link

Member

npalm commentedNov 2, 2025

I have just tested the PR on small scale, it seems to all working as expected.

Copy link

Member

npalm commentedNov 2, 2025

@iainlane @guicaulada what you think. Since the SQS message event is changed the PR can create a tiny deisruption during an update. For that reason I think we should mark it as a breaking change. What do you think?

npalm previously approved these changes

Nov 2, 2025

View reviewed changes

Copy link

Member

npalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Tested with multi runner setup and pools. Looks all good.

Copy link

ContributorAuthor

iainlane commentedNov 2, 2025

@iainlane @guicaulada what you think. Since the SQS message event is changed the PR can create a tiny deisruption during an update. For that reason I think we should mark it as a breaking change. What do you think?

Yeh not a bad idea. It’s a big and still a bit of a risky change. Could be an idea to do a beta / prerelease and let our orgs try it in prod for a bit before making a stable.

Merge branch 'main' into iainlane/many-events

073ab72

npalm dismissed theirstale review via073ab72

November 3, 2025 08:22

Copy link

Member

npalm commentedNov 3, 2025

@iainlane @guicaulada what you think. Since the SQS message event is changed the PR can create a tiny deisruption during an update. For that reason I think we should mark it as a breaking change. What do you think?
Yeh not a bad idea. It’s a big and still a bit of a risky change. Could be an idea to do a beta / prerelease and let our orgs try it in prod for a bit before making a stable.

I have no setup for a beta stream. So in case you have tested on your side, I think we could merge it. In case of any problems we could set the message batch size to 1. And it should work as before.

Copy link

guicaulada commentedNov 3, 2025

@iainlane @guicaulada what you think. Since the SQS message event is changed the PR can create a tiny deisruption during an update. For that reason I think we should mark it as a breaking change. What do you think?
Yeh not a bad idea. It’s a big and still a bit of a risky change. Could be an idea to do a beta / prerelease and let our orgs try it in prod for a bit before making a stable.
I have no setup for a beta stream. So in case you have tested on your side, I think we could merge it. In case of any problems we could set the message batch size to 1. And it should work as before.

I agree. I think merging it as a major release and marking it as a breaking change is the way to go if there's no beta stream. We should start running this in large scale soon and if there's any new issues we would be quick to address them. Anyone updating to the new major release should be aware of this change and take the necessary measures to ensure minimal disruption.

I have tested with several parallel workflows of 200 jobs each, the only issue I found is when I hitVcpuLimitExceeded which would cause repeated failures, eventually messages would get dropped and jobs would be left hanging in the queue. I assumed this would be the expected behavior given theretry_job is also fault-tolerant. Once the quota was increased, all jobs would finish successfully, it felt more like an issue related to AWS Account limits than the new scale-up logic.

We don't haveredrive_build_queue orretry_job enabled, maybe those could help prevent these kinds of issues.

npalm changed the title~~feat(control-plane): add support for handling multiple events in a single invocation~~feat(control-plane)!: add support for handling multiple events in a single invocation

Nov 3, 2025

Labels

enhancement

New feature or request

Movatterモバイル変換

feat(control-plane)!: add support for handling multiple events in a single invocation#4603

Are you sure you want to change the base?

feat(control-plane)!: add support for handling multiple events in a single invocation#4603

Uh oh!

Conversation

iainlane commentedMay 29, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Uh oh!

iainlaneJun 13, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

CopilotAI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

npalm commentedSep 3, 2025

Uh oh!

npalm commentedSep 3, 2025

Uh oh!

iainlane commentedSep 3, 2025

Uh oh!

npalmSep 8, 2025

Choose a reason for hiding this comment

Uh oh!

npalm commentedSep 8, 2025

Uh oh!

npalm commentedSep 8, 2025

Uh oh!

npalm commentedOct 3, 2025

Uh oh!

guicaulada commentedOct 16, 2025

Uh oh!

iainlane commentedNov 2, 2025

Uh oh!

npalm commentedNov 2, 2025

Uh oh!

npalm commentedNov 2, 2025

Uh oh!

npalm left a comment

Choose a reason for hiding this comment

Uh oh!

iainlane commentedNov 2, 2025

Uh oh!

npalm commentedNov 3, 2025

Uh oh!

guicaulada commentedNov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

iainlane commentedMay 29, 2025•
edited
Loading