NB: I have opened this Pull Request as adraft since I intend to continue working on it by improving the logging output.
Since the builds are started concurrently, the build progress information is pretty much useless and flickers a lot.

Problem

The current implementation builds every single profile synchronously, including remote builds.
Since remote builds were introduced back in#175, remote builds could be pipelined to deploy
new configurations in a more time-efficient manner.

Solution

Build configurations that support remote builds concurrently.

Sidenote: I have decided to continue building local builds in a synchronous manner because I have run into hardware deadlocks previously when trying to evaluate and/or build multiple systems at the same time.

I have tested this code by deploying to my personal infrastructure (https://github.com/philtaken/dotfiles) and it works as intended.

PhilTaken requested a review fromrvem

April 18, 2024 21:39

rvem reviewed

Apr 22, 2024

View reviewed changes

src/cli.rs Outdated

		// await both the remote builds and the local builds to speed up deployment times
		try_join!(
		// remote builds can be run asynchronously since they do not affect the local machine
		try_join_all(remote_builds.into_iter().map(\|data\|async{

Copy link

Member

rvemApr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Fromtry_join_all docs:

If any future returns an error then all other futures will be canceled and an error will be returned immediately

IMO, it might be better to wait for all futures to be completed/failed instead of canceling everything on the first failure, thus the non-failed builds will complete despite the error in an unrelated build. What do you think?

Copy link

Member

rvemApr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Since the builds are started concurrently, the build progress information is pretty much useless and flickers a lot.

I'm not sure how viable it is to implement, but perhaps, it's possible to collectstdout andstderr for each future, if it is, we can collect and report detailed output synchronously once all remote builds are completed

Copy link

CollaboratorAuthor

PhilTakenApr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

IMO, it might be better to wait for all futures to be completed/failed instead of canceling everything on the first failure, thus the non-failed builds will complete despite the error in an unrelated build. What do you think?

I'll see how easy it would be to implement that, if it is that might be worth checking out so that consecutive builds don't have to build as much.

perhaps, it's possible to collect stdout and stderr for each future, if it is, we can collect and report detailed output synchronously once all remote builds are completed

I was considering that maybe one line from the bottom per build would be neat e.g.:

[previous output]
host 1: [build progress]
host 2: [build progress]

This will, however, require some more fine grained control over the terminal output i.e. raw output instead of cooked.

Copy link

Member

rvemApr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

would be neat

Agree, but indeed sounds quite non-trivial

Copy link

Member

rvemApr 22, 2024•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Oh, another somewhat conceptual concern. Profiles can potentially depend on each other (seeprofilesOrder flake option), so perhaps it's worth doing the parallelization on per-host basis instead of per-profile

PhilTaken force-pushed thephil/async-build-and-push branch frombb4a111 to1cc6e35Compare

June 10, 2024 17:48

Copy link

CollaboratorAuthor

PhilTaken commentedJun 10, 2024

mentioning#46 for visibility

the current version works as expected, the only issue is the log flickering with multiple invocations of nix writing to stdout in parallel. I was considering utilising the raw-internal-json output similarly to how nix-output-monitor does it but that might be out of scope here 🤔

Copy link

CollaboratorAuthor

PhilTaken commentedJun 10, 2024

activation is fully synchronous but that is usually the part that takes the least amount of time

rvem reviewed

Jun 14, 2024

View reviewed changes

src/cli.rs Outdated

		async{
		// run local builds synchronously to prevent hardware deadlocks
		for data in&local_builds{
		deploy::push::build_profile(data).await.unwrap();

Copy link

Member

rvemJun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm not a huge fan of using "partial functions", isn't unhandled panic fromunwrap going to kill the main thread?
Also, AFAICS, at the moment we completely ignore non-remote build/push results

Copy link

Member

rvem commentedJun 14, 2024

The new approach seems good 👍

PhilTaken force-pushed thephil/async-build-and-push branch from52f5c53 todd7ec8cCompare

June 19, 2024 09:39

PhilTaken requested a review fromrvem

June 20, 2024 18:54

PhilTakenand others added4 commits

September 11, 2024 13:24

build and push profiles asynchronously

91ce286

build remote builds (remote_build = true) asynchronously to speedup the deployment process.local builds should not be run asynchronously to prevent running intohardware deadlocks

group remote builds by host

a70a339

continue building when a remote build fails and report errors afterwards

c00037e

add indicatif progress indicators for remote and local builds

42ae995

PhilTaken force-pushed thephil/async-build-and-push branch fromdd7ec8c to42ae995Compare

September 13, 2024 05:59

Copy link

CollaboratorAuthor

PhilTaken commentedSep 13, 2024•
edited
Loading

there are a still lot of unwraps that need to be handled better, this pr is still far from finished but progress has been made:

I added some progress indicators using the indicatif crate for each concurrent build 🥳

This is what they look like (image the spinner spinning and the text changing as nix builds stuff:

For remote and builds each host gets its own progress spinner that is reused for all of that hosts profiles since they are built in order anyways. This could be adjusted for multi-profile hosts since indicatif also supports nested tree-like progress bars.

I changed some data structures around to make working with them in async contexts easier.
The nested references are somewhat ok until you have to make them work in a async context where you need send objects. deep cloning by hand really isn't a great solution and it really doesnt make any difference performance-wise to add some clones here and there.
Nobody is going to deploy to >10k hosts and if they would, cloning the deploy data a few times would be the least of their problems.
This also removes a bunch of generic lifetime specifiers which I am not sad about.

PhilTaken added2 commits

September 18, 2024 13:25

wrap the logger with a custom implementation

02b9681

the custom implementation handles indicatif's progressbars better so as to notleave orphaned progress bar fragments when logging while a progressbar is active

improved errer handling for async deployments

7fa8db8

Copy link

CollaboratorAuthor

PhilTaken commentedSep 18, 2024

With these two last commits I would declare this PR somewhat usable 🥳

If anyone wants to try it out with their own setups I would be very happy to hear how it works for you, I am always open for feedback of any kind

Copy link

ManoftheSea commentedSep 18, 2024

Hi all,
I tried this out, with my setup in ManoftheSea/SeaofDirac/trunk, on targets [littlecreek, crunchbits, technetium]. I saw that it ran the checks for all the systems, which meant building them locally, then proceeded with pushing profiles to the systems in parallel (which resulted in building them remotely, I believe). After all three were ready, they deployed in sequence, and as it was just an update, they all completed happily.

This works well for these three systems, which are a pair of VPSes and a local (to me) server; were I trying to deploy all the systems in my home network, I would prefer instead that my development machine (personal laptop) be able to push the profiles to a single server (technetium), so that it can then act as a substituter for e.g. aluminium and nickel. Maybe provide a "build-host" attribute per-host, with a default of the hostself? e.g. deploy.nickel.buildHost = "technetium"; deploy.littlecreek.buildHost = "littlecreek";

PhilTaken marked this pull request as ready for review

September 30, 2024 11:27

Copy link

ChecksumDev commentedNov 24, 2024

Hi there, our team atPyro utilized this PR and observed a massive speedup deploying to a our node cluster and for our initial deployments we see a huge speed increase of at minimum 2x, however there might be a problem.

I'm deploying from Ubuntu WSL, so that might be the culprit but it seems after every deployment in the same terminal session only half will be working in parallel, and then half again, until its back to single threaded (even though the spinners are still there). It might not be halfing but there is at minimum diminishing returns over multiple command invocations.

Here's a successful one if you're curious, I haven't been able to record the above yet though.
I figured with such a big cluster our network would likely be prime testing ground for this PR.

I'm really excited for this to get merged and I'll test anything necessary to speed things along.

https://asciinema.org/a/dirahL9YYWFaT0lORkU98nX8s

P.s. A huge potential speedup would be processing checks in a parallel fashion as well, as that takes up a huge chunk of time, though I'm not sure how viable this is.

Copy link

CollaboratorAuthor

PhilTaken commentedNov 26, 2024•
edited
Loading

Hi there, our team atPyro utilized this PR and observed a massive speedup deploying to a our node cluster and for our initial deployments we see a huge speed increase of at minimum 2x

oh that is great to hear! thanks for the feedback.

it seems after every deployment in the same terminal session only half will be working in parallel, and then half again, until its back to single threaded (even though the spinners are still there). It might not be halfing but there is at minimum diminishing returns over multiple command invocations.

hmm, I'll have to investigate how that might happen. I'll add some more debug output for each thread so it'll easier to debug on your side. This pr pretty much just spawns a bunch of threads using high-level abstractions so I'm curious how that might happen 🤔

P.s. A huge potential speedup would be processing checks in a parallel fashion as well, as that takes up a huge chunk of time, though I'm not sure how viable this is.

unfortunately, the checks are limited by nix's eval speed. there is some work going on on detsys's side (see e.g.https://determinate.systems/posts/parallel-nix-eval/) but afaiu nix flake checks are currently single-threaded in nature.

you could attempt to get around that by checking/building all systems in parallel batches usinggnu parallel although I am not sure how feasible that is. I do not really have access to hardware that could build even two systems in parallel...

ChecksumDev approved these changes

Dec 28, 2024

View reviewed changes

Copy link

ChecksumDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Since my previous comment, we've continued using this pull request, now with approximately 30 nodes, and have demonstrated no further problems. To us, at least, it seems to have identical parity to its single-threaded counterpart onmain.

Performance might be improvable and the visual aspects could be worked on, but so far nothing has broken down and it's been smooth sailing, so I'd call this pretty stable! 👀

rvem reviewed

Jan 14, 2025

View reviewed changes

Copy link

Member

rvem left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It took me waaay too much time to get back to this PR, I'm sorry about that:(
I tested these changes, UX looks neat, code changes themselves look good too, though I have a few minor concerns/suggestions

src/push.rsShow resolvedHide resolved

src/cli.rs

		Err(ref e) =>{
		// TODO set style to red X
		pb.finish_with_message(format!("Error: {}", e.to_string()));
		}

Copy link

Member

rvemJan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is there a reason whyres is not returned similarly to how it's done with remote builds (in line 666 😈)?
AFAIU, currently the error from pushing will be displayed, but won't be reported further

Copy link

CollaboratorAuthor

PhilTakenJan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

good point, just requires a faux future

src/push.rsShow resolvedHide resolved

Copy link

CollaboratorAuthor

PhilTaken commentedJan 22, 2025

just a small update for everybody interested in this: I am still working on it, didnt have all too much time recently. This weekend i've got some more time so I'll address all of rvem's points above 🫡

Copy link

ChecksumDev commentedJan 23, 2025

Thank you for the update! Our team was considering taking over this PR and aiding its development, but its good to know there's still activity. We use this branch religiously and are free to test any new functionality / bug fixes.

change: address pr feedback (squash this commit)

2483a1e

Copy link

CollaboratorAuthor

PhilTaken commentedJan 28, 2025

We use this branch religiously and are free to test any new functionality / bug fixes.

that is a huge number of hosts, thanks for a large scale test on this!

did you encounter the issue with the deployments not running in parallel again? I could not find anything in the code that could lead to such an issue

feat: checkmarks (success) / red cross (error) indicators for finishe…

b20db77

…d / aborted deployments

Copy link

cyplo commentedNov 7, 2025

This looks really good ! Any further help testing needed to land this ?

Labels

None yet

Movatterモバイル変換

add support for asynchronous building and pushing of profiles#271

Are you sure you want to change the base?

add support for asynchronous building and pushing of profiles#271

Uh oh!

Conversation

PhilTaken commentedApr 18, 2024

Problem

Solution

Uh oh!

rvemApr 22, 2024

Choose a reason for hiding this comment

Uh oh!

rvemApr 22, 2024

Choose a reason for hiding this comment

Uh oh!

PhilTakenApr 22, 2024

Choose a reason for hiding this comment

Uh oh!

rvemApr 22, 2024

Choose a reason for hiding this comment

Uh oh!

rvemApr 22, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PhilTaken commentedJun 10, 2024

Uh oh!

PhilTaken commentedJun 10, 2024

Uh oh!

rvemJun 14, 2024

Choose a reason for hiding this comment

Uh oh!

rvem commentedJun 14, 2024

Uh oh!

PhilTaken commentedSep 13, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

PhilTaken commentedSep 18, 2024

Uh oh!

ManoftheSea commentedSep 18, 2024

Uh oh!

ChecksumDev commentedNov 24, 2024

Uh oh!

PhilTaken commentedNov 26, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

ChecksumDev left a comment

Choose a reason for hiding this comment

Uh oh!

rvem left a comment• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rvemJan 10, 2025

Choose a reason for hiding this comment

Uh oh!

PhilTakenJan 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PhilTaken commentedJan 22, 2025

Uh oh!

ChecksumDev commentedJan 23, 2025

Uh oh!

PhilTaken commentedJan 28, 2025

Uh oh!

cyplo commentedNov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rvemApr 22, 2024•
edited
Loading

PhilTaken commentedSep 13, 2024•
edited
Loading

PhilTaken commentedNov 26, 2024•
edited
Loading

rvem left a comment•
edited
Loading