This pull request makes significant enhancements to the server implementation in whisper.cpp, especially around concurrency and request handling. Here’s a summary of the key changes:

Major Changes

1. Multi-threaded Task Queue for Inference

Introduces a newWhisperTaskQueue class, which manages a queue of inference tasks and processes them using multiple worker threads (the number of workers is now configurable via a new--workers server argument).
All incoming POST requests for transcription are now handled asynchronously via this queue, allowing multiple requests to be processed in parallel.
Each task encapsulates all request information and handles its own completion signaling and abort reasons (e.g., client disconnect, server shutdown).

2. Graceful Shutdown and Robust Handling

The server now handles shutdowns more gracefully: ongoing and queued tasks are notified and can be aborted with meaningful error responses (503 for server shutdown,499 for client disconnect).
The task queue is properly destroyed and re-created when a new model is loaded or on shutdown, ensuring clean resource management.

3. Refactoring of Main Server Logic

The request handler for audio transcription (POST /inference_path) is refactored to create a fresh copy of the parameter set for each request, improving thread safety.
The actual inference is now performed inside the worker threads, so the server remains responsive to new requests.
The code for response formatting (plain text, SRT, VTT, verbose JSON) is encapsulated within the task result logic.

4. Command-Line Interface and Usability Improvements

Adds a--workers N argument to control the number of concurrent worker threads for handling requests.
Usage help text and parameter parsing are updated accordingly.

5. Minor Fixes and Improvements

Swaps the use ofstd::atomic_flag for a more idiomaticstd::atomic<bool> for the termination signal.
Reduces use of global and shared state, improving modularity and reliability.

Example: New Option

./server --workers 4

This will launch the server with 4 worker threads for handling requests concurrently.

server: refactor concurrency model and enhance task management

a4df1af

danbev reviewed

Jun 17, 2025

View reviewed changes

Copy link

Collaborator

danbev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I've done an initial pass over this and wanted to leave some early feedback (or feedback in chucks at least so there are not too many in one go) and I'll go through this in more detail.

examples/server/server.cpp

		@@ -41,10 +43,10 @@ const std::string vjson_format = "verbose_json";
		const std::string vtt_format = "vtt";

		std::function<void(int)> shutdown_handler;
		std::atomic_flag is_terminating = ATOMIC_FLAG_INIT;

Copy link

Collaborator

danbevJun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think thatstd::atomic_flag might be more appropriate here as it is guaranteed to be lockfree. And it is only used in a signal handler where we don't want any potential blocking. So I'd prefer to keep this as is.

examples/server/server.cpp

		const httplib::Request* request_ptr; // For abort callback
		std::atomic<AbortReason> abort_reason{AbortReason::NotAborted};
		std::atomic<bool>* stop_flag_ptr{nullptr};

Copy link

Collaborator

danbevJun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nit: remove empty space (not showing here but is visible using command line git). There are few more of these in the file but I won't list them all to avoid cluttering the review.

examples/server/server.cpp

		WhisperTask() = default;

		// Move constructor
		WhisperTask(WhisperTask&& other) noexcept

Copy link

Collaborator

danbevJun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nit: remove trailing white space (not showing here but is visible using command line git). There are few more of these in the file but I won't list them all to avoid cluttering the review.

examples/server/server.cpp

		std::vector<std::vector<float>> pcmf32s;
		whisper_params params;
		std::string filename;
		const httplib::Request* request_ptr; // For abort callback

Copy link

Collaborator

danbevJun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nit: Pointers/references are preferred to be "non-leaning". There are few more of these.

examples/server/server.cpp


		class WhisperTaskQueue {
		public:
		WhisperTaskQueue(struct whisper_context* ctx, size_t n_workers = 2)

Copy link

Collaborator

danbevJun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just wondering about the default of2 forn_workers as the default value is 1 elsewhere in the code and perhaps this should be also be 1?

examples/server/server.cpp

		wparams.token_timestamps = !task.params.no_timestamps && task.params.response_format == vjson_format;
		wparams.no_context = task.params.no_context;
		wparams.suppress_nst = task.params.suppress_nst;

Copy link

Collaborator

danbevJun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Perhaps also add VAD parameters here similar to what is done in

whisper.cpp/examples/server/server.cpp

Lines 924 to 932 in2a4d6db

	wparams.vad = params.vad;
	wparams.vad_model_path = params.vad_model.c_str();

	wparams.vad_params.threshold = params.vad_threshold;
	wparams.vad_params.min_speech_duration_ms = params.vad_min_speech_duration_ms;
	wparams.vad_params.min_silence_duration_ms = params.vad_min_silence_duration_ms;
	wparams.vad_params.max_speech_duration_s = params.vad_max_speech_duration_s;
	wparams.vad_params.speech_pad_ms = params.vad_speech_pad_ms;
	wparams.vad_params.samples_overlap = params.vad_samples_overlap;

examples/server/server.cpp

		std::mutex queue_mutex_;
		std::condition_variable queue_cv_;
		std::mutex whisper_mutex_; // Protect whisper context access
		std::atomic<bool> stop_flag_;

Copy link

Collaborator

danbevJun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I thinkstop_flag_ could just be a bool as the accesses to it are all guarded by thequeue_cv_ lock.

Copy link

ContributorAuthor

sachaarbonel commentedJun 18, 2025

Lte's hold on this one. I ran some becnhmarks and tracing with NVIDIA Nsight and CPU is not the bottleneck

sachaarbonel marked this pull request as draft

June 18, 2025 08:53

Labels

None yet

2 participants

Movatterモバイル変換

feat(server): refactor concurrency model and enhance task management#3257

Are you sure you want to change the base?

feat(server): refactor concurrency model and enhance task management#3257

Uh oh!

Conversation

sachaarbonel commentedJun 16, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Major Changes

1. Multi-threaded Task Queue for Inference

2. Graceful Shutdown and Robust Handling

3. Refactoring of Main Server Logic

4. Command-Line Interface and Usability Improvements

5. Minor Fixes and Improvements

Example: New Option

Uh oh!

danbev left a comment

Choose a reason for hiding this comment

Uh oh!

danbevJun 16, 2025

Choose a reason for hiding this comment

Uh oh!

danbevJun 16, 2025

Choose a reason for hiding this comment

Uh oh!

danbevJun 16, 2025

Choose a reason for hiding this comment

Uh oh!

danbevJun 16, 2025

Choose a reason for hiding this comment

Uh oh!

danbevJun 17, 2025

Choose a reason for hiding this comment

Uh oh!

danbevJun 17, 2025

Choose a reason for hiding this comment

Uh oh!

danbevJun 17, 2025

Choose a reason for hiding this comment

Uh oh!

sachaarbonel commentedJun 18, 2025

Uh oh!

Uh oh!

sachaarbonel commentedJun 16, 2025•
edited
Loading