Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat(server): refactor concurrency model and enhance task management#3257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
sachaarbonel wants to merge1 commit intoggml-org:master
base:master
Choose a base branch
Loading
fromsachaarbonel:simple-task-queue-server

Conversation

sachaarbonel
Copy link
Contributor

@sachaarbonelsachaarbonel commentedJun 16, 2025
edited
Loading

This pull request makes significant enhancements to the server implementation in whisper.cpp, especially around concurrency and request handling. Here’s a summary of the key changes:

Major Changes

1. Multi-threaded Task Queue for Inference

  • Introduces a newWhisperTaskQueue class, which manages a queue of inference tasks and processes them using multiple worker threads (the number of workers is now configurable via a new--workers server argument).
  • All incoming POST requests for transcription are now handled asynchronously via this queue, allowing multiple requests to be processed in parallel.
  • Each task encapsulates all request information and handles its own completion signaling and abort reasons (e.g., client disconnect, server shutdown).

2. Graceful Shutdown and Robust Handling

  • The server now handles shutdowns more gracefully: ongoing and queued tasks are notified and can be aborted with meaningful error responses (503 for server shutdown,499 for client disconnect).
  • The task queue is properly destroyed and re-created when a new model is loaded or on shutdown, ensuring clean resource management.

3. Refactoring of Main Server Logic

  • The request handler for audio transcription (POST /inference_path) is refactored to create a fresh copy of the parameter set for each request, improving thread safety.
  • The actual inference is now performed inside the worker threads, so the server remains responsive to new requests.
  • The code for response formatting (plain text, SRT, VTT, verbose JSON) is encapsulated within the task result logic.

4. Command-Line Interface and Usability Improvements

  • Adds a--workers N argument to control the number of concurrent worker threads for handling requests.
  • Usage help text and parameter parsing are updated accordingly.

5. Minor Fixes and Improvements

  • Swaps the use ofstd::atomic_flag for a more idiomaticstd::atomic<bool> for the termination signal.
  • Reduces use of global and shared state, improving modularity and reliability.

Example: New Option

./server --workers 4

This will launch the server with 4 worker threads for handling requests concurrently.

lin72h reacted with thumbs up emoji
Copy link
Collaborator

@danbevdanbev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I've done an initial pass over this and wanted to leave some early feedback (or feedback in chucks at least so there are not too many in one go) and I'll go through this in more detail.

@@ -41,10 +43,10 @@ const std::string vjson_format = "verbose_json";
const std::string vtt_format = "vtt";

std::function<void(int)> shutdown_handler;
std::atomic_flag is_terminating = ATOMIC_FLAG_INIT;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think thatstd::atomic_flag might be more appropriate here as it is guaranteed to be lockfree. And it is only used in a signal handler where we don't want any potential blocking. So I'd prefer to keep this as is.

const httplib::Request* request_ptr; // For abort callback
std::atomic<AbortReason> abort_reason{AbortReason::NotAborted};
std::atomic<bool>* stop_flag_ptr{nullptr};

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nit: remove empty space (not showing here but is visible using command line git). There are few more of these in the file but I won't list them all to avoid cluttering the review.

WhisperTask() = default;

// Move constructor
WhisperTask(WhisperTask&& other) noexcept
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nit: remove trailing white space (not showing here but is visible using command line git). There are few more of these in the file but I won't list them all to avoid cluttering the review.

std::vector<std::vector<float>> pcmf32s;
whisper_params params;
std::string filename;
const httplib::Request* request_ptr; // For abort callback
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nit: Pointers/references are preferred to be "non-leaning". There are few more of these.


class WhisperTaskQueue {
public:
WhisperTaskQueue(struct whisper_context* ctx, size_t n_workers = 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just wondering about the default of2 forn_workers as the default value is 1 elsewhere in the code and perhaps this should be also be 1?

wparams.token_timestamps = !task.params.no_timestamps && task.params.response_format == vjson_format;
wparams.no_context = task.params.no_context;
wparams.suppress_nst = task.params.suppress_nst;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Perhaps also add VAD parameters here similar to what is done in

wparams.vad = params.vad;
wparams.vad_model_path = params.vad_model.c_str();
wparams.vad_params.threshold = params.vad_threshold;
wparams.vad_params.min_speech_duration_ms = params.vad_min_speech_duration_ms;
wparams.vad_params.min_silence_duration_ms = params.vad_min_silence_duration_ms;
wparams.vad_params.max_speech_duration_s = params.vad_max_speech_duration_s;
wparams.vad_params.speech_pad_ms = params.vad_speech_pad_ms;
wparams.vad_params.samples_overlap = params.vad_samples_overlap;

std::mutex queue_mutex_;
std::condition_variable queue_cv_;
std::mutex whisper_mutex_; // Protect whisper context access
std::atomic<bool> stop_flag_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I thinkstop_flag_ could just be a bool as the accesses to it are all guarded by thequeue_cv_ lock.

@sachaarbonel
Copy link
ContributorAuthor

Lte's hold on this one. I ran some becnhmarks and tracing with NVIDIA Nsight and CPU is not the bottleneck

danbev reacted with thumbs up emoji

@sachaarbonelsachaarbonel marked this pull request as draftJune 18, 2025 08:53
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@danbevdanbevdanbev left review comments

At least 1 approving review is required to merge this pull request.

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

2 participants
@sachaarbonel@danbev

[8]ページ先頭

©2009-2025 Movatter.jp