Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Cover image for Running multiple LLMs on a single GPU
Shannon Lal
Shannon Lal

Posted on

     

Running multiple LLMs on a single GPU

In recent weeks, I have been working on projects that utilize GPUs, and I have been exploring ways to optimize their usage. To gain insights into GPU utilization, I started by analyzing the memory consumption and usage patterns using the nvidia-smi tool. This provided me with a detailed breakdown of the GPU memory and usage for each application.
One of the areas I have been focusing on is deploying our own LLMs. I noticed that when working with smaller LLMs, such as those with 7B parameters, on an A100 GPU, they were only consuming about 8GB of memory and utilizing around 20% of the GPU during inference. This observation led me to investigate the possibility of running multiple LLM processes in parallel on a single GPU to optimize resource utilization.
To achieve this, I explored using Python's multiprocessing module and the spawn method to launch multiple processes concurrently. By doing so, I aimed to efficiently run multiple LLM inference tasks in parallel on a single GPU. The following code demonstrates the approach I used to set up and execute multiple LLMs on a single GPU.

MAX_MODELS = 3def load_model(model_name: str, device: str):    model = AutoModelForCausalLM.from_pretrained(        model_name,        return_dict=True,         load_in_8bit=True,         device_map={"":device},        trust_remote_code=True,    )    tokenizer = AutoTokenizer.from_pretrained(model_name)    tokenizer.pad_token = tokenizer.eos_token    return model, tokenizerdef inference(model, prompt: str):    text = model.process_and_generate(prompt, params={"max_new_tokens": 200, "temperature": 1.0})    return textdef process_task(task_queue, result_queue):    model = load_model("tiiuae/falcon-7b-instruct", device="cuda:0")    while True:        task = task_queue.get()        if task is None:            break        prompt = task        start = time.time()        summary = inference(model, prompt)        print(f"Completed inference in {time.time() - start}")        result_queue.put(summary)def main():    task_queue = multiprocessing.Queue()    result_queue = multiprocessing.Queue()    prompt = "" # The prompt you want to execute    processes = []    for _ in range(MAX_MODELS):        process = multiprocessing.Process(target=process_task, args=(task_queue, result_queue ))        process.start()        processes.append(process)    start = time.time()    # I want to run this 3 times for each of the models    for _ in range(MAX_MODELS*3):        task_queue.put((prompt))    results = []    for _ in range(MAX_MODELS*3):        result = result_queue.get()        results.append(result)    end = time.time()if __name__ == "__main__":    multiprocessing.set_start_method("spawn")    main()
Enter fullscreen modeExit fullscreen mode

The following is a quick summary of some of the tests that I ran.

GPU# of LLMsGPU MemoryGPU UsageAverage Inference Time
A100 with 40GB18 GB20%12.8 seconds
A100 with 40GB216 GB95%16 seconds
A100 with 40GB332 GB100%23.2 seconds

Running multiple LLM instances on a single GPU can significantly reduce costs and increase availability by efficiently utilizing the available resources. However, it's important to note that this approach may result in a slight performance degradation, as evident from the increased average inference time when running multiple LLMs concurrently. If you have any other ways of optimizing GPU usage or questions on how this works feel free to reach out.

Thanks

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

I am an Engineering Manager with experience leading large-scale projects in both public and private sectors. I have collaborated with CEO and VP level executives on development and refinement of both
  • Location
    Montreal, Quebec, Canada
  • Education
    University of Ottawa
  • Work
    VP Engineering at Designstripe
  • Joined

More fromShannon Lal

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp