CN120406814A

Movatterモバイル変換

Info

Publication number: CN120406814A
Application number: CN202410223741.XA
Authority: CN
Inventors: 高彬; 左鹏飞; 杨幸坤; 邓俊波
Original assignee: Huawei Cloud Computing Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2024-01-31
Filing date: 2024-02-28
Publication date: 2025-08-01
Also published as: WO2025161395A1

Abstract

Translated fromChinese

本申请提供了一种计算系统、多轮会话推理方法、装置和计算设备集群，属于云计算技术领域。该系统中，外部存储设备存储了已完成的会话的历史键值缓存，由于外部存储设备的容量远大于加速器内的HBM的容量，因此，能够提高历史键值缓存的命中率，从而避免进行键值缓存重计算；并且，主机能够在加速器处理会话时，从外部存储设备中预加载任务队列中待处理会话的历史键值缓存至主机的内存中，加速器能够在对会话进行第i层计算时，从主机的内存中预加载第i+1层计算所需的历史键值缓存，由于计算过程与数据加载过程同步进行，计算过程无需等待数据加载完成，从而能够隐藏加速器访问外部存储设备的时间开销，提高多轮会话推理的效率。

The present application provides a computing system, a multi-round conversation reasoning method, an apparatus, and a computing device cluster, which belong to the field of cloud computing technology. In this system, an external storage device stores a historical key value cache of completed conversations. Since the capacity of the external storage device is much larger than the capacity of the HBM in the accelerator, the hit rate of the historical key value cache can be improved, thereby avoiding recalculation of the key value cache; and, when the accelerator processes the conversation, the host can preload the historical key value cache of the conversation to be processed in the task queue from the external storage device into the host's memory. When the accelerator performs the i-th layer calculation on the conversation, the accelerator can preload the historical key value cache required for the i+1-th layer calculation from the host's memory. Since the calculation process is synchronized with the data loading process, the calculation process does not need to wait for the data loading to be completed, thereby hiding the time overhead of the accelerator accessing the external storage device and improving the efficiency of multi-round conversation reasoning.