技术领域technical field
本公开涉及计算机技术领域,尤其涉及一种数据处理方法、数据处理装置、计算机存储介质和电子设备。The present disclosure relates to the technical field of computers, and in particular to a data processing method, a data processing device, a computer storage medium and electronic equipment.
背景技术Background technique
分布式强化学习框架(Distributed Reinforcement Learning)是机器学习的范式和方法论之一,用于描述和解决智能体(agent)在与环境的交互过程中通过学习策略(policy)以达成奖励(reward)最大化或实现特定目标的问题。而分布式强化学习算法通常由学习者(learner)和行动者(actor)组成,actor使用learner学到的人工智能(Artificial Intelligence,AI)策略与环境交互产生经验数据,而learner收集actor的经验数据用于提升AI策略。The Distributed Reinforcement Learning framework (Distributed Reinforcement Learning) is one of the paradigms and methodologies of machine learning, which is used to describe and solve the problem that the agent (agent) learns the strategy (policy) to achieve the maximum reward (reward) during the interaction process with the environment. problem of optimizing or achieving a specific goal. The distributed reinforcement learning algorithm is usually composed of a learner and an actor. The actor uses the artificial intelligence (AI) strategy learned by the learner to interact with the environment to generate experience data, and the learner collects the experience data of the actor. Used to improve AI strategy.
现有的分布式强化学习框架主要是基于Python生态实现的,由于全局解释器锁(Global Interpreter Lock,GIL)的限制,即使在多核心处理器上也只允许同一时间执行一个线程接收并处理actor发送的经验数据。The existing distributed reinforcement learning framework is mainly implemented based on the Python ecosystem. Due to the limitation of the Global Interpreter Lock (GIL), even on a multi-core processor, only one thread is allowed to receive and process actors at the same time. The empirical data sent.
然而,上述现有的技术方案,易导致learner接收并处理数据的效率低。However, the above-mentioned existing technical solution easily leads to low efficiency of the learner receiving and processing data.
需要说明的是,在上述背景技术部分公开的信息仅用于加强对本公开的背景的理解,因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。It should be noted that the information disclosed in the above background section is only for enhancing the understanding of the background of the present disclosure, and therefore may include information that does not constitute the prior art known to those of ordinary skill in the art.
发明内容Contents of the invention
本公开提供了一种数据处理方法、数据处理装置、计算机存储介质和电子设备,进而提高接收并处理数据的效率。The disclosure provides a data processing method, a data processing device, a computer storage medium and electronic equipment, thereby improving the efficiency of receiving and processing data.
第一方面,本公开一个实施例提供了一种数据处理方法,该方法包括:响应于接收行动者传送的数据传输指令,根据数据传输指令启动学习者与行动者之间的多个工作线程;根据多个工作线程接收行动者传送的第一目标经验数据;其中,第一目标经验数据根据强化学习模型的第一网络参数确定;将强化学习模型的第一网络参数更新为与第一目标经验数据对应的第二网络参数,以便行动者生成与第二网络参数对应的经验数据。In the first aspect, an embodiment of the present disclosure provides a data processing method, the method comprising: in response to receiving a data transmission instruction transmitted by an actor, starting multiple working threads between the learner and the actor according to the data transmission instruction; Receive the first target experience data transmitted by actors according to multiple working threads; wherein, the first target experience data is determined according to the first network parameter of the reinforcement learning model; update the first network parameter of the reinforcement learning model to be consistent with the first target experience The second network parameters corresponding to the data, so that the actors can generate experience data corresponding to the second network parameters.
第二方面,本公开一个实施例提供了一种数据处理装置,该装置包括:线程启动模块用于响应于接收行动者传送的数据传输指令,根据数据传输指令启动学习者与行动者之间的多个工作线程;数据接收模块用于根据多个工作线程接收行动者传送的第一目标经验数据;其中,第一目标经验数据根据强化学习模型的第一网络参数确定;参数更新模块,用于将强化学习模型的第一网络参数更新为与第一目标经验数据对应的第二网络参数,以便行动者生成与第二网络参数对应的经验数据。In the second aspect, an embodiment of the present disclosure provides a data processing device, the device includes: a thread start module for responding to receiving a data transmission instruction transmitted by an actor, and starting the communication between the learner and the actor according to the data transmission instruction A plurality of working threads; the data receiving module is used to receive the first target experience data transmitted by the actors according to the multiple working threads; wherein, the first target experience data is determined according to the first network parameters of the reinforcement learning model; the parameter updating module is used for The first network parameter of the reinforcement learning model is updated to the second network parameter corresponding to the first target experience data, so that the actor generates experience data corresponding to the second network parameter.
第三方面,本公开一个实施例提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如上的数据处理方法。In a third aspect, an embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the above data processing method is implemented.
第四方面,本公开一个实施例提供了一种电子设备,包括:处理器;以及存储器,用于存储处理器的可执行指令;其中,处理器配置为经由执行可执行指令来执行如上的数据处理方法。In a fourth aspect, an embodiment of the present disclosure provides an electronic device, including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the above data by executing the executable instructions Approach.
本公开的技术方案具有以下有益效果:The technical solution of the present disclosure has the following beneficial effects:
上述数据处理方法通过响应于接收行动者传送的数据传输指令,根据数据传输指令启动学习者与行动者之间的多个工作线程;根据多个工作线程接收行动者传送的第一目标经验数据;其中,第一目标经验数据根据强化学习模型的第一网络参数确定;将强化学习模型的第一网络参数更新为与第一目标经验数据对应的第二网络参数,以便行动者生成与第二网络参数对应的经验数据。一方面,该方法实现分布式强化学习模型中learner与actor之间的多线程,以通过多个CPU内核同时运行同一程序,从而实现多个线程并行接收并处理actor发送的经验数据,该过程避免了现有强化学习模型因Python生态的GIL限制只能通过单CPU接收并处理经验数据所导致的数据处理效率低,从而形成计算瓶颈的技术问题,实现从而提高数据接收并处理的效率的技术效果。另一方面,通过多个工作线程接收到的第一目标经验数据均处于共享内存空间中,可直接供learner进行学习以更新强化学习模型的网络参数,该过程避免了现有技术通过Python无法适用于多线程接收第一目标经验数据的过程,从而提高数据处理过程的应用范围。The above data processing method starts multiple working threads between the learner and the actor according to the data transmission instruction in response to receiving the data transmission instruction sent by the actor; receives the first target experience data sent by the actor according to the multiple working threads; Among them, the first target experience data is determined according to the first network parameters of the reinforcement learning model; the first network parameters of the reinforcement learning model are updated to the second network parameters corresponding to the first target experience data, so that the actors can generate The empirical data corresponding to the parameters. On the one hand, this method implements multi-threading between the learner and the actor in the distributed reinforcement learning model to run the same program through multiple CPU cores at the same time, so that multiple threads can receive and process the experience data sent by the actor in parallel. This process avoids The existing reinforcement learning model can only receive and process empirical data through a single CPU due to the GIL limitation of the Python ecosystem. The technical problem of low data processing efficiency, which forms a computing bottleneck, achieves the technical effect of improving the efficiency of data reception and processing . On the other hand, the first target experience data received through multiple working threads are all in the shared memory space, which can be directly used by the learner to update the network parameters of the reinforcement learning model. This process avoids that the existing technology cannot be applied through Python. The process of receiving the first target experience data is based on multithreading, thereby improving the application range of the data processing process.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.
附图说明Description of drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施方式,并与说明书一起用于解释本公开的原理。显而易见地,下面描述中的附图仅仅是本公开的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description serve to explain the principles of the disclosure. Apparently, the drawings in the following description are only some embodiments of the present disclosure, and those skilled in the art can also obtain other drawings according to these drawings without creative efforts.
图1示意性示出本示例性实施方式中一种强化学习的基本原理示意图;FIG. 1 schematically shows a schematic diagram of a basic principle of reinforcement learning in this exemplary embodiment;
图2示意性示出本示例性实施方式中一种分布式强化学习代理架构的示意图;Fig. 2 schematically shows a schematic diagram of a distributed reinforcement learning agent architecture in this exemplary embodiment;
图3示意性示出本示例性实施方式中一种数据处理系统的系统架构图;FIG. 3 schematically shows a system architecture diagram of a data processing system in this exemplary embodiment;
图4示意性示出本示例性实施方式中一种数据处理方法的流程图;Fig. 4 schematically shows a flow chart of a data processing method in this exemplary embodiment;
图5示意性示出本示例性实施方式中一种数据传输过程的示意图;Fig. 5 schematically shows a schematic diagram of a data transmission process in this exemplary embodiment;
图6示意性示出本示例性实施方式中一种数据处理装置结构示意图;Fig. 6 schematically shows a schematic structural diagram of a data processing device in this exemplary embodiment;
图7示意性示出本示例性实施方式中另一种数据处理装置结构示意图;FIG. 7 schematically shows a schematic structural diagram of another data processing device in this exemplary embodiment;
图8示意性示出本示例性实施方式中一种电子设备的结构示意图。Fig. 8 schematically shows a schematic structural view of an electronic device in this exemplary embodiment.
具体实施方式Detailed ways
现在将参考附图更全面地描述示例性实施方式。然而,示例性实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本公开将更加全面和完整,并将示例性实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。在下面的描述中,提供许多具体细节从而给出对本公开的实施方式的充分理解。然而,本领域技术人员将意识到,可以实践本公开的技术方案而省略特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知技术方案以避免喧宾夺主而使得本公开的各方面变得模糊。Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete and will fully convey the concept of example embodiments. communicated to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided in order to give a thorough understanding of embodiments of the present disclosure. However, those skilled in the art will appreciate that the technical solutions of the present disclosure may be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. may be adopted. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
此外,附图仅为本公开的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus repeated descriptions thereof will be omitted. Some of the block diagrams shown in the drawings are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different network and/or processor means and/or microcontroller means.
附图中所示的流程图仅是示例性说明,不是必须包括所有的步骤。例如,有的步骤还可以分解,而有的步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。The flowcharts shown in the figures are illustrative only and do not necessarily include all steps. For example, some steps can be decomposed, and some steps can be combined or partly combined, so the actual execution sequence may be changed according to the actual situation.
为了帮助本领域技术人员更好地理解本公开的技术方案,下面将对本公开技术方案涉及的相关内容进行介绍。In order to help those skilled in the art better understand the technical solution of the present disclosure, relevant content involved in the technical solution of the present disclosure will be introduced below.
(1)强化学习(Reinforcement Learning,RL),又称再励学习、评价学习或增强学习,是机器学习的范式和方法论之一,用于描述和解决智能体(agent)在与环境的交互过程中通过学习策略以达成回报最大化或实现特定目标的问题。(1) Reinforcement Learning (RL), also known as Reinforcement Learning, Evaluation Learning or Enhanced Learning, is one of the paradigms and methodologies of machine learning, used to describe and solve the interaction process between agents and the environment In the problem of learning strategies to maximize rewards or achieve specific goals.
(2)全局解释器锁(Global Interpreter Lock,GIL):是计算机程序设计语言解释器用于同步线程的工具,使得在同一进程内任何时刻仅有一个线程在执行,即使在多核中央处理器平台上,由于GIL的存在,所以禁止多线程的并行执行(Python语言中负责垃圾回收的模块)。(2) Global Interpreter Lock (Global Interpreter Lock, GIL): It is a tool used by computer programming language interpreters to synchronize threads, so that only one thread is executing at any time in the same process, even on a multi-core CPU platform , Due to the existence of the GIL, parallel execution of multiple threads is prohibited (the module responsible for garbage collection in the Python language).
(3)智能体(Agent):人工智能领域中一个概念,以云为基础,以人工智能(Artificial Intelligence,AI)为核心,构建一个立体感知、全域协同、精准判断、持续进化、开放的智能系统。(3) Agent (Agent): a concept in the field of artificial intelligence, based on the cloud, with artificial intelligence (AI) as the core, to build a three-dimensional perception, global collaboration, accurate judgment, continuous evolution, and open intelligence system.
(4)张量(tensor):深度学习框架中的基本数据单元,可以看作为高维数组。(4) Tensor: The basic data unit in the deep learning framework, which can be regarded as a high-dimensional array.
(5)序列化(Serialization)与反序列化(deserialization):序列化是把对象转换为字节序列的过程称为对象的序列化;而反序列化则是与序列化相反的过程,即把字节序列恢复为对象的过程。(5) Serialization (Serialization) and deserialization (deserialization): serialization is the process of converting an object into a sequence of bytes called object serialization; and deserialization is the opposite process of serialization, that is, the The process of restoring a sequence of bytes to an object.
(6)线程(thread):是操作系统能够进行运算调度的最小单位,其功能是执行应用程序中的某个具体任务,比如一段程序、一个函数等。一条线程指的是进程中一个单一顺序的控制流,一个进程中可以并发多个线程,每条线程并行执行不同的任务。(6) Thread: It is the smallest unit that the operating system can perform operation scheduling, and its function is to execute a specific task in the application program, such as a program, a function, etc. A thread refers to a single sequential flow of control in a process. Multiple threads can run concurrently in a process, and each thread performs different tasks in parallel.
本公开示例性实施方式提供的数据处理方法,可以应用于任意使用强化学习的学习框架的应用场景中。其中,强化学习在众多领域均得到了广泛的应用,例如,机器人控制领域、自动驾驶、5G通信、游戏领域、智能物流或其他等。The data processing method provided by the exemplary embodiments of the present disclosure can be applied to any application scenario using a learning framework of reinforcement learning. Among them, reinforcement learning has been widely used in many fields, such as robot control, automatic driving, 5G communication, game field, intelligent logistics or others.
图1示意性示出本示例性实施方式中一种强化学习的基本原理示意图。在游戏AI领域中强化学习得到了极大的应用,强化学习是从动物学习、参数扰动自适应控制等理论发展而来,其基本原理是如图1所示。在图1中,在游戏环境(environment)中,AI智能体(agent)观测当下所处的状态(state)并做出决策(action),作用于环境后进入新的状态并获得奖励信号(reward),以此不断循环迭代。强化算法根据奖励信号逐步调整AI在不同状态下的决策,即如果Agent的某个行为策略导致环境正的奖赏(强化信号),那么Agent以后产生这个行为策略的趋势便会加强,Agent的目标是在每个离散状态发现最优策略以使期望的折扣奖赏和最大。Fig. 1 schematically shows a schematic diagram of a basic principle of reinforcement learning in this exemplary embodiment. Reinforcement learning has been widely used in the field of game AI. Reinforcement learning is developed from theories of animal learning and parameter perturbation adaptive control. Its basic principle is shown in Figure 1. In Figure 1, in the game environment (environment), the AI agent (agent) observes the current state (state) and makes a decision (action), and enters a new state after acting on the environment and obtains a reward signal (reward ), so as to continuously iterate. The strengthening algorithm gradually adjusts the decision-making of AI in different states according to the reward signal, that is, if a certain behavior strategy of the Agent leads to positive rewards in the environment (reinforcement signal), then the tendency of the Agent to generate this behavior strategy in the future will be strengthened, and the goal of the Agent is An optimal policy is found at each discrete state to maximize the sum of expected discounted rewards.
为了支撑复杂AI的训练,主流的强化学习框架通常采用分布式强化学习的训练架构,该强化学习架构通常包含学习者(learner,也可称为策略体)和行动者(actor,也可称为执行体)。actor通常在中央处理器(Central Processing Unit,简称CPU)上运行,迭代在环境(environment)中执行动作以及模型上运行推理来预测下一个动作的过程。actor会经常更新推理模型的参数,在收集到足够多的观察数据后,将观察和动作的轨迹发送给actor,actor再对模型进行优化。In order to support the training of complex AI, the mainstream reinforcement learning framework usually adopts a distributed reinforcement learning training architecture, which usually includes a learner (learner, also called a strategy body) and an actor (actor, also called Executor). An actor usually runs on a central processing unit (Central Processing Unit, referred to as CPU), and iterates the process of performing actions in the environment and running reasoning on the model to predict the next action. The actor will frequently update the parameters of the inference model. After collecting enough observation data, the trajectory of the observation and action will be sent to the actor, and the actor will optimize the model.
图2示意性示出本示例性实施方式中一种分布式强化学习框架架构示意图,如图2所示,由一个学习者(learner)带动大量行动者(actor)学习AI策略,多个actor分别使用learner学到的AI策略(策略1、策略2、…、策略n)与游戏环境交互产生经验数据(经验1、经验2、…、经验n);learner收集上述经验数据用于提升AI策略。但是,通常actor的数量非常多,产生的经验数据可能是非常巨量的,这对learner的数据接收能力提出了严苛的挑战。Figure 2 schematically shows a schematic diagram of a distributed reinforcement learning framework architecture in this exemplary embodiment. Use the AI strategy (strategy 1, strategy 2, ..., strategy n) learned by the learner to interact with the game environment to generate experience data (experience 1, experience 2, ..., experience n); the learner collects the above experience data to improve the AI strategy. However, usually the number of actors is very large, and the experience data generated may be very large, which poses a severe challenge to the data receiving ability of the learner.
目前,现有的分布式强化学习框架主要基于Python生态,learner使用单CPU异步接收并处理经验数据,而经验数据的实质是字节数据类型。且在learner服务器接收到经验数据后需要进行一系列数据处理过程,例如解压缩和反序列化处理。At present, the existing distributed reinforcement learning framework is mainly based on the Python ecosystem. The learner uses a single CPU to asynchronously receive and process experience data, and the essence of experience data is a byte data type. And after the learner server receives the empirical data, it needs to perform a series of data processing processes, such as decompression and deserialization processing.
其中,数据处理过程与actor向learner传输过程中的数据处理过程是相对的。例如,在actor与learner的交互过程中,为了节省数据传输过程中的网络带宽以及便于数据传输,通常actor向learner发送的经验数据是进行压缩处理,并通过序列化将预设数据结构转换为数据流(例如将对象转换为字节数据),因此,learner在接收到actor发送的经验数据进行解压缩操作,并将解压缩后的经验数据通过反序列化恢复为预设的数据结构(以开源的Python机器学习库的PyTorch为例,通常构成为若干tensor数据)。当learner接收到累积的经验数据达到预设数量(预设数量(batch size)由开发人员根据实际情况进行配置),learner则将这些经验数据整合起来用于AI策略的学习。Among them, the data processing process is relative to the data processing process in the process of actor to learner transmission. For example, in the process of interaction between actor and learner, in order to save network bandwidth during data transmission and facilitate data transmission, usually the experience data sent by actor to learner is compressed, and the preset data structure is converted into data through serialization stream (such as converting objects into byte data), therefore, the learner decompresses the experience data sent by the actor after receiving it, and restores the decompressed experience data to a preset data structure through deserialization (based on open source Take PyTorch of the Python machine learning library as an example, usually composed of several tensor data). When the accumulated experience data received by the learner reaches a preset amount (the preset number (batch size) is configured by the developer according to the actual situation), the learner integrates these experience data for AI strategy learning.
然而,由于相关技术使用的Python环境,而Python的解释器使用了GIL,无论系统有多少个CPU核心,Python程序(经验数据接收及处理过程)都只能在一个CPU上运行,从而造成线程数据堵塞,导致数据处理的效率低,造成系统性能瓶颈。同时,learner接收经验数据后需要及时整合为批数据(即batch数据)进行学习,而该过程需要运行在共享的内存空间,Python生态无法实现该过程,导致适用性小。However, due to the Python environment used by the related technology, and the Python interpreter uses the GIL, no matter how many CPU cores the system has, the Python program (experience data receiving and processing process) can only run on one CPU, resulting in thread data Blockage leads to low data processing efficiency and system performance bottlenecks. At the same time, after the learner receives experience data, it needs to integrate it into batch data (batch data) for learning in a timely manner, and this process needs to run in a shared memory space. The Python ecosystem cannot implement this process, resulting in low applicability.
本公开示例性实施方式考虑到上述问题,提出一种数据处理方法,该方法使用多线程并行接收并处理经验数据。例如,使用C++模块实现接收经验数据的过程,由于C++支持使用多线程,且C++程序可支持在多个CPU上同步运行,从而通过多线程并行接收并处理经验数据,达到提高数据处理效率的技术效果。In consideration of the above problems, the exemplary embodiments of the present disclosure propose a data processing method, which uses multi-threads to receive and process experience data in parallel. For example, use the C++ module to implement the process of receiving experience data. Since C++ supports multi-threading, and C++ programs can support simultaneous operation on multiple CPUs, the experience data can be received and processed in parallel through multi-threading to achieve a technology that improves data processing efficiency. Effect.
为了解决上述问题,本公开提出了一种数据处理方法及装置,该方法及装置可以应用于图3所示的示例性应用环境的系统架构中。In order to solve the above problems, the present disclosure proposes a data processing method and device, which can be applied to the system architecture of the exemplary application environment shown in FIG. 3 .
如图3所示,系统架构300可以包括终端设备301、302、303中的一个或多个,网络304和服务器305。网络304用以在终端设备301、302、303和服务器305之间提供通信链路的介质。网络303可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。终端设备301、302、303例如可以是智能手机、掌上电脑(Personal Digital Assistant,PDA)、笔记本电脑、服务器、台式计算机或其它任何具有联网功能的计算设备,但并不局限于此。As shown in FIG. 3 , the system architecture 300 may include one or more of terminal devices 301 , 302 , and 303 , a network 304 and a server 305 . The network 304 is used as a medium for providing communication links between the terminal devices 301 , 302 , 303 and the server 305 . Network 303 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others. The terminal devices 301, 302, 303 may be, for example, smart phones, PDAs (Personal Digital Assistant, PDA), laptops, servers, desktop computers or any other computing devices with networking capabilities, but are not limited thereto.
应该理解,图3中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。比如服务器305可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。It should be understood that the numbers of terminal devices, networks and servers in Fig. 3 are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and servers. For example, the server 305 can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, and can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, and cloud communications. , middleware services, domain name services, security services, CDN, and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
本公开实施例所提供的数据处理方法可以在服务器305执行,相应地,数据处理装置一般设置于服务器305中。本公开实施例所提供的数据处理方法也可以在终端设备中执行,相应地,数据处理装置也可以设置在终端设备中。本公开实施例所提供的数据处理方法还可以部分的在服务器305中执行,部分的在终端设备中执行,相应地,数据处理装置的部分模块可以设置在服务器305中,部分模块设置在终端设备中。The data processing method provided by the embodiment of the present disclosure may be executed on the server 305 , and correspondingly, the data processing device is generally disposed on the server 305 . The data processing method provided by the embodiments of the present disclosure may also be executed in a terminal device, and correspondingly, the data processing apparatus may also be set in the terminal device. The data processing method provided by the embodiment of the present disclosure can also be partially executed in the server 305, and partially executed in the terminal device. Correspondingly, some modules of the data processing apparatus can be set in the server 305, and some modules can be set in the terminal device middle.
举例而言,在一种示例性实施例中,以示例性应用场景为游戏场景为例,终端设备301、302、303中搭载游戏,玩家可以在终端设备301、302、303中登录游戏平台并参与游戏,上述服务器305可以将上述终端设备301、302、303匹配到同一游戏进程中。在玩家使用终端设备游戏过程中,由多个终端设备与服务器305构成分布式强化学习训练框架系统,多个actor在多个终端设备上分布式并行产生游戏过程中的训练数据,而learner服务器对应服务器305。服务器305响应于接收行动者传送的数据传输指令,根据数据传输指令启动学习者与行动者之间的多个工作线程;根据多个工作线程接收行动者传送的第一目标经验数据;其中,第一目标经验数据根据强化学习模型的第一网络参数确定;将强化学习模型的第一网络参数更新为与第一目标经验数据对应的第二网络参数,以便行动者执行与第二网络参数执行对应的目标动作。For example, in an exemplary embodiment, taking the exemplary application scenario as a game scene as an example, the terminal devices 301, 302, and 303 carry games, and players can log in to the game platform on the terminal devices 301, 302, and 303 and To participate in the game, the server 305 can match the terminal devices 301, 302, 303 into the same game process. During the game process of the player using the terminal device, multiple terminal devices and the server 305 form a distributed reinforcement learning training framework system, and multiple actors generate training data in parallel in a distributed manner on multiple terminal devices, and the learner server corresponds to server 305. The server 305 responds to receiving the data transmission instruction transmitted by the actor, and starts multiple working threads between the learner and the actor according to the data transmission instruction; receives the first target experience data transmitted by the actor according to the multiple working threads; wherein, the first A target experience data is determined according to the first network parameter of the reinforcement learning model; the first network parameter of the reinforcement learning model is updated to the second network parameter corresponding to the first target experience data, so that the actor performs the corresponding target action.
但本领域技术人员容易理解的是,上述应用场景仅是用于举例,本示例性实施例中并不以此为限。However, those skilled in the art can easily understand that the above application scenarios are for example only, and are not limited thereto in this exemplary embodiment.
下面以上服务器305为执行主体,将该数据处理方法应用于上述的服务器305为例进行说明。参见图4,本公开示例性实施例提供的数据处理方法包括如下步骤S401-步骤S403:Hereinafter, the above server 305 is used as the execution subject, and the data processing method is applied to the above server 305 as an example for description. Referring to FIG. 4, the data processing method provided by the exemplary embodiment of the present disclosure includes the following steps S401-step S403:
步骤S401、响应于接收行动者传送的数据传输指令,根据数据传输指令启动学习者与行动者之间的多个工作线程。Step S401, in response to receiving the data transmission instruction sent by the actor, start multiple working threads between the learner and the actor according to the data transmission instruction.
步骤S402、根据多个工作线程接收行动者传送的第一目标经验数据;其中,第一目标经验数据根据强化学习模型的第一网络参数确定。Step S402, receiving the first target experience data transmitted by the actor according to multiple working threads; wherein, the first target experience data is determined according to the first network parameter of the reinforcement learning model.
步骤S403、将强化学习模型的第一网络参数更新为与第一目标经验数据对应的第二网络参数,以便行动者执行与第二网络参数对应的目标动作。Step S403, updating the first network parameter of the reinforcement learning model to the second network parameter corresponding to the first target experience data, so that the actor can perform the target action corresponding to the second network parameter.
在图4所提供的技术方案中,通过响应于接收行动者传送的数据传输指令,根据数据传输指令启动学习者与行动者之间的多个工作线程;根据多个工作线程接收行动者传送的第一目标经验数据;其中,第一目标经验数据根据强化学习模型的第一网络参数确定;将强化学习模型的第一网络参数更新为与第一目标经验数据对应的第二网络参数,以便行动者生成与第二网络参数对应的经验数据。一方面,该方法实现分布式强化学习模型中learner与actor之间的多线程,以通过多个CPU内核同时运行同一程序,从而实现多个线程并行接收并处理actor发送的经验数据,该过程避免了现有强化学习模型因Python生态的GIL限制只能通过单CPU接收并处理经验数据所导致的数据处理效率低,从而形成系统计算瓶颈的技术问题,实现从而提高数据接收并处理的效率的技术效果。另一方面,通过多个工作线程接收到的经验数据均处于共享内存空间中,可直接供learner进行学习以更新强化学习模型的网络参数,该过程避免了现有技术通过Python无法适用于多线程接收经验数据的过程,从而提高数据处理过程的应用范围。In the technical solution provided in Figure 4, by responding to receiving the data transmission instruction sent by the actor, multiple working threads between the learner and the actor are started according to the data transmission instruction; The first target experience data; wherein, the first target experience data is determined according to the first network parameter of the reinforcement learning model; the first network parameter of the reinforcement learning model is updated to the second network parameter corresponding to the first target experience data, so as to act or generate empirical data corresponding to the second network parameters. On the one hand, this method implements multi-threading between the learner and the actor in the distributed reinforcement learning model to run the same program through multiple CPU cores at the same time, so that multiple threads can receive and process the experience data sent by the actor in parallel. This process avoids It solves the technical problem that the existing reinforcement learning model can only receive and process empirical data through a single CPU due to the GIL limitation of Python ecology, which results in low data processing efficiency, thus forming a system computing bottleneck, and realizes the technology to improve the efficiency of data receiving and processing Effect. On the other hand, the experience data received through multiple working threads are all in the shared memory space, which can be directly used by the learner to update the network parameters of the reinforcement learning model. This process avoids the fact that the existing technology cannot be applied to multi-threading through Python. The process of receiving empirical data, thereby increasing the scope of application of the data processing process.
以下对图4所示实施例中各个步骤的具体实施方式进行详细阐述:The specific implementation of each step in the embodiment shown in Figure 4 is described in detail below:
在步骤S401中,响应于接收行动者传送的数据传输指令,根据数据传输指令启动学习者与行动者之间的多个工作线程。In step S401, in response to receiving the data transmission instruction sent by the actor, multiple working threads between the learner and the actor are started according to the data transmission instruction.
其中,学习者、行动者的数量可以是一个或者多个,例如,一个learner可以与多个actor建立网络连接,或者多个learner与多个actor建立网络连接形成的集群,本公开实施例对此不作任何特殊限制。Among them, the number of learners and actors can be one or more. For example, one learner can establish network connections with multiple actors, or a cluster formed by multiple learners and multiple actors. No special restrictions are imposed.
示例性的,以图2所示的分布式强化学习框架为例,当actor产生经验数据,会向learner服务器发送数据传输指令,以申请将产生的经验数据发送至learner服务器。同时启动学习者与行动者之间的多个工作线程,以在多核处理器中同一时刻启动多个CPU内核运行同一程序。其中,在learner与actor之间的启动的工作线程数量可以由用户预先配置,即可以根据实际情况进行调整。Exemplarily, taking the distributed reinforcement learning framework shown in FIG. 2 as an example, when an actor generates experience data, it will send a data transmission instruction to the learner server to apply for sending the generated experience data to the learner server. Simultaneously start multiple worker threads between the learner and the actor to start multiple CPU cores at the same time in the multi-core processor to run the same program. Among them, the number of worker threads started between the learner and the actor can be pre-configured by the user, that is, it can be adjusted according to the actual situation.
应该理解的是,该过程可以使用任意可实现同一时刻启动多个线程的编程语言,例如C++,本公开实施例对此不作任何特殊限制。It should be understood that this process may use any programming language capable of starting multiple threads at the same time, such as C++, which is not specifically limited in the embodiments of the present disclosure.
在步骤S402中,根据多个工作线程接收行动者传送的第一目标经验数据;其中,第一目标经验数据根据强化学习模型的第一网络参数确定。In step S402, the first target experience data transmitted by the actor is received according to multiple working threads; wherein, the first target experience data is determined according to the first network parameter of the reinforcement learning model.
其中,第一网络参数为learner向actor发送的AI策略对应的参数信息。Wherein, the first network parameter is the parameter information corresponding to the AI strategy sent by the learner to the actor.
根据本公开的一些实施例,在启动多个线程,可建立learner与actor之间的网络连接,以在网络连接状态为正常连接的情况的向learner传送第一目标经验数据。According to some embodiments of the present disclosure, when multiple threads are started, a network connection between the learner and the actor may be established, so as to transmit the first target experience data to the learner when the network connection status is a normal connection.
示例性的,可以使用嵌入式的网络链接库(ZMQ)以建立learner与actor之间的网络连接,由于ZMQ用于大型的高并发场景下构建大型分布式系统,主要用于多线程之间的消息传输。Exemplarily, the embedded network link library (ZMQ) can be used to establish a network connection between the learner and the actor. Since ZMQ is used to build large-scale distributed systems in large-scale high-concurrency scenarios, it is mainly used for communication between multiple threads. message transmission.
在本公开的一些示例实施例中,在执行步骤S402根据多个工作线程接收行动者传送的第一目标经验数据的步骤时,可以根据多个工作线程接收行动者传送的初始经验数据;对初始经验数据进行反序列化处理,得到第一目标经验数据。In some exemplary embodiments of the present disclosure, when performing step S402 receiving the first target experience data transmitted by the actor according to multiple working threads, the initial experience data transmitted by the actor may be received according to multiple working threads; The experience data is deserialized to obtain the first target experience data.
示例性的,初始经验数据通常为原始字节类型的数据,在使用多个工作线程异步接收到原始字节类型的初始经验数据时,需要通过反序列化操作将字节类型的数据转换为预设数据结构的第一目标经验数据。例如,预设数据结构可以是对象,便可通过序列化和反序列化实现了对象的传输过程。Exemplarily, the initial experience data is usually the data of the original byte type. When using multiple worker threads to asynchronously receive the initial experience data of the original byte type, it is necessary to convert the byte type Set the first target experience data of the data structure. For example, the preset data structure may be an object, and then the transfer process of the object can be realized through serialization and deserialization.
根据本公开的一些实施例,learner向actor发送AI策略时,通过序列化将对象转换为字节数据以保存在文件中或通过网络传输,而actor向learner发送经验数据时,通过反序列化操作将字节数据(经验数据)转换为预设数据结构(例如:对象)进行后续处理。According to some embodiments of the present disclosure, when the learner sends the AI strategy to the actor, the object is converted into byte data through serialization to be stored in a file or transmitted over the network, and when the actor sends the experience data to the learner, it is deserialized Convert byte data (empirical data) into preset data structures (eg: objects) for subsequent processing.
例如,以一个开源Python机器学习库的PyTorch为例,可以将原始字节类型的初始经验数据构造为tensor对象。For example, taking PyTorch, an open source Python machine learning library, as an example, the initial experience data of primitive byte type can be constructed as a tensor object.
应该理解的是,反序列化操作可以通过序列化类库MessagePack(可以简化为msgpack)、JSON或其他任意实现方式,本公开实施例对此不作任何特殊限制。It should be understood that the deserialization operation can be implemented through the serialization class library MessagePack (which can be simplified as msgpack), JSON or any other implementation manner, and this embodiment of the present disclosure does not impose any special limitation on this.
通过序列化和反序列化处理,不仅可以在传递和保存对象时,保证对象的完整性和可传递性,也可以实现将对象进行长久保持。Through serialization and deserialization processing, not only the integrity and transferability of the object can be guaranteed when the object is passed and saved, but also the object can be kept for a long time.
在本公开的一些示例实施例中,在执行对初始经验数据进行反序列化处理,得到第一目标经验数据步骤时,还可以将初始经验数据进行解压缩处理,得到第一中间经验数据,对第一中间经验数据进行反序列化处理,得到第一目标经验数据。In some exemplary embodiments of the present disclosure, when the step of deserializing the initial experience data to obtain the first target experience data is performed, the initial experience data may also be decompressed to obtain the first intermediate experience data. The first intermediate experience data is deserialized to obtain the first target experience data.
在通过多个工作线程接收行动者传送的初始经验数据时,对初始经验数据进行解压缩处理后再进行反序列化处理。When receiving the initial experience data transmitted by actors through multiple working threads, the initial experience data is decompressed and then deserialized.
示例性的,在数据传输过程中,为了节省计算机网络的带宽以及提高数据传输效率,通常actor发送的初始经验数据经过压缩处理。因此,learner在接收到初始经验数据后进行解压缩操作(例如LZ4算法,其中LZ4算法的压缩与解压缩效率高),以恢复至压缩处理前的数据。Exemplarily, in the process of data transmission, in order to save the bandwidth of the computer network and improve the efficiency of data transmission, usually the initial experience data sent by the actor is compressed. Therefore, the learner performs a decompression operation (such as the LZ4 algorithm, where the compression and decompression efficiency of the LZ4 algorithm is high) after receiving the initial empirical data, so as to restore the data before the compression process.
可以理解的是,上述LZ4压缩与解压缩算法为示例性的,还可以使用其他方法,如ZIP算法、Snappy等,本公开实施例对此不作任何特殊限制。It can be understood that the above-mentioned LZ4 compression and decompression algorithm is exemplary, and other methods, such as ZIP algorithm, Snappy, etc., can also be used, and the embodiment of the present disclosure does not impose any special limitation on this.
在数据传输过程中,通过压缩与解压缩处理可以节省数据传输的时间以及网络资源,进而提高数据传输以及处理效率。During the data transmission process, data transmission time and network resources can be saved through compression and decompression processing, thereby improving data transmission and processing efficiency.
在本公开的一些示例实施例中,在对初始经验数据进行反序列化处理,得到第一目标经验数据时,可以对初始经验数据进行反序列化处理,得到第二中间经验数据;将第二中间经验数据存储至缓存队列;若缓存队列中的第二中间经验数据的数据量大于或等于数据量阈值,则将数据量的第二中间经验数据进行合并,得到第一目标经验数据。In some exemplary embodiments of the present disclosure, when deserializing the initial experience data to obtain the first target experience data, the initial experience data may be deserialized to obtain the second intermediate experience data; the second The intermediate experience data is stored in the cache queue; if the data volume of the second intermediate experience data in the cache queue is greater than or equal to the data volume threshold, the second intermediate experience data of the data volume are combined to obtain the first target experience data.
同样以PyTorch为例,将反序列化得到的tensor数据存入缓存队列中,直至缓存队列中tensor数据的数据量大于或等于一数据量阈值(batch size),或者缓存队列的可用容量小于预设容量,便可以将数据量阈值的tensor数据进行合并,得到一个batch数据,以便learner根据batch数据进行学习。Also take PyTorch as an example, store the deserialized tensor data in the cache queue until the data volume of the tensor data in the cache queue is greater than or equal to a data volume threshold (batch size), or the available capacity of the cache queue is less than the preset Capacity, the tensor data with the data volume threshold can be combined to obtain a batch data, so that the learner can learn based on the batch data.
在缓存队列中的数据量达到数据量阈值后进行数据合并,是基于数据处理效率的缘故。若将数据处理过程中所有的经验数据放入共享内存后再加载到显存中进行模型的网络参数更新,难以满足较大的存储空间需求量;而learner直接加载一次数据处理过程得到的经验数据进行训练并更新模型的网络参数,效率较低。而在数据量达到数据量阈值后合并数据进行训练强化学习模型,可以兼顾存储空间和效率。Data merging is performed after the data volume in the cache queue reaches the data volume threshold, based on data processing efficiency. If all the empirical data in the data processing process are put into the shared memory and then loaded into the video memory to update the network parameters of the model, it is difficult to meet the large storage space demand; and the learner directly loads the empirical data obtained in the data processing process once for Training and updating the network parameters of the model is inefficient. After the data volume reaches the data volume threshold, the data is combined to train the reinforcement learning model, which can take into account both storage space and efficiency.
同时,在本公开一个可选的实施例中,对于存在多个actor同时与learner服务器连接进行数据传输的应用场景下,例如,游戏场景中多个游戏玩家匹配同一个游戏进程中,可以将第一预设数量的行动者划分为第二预设数量的行动者集合;将多个工作线程划分第三预设数量的工作线程组合;针对第二预设数量的行动者集合中的目标行动者组合,根据预先配置的线程映射表从第三预设数量的工作线程组合中确定与目标行动者集合匹配的目标工作线程;通过目标工作线程接收目标行动者组合传送的第一目标经验数据。At the same time, in an optional embodiment of the present disclosure, in an application scenario where multiple actors are simultaneously connected to the learner server for data transmission, for example, in a game scenario where multiple game players match the same game process, the first Divide a preset number of actors into a second preset number of actor sets; divide multiple worker threads into a third preset number of worker thread combinations; target actors in the second preset number of actor sets Combining, determining a target worker thread matching the target actor set from a third preset number of worker thread combinations according to a pre-configured thread mapping table; receiving the first target experience data transmitted by the target actor combination through the target worker thread.
其中,线程映射表中包含不同行动者集合对应的工作线程组合。Wherein, the thread mapping table includes worker thread combinations corresponding to different actor sets.
示例性的,各actor向learner服务器传送第一目标经验数据时,通常各工作线程会通过一个传输接口接收并处理第一目标经验数据。然而,该过程在传输数据量巨大的情况下易造成堵塞,造成系统传输数据压力。Exemplarily, when each actor transmits the first target experience data to the learner server, usually each worker thread will receive and process the first target experience data through a transmission interface. However, this process is easy to cause congestion when the amount of transmitted data is huge, causing pressure on the system to transmit data.
例如,当前与一个learner服务器连接20个actor,且learner服务器启动了10个工作线程,若不进行工作线程划分,则通过一个传输接口直接使用10个工作线程接收20个actor发送的第一目标经验数据,在数据量较大的情况下易出现数据堵塞,增大系统传输数据压力,而在数据量较小时易出现一些线程传输压力大,而一些线程未进行数据传输,从而导致系统资源浪费。For example, currently 20 actors are connected to a learner server, and the learner server starts 10 worker threads. If no worker threads are divided, 10 worker threads are used to receive the first target experience sent by 20 actors directly through a transmission interface. Data, when the amount of data is large, data congestion is prone to occur, increasing the pressure of system transmission data, and when the amount of data is small, some threads are prone to high transmission pressure, while some threads do not perform data transmission, resulting in waste of system resources.
而在进行工作进程和多个actor划分时,则预先将20个actor划分为5组,每个组包含4个actor,同时将10个工作线程也划分为5组,每个组包含2个工作线程,即每组均通过2个工作线程传输4个actor发送的经验数据。在该过程中,5组工作线程集合便可分配在不同的传输接口下,从而5组工作线程集合在不同的传输接口下接收并处理数据。When dividing the work process and multiple actors, 20 actors are divided into 5 groups in advance, each group contains 4 actors, and 10 worker threads are also divided into 5 groups, each group contains 2 workers Thread, that is, each group transmits experience data sent by 4 actors through 2 worker threads. In this process, the five sets of working threads can be allocated under different transmission interfaces, so that the five sets of working threads can receive and process data under different transmission interfaces.
因此,为了解决上述问题,可以将各actor与多个工作线程分别进行划分,得到多组行动者集合以及工作线程组合,该过程第三预设数量的工作线程组合可以通过不同的传输接口接收与之对应行动者集合传送的第一目标经验数据,从而进一步提高了数据传输及处理效率。Therefore, in order to solve the above problems, each actor and multiple worker threads can be divided separately to obtain multiple groups of actor sets and worker thread combinations. In this process, the third preset number of worker thread combinations can receive and communicate with each other through different transmission interfaces. Corresponding to the first target experience data sent by the actor set, thus further improving the efficiency of data transmission and processing.
应该理解的是,划分工作线程与第一预设数量的行动者的方式可以是平均划分或者其他任意方式,本公开实施例对此不作任何特殊限制。It should be understood that, the manner of dividing the worker threads and the first preset number of actors may be equal division or any other manner, which is not specifically limited in this embodiment of the present disclosure.
在本公开一个可选的实施例中,在根据多个工作线程接收行动者传送的第一目标经验数据之后,learner服务器可以实时检测与行动者的网络状态信息;若网络状态信息不满足预设状态信息,则向行动者发送数据重传请求,数据重传请求用于请求行动者在网络状态信息满足预设状态信息时重传第一目标经验数据。In an optional embodiment of the present disclosure, after receiving the first target experience data transmitted by the actor according to multiple working threads, the learner server can detect the network status information with the actor in real time; if the network status information does not meet the preset state information, send a data retransmission request to the actor, and the data retransmission request is used to request the actor to retransmit the first target experience data when the network state information meets the preset state information.
其中,网络连接信息即网络连接过程中的各项参数信息,例如,传输带宽,数据传输速率等。Wherein, the network connection information refers to various parameter information during the network connection process, for example, transmission bandwidth, data transmission rate, and the like.
示例性的,在learner服务器启动多个线程时,会建立learner服务器与每个actor之间的网络连接(例如,通过ZMQ构建网络连接关系),从而根据多个工作线程接收各actor传送的第一目标经验数据。Exemplarily, when the learner server starts multiple threads, it will establish a network connection between the learner server and each actor (for example, build a network connection relationship through ZMQ), so as to receive the first Target experience data.
在数据传输过程中,learner服务器会实时检测或以预设时间间隔检测learner与各actor之间的网络连接信息。若一actor向learner服务器传输第一目标经验数据的过程中,learner服务器检测到数据传输速率小于一个预设传输速率,便认为该actor当前网络状态不佳。为了防止第一目标经验数据丢失,learner服务器可以向该actor发送数据重传请求,以便行动者在数据传输速率大于或等于一个预设传输速率时重传第一目标经验数据。During the data transmission process, the learner server will detect the network connection information between the learner and each actor in real time or at preset time intervals. If an actor transmits the first target experience data to the learner server, and the learner server detects that the data transmission rate is less than a preset transmission rate, it considers that the actor's current network status is not good. In order to prevent the loss of the first target experience data, the learner server can send a data retransmission request to the actor, so that the actor retransmits the first target experience data when the data transmission rate is greater than or equal to a preset transmission rate.
Learner服务器通过检测到与行动者的网络状态信息,从而在网络状态信息不满足预设状态信息时进行数据重传可以确保该actor发送的数据不丢失,确保数据传输过程的安全性。By detecting the network status information with the actor, the Learner server retransmits data when the network status information does not meet the preset status information to ensure that the data sent by the actor is not lost and ensure the security of the data transmission process.
在步骤S403中,将强化学习模型的第一网络参数更新为与第一目标经验数据对应的第二网络参数,以便行动者生成与第二网络参数对应的经验数据。In step S403, the first network parameter of the reinforcement learning model is updated to the second network parameter corresponding to the first target experience data, so that the actor generates experience data corresponding to the second network parameter.
示例性的,learner接收到第一目标经验数据便可对强化学习模型进行训练并更新模型的网络参数,以便行动者执行与第二网络参数对应的行为,同时actor会结合当前环境与第二网络参数继续生成经验数据,直至达到终止条件结束强化学习模型的训练过程。Exemplarily, the learner can train the reinforcement learning model and update the network parameters of the model after receiving the first target experience data, so that the actor can perform the behavior corresponding to the second network parameters, and at the same time, the actor will combine the current environment with the second network The parameters continue to generate empirical data until the termination condition is reached to end the training process of the reinforcement learning model.
在执行步骤S403将强化学习模型的第一网络参数更新为与第一目标经验数据对应的第二网络参数,以便行动者执行与第二网络参数对应的目标动作之后,还可以判断是否达到强化学习模型训练的终止条件。若达到终止条件,则actor停止向learner服务器发送经验数据;反之,若未达到终止条件,则actor继续向learner服务器发送经验数据,以便强化学习模型根据发送的经验数据更新AI策略。After step S403 is performed to update the first network parameter of the reinforcement learning model to the second network parameter corresponding to the first target experience data, so that the actor can perform the target action corresponding to the second network parameter, it can also be judged whether the reinforcement learning The termination condition for model training. If the termination condition is met, the actor stops sending experience data to the learner server; otherwise, if the termination condition is not met, the actor continues to send experience data to the learner server, so that the reinforcement learning model can update the AI strategy based on the sent experience data.
在本公开一个可选的实施例中,若目标动作对应的第一行为参数与参考动作对应的第二行为参数的精度差值大于精度阈值,则接收行动者基于第二网络参数对应的第二目标经验数据;将强化学习模型的第二网络参数更新为与第二目标经验数据对应的第三网络参数,以便行动者执行与第三网络参数对应的动作。In an optional embodiment of the present disclosure, if the accuracy difference between the first behavior parameter corresponding to the target action and the second behavior parameter corresponding to the reference action is greater than the accuracy threshold, the receiving actor based on the second network parameter corresponding to Target experience data; updating the second network parameter of the reinforcement learning model to the third network parameter corresponding to the second target experience data, so that the actor can perform an action corresponding to the third network parameter.
示例性的,learner服务器可以根据actor发送的经验数据更新强化学习模型的网络参数(AI策略)是循环执行过程,以便基于当前环境和经验数据不断提升AI策略,提高actor执行目标动作的智能化。Exemplarily, the learner server can update the network parameters (AI strategy) of the reinforcement learning model according to the experience data sent by the actor, which is a cyclic execution process, so as to continuously improve the AI strategy based on the current environment and experience data, and improve the intelligence of the actor to perform the target action.
当actor基于AI策略和当前环境下执行的目标动作与参考动作对应行为参数之间的精度差值小于或等于一精度阈值,则认为当前actor执行动作的准确度高,便可以结束强化学习模型的训练过程;反之,若actor基于AI策略和当前环境下执行的目标动作与参考动作对应行为参数之间的精度差值小于或等于一精度阈值,则认为当前actor执行动作的准确度低,仍需要继续进行模型训练以提高actor执行动作的精度。When the accuracy difference between the target action performed by the actor based on the AI strategy and the current environment and the corresponding behavior parameters of the reference action is less than or equal to an accuracy threshold, it is considered that the accuracy of the current actor's action execution is high, and the reinforcement learning model can be terminated. Conversely, if the accuracy difference between the target action performed by the actor based on the AI strategy and the current environment and the corresponding behavior parameters of the reference action is less than or equal to a precision threshold, it is considered that the accuracy of the current actor's action execution is low, and it is still necessary to Continue with model training to improve the precision with which actors perform actions.
在本公开另一个可选的实施例中,若训练强化学习模型的迭代次数大于预设迭代次数,则接收行动者基于第二网络参数对应的第二目标经验数据;将强化学习模型的第二网络参数更新为与第二目标经验数据对应的第三网络参数,以便行动者执行与第三网络参数对应的动作。In another optional embodiment of the present disclosure, if the number of iterations for training the reinforcement learning model is greater than the preset number of iterations, the actor receives the second target experience data corresponding to the second network parameters; The network parameter is updated to a third network parameter corresponding to the second target experience data, so that the actor can perform an action corresponding to the third network parameter.
其中,预设迭代次数为用户预先设定的最大迭代次数。Wherein, the preset number of iterations is the maximum number of iterations preset by the user.
为防止强化学习模型训练过程进入死循环过程,通常为该过程预先配置最大迭代次数,以便在最大迭代次数范围内进行模型训练,提高数据和系统安全性。In order to prevent the reinforcement learning model training process from entering an endless loop process, the maximum number of iterations is usually pre-configured for the process, so that model training can be performed within the maximum number of iterations to improve data and system security.
进一步的,以使用C++实现本公开提供的数据处理方法为例,为了降低开发人员的研究周期,从而提高开发效率,可以将本公开提供的数据处理过程封装为C++模块,以便Python端直接调用该C++模块。Further, using C++ to implement the data processing method provided by this disclosure as an example, in order to reduce the research cycle of developers and improve development efficiency, the data processing process provided by this disclosure can be packaged as a C++ module so that the Python side can directly call the C++ modules.
以下将结合图5对上述整个数据处理过程进行详细说明。如图5所示包含学习者learner以及执行者actor,其中,learner中可以包含Python端和C++模块。The entire data processing process above will be described in detail below with reference to FIG. 5 . As shown in Figure 5, it includes the learner learner and the executor actor, where the learner can include Python and C++ modules.
以执行一次数据处理过程为例,在Python端,用户可以预先设定需要启动的线程数和缓存队列的数据量阈值(batch size),在learner接收到actor的传送的数据传输指令,便驱动Python端调用C++模块,C++模块启动learner与actor之间的多个工作线程并建立网络连接。Taking the execution of a data processing process as an example, on the Python side, the user can pre-set the number of threads to be started and the data volume threshold (batch size) of the cache queue, and the learner will drive the Python when it receives the data transmission command sent by the actor. The end calls the C++ module, and the C++ module starts multiple worker threads between the learner and the actor and establishes a network connection.
针对每个工作线程,工作线程异步接收actor传送的初始经验数据,其中初始经验数据为字节类型。For each worker thread, the worker thread asynchronously receives the initial experience data transmitted by the actor, where the initial experience data is a byte type.
在本公开一个可选的实施例中,为了节省网络带宽,通常actor发送的初始经验数据是经过压缩的,所以leaner接收到初始经验数据进行解压缩处理,得到第一中间经验数据。In an optional embodiment of the present disclosure, in order to save network bandwidth, usually the initial experience data sent by the actor is compressed, so the leaner receives the initial experience data and performs decompression processing to obtain the first intermediate experience data.
在本公开一个可选的实施例中,对压缩后的第一中间经验数据进行反序列化处理得到第二中间经验数据,将第二中间经验数据存储至缓存队列。In an optional embodiment of the present disclosure, the compressed first intermediate experience data is deserialized to obtain the second intermediate experience data, and the second intermediate experience data is stored in a cache queue.
每个工作线程均重复执行上述接收并处理经验数据以将第二中间经验数据存储至缓存队列的过程,直至缓存队列中的存储的第二中间经验数据的预设数据量大于或等于数据量阈值,则将缓存队列中预设数据量的第二中间经验数据进行合并,便可得到第一目标经验数据(batch数据),以将第一目标经验数据推送至上层Python端。Each worker thread repeats the above process of receiving and processing experience data to store the second intermediate experience data in the cache queue until the preset data volume of the second intermediate experience data stored in the cache queue is greater than or equal to the data volume threshold , the second intermediate experience data of the preset data volume in the cache queue is merged to obtain the first target experience data (batch data), so as to push the first target experience data to the upper-level Python end.
而Python端在接收到第一目标经验数据便可以进行训练并更新强化学习模型的网络参数。learner与actor循环进行上述数据交互以更新强化学习模型的网络参数,直至达到训练终止条件。The Python side can train and update the network parameters of the reinforcement learning model after receiving the first target experience data. The learner and actor perform the above data interaction in a loop to update the network parameters of the reinforcement learning model until the training termination condition is reached.
为了实现上述数据处理方法,本公开的一个实施例中提供一种数据处理装置。图6示意性示出了数据处理装置的示意性架构图。In order to implement the above data processing method, an embodiment of the present disclosure provides a data processing device. Fig. 6 schematically shows a schematic architecture diagram of a data processing device.
其中,该数据处理装置600包括线程启动模块601、数据接收模块602和参数更新模块603。Wherein, the data processing device 600 includes a thread starting module 601 , a data receiving module 602 and a parameter updating module 603 .
线程启动模块601用于响应于接收行动者传送的数据传输指令,根据数据传输指令启动学习者与行动者之间的多个工作线程;数据接收模块602用于根据多个工作线程接收行动者传送的第一目标经验数据;其中,第一目标经验数据根据强化学习模型的第一网络参数确定;参数更新模块603用于将强化学习模型的第一网络参数更新为与第一目标经验数据对应的第二网络参数,以便行动者生成与第二网络参数对应的经验数据。The thread starting module 601 is used to start multiple working threads between the learner and the actor according to the data transmission instruction in response to receiving the data transmission instruction sent by the actor; the data receiving module 602 is used to receive the data transmission instruction sent by the actor according to the multiple working threads The first target experience data; wherein, the first target experience data is determined according to the first network parameters of the reinforcement learning model; the parameter update module 603 is used to update the first network parameters of the reinforcement learning model to correspond to the first target experience data Second network parameters, so that the actor generates empirical data corresponding to the second network parameters.
本公开实施例提供的数据处理装置600,可以执行上述任一实施例中的数据处理方法的技术方案,其实现原理以及有益效果与数据处理方法的实现原理及有益效果类似,可参见数据处理方法的实现原理及有益效果,此处不再进行赘述。The data processing device 600 provided by the embodiments of the present disclosure can implement the technical solution of the data processing method in any of the above embodiments, and its realization principle and beneficial effect are similar to those of the data processing method. Please refer to the data processing method The realization principle and beneficial effect of the method will not be repeated here.
进一步的,为了实现上述数据处理方法,本公开的一个实施例中还提供了另一种数据处理装置。图7示意性示出了数据处理装置的示意性架构图。Further, in order to implement the above data processing method, another data processing device is provided in an embodiment of the present disclosure. Fig. 7 schematically shows a schematic architecture diagram of a data processing device.
其中,数据处理装置700在线程启动模块601、数据接收模块602和参数更新模块603的基础上,该数据处理装置700还包括请求发送模块604。Wherein, on the basis of the thread starting module 601 , the data receiving module 602 and the parameter updating module 603 , the data processing device 700 also includes a request sending module 604 .
在一个可选的实施例中,数据接收模块602具体用于根据多个工作线程接收行动者传送的初始经验数据;对初始经验数据进行反序列化处理,得到第一目标经验数据。In an optional embodiment, the data receiving module 602 is specifically configured to receive initial experience data transmitted by actors according to multiple working threads; deserialize the initial experience data to obtain first target experience data.
在一个可选的实施例中,数据接收模块602具体用于将初始经验数据进行解压缩处理,得到第一中间经验数据;对第一中间经验数据进行反序列化处理,得到第一目标经验数据。In an optional embodiment, the data receiving module 602 is specifically configured to decompress the initial experience data to obtain the first intermediate experience data; deserialize the first intermediate experience data to obtain the first target experience data .
在一个可选的实施例中,数据接收模块602具体用于对第一中间经验数据进行反序列化处理,得到第二中间经验数据;将第二中间经验数据存储至缓存队列;若缓存队列中第二中间经验数据的数据量大于或等于数据量阈值,则将数据量的第二中间经验数据进行合并,得到第一目标经验数据。In an optional embodiment, the data receiving module 602 is specifically configured to deserialize the first intermediate experience data to obtain the second intermediate experience data; store the second intermediate experience data in the cache queue; if the cache queue If the data amount of the second intermediate experience data is greater than or equal to the data amount threshold, the second intermediate experience data of the data amount are combined to obtain the first target experience data.
在一个可选的实施例中,数据接收模块602具体用于将第一预设数量的行动者划分为第二预设数量的行动者集合;将多个工作线程划分第三预设数量的工作线程组合;针对第二预设数量的行动者集合中的目标行动者组合,根据预先配置的线程映射表从第三预设数量的工作线程组合中确定与目标行动者集合匹配的目标工作线程;其中,线程映射表中包含不同行动者集合对应的工作线程组合;通过目标工作线程接收目标行动者组合传送的第一目标经验数据。In an optional embodiment, the data receiving module 602 is specifically configured to divide the first preset number of actors into a second preset number of actor sets; divide multiple work threads into a third preset number of work thread combination; for the target actor combination in the second preset number of actor sets, determine the target worker thread matching the target actor set from the third preset number of worker thread combinations according to a pre-configured thread mapping table; Wherein, the thread mapping table includes working thread combinations corresponding to different actor sets; the first target experience data transmitted by the target actor combination is received through the target working thread.
在一个可选的实施例中,数据接收模块602具体用于若目标动作对应的第一行为参数与参考动作对应的第二行为参数的精度差值大于精度阈值,则接收行动者基于第二网络参数对应的第二目标经验数据;参数更新模块603用于将强化学习模型的第二网络参数更新为与第二目标经验数据对应的第三网络参数。In an optional embodiment, the data receiving module 602 is specifically configured to, if the accuracy difference between the first behavior parameter corresponding to the target action and the second behavior parameter corresponding to the reference action is greater than the accuracy threshold, then the receiving actor based on the second network The second target experience data corresponding to the parameters; the parameter updating module 603 is used to update the second network parameters of the reinforcement learning model to the third network parameters corresponding to the second target experience data.
在一个可选的实施例中,数据处理装置700还包括请求发送模块604,其中,请求发送模块604用于检测与行动者的网络状态信息;若网络状态信息不满足预设状态信息,则向行动者发送数据重传请求,数据重传请求用于请求行动者在网络状态信息满足预设状态信息时重传第一目标经验数据。In an optional embodiment, the data processing device 700 further includes a request sending module 604, wherein the request sending module 604 is used to detect the network status information of the actor; if the network status information does not meet the preset status information, send a request to The actor sends a data retransmission request, and the data retransmission request is used to request the actor to retransmit the first target experience data when the network state information satisfies the preset state information.
本公开实施例提供的数据处理装置700,可以执行上述任一实施例中的数据处理方法的技术方案,其实现原理以及有益效果与数据处理方法的实现原理及有益效果类似,可参见数据处理方法的实现原理及有益效果,此处不再进行赘述。The data processing device 700 provided by the embodiments of the present disclosure can implement the technical solution of the data processing method in any of the above embodiments, and its realization principle and beneficial effects are similar to those of the data processing method. Please refer to the data processing method The realization principle and beneficial effect of the method will not be repeated here.
在本公开的示例性实施例中,还提供了一种计算机可读存储介质,其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中,本发明的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当程序产品在终端设备上运行时,程序代码用于使终端设备执行本说明书上述“示例性方法”部分中描述的根据本发明各种示例性实施方式的步骤。In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium on which a program product capable of implementing the above-mentioned method in this specification is stored. In some possible implementations, various aspects of the present invention can also be implemented in the form of a program product, which includes program code. When the program product runs on the terminal device, the program code is used to make the terminal device execute the above-mentioned Steps according to various exemplary embodiments of the invention are described in the "Exemplary Methods" section.
根据本发明的实施方式的用于实现上述方法的程序产品,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本发明的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。According to the program product for implementing the above method according to the embodiment of the present invention, it may adopt a portable compact disc read-only memory (CD-ROM) and include program codes, and may run on a terminal device such as a personal computer. However, the program product of the present invention is not limited thereto. In this document, a readable storage medium may be any tangible medium containing or storing a program, and the program may be used by or in combination with an instruction execution system, apparatus or device.
程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。A program product may take the form of any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connection with one or more conductors, portable disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。A computer readable signal medium may include a data signal carrying readable program code in baseband or as part of a carrier wave. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium other than a readable storage medium that can transmit, propagate, or transport a program for use by or in conjunction with an instruction execution system, apparatus, or device.
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
可以以一种或多种程序设计语言的任意组合来编写用于执行本发明操作的程序代码,程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。Program codes for performing the operations of the present invention can be written in any combination of one or more programming languages, including object-oriented programming languages such as Java, C++, etc., as well as conventional procedural programming Language - such as "C" or similar programming language. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server to execute. In cases involving a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (for example, using an Internet service provider). business to connect via the Internet).
在本公开的示例性实施例中,还提供了一种能够实现上述方法的电子设备。In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.
所属技术领域的技术人员能够理解,本发明的各个方面可以实现为系统、方法或程序产品。因此,本发明的各个方面可以具体实现为以下形式,即:完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等),或硬件和软件方面结合的实施方式,这里可以统称为“电路”、“模块”或“系统”。Those skilled in the art can understand that various aspects of the present invention can be implemented as systems, methods or program products. Therefore, various aspects of the present invention can be embodied in the following forms, that is: a complete hardware implementation, a complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software implementations, which can be collectively referred to herein as "circuit", "module" or "system".
下面参照图8来描述根据本发明的这种实施方式的电子设备800。图8显示的电子设备800仅仅是一个示例,不应对本发明实施例的功能和使用范围带来任何限制。An electronic device 800 according to this embodiment of the present invention is described below with reference to FIG. 8 . The electronic device 800 shown in FIG. 8 is only an example, and should not limit the functions and application scope of this embodiment of the present invention.
如图8所示,电子设备800以通用计算设备的形式表现。电子设备800的组件可以包括但不限于:上述至少一个处理单元810、上述至少一个存储单元820、连接不同系统组件(包括存储单元820和处理单元810)的总线830、显示单元840。As shown in FIG. 8, electronic device 800 takes the form of a general-purpose computing device. The components of the electronic device 800 may include, but are not limited to: at least one processing unit 810, at least one storage unit 820, a bus 830 connecting different system components (including the storage unit 820 and the processing unit 810), and a display unit 840.
其中,存储单元存储有程序代码,程序代码可以被处理单元810执行,使得处理单元810执行本说明书上述“示例性方法”部分中描述的根据本发明各种示例性实施方式的步骤。例如,处理单元810可以执行如图4中所示的步骤S401至步骤S403。Wherein, the storage unit stores program codes, and the program codes can be executed by the processing unit 810, so that the processing unit 810 executes the steps according to various exemplary embodiments of the present invention described in the "Exemplary Methods" section of this specification. For example, the processing unit 810 may execute steps S401 to S403 as shown in FIG. 4 .
存储单元820可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)8201和/或高速缓存存储单元8202,还可以进一步包括只读存储单元(ROM)8203。The storage unit 820 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 8201 and/or a cache storage unit 8202 , and may further include a read-only storage unit (ROM) 8203 .
存储单元820还可以包括具有一组(至少一个)程序模块8205的程序/实用工具8204,这样的程序模块8205包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。Storage unit 820 may also include programs/utilities 8204 having a set (at least one) of program modules 8205, such program modules 8205 including but not limited to: an operating system, one or more application programs, other program modules, and program data, Implementations of networked environments may be included in each or some combination of these examples.
总线830可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。Bus 830 may represent one or more of several types of bus structures, including a memory cell bus or memory cell controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local area using any of a variety of bus structures. bus.
电子设备800也可以与一个或多个外部设备900(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该电子设备800交互的设备通信,和/或与使得该电子设备800能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口850进行。并且,电子设备800还可以通过网络适配器860与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器860通过总线830与电子设备800的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备800使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。The electronic device 800 can also communicate with one or more external devices 900 (such as keyboards, pointing devices, Bluetooth devices, etc.), and can also communicate with one or more devices that enable the user to interact with the electronic device 800, and/or communicate with Any device (eg, router, modem, etc.) that enables the electronic device 800 to communicate with one or more other computing devices. Such communication may occur through input/output (I/O) interface 850 . Moreover, the electronic device 800 can also communicate with one or more networks (such as a local area network (LAN), a wide area network (WAN) and/or a public network such as the Internet) through the network adapter 860 . As shown, network adapter 860 communicates with other modules of electronic device 800 via bus 830 . It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with electronic device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本公开实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本公开实施方式的方法。Through the description of the above implementations, those skilled in the art can easily understand that the example implementations described here can be implemented by software, or by combining software with necessary hardware. Therefore, the technical solutions according to the embodiments of the present disclosure can be embodied in the form of software products, and the software products can be stored in a non-volatile storage medium (which can be CD-ROM, U disk, mobile hard disk, etc.) or on the network , including several instructions to make a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiments of the present disclosure.
此外,上述附图仅是根据本发明示例性实施例的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。In addition, the above-mentioned figures are only schematic illustrations of the processes included in the method according to the exemplary embodiments of the present invention, and are not intended to be limiting. It is easy to understand that the processes shown in the above figures do not imply or limit the chronological order of these processes. In addition, it is also easy to understand that these processes may be executed synchronously or asynchronously in multiple modules, for example.
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本公开的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。It should be noted that although several modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory. Actually, according to the embodiment of the present disclosure, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of one module or unit described above can be further divided to be embodied by a plurality of modules or units.
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其他实施例。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由权利要求指出。Other embodiments of the disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any modification, use or adaptation of the present disclosure, and these modifications, uses or adaptations follow the general principles of the present disclosure and include common knowledge or conventional technical means in the technical field not disclosed in the present disclosure . The specification and examples are to be considered exemplary only, with the true scope and spirit of the disclosure indicated by the appended claims.
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限定。It should be understood that the present disclosure is not limited to the precise constructions which have been described above and shown in the drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310417978.7ACN116451814A (en) | 2023-04-14 | 2023-04-14 | Data processing method, device, storage medium and electronic device |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310417978.7ACN116451814A (en) | 2023-04-14 | 2023-04-14 | Data processing method, device, storage medium and electronic device |
| Publication Number | Publication Date |
|---|---|
| CN116451814Atrue CN116451814A (en) | 2023-07-18 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310417978.7APendingCN116451814A (en) | 2023-04-14 | 2023-04-14 | Data processing method, device, storage medium and electronic device |
| Country | Link |
|---|---|
| CN (1) | CN116451814A (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20200293883A1 (en)* | 2017-10-27 | 2020-09-17 | Deepmind Technologies Limited | Distributional reinforcement learning for continuous control tasks |
| CN111698327A (en)* | 2020-06-12 | 2020-09-22 | 中国人民解放军国防科技大学 | Distributed parallel reinforcement learning model training method and system based on chat room architecture |
| CN113570395A (en)* | 2021-01-22 | 2021-10-29 | 腾讯科技(深圳)有限公司 | Information processing method and device, computer readable medium and electronic equipment |
| CN114861826A (en)* | 2022-05-31 | 2022-08-05 | 中国科学技术大学 | Large-scale reinforcement learning training framework system based on distributed design |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20200293883A1 (en)* | 2017-10-27 | 2020-09-17 | Deepmind Technologies Limited | Distributional reinforcement learning for continuous control tasks |
| CN111698327A (en)* | 2020-06-12 | 2020-09-22 | 中国人民解放军国防科技大学 | Distributed parallel reinforcement learning model training method and system based on chat room architecture |
| CN113570395A (en)* | 2021-01-22 | 2021-10-29 | 腾讯科技(深圳)有限公司 | Information processing method and device, computer readable medium and electronic equipment |
| CN114861826A (en)* | 2022-05-31 | 2022-08-05 | 中国科学技术大学 | Large-scale reinforcement learning training framework system based on distributed design |
| Publication | Publication Date | Title |
|---|---|---|
| CN113377520B (en) | Resource scheduling method, device, equipment and storage medium | |
| US11188380B2 (en) | Method and apparatus for processing task in smart device | |
| JP2022137182A (en) | Associative learning method, device, equipment and storage medium | |
| US10970649B2 (en) | Automated reinforcement-learning-based application manager that uses local agents | |
| CN107515795A (en) | Multi-task parallel data processing method, device, medium and equipment based on queue | |
| CN115204399A (en) | Quantum computing task computing method and device and quantum computer operating system | |
| EP4583039A1 (en) | Image processing method and apparatus, device, and medium | |
| CN116151374A (en) | Distributed model reasoning method, device, device, storage medium and program product | |
| AlOrbani et al. | Load balancing and resource allocation in smart cities using reinforcement learning | |
| CN118485292A (en) | Method and device for automatic workflow generation and computing power allocation based on artificial intelligence | |
| Liu | Elasticros: An elastically collaborative robot operation system for fog and cloud robotics | |
| CN118626275B (en) | Heterogeneous computing resource virtualization processing method, electronic device and storage medium | |
| CN119537031A (en) | Computing power scheduling method, device, storage medium and electronic device for model training | |
| CN113010285B (en) | Method, device, device, medium and product for processing data | |
| CN115016911A (en) | Task orchestration method, apparatus, device and medium for large-scale federated learning | |
| CN118550692A (en) | Resource scheduling method of virtualized network function and related equipment | |
| CN116451814A (en) | Data processing method, device, storage medium and electronic device | |
| CN117519978A (en) | AI cloud edge platform grinding and transporting method | |
| CN119806851B (en) | Developing a state computing power scheduling method and device for the universal computing power intelligent computing center | |
| CN113703936A (en) | Method for creating computing power container, computing power platform, electronic device and storage medium | |
| CN119883659B (en) | General computing power-oriented intelligent computing center model remote development method and device | |
| CN119182664B (en) | Distributed computing communication method and device, distributed computing method and system | |
| US20250307021A1 (en) | Data processing method and device | |
| CN116112552A (en) | Embedded front-end and back-end communication method and device, electronic equipment and storage medium | |
| WO2024046463A1 (en) | Model construction method, apparatus and platform, electronic device and storage medium |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |