Movatterモバイル変換


[0]ホーム

URL:


CN117850705B - Artificial intelligent chip and data synchronization method thereof - Google Patents

Artificial intelligent chip and data synchronization method thereof
Download PDF

Info

Publication number
CN117850705B
CN117850705BCN202410194532.7ACN202410194532ACN117850705BCN 117850705 BCN117850705 BCN 117850705BCN 202410194532 ACN202410194532 ACN 202410194532ACN 117850705 BCN117850705 BCN 117850705B
Authority
CN
China
Prior art keywords
synchronization
circuit
computing
memory
lookup table
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410194532.7A
Other languages
Chinese (zh)
Other versions
CN117850705A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Original Assignee
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bi Ren Technology Co ltd, Beijing Bilin Technology Development Co ltdfiledCriticalShanghai Bi Ren Technology Co ltd
Priority to CN202410194532.7ApriorityCriticalpatent/CN117850705B/en
Publication of CN117850705ApublicationCriticalpatent/CN117850705A/en
Application grantedgrantedCritical
Publication of CN117850705BpublicationCriticalpatent/CN117850705B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The present disclosure provides an artificial intelligence chip and a data synchronization method thereof. The artificial intelligence chip comprises a memory circuit and a plurality of computing circuits. The plurality of computing circuits are coupled to the memory circuit. At least one of the plurality of computing circuits is selectively organized into a group of computing circuits to collectively perform an operational task. The computing circuit group sends an access request to the memory circuit based on the operation task and carries synchronous information. The memory circuit checks the synchronous information of the access request to determine whether to return the target data corresponding to the access request to the computing circuit group.

Description

Artificial intelligent chip and data synchronization method thereof
Technical Field
The present disclosure relates to an artificial intelligence chip and a data synchronization method thereof.
Background
Computing devices such as artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) chips can provide significant computing power. The tremendous effort of artificial intelligence chips comes from the large number of hardware Execution units (EU, or Execution cores) inside. One AI chip typically contains multiple stream processor clusters (Stream Processor Cluster, SPC), each of which typically contains multiple Compute cores (CU, or computation units), such as at least one of an Integer (INT) Compute core, a Floating Point (FP) Compute core, a tensor core (tensor core), and a vector core (vector core), and each Compute core typically contains multiple execution cores. The stream processor clusters can support general purpose computing, scientific computing, and neural network computing by programmatically organizing the various types of computing cores. In many computing task scenarios, different computing cores within the same stream processor cluster may perform the same computing task together, or different stream processor clusters may perform the same computing task together. Accordingly, read-after-Write (RAW), write-after-Read (WAR), write-after-Write (WAW), and the like are related data synchronization issues that are among the many technical issues in the art.
Disclosure of Invention
The present disclosure is directed to an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) chip and a data synchronization method thereof to perform operational tasks.
In an embodiment according to the present disclosure, the artificial intelligence chip includes a memory circuit and a plurality of computing circuits. The plurality of computing circuits are coupled to the memory circuit. At least one of the plurality of computing circuits is selectively organized into a group of computing circuits to collectively perform an operational task. The computing circuit group sends an access request to the memory circuit based on the operation task and carries synchronous information. The memory circuit checks the synchronization information to determine whether to return the target data block corresponding to the access request to the computing circuit group.
In an embodiment according to the present disclosure, the data synchronization method of the artificial intelligence chip includes: selectively organizing at least one of the plurality of computing circuits into a group of computing circuits to collectively perform an operational task; the computing circuit group sends an access request with synchronous information to the memory circuit based on the operation task; and checking the synchronous information by the memory circuit to determine whether to return the target data block corresponding to the access request to the computing circuit group.
Based on the above, the access request sent by the computing circuit group to the memory circuit is provided with the synchronization information. For example, in the application scenario of "different computing cores within the same stream processor cluster as the computing circuit group", the computing core group may issue an access request with synchronization information to a memory (e.g., at least one of a level one cache and an input buffer) within the same stream processor cluster. In the context of an application where a different stream processor cluster is used as the computational circuitry group, the stream processor cluster may issue an access request with synchronization information to a memory (e.g., a secondary cache or other shared memory) shared by the different stream processor clusters. The memory circuit may check the synchronization information of the access request itself to determine whether to return the target data block corresponding to the access request to the computing circuit group. Thus, data synchronization between different computing circuits in the computing circuit group can be ensured.
Drawings
Fig. 1 is a schematic block diagram of an Artificial Intelligence (AI) chip in accordance with at least one embodiment of the present disclosure.
FIG. 2 is a flow diagram of a method for data synchronization of an artificial intelligence chip in accordance with at least one embodiment of the present disclosure.
FIG. 3 is a diagram illustrating a data structure of an access request with synchronization information according to at least one embodiment of the present disclosure.
Fig. 4 is a circuit block diagram of an AI chip according to at least one embodiment of the disclosure.
Fig. 5 is a circuit block diagram of an AI chip according to another embodiment of the disclosure.
Description of the reference numerals
100. 400, 500: Artificial Intelligence (AI) chip
110_1, 110—N: computing circuit
120. 413: Memory circuit
121: Memory
122. 414, 422, SCU51, scu52_3: synchronous checking circuit
410. SPC0, SPC1, SPC2 and SPC3: stream processor cluster
411: Scheduler
412: Computing core
414A, 422a: synchronous lookup table
414B, 422b: inspection circuit
420. 520: Shared memory
421: Second level cache
D1_1: cache line
Lr1_1: load requirement
R1_1: synchronous inspection results
S1_1: synchronization information
S210, S220, S230: step (a)
Detailed Description
Reference will now be made in detail to the exemplary embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the description to refer to the same or like parts.
The term "coupled" as used throughout this specification (including the claims) may refer to any direct or indirect connection. For example, if a first device couples (or connects) to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections. The terms first, second and the like in the description (including the claims) are used for naming components or distinguishing between different embodiments or ranges and are not used for limiting the number of components, either upper or lower, or the order of the components. In addition, wherever possible, the same reference numbers will be used throughout the drawings and the description to refer to the same or like parts. The components/elements/steps in different embodiments using the same reference numerals or using the same terminology may be referred to with respect to each other.
Fig. 1 is a schematic block diagram of an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) chip 100 according to one embodiment of the present disclosure. The AI chip 100 shown in FIG. 1 includes a plurality of computing circuits (e.g., 110_1, …, 110_n shown in FIG. 1) and a memory circuit 120. The number n of calculation circuits may be determined according to the actual design. The computing circuits 110_1 to 110—n can access (access) the memory circuit 120. For example, in an application scenario in which "different computing cores within the same stream processor cluster are used as the computing circuits 110_1 to 110—n", the computing circuits 110_1 to 110—n may access the memory circuit 120, such as at least one of a level one cache (L1 cache) and an input buffer (input buffer), within the same stream processor cluster. The different computing cores include, for example, an Integer (INT) computing core, a Floating Point (FP) computing core, a tensor core (tensor core), a vector core, or other computing cores. The stream processor clusters of the AI chip 100 can support general purpose computing, scientific computing, and neural network computing by programmatically organizing various types of computing cores. In the application scenario of using the different stream processor clusters as the computing circuits 110_1 to 110_n, the computing circuits 110_1 to 110_n can access the memory circuit 120 shared by the different stream processor clusters, such as a secondary cache (L2 cache) or other off-chip memories.
The computing circuits 110_1 to 110—n are coupled to the memory circuit 120. At least one of the computing circuits 110_1 to 110—n may be selectively organized into a group of computing circuits to collectively perform an operational task. In some embodiments, the computing circuits 110_1-110—n and/or the memory circuit 120 may be implemented as hardware (hardware) circuits, according to various designs. In other embodiments, the computing circuits 110_1-110—n and/or the memory circuit 120 may be implemented in a combination of hardware, firmware, or software (i.e., a program).
In hardware, the computing circuits 110_1-110—n and/or the memory circuit 120 may be implemented as logic circuits on an integrated circuit INTEGRATED CIRCUIT. For example, the computing circuits 110_1-110—n and/or related functions of the memory circuit 120 may be implemented in various logic blocks, modules and circuits in one or more controllers, hardware controllers, microcontrollers (Microcontroller), hardware processors (hardware processor), microprocessors (microprocessors), application-specific integrated circuits (ASICs), digital signal processors (DIGITAL SIGNAL processors, DSPs), field programmable gate arrays (Field Programmable GATE ARRAY, FPGA), central processing units (Central Processing Unit, CPUs), or other processing units. The computing circuits 110_1-110—n and/or the associated functions of the memory circuit 120 may be implemented as hardware circuits, such as various logic blocks, modules and circuits in an integrated circuit, using a hardware description language (hardware description languages, such as Verilog HDL or VHDL) or other suitable programming language.
In software or firmware, the computing circuits 110_1-110—n and/or the related functions of the memory circuit 120 can be implemented as programming codes (programming codes). For example, the computing circuits 110_1-110—n and/or the memory circuit 120 are implemented using a general programming language (programming languages, such as C, C ++ or assembly language) or other suitable programming language. The programming code may be recorded or deposited on a "non-transitory machine readable storage medium (non-transitory machine-readable storage medium)". In some embodiments, the non-transitory machine-readable storage medium includes, for example, a semiconductor memory and/or a storage device. The semiconductor Memory includes a Memory card, a Read Only Memory (ROM), a FLASH Memory (FLASH Memory), a programmable logic circuit, or other semiconductor Memory. The storage device includes a hard disk (HARD DISK DRIVE, HDD), a Solid state disk (Solid-state drive, STATE DRIVE, SSD), or other storage device. An electronic device (e.g., a CPU, hardware controller, microcontroller, hardware processor, or microprocessor) may read and execute the programming code from the non-transitory machine-readable storage medium to implement the functions associated with the computing circuits 110_1-110—n and/or the memory circuit 120.
FIG. 2 is a flow chart of a method for data synchronization of an artificial intelligence chip according to an embodiment of the disclosure. In some embodiments, the data synchronization method shown in fig. 2 may be implemented in firmware or software (i.e., a program). For example, the operations associated with the data synchronization method shown in FIG. 2 may be implemented as non-transitory machine-readable instructions (programming code or program) that may be stored on a machine-readable storage medium. The data synchronization method shown in fig. 2 may be implemented when non-transitory machine readable instructions are executed by a computer. In other embodiments, the data synchronization method of FIG. 2 may be implemented in hardware, such as the artificial intelligence chip 100 of FIG. 1.
Referring to fig. 1 and 2, in step S210, at least one of the computing circuits 110_1 to 110—n may be selectively organized into a computing circuit group to jointly execute a computing task. At each memory access, the computing circuit group issues an access request with synchronization information to the memory circuit 120 based on the operation task (step S220). In step S230, the memory circuit 120 checks the synchronization information of the access request to determine whether to return the target data block corresponding to the access request to the computing circuit group.
For example, it is assumed that the computing circuits 110_1 to 110—n are selectively organized into a computing circuit group to collectively perform an operation task. At least one first computing circuit (e.g., computing circuit 110—n) in the computing circuit group issues a storage request to the memory circuit 120 to store data (target data block) in the process of the operation task to the memory circuit 120. The memory circuit 120 includes a synchronization lookup table (not shown in fig. 1). The memory circuit 120 checks the synchronization information of the storage request sent by the computing circuit 110—n to update the count value corresponding to the target data block in the synchronization lookup table (e.g., increment the count value of the corresponding entry in the synchronization lookup table by 1). At least one second computing circuit (e.g., computing circuit 110_1) in the computing circuit group issues a load request to the memory circuit 120. The memory circuit 120 checks the count value in the synchronization lookup table based on the synchronization information of the load request itself to determine whether to return the target data block corresponding to the load request to the computing circuit 110_1.
In the application scenario of using "different computing cores in the same stream processor cluster" as the computing circuits 110_1 to 110—n ", the computing circuits 110_1 to 110—n shown in fig. 1 include a plurality of computing cores in the same stream processor cluster, and the memory circuit 120 shown in fig. 1 includes the memory 121 and the synchronization checking circuit 122 in the same stream processor cluster. In accordance with a practical design, in some embodiments, the plurality of computing cores includes at least one tensor core and at least one vector core, and the memory 121 may include any memory (e.g., at least one of a level one cache and an input buffer) within the same stream processor cluster. The synchronization checking circuit 122 of the memory circuit 120 includes a synchronization lookup table (not shown in fig. 1). At least one first computing core (e.g., the computing circuit 110—n) of the plurality of computing cores issues a storage request to the memory 121 to store data (target data block) in the process of the operation task to the memory 121. The synchronization checking circuit 122 checks the synchronization information of the storage request sent by the first computing core to update the count value corresponding to the target data block in the synchronization lookup table (e.g., increment the count value of the corresponding entry in the synchronization lookup table by 1). At least one second computing core (e.g., computing circuitry 110_1) of the plurality of computing cores issues a load request to memory 121. The synchronization checking circuit 122 checks the count value in the synchronization lookup table based on the synchronization information of the load request itself to determine whether to notify the memory 121 to return the target data block corresponding to the load request to the second computing core.
In the application scenario of using "different stream processor clusters as the computing circuits 110_1 to 110—n", the computing circuits 110_1 to 110—n shown in fig. 1 include a plurality of stream processor clusters, and the memory circuit 120 shown in fig. 1 includes a shared memory (e.g., the memory 121 shown in fig. 1) and the synchronization checking circuit 122. For example, the memory 121 shown in FIG. 1 may be a memory shared by different clusters of stream processors (e.g., a secondary cache or other shared memory). At each memory access, the computing circuits 110_1 to 110—n issue an access request with synchronization information to the memory 121 based on the operation task. At least one first stream processor cluster (e.g., the computing circuit 110—n) of the plurality of stream processor clusters issues a storage request to the memory 121 (shared memory) to store data (target data block) in the process of the operation task to the shared memory 121. The synchronization checking circuit 122 of the memory circuit 120 includes a synchronization lookup table (not shown in fig. 1). The synchronization checking circuit 122 checks synchronization information of the storage request sent by the first stream processor cluster to update a count value corresponding to the target data block in the synchronization lookup table (e.g., increment a count value of a corresponding entry in the synchronization lookup table by 1). At least one second stream processor cluster (e.g., compute circuit 110_1) of the plurality of stream processor clusters issues a load request to shared memory 121. The synchronization checking circuit 122 checks the count value in the synchronization lookup table based on the synchronization information of the load request itself to determine whether to notify the shared memory 121 to return the target data block corresponding to the load request to the second stream processor cluster.
For example, the computing circuit 110_1 issues a load request Lr1_1 to the memory 121. The synchronization checking circuit 122 checks the synchronization information s1_1 of the load request lr1_1 to determine whether to inform the memory 121 to return the target data block corresponding to the load request lr1_1 to the computing circuit 110_1. When the result of the sync check circuit 122 checking the sync information s1_1 indicates that the target data block in the memory 121 is ready, the sync check circuit 122 notifies the memory 121 to return the target data block corresponding to the load request lr1_1 to the calculation circuit 110_1. When the result of the sync check circuit 122 checking the sync information s1_1 indicates that the target data block in the memory 121 is not ready, the sync check circuit 122 may report the sync check result r1_1 (the information indicating that the target data block in the memory 121 is not ready) corresponding to the load request lr1_1 back to the computing circuit 110_1.
FIG. 3 is a diagram illustrating a data structure of an access request with synchronization information according to an embodiment of the present disclosure. Referring to fig. 1 and 3, the computing circuits 110_1 to 110—n issue an access request to the memory 121 with synchronization information. Fig. 3 is an explanatory example of the loading request lr1_1 as an access request. Other types of access requirements may refer to the relevant description of the loading requirement lr1_1 and so on. In the embodiment shown in FIG. 3, the load request LR1_1 includes a cache line (cacheline) D1_1 and synchronization information S1_1. The specific data structure of the synchronization information s1_1 may be determined according to the actual design, and fig. 3 shows one specific example among many implementation examples of the data structure of the synchronization information s1_1. In the embodiment shown in fig. 3, the synchronization information s1_1 includes identification information fields (e.g., a synchronization identification field and a group field). The group field is used for recording the identification number of the computing circuit group. The synchronization identification field is used to record the identification number of the target data block. The synchronization information s1_1 further includes a type field, a count field, and a valid field. The type field is used to record the synchronization type (e.g., one write read many, many write read many … …, etc.) of the corresponding data. The count field of the synchronization information s1_1 is used to record how much data is currently needed to arrive. The valid field is used to record whether the current entry (entry) is valid.
The synchronization checking circuit 122 of the memory circuit 120 includes a synchronization lookup table (not shown in fig. 1). The specific data structure of each entry (entry) of the synchronization look-up table can be referred to the relevant description of the data structure of the synchronization information s1_1 shown in fig. 3 and so on. The count field of the sync lookup table is used to record how much data is currently ready. The valid field of the sync lookup table is used to record whether the current entry is valid. When the access request is a storage request, the synchronization checking circuit 122 may check an identification information field (a synchronization identification field and a group field) of synchronization information of the storage request itself to update a count value corresponding to the storage request in the synchronization lookup table. For example, the synchronization checking circuit 122 may find a corresponding entry in the synchronization lookup table according to the synchronization identification field and the group field of the synchronization information of the storage request, and then increment the count value in the count field of the corresponding entry by 1.
When the access request is a load request (for example, the load request lr1_1 shown in fig. 3), the synchronization checking circuit 122 of the memory circuit 120 may find the count value corresponding to the load request from a synchronization lookup table (not shown in fig. 1) based on the identification information field (synchronization identification field and group field) of the synchronization information of the load request. For example, the synchronization checking circuit 122 may find a corresponding entry in the synchronization lookup table according to the synchronization identification field and the group field of the synchronization information of the load request, and then fetch the count value from the count field of the corresponding entry. The synchronization checking circuit 122 may check the count value in the synchronization lookup table based on a count field of the synchronization information of the load request, so as to determine whether to inform the memory 121 to return the target data block corresponding to the access request to the computing circuit group. For example, the count field of the synchronization information s1_1 of the load request lr1_1 may itself have a target value, and the synchronization checking circuit 122 may take the count value from the count field of the corresponding entry in the synchronization lookup table, and then compare the target value of the load request lr1_1 with the count value in the synchronization lookup table. When the count value in the synchronization lookup table does not reach the target value (for example, the target value of the load request lr1_1 is greater than the count value in the synchronization lookup table), it indicates that one or some of the computing circuits in the computing circuit group have not stored the relevant data (target data) in the memory 121, and thus the synchronization checking circuit 122 may transmit the synchronization checking result indicating that the "target data block in the memory 121 is not ready" back to the computing circuit group. When the count value in the synchronization lookup table has reached the target value (for example, the target value of the load request lr1_1 is equal to or smaller than the count value in the synchronization lookup table), it indicates that the calculation circuit group has stored the relevant data (target data) into the memory 121, and therefore the synchronization checking circuit 122 can inform the memory 121 to return the target data block corresponding to the access request to the calculation circuit group.
In summary, at least one of the computing circuits 110_1 to 110—n may be selectively organized into a computing circuit group to jointly execute a computing task. The access requests issued by the compute circuit group to the memory 121 are self-contained with synchronization information. The synchronization checking circuit 122 may check synchronization information of the access request itself to determine whether to return the target data block corresponding to the access request to the computing circuit group. Therefore, data synchronization between different ones of the calculation circuits 110_1 to 110—n can be ensured. Unlike the above embodiments, the prior art scheme processes shared data through the order of store, enclose, sync, load to ensure synchronization of the data. The prior art solution requires a large time overhead. Furthermore, the access requirements issued by the prior art schemes do not have synchronization information, i.e. the "synchronization" of the prior art schemes is separate from the "store, load", adding additional modules. The prior art scheme increases performance overhead when storage synchronization coordination is required. For the prior art scheme, the larger the access synchronization, the farther the synchronization module is from the processing unit, and the larger the synchronization overhead.
Fig. 4 is a circuit block diagram of an AI chip 400 according to an embodiment of the disclosure. The AI chip 400 shown in fig. 4 includes a plurality of stream processor clusters (e.g., stream processor cluster 410) and a shared memory 420. The stream processor cluster 410 and the shared memory 420 shown in fig. 4 can be used as one of many embodiments of the computing circuits 110_1-110—n and the memory circuit 120 shown in fig. 1, and thus the AI chip 400, the stream processor cluster 410 and the shared memory 420 shown in fig. 4 can refer to the AI chip 100, the computing circuits 110_1-110—n and the memory circuit 120 shown in fig. 1 and so on. In the embodiment shown in fig. 4, the shared memory 420 includes a secondary cache 421 and a synchronization check circuit 422, and the synchronization check circuit 422 includes a synchronization lookup table 422a and a check circuit 422b. The secondary cache 421 and the synchronization checking circuit 422 shown in fig. 4 can refer to the relevant description of the memory 121 and the synchronization checking circuit 122 shown in fig. 1 and so on.
At each memory access, the stream processor cluster 410 may issue an access request with synchronization information to the secondary cache 421 based on the operational task. For example, when the access request is a storage request, the stream processor cluster 410 may issue a storage request to the secondary cache 421 to store data (target data block) during the operation task to the secondary cache 421. The checking circuit 422b may check the identification information field (the synchronization identification field and the group field) of the storage request self-contained synchronization information to update the count value corresponding to the storage request in the synchronization lookup table 422 a. For example, the checking circuit 422b may find the corresponding entry in the synchronization lookup table 422a according to the synchronization identification field and the group field of the synchronization information of the storage request, and then increment the count value in the count field of the corresponding entry by 1 (indicating that a piece of target data is ready in the secondary cache 421).
When the access requirement is a load requirement, the stream processor cluster 410 may issue a load requirement to the secondary cache 421. The checking circuit 422b may check the count value in the synchronization lookup table 422a based on the synchronization information of the load request itself to determine whether to notify the secondary cache 421 to return the target data block corresponding to the load request to the stream processor cluster 410. For example, the checking circuit 422b may find the corresponding entry in the synchronization lookup table 422a according to the synchronization identification field and the group field of the synchronization information of the load request, and then fetch the count value from the count field of the corresponding entry. The count field of the synchronization information of the load request issued by the stream processor cluster 410 is provided with the target value, and the checking circuit 422b may retrieve the count value from the count field of the corresponding entry in the synchronization lookup table 422a and then compare the target value of the load request with the count value of the synchronization lookup table 422 a. When the count value in the synchronization lookup table 422a does not reach the target value (e.g., the target value of the load request is greater than the count value of the synchronization lookup table 422 a), the check circuit 422b may return a synchronization check result indicating that the target data block in the secondary cache 421 is not ready back to the stream processor cluster 410. When the count value in the synchronization lookup table 422a has reached the target value (e.g., the target value of the load request is equal to or less than the count value of the synchronization lookup table 422 a), the check circuit 422b may inform the secondary cache 421 to return the target data block corresponding to the access request to the stream processor cluster 410.
The stream processor cluster 410 shown in fig. 4 includes a scheduler 411, a plurality of computation cores 412 (e.g., tensor cores and vector cores), and a memory circuit 413. The computing core 412 and the memory circuit 413 shown in fig. 4 may be used as one of the embodiments of the computing circuits 110_1 to 110—n and the memory circuit 120 shown in fig. 1, so the computing core 412 and the memory circuit 413 shown in fig. 4 may refer to the related descriptions of the computing circuits 110_1 to 110—n and the memory circuit 120 shown in fig. 1 and so on. In the embodiment shown in fig. 4, the memory circuit 413 includes an input buffer, a level one buffer, and a synchronization check circuit 414, and the synchronization check circuit 414 includes a synchronization lookup table 414a and a check circuit 414b. The secondary cache and sync check circuit 414 of fig. 4 may refer to the relevant description of the memory 121 and the sync check circuit 122 of fig. 1 and so on. The scheduler 411 may issue a calculation instruction to the calculation core 412 based on the calculation task, and the calculation core 412 may issue an access request with synchronization information to the memory circuit 413 based on the calculation instruction.
For example, when the access request is a storage request, the computing core 412 may issue the storage request to the first level cache to store data (target data block) during the operation task to the first level cache. The checking circuit 414b may check the identification information field (the synchronization identification field and the group field) of the storage request self-contained synchronization information to update the count value corresponding to the storage request in the synchronization lookup table 414 a. For example, the checking circuit 414b may find the corresponding entry in the synchronization lookup table 414a according to the synchronization identification field and the group field of the synchronization information of the storage request, and then increment the count value in the count field of the corresponding entry by 1 (indicating that a piece of target data is ready in the first level cache).
When the access requirement is a load requirement, the compute core 412 may issue the load requirement to the level one cache. The checking circuit 414b may check the count value in the synchronization lookup table 414a based on the synchronization information of the load request itself to determine whether to notify the primary cache to return the target data block corresponding to the load request to the computing core 412. For example, the checking circuit 414b may find a corresponding entry in the synchronization lookup table 414a according to the synchronization identification field and the group field of the synchronization information of the load request, and then fetch the count value from the count field of the corresponding entry. The count field of the synchronization information of the load request issued by the compute core 412 may itself have a target value, and the check circuit 414b may retrieve the count value from the count field of the corresponding entry in the synchronization lookup table 414a and then compare the target value of the load request to the count value of the synchronization lookup table 414 a. When the count value in the synchronization lookup table 414a does not reach the target value (e.g., the target value of the load request is greater than the count value of the synchronization lookup table 414 a), the checking circuit 414b may return a synchronization check result to the computing core 412 indicating that the target data block in the first level cache is not ready. When the count value in the synchronization lookup table 414a has reached the target value (e.g., the target value of the load request is equal to or less than the count value of the synchronization lookup table 414 a), the checking circuit 414b may inform the primary cache to return the target data block to the compute core 412 corresponding to the access request.
Fig. 5 is a circuit block diagram of an AI chip 500 according to another embodiment of the disclosure. The AI chip 500 shown in fig. 5 includes a plurality of stream processor clusters (e.g., stream processor clusters SPC0, SPC1, SPC2, and SPC 3) and a shared memory 520. The stream processor clusters SPC 0-SPC 3 and the shared memory 520 shown in fig. 5 can be used as one of many embodiments of the computing circuits 110_1-110—n and the memory circuit 120 shown in fig. 1, so the AI chip 500, the stream processor clusters SPC 0-SPC 3 and the shared memory 520 shown in fig. 5 can refer to the related descriptions of the AI chip 100, the computing circuits 110_1-110—n and the memory circuit 120 shown in fig. 1 and so on. In the embodiment shown in fig. 5, the shared memory 520 includes a secondary cache and synchronization check circuit SCU51. The secondary cache and synchronization check circuit SCU51 shown in fig. 5 can refer to the relevant description of the secondary cache 421 and synchronization check circuit 422 shown in fig. 4 and so on.
In the embodiment shown in fig. 5, each stream processor cluster includes multiple compute cores, a level one cache, and a synchronization check circuit. For example, the stream processor cluster SPC3 comprises a plurality of computing cores, a level one cache and synchronization check circuit scu52_3. The computing core, the level one cache and sync check circuit SCU52_3 in the stream processor cluster SPC3 shown in fig. 5 can be used as one of the embodiments of the computing circuits 110_1 to 110—n, the memory 121 and the sync check circuit 122 shown in fig. 1. The computing core, the level one cache and sync checking circuit scu52_3 in the stream processor cluster SPC3 shown in fig. 5 may refer to the relevant description of the computing core 412, the level one cache and sync checking circuit 414 shown in fig. 4 and so on. The other stream processor clusters SPC 0-SPC 2 can refer to the relevant description of the stream processor cluster SPC3 and so on.
For convenience of explanation, it is assumed herein that the stream processor clusters SPC 0-SPC 3 shown in FIG. 5 are selectively organized into a group of computing circuits to collectively perform an operational task. In the execution of the operation task, the stream processor cluster SPC3 needs to use the calculation results (target data) of the other stream processor clusters SPC0 to SPC2, and the AI chip 500 can ensure the data synchronization between the stream processor clusters SPC0 to SPC 3. For data synchronization, the stream processor clusters SPC0 to SPC3 send initialization information with synchronization information to the synchronization checking circuit SCU51, and the synchronization checking circuit SCU51 opens up a new entry (hereinafter referred to as corresponding entry) in a synchronization look-up table (not shown in fig. 5) based on identification information fields (synchronization identification field and group field) of the synchronization information of the initialization information. The count value in the count field of the corresponding entry that completes the initialization is an initial value (e.g., 0 or other real number).
When the stream processor cluster SPC3 sends a load request with synchronization information to the shared memory 520, the synchronization checking circuit SCU51 may find the count value corresponding to the load request from a synchronization lookup table (not shown in fig. 5) based on the identification information field (synchronization identification field and group field) of the synchronization information of the load request. The count field of the synchronization information of the load request issued by the stream processor cluster SPC3 is itself provided with a target value (in this example, the target value is "3" because the stream processor cluster SPC3 needs to use three calculation results of the stream processor clusters SPC0 to SPC 2). The synchronization checking circuit SCU51 may fetch a count value (currently an initial value of "0") from the count field of the corresponding entry in the synchronization look-up table. Because the count value "0" in the synchronization look-up table does not reach the target value "3" (indicating that no target data block is stored in the secondary cache), the synchronization check circuit SCU51 may return a synchronization check result indicating that the target data block in the secondary cache is not ready back to the stream processor cluster SPC3. Based on the synchronization check result, the stream processor cluster SPC3 can wait for data ready.
Each time a stream processor cluster SPC 0-SPC 3 accesses the secondary cache, any one stream processor cluster may issue an access request with synchronization information to the secondary cache based on the operation task. For example, the stream processor cluster SPC0 may issue a storage request to the secondary cache to store data (target data block) during the operation task to the secondary cache. The synchronization checking circuit SCU51 may check the identification information field (synchronization identification field and group field) of the self-contained synchronization information of the storage request to update the count value corresponding to the storage request in a synchronization lookup table (not shown in fig. 5). For example, the synchronization checking circuit SCU51 may find a corresponding entry in the synchronization lookup table according to the synchronization identification field and the group field of the synchronization information of the storage request, and then increment the count value in the count field of the corresponding entry by 1 (indicating that a piece of target data is ready in the secondary cache). In this process, when the stream processor cluster SPC3 sends a load request with synchronization information itself to the shared memory 520, the synchronization checking circuit SCU51 may fetch a count value (not reaching "3" at this time) from the count field of the corresponding entry in the synchronization lookup table. Because the count value in the synchronization lookup table does not reach the target value "3", the synchronization check circuit SCU51 may return a synchronization check result indicating that the target data block in the secondary cache is not ready to the stream processor cluster SPC3. Based on the synchronization check result, the stream processor cluster SPC3 may block the thread to wait for data to be ready, or the stream processor cluster SPC3 may directly return a signal of "load failure" to the scheduler (not shown in fig. 5) without blocking the thread.
After each of the stream processor clusters SPC0 to SPC2 stores the respective calculation results (target data blocks) in the secondary cache, the count value in the count field of the corresponding entry in the synchronization lookup table has progressed to "3". When the stream processor cluster SPC3 sends a load request with synchronization information to the shared memory 520, the synchronization checking circuit SCU51 may find a count value (at this time, 3) corresponding to the load request from a synchronization lookup table (not shown in fig. 5) based on identification information fields (synchronization identification field and group field) of the synchronization information of the load request. Because the count value "3" in the synchronization lookup table has reached the target value "3" (indicating that the stream processor clusters SPC 0-SPC 2 have all stored the respective calculation results in the secondary cache), the synchronization checking circuit SCU51 may inform the secondary cache to return the target data block corresponding to the access request to the stream processor cluster SPC3. Therefore, the AI chip 500 can ensure data synchronization between the stream processor clusters SPC 0-SPC 3.
Based on the actual design, the technique shown in FIG. 5 may be further optimized in some embodiments. The AI chip 500 of fig. 5 may enable different levels of synchronization checking circuitry to perform synchronization checking depending on how far or how far the data is accessed. When the AI chip 500 finds that the storing and reading of the target data block to be accessed occurs only within the same stream processor cluster, the AI chip 500 may enable a local level of synchronization check circuitry (e.g., synchronization check circuitry SCU 52_3) to achieve data synchronization. When the AI chip 500 finds that the storing and reading of the target data block to be accessed occurs in a memory (e.g., a secondary cache) shared by different stream processor clusters, the AI chip 500 may enable a wide-area-level synchronization check circuit (e.g., the synchronization check circuit SCU 51) to achieve data synchronization. When the AI chip 500 finds that only "one stream processor cluster" (e.g., the stream processor cluster SPC 3) needs to read the target data block, the wide-area-level synchronization check circuit SCU51 synchronizes the corresponding entry related to the "one stream processor cluster" in the synchronization look-up table (not shown in fig. 5) of the synchronization check circuit (e.g., the synchronization check circuit scu52_3) of the "one stream processor cluster". Thus, the "certain stream processor cluster" may perform synchronization checking locally to achieve data synchronization.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present disclosure, and not for limiting the same; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present disclosure.

Claims (18)

6. The artificial intelligence chip of claim 1, wherein the plurality of computing circuits includes a plurality of computing cores within a same stream processor cluster, the memory circuit includes a memory within the same stream processor cluster and a synchronization check circuit, at least one first computing core of the plurality of computing cores issues the storage request to the memory to store the target data block to the memory, the synchronization check circuit includes the synchronization lookup table, the synchronization check circuit checks the synchronization information of the storage request issued by the at least one first computing core to update a count value corresponding to the target data block in the synchronization lookup table, at least one second computing core of the plurality of computing cores issues the load request to the memory, and the synchronization check circuit checks the count value in the synchronization lookup table based on the synchronization information of the load request self-contained to determine whether to inform the memory to return the target data block corresponding to the load request to the at least one second computing core.
8. The artificial intelligence chip of claim 1, wherein the plurality of computing circuits comprises a plurality of stream processor clusters, the memory circuit comprises a shared memory and a synchronization checking circuit, at least one first stream processor cluster of the plurality of stream processor clusters issues the storage request to the shared memory to store the target data block to the shared memory, the synchronization checking circuit comprises the synchronization lookup table, the synchronization checking circuit checks the synchronization information sent by the at least one first stream processor cluster to update a count value corresponding to the target data block in the synchronization lookup table, at least one second stream processor cluster of the plurality of stream processor clusters sends the load request to the shared memory, and the synchronization checking circuit checks the count value in the synchronization lookup table based on the synchronization information carried by the load request to determine whether to notify the shared memory to return the target data block corresponding to the load request to the at least one second stream processor cluster.
CN202410194532.7A2024-02-222024-02-22Artificial intelligent chip and data synchronization method thereofActiveCN117850705B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202410194532.7ACN117850705B (en)2024-02-222024-02-22Artificial intelligent chip and data synchronization method thereof

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202410194532.7ACN117850705B (en)2024-02-222024-02-22Artificial intelligent chip and data synchronization method thereof

Publications (2)

Publication NumberPublication Date
CN117850705A CN117850705A (en)2024-04-09
CN117850705Btrue CN117850705B (en)2024-05-07

Family

ID=90530459

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202410194532.7AActiveCN117850705B (en)2024-02-222024-02-22Artificial intelligent chip and data synchronization method thereof

Country Status (1)

CountryLink
CN (1)CN117850705B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN120234295B (en)*2025-06-032025-08-01上海壁仞科技股份有限公司Artificial intelligence chip and operation method thereof

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101739371A (en)*2006-09-262010-06-16株式会社日立制作所Storing system and disc control device
CN105405070A (en)*2015-12-032016-03-16国家电网公司Distributed memory power grid system construction method
CN105630413A (en)*2015-12-232016-06-01中国科学院深圳先进技术研究院Synchronized writeback method for disk data
CN116841710A (en)*2023-06-192023-10-03蔚来汽车科技(安徽)有限公司 Task scheduling method, task scheduling system and computer storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US10990391B2 (en)*2018-03-312021-04-27Micron Technology, Inc.Backpressure control using a stop signal for a multi-threaded, self-scheduling reconfigurable computing fabric
WO2019191737A1 (en)*2018-03-312019-10-03Micron Technology, Inc.Multi-threaded self-scheduling reconfigurable computing fabric

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN101739371A (en)*2006-09-262010-06-16株式会社日立制作所Storing system and disc control device
CN105405070A (en)*2015-12-032016-03-16国家电网公司Distributed memory power grid system construction method
CN105630413A (en)*2015-12-232016-06-01中国科学院深圳先进技术研究院Synchronized writeback method for disk data
CN116841710A (en)*2023-06-192023-10-03蔚来汽车科技(安徽)有限公司 Task scheduling method, task scheduling system and computer storage medium

Also Published As

Publication numberPublication date
CN117850705A (en)2024-04-09

Similar Documents

PublicationPublication DateTitle
US6038646A (en)Method and apparatus for enforcing ordered execution of reads and writes across a memory interface
US6272520B1 (en)Method for detecting thread switch events
CN1294484C (en)Breaking replay dependency loops in processor using rescheduled replay queue
US5450564A (en)Method and apparatus for cache memory access with separate fetch and store queues
JP4160925B2 (en) Method and system for communication between processing units in a multiprocessor computer system including a cross-chip communication mechanism in a distributed node topology
US5832262A (en)Realtime hardware scheduler utilizing processor message passing and queue management cells
US5016167A (en)Resource contention deadlock detection and prevention
US4881163A (en)Computer system architecture employing cache data line move-out queue buffer
US20020184445A1 (en)Dynamically allocated cache memory for a multi-processor unit
EP0637799A2 (en)Shared cache for multiprocessor system
US7073026B2 (en)Microprocessor including cache memory supporting multiple accesses per cycle
EP0348628A2 (en)Cache storage system
US5898882A (en)Method and system for enhanced instruction dispatch in a superscalar processor system utilizing independently accessed intermediate storage
US8190825B2 (en)Arithmetic processing apparatus and method of controlling the same
JPH04232532A (en)Digital computer system
US20140129806A1 (en)Load/store picker
US20080288691A1 (en)Method and apparatus of lock transactions processing in single or multi-core processor
EP0605868A1 (en)Method and system for indexing the assignment of intermediate storage buffers in a superscalar processor system
CN117850705B (en)Artificial intelligent chip and data synchronization method thereof
CN117852600B (en)Artificial intelligence chip, method of operating the same, and machine-readable storage medium
US6209081B1 (en)Method and system for nonsequential instruction dispatch and execution in a superscalar processor system
EP0265108B1 (en)Cache storage priority
CN114036091B (en)Multiprocessor peripheral multiplexing circuit and multiplexing method thereof
US12204900B2 (en)Predicates for processing-in-memory
CA1116756A (en)Cache memory command circuit

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp