Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region. For example, information such as setting operation involved in the present application is obtained under a sufficient authorization.
It should be understood that, although the terms first, second, etc. may be used in this disclosure to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first parameter may also be referred to as a second parameter, and similarly, a second parameter may also be referred to as a first parameter, without departing from the scope of the present disclosure. The term "if" as used herein may be interpreted as "at..once" or "when..once" or "in response to a determination", depending on the context.
First, a description is made of related nouns to which the present application relates.
A Cache (Cache) is a small-capacity and high-speed memory, which belongs to a part of a storage system and stores instructions and data which are commonly used by programs. May also be referred to as Cache Memory (Cache Memory). Processors typically include multiple levels of caches, each of which typically has a different size, and smaller caches typically have higher read and write efficiency due to the smaller amount of data that needs to be processed.
The caches are divided into a plurality of levels, such as a first level (L1) cache, a second level (L2) cache and a third level (L3) cache, the cache capacity of each level gradually increases and the speed gradually slows down, the last level cache (LLC, last Level Cache) is the last level cache of the CPU core, such as the L3 cache, when the processing core has the requirement of accessing data, if the data does not hit in the L1 cache or the L2 cache, LLC can be continuously searched, and if the data does not hit in the LLC, the processing core needs to read the required data from the main memory.
The cache can be divided into an instruction cache and a data cache according to different information stored in the cache. The instruction cache is a cache for storing instructions, and the data cache is a cache for storing data. The embodiment of the present application is mainly illustrated by data buffering, but the embodiment of the present application may also be used to support other buffering for storing data, which is not limited thereto.
Network-on-Chip (NoC), a Network-design-based communication subsystem located in an integrated circuit. Are commonly used for data transmission and communication between different modules in a System on Chip (SoC). The NoC technology is connected with the processor core, the memory and various peripheral devices through the router to form a highly parallel communication architecture, and the data transmission efficiency and the communication bandwidth are effectively improved. Compared with the traditional shared bus mode, the NoC technology can provide higher expandability and better performance, and is particularly suitable for multi-core systems.
The instruction is a command for instructing a computer to execute a certain operation, and is the minimum functional unit for the computer to run. One instruction is a statement in machine language and one instruction is a set of meaningful binary codes. The set of all instructions of a computer constitutes the instruction system, also known as the instruction set, of that computer.
Instruction format-readable representation of instructions, an instruction typically includes an operation code (Opcode) and operands (Operand). The opcode is used to describe the type of operation that the instruction is to perform, and the operand provides the data or address of the data that is required to execute the instruction. The opcode is indispensable in an instruction, but there may be none, one or two operands.
In a multi-core processor, constant data (e.g., instruction data, constants in a program, program control flows, etc.) is often read frequently by various components in the processing core, and this portion of data is often closely related to the control flows of the program, which is delay-sensitive data. Thus, in modern processor cache hierarchies, a separate constant cache is often provided at each processing core to reduce the read latency of constant data.
However, constant data is actually shared among multiple processing cores, and each processing core independently caches the same constant data, resulting in storage redundancy and reduced overall cache utilization. In addition, the last-stage cache which can be shared by all cores is often located at the back end of the network on chip, when cache miss occurs in the processing core, constant data which is missed needs to be read from the last-stage cache or the main memory, and the transmission delay and the reading delay of the process are high.
Illustratively, as shown in FIG. 1, at least two processing cores in processor 100 are coupled through network-on-chip 200, and processor 100 is coupled through network-on-chip 200 to last level cache 300 and main memory 400. Each of the at least two processing cores includes a cache, which may be a general purpose cache or a constant cache. If the cache is a constant cache, each processing core will typically also include a general purpose cache, which is not shown in FIG. 1. In the related art, if the first constant of the processing core 1 misses in the cache of the processing core 1, that is, a cache miss occurs, the processing core 1 is required to request the first constant from the last level cache 300, and the transmission path 10 is shown in fig. 1. On the one hand, the last-stage buffer 300 has a larger capacity, so that the read-write speed of the last-stage buffer 300 is slower. On the other hand, the last level buffer 300 is located at the back end of the network on chip 200, and the transmission delay of the read request of the processing core 1 to the last level buffer 300 is higher.
To solve this problem, the processor shown in the embodiment of the present application supports cache sharing between at least two processing cores, as shown in fig. 2, so as to reduce the latency of constant data. Specifically, the following is shown.
Fig. 2 shows a schematic diagram of a processor according to an exemplary embodiment of the present application. Processor 100 includes at least two processing cores.
Optionally, the processor 100 shown in the embodiments of the present application includes at least two processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 100 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field Programmable GATE ARRAY ), PLA (Programmable Logic Array, programmable logic array). Processor 100 may be a CPU (Central Processing Unit ). In some embodiments, the processor 100 may be a GPU (Graphics Processing Unit, graphics processor). In other embodiments, the processor 100 may also be an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning. The embodiment of the application is illustrated by taking the processor 100 as a GPU.
Alternatively, each of the at least two processing cores may communicate with each other via a bus, may communicate with each other via a network on chip 200 as shown in fig. 2, may communicate with each other via traces on a circuit board of the processor, and so on. It should be noted that the foregoing only shows an example of a connection manner between at least two processing cores, but does not constitute limitation of the embodiment of the present application with respect to the connection manner between at least two processing cores.
A first one of the at least two processing cores is configured to fetch a first constant cached in a second one of the processing cores in the event that the first constant misses in the first cache.
The first cache is a cache of the first processing core.
Optionally, a first processing core of the at least two processing cores requests the first constant from a second cache, the second cache being a cache of the second processing core, in case the first constant misses in the first cache.
Alternatively, the first cache and the second cache may be caches dedicated to constants, that is, the data cached in the first cache and the second cache are read-only data, where the first cache may be referred to as a first constant cache and the second cache may be referred to as a second constant cache. Or, the first cache and the second cache are universal caches, that is, the first cache and the second cache are used for both cache constants and cache variables, that is, the data cached in the first cache and the second cache can be read-only data or data supporting write operations (i.e., writable data). For example, the first cache and the second cache are level one caches in the first processing core and the second processing core, respectively.
Optionally, the first cache is a private cache of the first processing core, i.e. the first cache is an exclusive cache of the first processing core. The first cache typically only supports access by the first processing core, or the data in the first cache is not directly available for querying by other processing cores.
Optionally, for a first processing core of the at least two processing cores, if a first constant is required to be used, first accessing the first cache, if the first constant hits in the first cache, no subsequent steps are required, in case the first constant misses in the first cache, selecting a second processing core of the at least two processing cores, and sending a read request carrying an address of the first constant and an identification of the first processing core to the second processing core.
Alternatively, the second processing core is one processing core or a plurality of processing cores, i.e. the first processing core may acknowledge one or more of the at least two processing cores as the second processing core.
Optionally, the second processing core accesses the second cache based on a read request sent by the first processing core, sends the first constant to the first processing core according to the identification of the first processing core indicated in the read request in case of a hit of the first constant in the second cache, and returns miss information to the first processing core according to the identification of the first processing core indicated in the read request in case of a miss of the first constant in the second cache.
Optionally, the first constant is data that is frequently read by at least two processing cores, or the read frequency of the at least two processing cores for the first constant is greater than a frequency threshold. That is, in the case where the read frequency of at least two processing cores for constants is greater than the frequency threshold, the method provided in the embodiment of the present application may be used to improve the read efficiency of at least two processing cores for constants.
It should be noted that, with respect to the "request" and "response" shown in the embodiments of the present application, whether the read request, the query request, the read response, the query response, etc., are generally in the form of instructions in the processor, but the embodiments of the present application are not limited thereto.
In summary, the processor provided in the embodiment of the present application supports that the first processing core requests the first constant from the other processing cores under the condition that the first constant is not hit. Compared with the traditional mode of requesting the first constant from the last-stage cache or the main memory, on one hand, the read-write speed of the cache in the processing core is generally higher than that of the last-stage cache and the main memory, namely, the speed of the second processing core searching the first constant from the second cache is faster than that of the last-stage cache or the main memory, or the read delay of the second processing core is lower than that of the last-stage cache or the main memory, so that the speed of obtaining the first constant through the second processing core request is indirectly faster, and the read efficiency of the first processing core for the constant is improved. On the other hand, due to the architecture problem of the processor, the transmission path (or transmission delay) between the first processing core and the last stage cache or the main memory is often longer than the transmission path (or transmission delay) between the first processing core and the second processing core, the transmission path between the first processing core and the last stage cache is shown as a path 10 in fig. 1, the transmission path between the first processing core and the second processing core is shown as a path 20 in fig. 2, and the transmission delay of the first processing core for acquiring the read request of the first constant from the second processing core and the transmission delay of the second processing core for returning the read response of the first constant to the first processing core is lower than that of the conventional method, so that the speed for acquiring the first constant through the second processing core request is faster, and the reading efficiency of the first processing core for the constant is improved.
In addition, the processor shown in the embodiment of the present application realizes the sharing of the cache between at least two processing cores, and by means of sharing the cache, each processor does not need to ensure that the constant is cached to its own cache, but only needs to ensure that there is a processing core in which the constant is cached in at least two processing cores. For example, when the first processing core needs the first constant, the first constant is read from the second processing core, so that the sharing of private caches between at least two processing cores is realized, and as the sharing mechanism of the private caches is mainly aimed at the constant, the constant has read-only property, and the problem of cache inconsistency is not easy to occur, thereby not only ensuring the reliability of constant data, but also avoiding the first constant from generating redundancy in at least two processing cores, and improving the utilization rate of the caches in at least two processing cores.
1. And determining a second processing core.
For the first processing core, the first constant is preferably chosen to be requested from the second processing core to improve the reading efficiency. But if the first constant does not hit in the second processing core, the first processing core may be required to select the second processing core again or instead read the first constant into the last level cache or main memory using conventional means. Thus, additional design is required in the selection of the second processing core. The embodiment of the application provides two determination modes of the second processing core.
The first mode of determination is to determine an adjacent processing core as a second processing core.
And the second determination mode is that a shared directory unit is set, and the processing core cached with the first constant is determined as a second processing core by the shared directory unit.
Next, description will be made one by one regarding the determination manners of the above two second processing cores, and it should be understood that the description order does not represent the merits of the determination manners.
The first mode of determination is to determine an adjacent processing core as a second processing core.
Optionally, the first processing core or the processor or the shared directory unit is configured to determine all or part of the neighboring cores as the second processing core, and the neighboring cores are processing cores physically neighboring or logically neighboring the first processing core, and request the first constant from the second processing core, in case the first constant misses in the first cache. That is, the second processing core is all or part of the neighboring cores.
It should be noted that the unit determining the adjacent processing core as the second processing core may be the first processing core, the processor, the shared directory unit, or the like, which are shown above, or may be other control modules in the processor, other processing cores (such as processing cores other than the first processing core), or the like. The embodiment of the present application is not limited to the unit that determines the adjacent processing core as the second processing core.
Optionally, for each of the at least two processing cores, there is at least one adjacent core. The adjacent core is a processing core that is physically or logically adjacent to the first processing core.
Optionally, physical adjacency refers to an adjacency where there is a directly connected trace on the circuit board of processor 100. That is, a first processing core and a second processing core may be considered physically adjacent when there are directly connected traces on the circuit board.
Alternatively, logical adjacency refers to adjacency as indicated by a driver or firmware or other means. For example, for a register corresponding to one neighboring core in the first processing core, one or more processing cores are indicated as neighboring cores of the first processing core by a driver. The one or more processing cores may be indicated from a physical or logical identifier or the like.
Optionally, the first processing core determines a portion of the neighboring cores as the second processing core if the first constant misses in the first cache. Some neighboring cores are processing cores that are selected by the first processing core itself. Part of the neighboring cores may be selected randomly by the first processing core, or the first processing core may be selected according to a selection condition.
Optionally, the neighboring core is a processing core that is logically neighboring the first processing core. The neighboring core support processors of the first processing core are dynamically adjusted according to the state of at least two processing cores. Specifically, the adjacent processing cores are processing cores that satisfy the first condition. The first condition includes at least one of a load lower than a first threshold, a top n bits ordered from small to large in load, n being a positive integer, a distance from the first processing core lower than a second threshold, and a top k bits ordered from small to large in distance from the first processing core, k being a positive integer.
The processor or the shared directory unit is used for determining the processing cores meeting the first condition as adjacent cores of the first processing core, wherein the first condition comprises at least one of load lower than a first threshold value, n being a positive integer in order from small to large, n being a positive integer, distance from the first processing core lower than a second threshold value, and k being a positive integer in order from small to large, and the first processing core being a first k bits.
Optionally, the number of adjacent cores supported for storage by the first processing core is a first number, and the number of processing cores satisfying the first condition is a second number. If the first number is greater than or equal to the second number, all processing cores satisfying the first condition are identified as neighboring cores, and if the first number is less than the second number, processing cores satisfying the first condition from the selection portion are selected as neighboring cores, such as selecting the first number of processing cores in order of load from small to large, or selecting the first number of processing cores in order of distance from near to far, or the like.
Illustratively, the first threshold is set by the developer, or the first threshold is set empirically by an expert, or the first threshold is dynamically determined based on the load of at least two processing cores. For example, the first threshold is an average value, a median value, etc. of the loads of the at least two processing cores. Assuming that there are 5 processing cores, the load of processing core 1 is 50%, the load of processing core 2 is 20%, the load of processing core 3 is 40%, the load of processing core 4 is 80%, and the load of processing core 5 is 20%, then the first threshold is (50% +20% +40% +80% + 20%)/5=42% at this time, the processor interval fixed duration triggers the adjustment of the adjacent cores once, if the first processing core is processing core 3, its original adjacent cores are processing core 2 and processing core 4, processing core 2 will still be identified as the adjacent core of processing core 3 because the load of processing core 2 is 20% below the first threshold 42%, but since the load of processing core 4 is 80% above the first threshold, the processor selects another processing core with a load below the first threshold and closer to processing core 3 from among the at least two processing cores as another adjacent core of processing core 3, such as determining processing core 5 as another adjacent core of processing core 3.
Optionally, the first processing core is configured to identify, as the second processing core, a neighboring core that satisfies the first condition if the first constant misses in the first cache.
For example, assuming that there are 5 processing cores, the load of processing core 1 is 50%, the load of processing core 2 is 20%, the load of processing core 3 is 40%, the load of processing core 4 is 80%, and the load of processing core 5 is 20%, then the first threshold is (50% +20% +40% +80% + 20%)/5=42%, and if the first processing core is processing core 3, its corresponding neighboring cores are processing core 2 and processing core 4, since the load of processing core 2 is 20% below the first threshold 42%, processing core 2 can be identified as a second processing core, and the load of processing core 4 is 80% above the first threshold 42%, and therefore, processing core 4 is not identified as a second processing core. After validating the second processing core, processing core 3 (i.e., the first processing core) requests the first constant from processing core 2 (i.e., the second processing core). After receiving a read request carrying an address of a first constant, the processing core 2 queries the cache of the processing core 2 for the first constant, returns the first constant to the processing core 3 if hit, and returns miss information to the processing core 3 if miss. In this case, the processing core 3 may adopt a second confirmation method in which the shared directory unit is set, the shared directory unit determines the processing core in which the first constant is cached as the second processing core, and the second processing core is reconfirmed, or the first constant may be directly requested to the last-stage cache or main memory, or the two confirmation methods may be adopted in parallel, that is, the shared directory unit is set, the shared directory unit determines the processing core in which the first constant is cached as the second processing core, and the second processing core is reconfirmed, and the first constant is requested to the last-stage cache or main memory, and finally, the first returned result in the shared directory unit or the last-stage cache or main memory is used.
Illustratively, the second threshold is set by the developer, or the second threshold is set empirically by an expert, or the second threshold is dynamically determined based on the loading of at least two processing cores. For example, the second threshold is an average value, a median value, etc. of the loads of the at least two processing cores.
Illustratively, the distance between a processing core and a first processing core is typically an integer, e.g., a processing core is directly adjacent to the first processing core, then the distance between the first processing core and the processing core may be considered to be 1, and if the first processing core is spaced from the processing core by one processing core, then the distance between the first processing core and the processing core may be considered to be 2. For example, as shown in fig. 3, it is assumed that at least two processing cores are 5 processing cores, and a loop structure is adopted between the 5 processing cores. For processing core 1, a processing core 1 distance from processing core 1 includes processing core 2 and processing core 5, and a processing core 2 distance from processing core 1 includes processing core 3 and processing core 4.
In summary, the processor provided in the embodiments of the present application shows a manner in which the first processing core determines that the adjacent core is determined to be the second processing core by itself. In this way, for the first processing core, it is not necessary on the one hand to request the first constant from all but the first processing cores of at least two of the processing cores in the processor, but instead to select all or part of the neighboring cores from among the neighboring processing cores of the first processing core to request the first constant, avoiding congestion to the read request between the processing cores.
On the other hand the transmission delay between the adjacent core, in particular the physically adjacent processing core, and the first processing core is low. In particular, in the conventional manner, the transmission delay between the first processing core and the adjacent processing core is low, so that the efficiency of reading the first constant by the first processing core can be effectively improved.
Further, the neighboring core support of each processing core is dynamically adjusted by the processor according to the first condition. If the load is used to adjust the adjacent cores, the first processing core can request the first constant from the processing core with lower load each time, so that the processing load of the processing core is prevented from being increased. If the distance between the first processing core and the adjacent core is adopted to determine the adjacent core, a greedy criterion is used, that is, the first processing core can select the processing core with lower transmission delay as the second processing core each time, so that the transmission delay of the first processing core for requesting the first constant can be reduced as much as possible.
And the second determination mode is that a shared directory unit is set, and the processing core cached with the first constant is determined as a second processing core by the shared directory unit.
Alternatively, as shown in FIG. 4, processor 100 includes a shared directory unit 500, shared directory unit 500 being coupled to at least two processing cores, or, as shown in FIG. 5, shared directory unit 500 is separate from processor 100, processor 100 being coupled to shared directory unit 500.
Optionally, the first processing core is configured to send a query request to the shared directory unit 500, where the query request includes an address of the first constant, and the shared directory unit 500 is configured to indicate a cache condition of the at least one constant in the at least two processing cores, in case the first constant misses in the first cache. The shared directory unit 500 is configured to determine, based on the query request, that the processing core in which the first constant is cached is the second processing core. The shared directory unit 500 is configured to request the first constant from the second processing core, or send the identification of the second processing core to the first processing core so that the first processing core requests the first constant from the second processing core.
Optionally, the shared directory unit 500 is configured to indicate a cache condition of at least one constant in at least two processing cores, or the shared directory unit 500 stores a cache condition of at least one constant in at least two processing cores. At least one constant is a constant of a cache in at least two processing cores, or at least one constant is a constant stored in a cache in at least two processing cores.
The shared directory unit 500 is illustratively embodied as a directory-like structure, or table structure. The shared directory unit 500 stores whether each of the at least one constant is present in a cache of a respective one of the at least two processing cores.
Optionally, the shared directory unit 500 includes at least one cache core bitmap, where the at least one cache core bitmap corresponds one-to-one to at least one constant. The cache core bitmap corresponding to each constant of the at least one constant is used to indicate a processing core in which the constant is cached in the at least two processing cores.
Wherein the at least two processing cores are all or part of the processing cores in the processor.
Optionally, each of the at least one cache core bitmap includes p bits, p being a positive integer, p representing a number of processing cores of the at least two processing cores. For example, if the at least two processing cores are 5 processing cores, then p is 5, i.e., each of the at least one cache core bitmap comprises 5 bits. Wherein each bit in each cache core bitmap has a corresponding concern with each of the at least two processing cores, i.e., one bit for each processing core.
The cache core bitmap indicates that at least two processing cores aim at constant cache conditions, so that the reliability of the second processing core determined by the shared directory unit 500 is guaranteed, compared with the identification of the stored processing core, the data size is reduced, the query efficiency of the shared directory unit 500 is improved, and the reading efficiency of the first processing core aiming at the first constant is indirectly improved.
Illustratively, at least two processing cores are 5 processing cores, the cache core bitmap of the first constant includes 5 bits, assuming that the first bit from right to left corresponds to processing core 1, the second bit corresponds to processing core 2, the third bit corresponds to processing core 3, the fourth bit corresponds to processing core 4, and the fifth bit corresponds to processing core 5, if the cache core bitmap of the first constant is 00110, it indicates that the processing cores 2 and 3 cache the first constant. At this time, the shared directory unit 500 may confirm at least one of the processing cores 2 and 3 as the second processing core.
Alternatively, the shared directory unit 500 may also filter all or part of the processing cores cached with the first constant as the second processing core by the first condition in the manner shown in the above "determining a neighboring processing core as the second processing core" when the second processing core is confirmed. That is, the shared directory unit 500 determines, based on the query request, that the processing core that has cached the first constant and satisfies the first condition is the second processing core. The first condition includes at least one of a load lower than a first threshold, a first n bits ordered from small to large by a load, n being a positive integer, a distance from the first processing core lower than a second threshold, and a first k bits ordered from small to large by a distance from the first processing core, k being a positive integer.
Optionally, the shared directory unit 500 further includes at least one state flag, where the at least one state flag corresponds to at least one constant, and each state flag in the at least one state flag is used for setting the constant corresponding to the state flag in an exclusive state or a shared state. The exclusive state is used for indicating that the constant is cached by one processing core, the shared state is used for indicating that the constant is cached by a plurality of processing cores, or the exclusive state is used for indicating that the constant exists in the cache of one processing core, and the shared state is used for indicating that the constant exists in the cache of a plurality of processing cores. That is, the setting of the shared directory unit 500 follows the cache coherency protocol, setting a status flag for each constant to identify whether the constant is exclusive to one processing core or shared by multiple processing cores. It should be appreciated that the cache coherency protocol is set to optimize data access efficiency and data coherency between multiprocessing cores, but that for constants, the constant itself is of a read-only nature (i.e., not modifiable or globally shared), and thus the setting of the state flags is an option for constants. The state flag is set to enable the shared directory unit 500 to quickly learn the state of the constant in at least two processing cores, so that the shared directory unit 500 can implement global adjustment for the constant based on the state flag, e.g., the shared directory unit 500 determines whether the constant needs to be cached in other processing cores according to a change in the state of the constant and a change in the access frequency of the processing cores to the constant.
For example, the processing core 1 prefetches hot data into the cache of the processing core 1 according to the hot data prefetching principle. The hot data refers to frequently accessed data or data expected to be frequently accessed, namely the hot data is first data or data associated with the first data, and the first data refers to data with access frequency greater than an access threshold value. The association with the first data presence comprises a temporal association or a spatial association. In particular, the prefetching of the hot data may be based on temporal locality or spatial locality. Temporal locality refers to the fact that recently accessed data may be reused, and spatial locality refers to the fact that adjacent data may be accessed continuously. That is, the data associated with the first data presence time refers to the data that was the first data before the fixed duration, and the data associated with the first data presence space refers to the data adjacent to the address of the first data or within the fixed address of the interval. For example, processing core 1 frequently accesses constant 1, with constant 1 adjacent to constant 2, then processing core 1 may consider constant 2 to be hot data, thus prefetching constant 2 into the cache ahead of time. Constant 2 is marked in shared directory unit 500 as exclusive at this time. If processing core 1 starts to access constant 1 frequently after constant 2 is prefetched into the cache of processing core 1, shared directory unit 500 may determine that constant 2 is hot data, and may prefetch constant 2 into the caches of other processing cores in advance. If the shared directory unit 500 determines that there are two hot data, such as constant 3 and constant 4, where constant 3 is in an exclusive state and constant 4 is in a shared state, when the shared directory unit 500 prefetches hot data for a processing core, constant 3 is preferentially prefetched into the processing core because the hot data prefetching priority of the constant in the exclusive state is higher than the hot data prefetching priority of the constant in the shared state. The hot data prefetching priority is used for indicating the priority of hot data prefetched into a cache of a processing core, wherein the hot data is frequently accessed by the processing core or is expected to be frequently accessed by the processing core, and the frequent access is that the access frequency is larger than an access threshold. By preferentially prefetching the data in the exclusive state, the prefetching accuracy can be improved, so that the cache is utilized more efficiently, namely, the cache is preferentially a hot and exclusive data reserved space, namely, the data prefetching is realized in a targeted and more efficient manner, and the overall performance of the system is improved.
Alternatively, the access frequency is preset, such as automatically restoring the value after the processor is powered up, or set by the user, such as by the user through an open interface of the processor.
Optionally, the shared directory unit 500 is configured to send the miss information to the first processing core if the cache core cached with the first constant is not queried. The first processing core is used for requesting a first constant from a final stage cache or a main memory, wherein the final stage cache is the last stage cache in the processor.
The main memory may also be referred to as a memory, a main memory, or the like. Main memory is the primary storage hierarchy in the processor that is directly accessed by the processor for storing instructions and data executed by the processor.
Optionally, the miss information is to indicate that there is no processing core of the at least two processing cores that caches the first constant.
Optionally, the first processing core sends cache update information to the shared directory unit 500 after confirming that the first constant is requested from the last level cache or main memory, or after storing the first constant requested from the last level cache or main memory to the first cache, the cache update information indicating that the first constant is cached in the first cache. The shared directory unit 500 needs to update the cache state in the shared directory unit 500 according to the cache update information. For details reference is made to the following "2. Maintenance of shared directory units".
Alternatively, the shared directory unit 500 identifies a processing core that caches the first constant but does not satisfy the first condition as the second processing core if no processing core that caches the first constant and satisfies the first condition is queried. That is, the shared directory unit 500 reconfirms the second processing core on the basis of the first constant cached in the first directory unit without the processing core having both the first constant cached in the first directory unit and the first constant cached in the second directory unit. The first condition is a condition related to at least one of a load of the processing core and a distance of the processing core from the first processing core. The first condition is that the shared directory unit 500 determines additional conditions for the second processing core. If the processing core meeting the basic condition and the additional condition does not exist, the shared directory unit can reduce the screening condition of the second processing core, and only the basic condition is needed to be met, so that the first processing core can obtain the first constant in the shortest time delay possible, and the reading efficiency of the constant is improved. Especially, in the case that the first condition is a condition related to the distance of the first processing core, although the first constant cannot be guaranteed to achieve the shortest first read delay in reading, compared with the second read delay read from the last level cache or the main memory in the conventional manner, the first read delay is still far lower than the second read delay, that is, the screening of the additional condition is abandoned, although an optimal read efficiency cannot be achieved, the read efficiency is still improved compared with the conventional manner.
Optionally, the shared directory unit 500 sends miss information to the first processing core without querying the cache core cached with the first constant and meeting the first condition. That is, the shared directory unit 500 directly sends miss information to the first processing core without the processing core having both the first constant cached and the first condition satisfied. To inform the first processing core that the first constant should be requested from the last level cache or main memory. In the case where the first condition is a load-related condition, when the loads of the processing cores in which the first constant is cached are high, in order to avoid a situation in which the loads further increase due to the first processing core requesting the first constant from the processing cores, the first processing core is directly informed of a miss to cause the first processing core to request the first constant from the last-stage cache or main memory. The method and the device avoid load pressure caused by the read request of the first processing core, ensure normal operation of the processing cores which are cached with the first constant but do not meet the first condition, and avoid suspension of the operation of the first processing core caused by long-time waiting of the first processing cores for the processing cores to meet the first condition, namely ensure the working efficiency of at least two processing cores.
The first condition is a condition related to at least one of a load of the processing core and a distance of the processing core from the first processing core. The first condition is that the shared directory unit 500 determines additional conditions for the second processing core. If the processing core meeting the basic condition and the additional condition does not exist, the shared directory unit can reduce the screening condition of the second processing core, and only the basic condition is needed to be met, so that the first processing core can obtain the first constant in the shortest time delay possible, and the reading efficiency of the constant is improved.
Optionally, the miss information is to indicate that there are no processing cores of the at least two processing cores that are cached with the first constant and that satisfy the first condition.
In summary, the processor provided in the embodiment of the present application shows two arrangements of the shared directory unit, i.e. the shared directory unit is disposed in the processor or disposed outside the processor. When the shared directory unit is disposed external to the processor, other processors or components are also supported for use of the shared directory unit.
Another aspect shows a shared directory unit for determining a second processing core. The shared directory unit is to indicate a cache condition of the at least one constant in the at least two processing cores. That is, the second processing core determined by the shared directory unit caches the first constant, improving the hit rate of the read request sent by the first processing core to the second processing core, especially compared to the above-described "determining mode one, in which the first processing core is determined to be the second processing core", although the delay of the first processing core querying the shared directory unit for the second processing core and the delay of the shared directory unit determining the second processing core are increased, these increased delays are still generally shorter than the read delay and the transmission delay of the first processing core directly from the last level cache or the main memory, and the hit rate of the first processing core to the first constant is still ensured.
In addition, through sharing the directory unit, the integration of constant cache conditions in at least two processing cores in the processor is realized, and further, the private cache sharing of the at least two processing cores is realized, the cache redundancy in the at least two processing cores in the processor is reduced, and the cache utilization rate of the at least two processing cores is improved.
After the shared directory unit 500 has acknowledged the second processing core, there are two ways how the first constant is requested for the first processing core to the second processing core, one is that the first processing core sends a read request to the second processing core by itself, and the other is that the shared directory unit 500 helps the first processing core send a read request to the second processing core. Specifically, the following is shown.
(1) The first processing core sends a read request to the second processing core.
Optionally, the shared directory unit 500 is configured to send a query response to the first processing core, where the query response includes an identification of the second processing core.
Optionally, the first processing core requests the first constant from the second processing core based on an identification of the second processing core in the query response. Or the first processing core sends a read request to the second processing core based on the identification of the second processing core in the query response, wherein the read request carries the address of the first constant.
Optionally, the query response further includes an address of the first constant, or the query response includes identification information, where the identification information is used to indicate that the current query response is a response of the first processing core to the query request of the first constant. In one implementation, the address of the first constant is used to confirm the correspondence between the query request and the query response, where the query response includes the address of the first constant. In another implementation manner, the query request and the query response respectively carry the same or corresponding identification information, where the identification information may be preset by a developer or generated according to a preset rule, and the embodiment of the present application is not limited to this.
Optionally, the second processing core queries the second cache for the first constant based on the address of the first constant and returns the first constant to the first processing core in a read response.
In summary, in the processor provided by the embodiment of the present application, the shared directory unit sends the query response to the first processing core, so that the first processing core requests the first constant from the second processing core, thereby ensuring the singleness of the function of the shared directory unit and improving the maintainability of the shared directory unit.
(2) The shared directory unit sends a read request to the second processing core.
Optionally, the shared directory unit 500 is configured to send a read request to the second processing core, where the read request includes an address of the first constant, and the target address of the read request is the first cache.
Optionally, the target address of the read request is the processing core that requested the first constant. I.e. the target address of the read request may be the first cache, an identification of the first processing core, etc. After receiving the read request, the second processing core reads the first constant from the second cache according to the address of the first constant, and returns the first constant to the first cache or the first processing core according to the target address of the read request.
Optionally, shared directory unit 500 may send a query response to the first processing core in addition to the read request to the second processing core. The query response is used to indicate that there is a processing core that caches a first constant of the at least two processing cores, or the query response is used to indicate a second processing core that caches a first constant. That is, as shown in "(1) the first processing core sends a read request to the second processing core", the query response may or may not carry the identifier of the second processing core.
For the first processing core, after receiving the query response, the first processing core may send a read request to the second processing core as described above "(1) while still sending a read request for the first constant to the second processing core. The second processing core may perform a read response to at least one of the first processing core and the shared directory unit 500.
In summary, after the shared directory unit determines the second processing core, the processor provided in the embodiment of the present application assists the first processing core to request the first constant to the second processing core, and the shared directory unit does not need to return a query response to the first processing core, and then the first processing core sends a read request to the second processing core, so that a transmission path of the first processing core requesting the first constant is shortened, transmission delay is reduced, and reading efficiency of the first constant is further improved.
In some embodiments, based on "determination mode two, a shared directory unit is set, and a plurality of processing cores exist in the second processing core determined by the shared directory unit that determines the processing core cached with the first constant as the second processing core". That is, there are a plurality of second processing cores. At this time, the first processing core or the shared directory unit is configured to request the first constant from a processing core that satisfies a second condition among the plurality of second processing cores, where the second condition includes at least one of a lowest load and a closest distance to the first processing core.
Since the second determination mode is to set the shared directory unit, the second processing core determined by the shared directory unit determining the processing core cached with the first constant as the second processing core is the processing core cached with the first constant, that is, if the second processing core is based on the second determination mode, the shared directory unit determining the processing core cached with the first constant as the second processing core is set, the first determination mode is not required, i.e., the adjacent processing cores are determined as the second processing cores, and only one second processing core is required to be selected. The load in the whole processor can be prevented from being increased, and the first processing core can be guaranteed to be capable of quickly reading the first constant.
The above-mentioned "determination mode one" in which the adjacent processing core is determined as the second processing core "and" determination mode two "in which the shared directory unit is provided and the processing core in which the first constant is cached is determined as the second processing core" may be implemented as an independent embodiment or may be implemented as a combined embodiment. The first processing core first executes a first determination mode, wherein the adjacent core is determined to be a second processing core, the adjacent core is determined to be the second processing core, the first constant is requested to the second processing core, if the first constant is not cached in the adjacent core of the first processing core, namely, the first constant is not hit in the second processing core, the first processing core executes a second determination mode, wherein a shared directory unit is set, the processing core cached with the first constant is determined to be the second processing core by the shared directory unit, the processing core cached with the first constant in at least two processing cores is queried through the shared directory unit, the processing core is confirmed to be the second processing core, then the first constant is requested to the second processing core, and if the shared directory unit returns miss information, the processing core cached with the first constant in at least two processing cores is not stored, or the processing core cached with the first constant and meeting the first condition is not stored. At this point, the first processing core should request a first constant from the last level cache or main memory. That is, a first processing core configured to determine all or a portion of neighboring cores as a second processing core in the event that a first constant misses in a first cache, the neighboring cores being processing cores that are physically or logically neighboring to the first processing core, request the first constant from the second processing core, in the event that the first constant misses in the second processing core, send a query request to a shared directory unit, the query request including an address of the first constant, the shared directory unit being configured to indicate a cache condition of at least one constant in at least two processing cores, the shared directory unit being configured to determine, based on the query request, that the processing core in which the first constant is cached is the second processing core, and the shared directory unit being configured to request the first constant from the second processing core, or to send an identification of the second processing core to the first processing core such that the first processing core requests the first constant from the second processing core.
In other embodiments, the first processing core may request the first constant from the last level cache or main memory at the same time, in addition to performing "determination mode two: set shared directory unit, which determines the processing core cached with the first constant as the second processing core". That is, a first processing core configured to send a query request to a shared directory unit in case of a miss of the first constant in the first cache, and to send a read request to a last level cache or main memory, the first constant being determined based on an earlier arriving (or earliest arriving) response of the query response or the read response, wherein the query request comprises an address of the first constant, the read request comprises an address of the first constant, the shared directory unit is configured to indicate a cache condition of at least one constant in at least two processing cores, the query response is a response of the shared directory unit to the query request, the read response is the last level cache or main memory response to the read request, and the last level cache is the last level cache in the processor. Ignoring a later arriving response of the query response and the read response, or taking the later arriving response of the query response and the read response as an invalid response, or discarding the later arriving response of the query response and the read response. If the query response is earlier than the last-level cache or the main-memory read response, waiting for the second processing core read response or sending a read request to the second processing core according to the query response, and setting the last-level cache or the main-memory read response to be invalid (or discarding the last-level cache or the main-memory read response). If the last level cached or hosted read response is earlier than the query response, the query response is set to invalid (or the query response is discarded). In this way, the excessive delay caused by the fact that the first constant is not existed in the shared directory unit and the first processing core still needs to request the first constant from the final stage cache or the main memory can be avoided, that is, the longest transmission delay of the mode provided by the embodiment of the application is shortened to be the same as the traditional mode as much as possible.
For the shared directory unit 500, to ensure timeliness of the caching of at least one constant in the shared directory unit 500, the shared directory unit 500 should be maintained in time.
2. Maintenance of shared directory units.
Optionally, the shared directory unit 500 is configured to update, in case of a cache update in any of the at least two processing cores, at least one constant cache condition in the shared directory unit.
Illustratively, the cache condition of at least one constant in the shared directory unit 500 is represented by a directory entry. The directory entry is in the format of [ constant block address, cache core bitmap, status flag ]. Wherein the constant block address is a constant address. The address of the constant may be the address of the constant in the last level cache or main memory, or the address in the processing core.
Several scenarios in which the shared directory unit maintains directory entries are shown next.
In the first case, there is a directory entry corresponding to the first constant, and the first cache is newly added with the first constant. For example, when the first processing core reads the first constant from the cache of the second processing core in the manner described above, and the first constant is newly added in the first cache, the shared directory unit 500 is triggered to update the directory entry corresponding to the first constant. Assume that at least two processing cores are 5 processing cores, processing core 1 through processing core 5, respectively. If the directory entry corresponding to the first constant is [ address, 00110, shared status of the first constant ] before the first cache is updated, wherein a bit of 0 in the cache core bitmap indicates that the processing core corresponding to the bit does not cache the first constant, and a bit of 1 in the cache core bitmap indicates that the processing cores corresponding to the bit cache the first constant, that is, processing core 2 and processing core 3 cache the first constant, and processing core 1, processing core 4, and processing core 5 do not cache the first constant. If the first processing core is the processing core 4 and the second processing core is the processing core 3, after the processing core 4 reads the first constant from the processing core 3, the processing core sends cache update information to the shared directory unit 500, where the cache update information is used to indicate that the first cache is newly added with the first constant, and the cache update information carries an address of the first constant. The shared directory unit 500 determines, according to the cache update information, that the directory entry corresponding to the first constant should be updated, where the directory entry corresponding to the updated first constant is [ address of the first constant, 01110, shared state ].
And in the second case, no directory entry corresponding to the first constant exists, and the first constant is newly added to the first cache, namely the first processing core reads the first constant from the last-stage cache or the main memory. Assume that at least two processing cores are 5 processing cores, processing core 1 through processing core 5, respectively. If the first processing core is the processing core 4, after the processing core 4 reads the first constant from the last-level cache or the main memory, the cache update information is sent to the shared directory unit 500, where the cache update information is used to indicate that the first cache is newly added with the first constant, and the cache update information carries the address of the first constant. According to the cache update information, the shared directory unit 500 determines that the directory entry corresponding to the first constant does not exist, and then the shared directory unit 500 newly adds a directory entry corresponding to the first constant, where the directory entry corresponding to the newly added first constant is [ address of the first constant, 01000, exclusive state ]. At this time, since only one processing core caches the first constant, the state of the first constant is an exclusive state.
And thirdly, the first cache clears or replaces the first constant, and the directory entry corresponding to the first constant is not deleted. Assume that at least two processing cores are 5 processing cores, processing core 1 through processing core 5, respectively. If the first processing core is the processing core 4, the processing core 4 replaces the cache line corresponding to the first constant due to insufficient space of the first cache, and should notify the shared directory unit 500 to update the directory entry corresponding to the first constant, for example, send cache update information to the shared directory unit 500, where the cache update information is used to instruct the first cache to clear or replace the first constant, and the cache update information carries the address of the first constant. The shared directory unit 500 determines, according to the cache update information, that the directory entry corresponding to the first constant is updated, if the directory entry corresponding to the first constant before the update is [ address of the first constant, 01110, shared state ], then the directory entry corresponding to the updated first constant is [ address of the first constant, 00110, shared state ].
And fourthly, the first cache clears or replaces the first constant, and the directory entry corresponding to the first constant is deleted. Assume that at least two processing cores are 5 processing cores, processing core 1 through processing core 5, respectively. If the first processing core is the processing core 4, the processing core 4 replaces the cache line corresponding to the first constant due to insufficient space of the first cache, and should notify the shared directory unit 500 to update the directory entry corresponding to the first constant, for example, send cache update information to the shared directory unit 500, where the cache update information is used to instruct the first cache to clear or replace the first constant, and the cache update information carries the address of the first constant. The shared directory unit 500 determines, according to the cache update information, that the directory entry corresponding to the first constant is updated, e.g., the directory entry corresponding to the first constant before the update is [ address of the first constant, 01000, exclusive status ], and since the current status of the first constant is the exclusive status, the processing core 4 indicates that the first constant has been cleared or replaced, i.e., no processing core currently caches the first constant, the directory entry corresponding to the first constant may be deleted, so as to reduce memory occupation of invalid information in the shared directory unit 500, and avoid reduction of the query speed of the shared directory unit 500 caused by excessive invalid information.
In summary, the processor provided in the embodiment of the present application shows a maintenance process of the shared directory unit. The cache condition of at least one constant stored in the shared directory unit is guaranteed to be real-time, so that the reliability of the second processing core is determined by the shared directory unit. Therefore, when the cache of each processing core in the at least two processing cores is updated, the shared directory unit should update the cache condition of at least one constant, so as to ensure that the first processing core can determine a reliable second processing core according to the shared directory unit, and realize fast reading of the first constant.
It should be noted that, after the first processing core reads the first constant from the second processing core, the last stage cache, or the main memory, the first constant may be cached in the first cache, or the first constant may not be cached in the first cache. For example, the first processing core stores the first constant in a register, performs an operation based on the first constant stored in the register, and releases the first constant stored in the register after the operation is finished, or the first processing core caches the first constant in the L1 cache, and releases the first constant in the L1 cache after the operation is finished, or determines whether to release the first constant in the L1 cache based on an actual execution condition (e.g., release according to a release policy after the L1 cache is full). In some embodiments, the first constant is not cached in the first cache, but is cached in the second processing core, and compared with the conventional method, the method provided by the embodiment of the application supports the first processing core to read the first constant from the second processing core, so that the reading efficiency of the first constant is higher than that of the conventional method. In another writing embodiment, after the first constant is cached to the first cache, the first processing core can directly read the first constant from the corresponding first cache, without reading the first constant from the second processing core, the last-stage cache or the main memory, so that the efficiency of reading the first constant by the first processing core can be greatly improved and the time delay of reading the first constant can be reduced. However, the capacity of the first buffer of the first processing core is often smaller to ensure a higher read-write speed, so the first processing core needs to determine whether to buffer the first constant into the first buffer.
The first processing core is configured to cache the first constant to the first cache if a third condition is met, where the third condition includes at least one of an access frequency of the first constant being greater than a third threshold, the access frequency being used to indicate a frequency with which the first processing core requests the first constant from the third processing core, and an occupancy of the first cache being less than a fourth threshold.
Optionally, the third threshold and the fourth threshold may be set by a software developer through an interface, or may be set by the processor before shipment.
Optionally, the first processing core caches the first constant in the first cache if the access frequency of the first constant is greater than a third threshold. If the first processing core frequently accesses the first constant in the third processing core, if the number of accesses in the first duration is greater than the third threshold, the first constant currently read from the third processing core is cached to the first cache, that is, stored in a cache line of the first cache.
Optionally, the first processing core caches the first constant to the first cache if an occupancy of the first cache is less than a fourth threshold. That is, if the first constant is hit in the third processing core when the first cache is idle, the first constant is also cached to the first processing core when the first constant is read to the first processing core.
Optionally, the first processing core caches the first constant to the first cache if the occupancy of the first cache is greater than a fifth threshold and the access frequency of the first constant is greater than a third threshold. That is, if the capacity of the first cache is relatively tight, only the first constant with relatively high access frequency is cached in the first cache.
Alternatively, in the case where the first processing core reads the first constant from the last-level cache or main memory, or in the case where the cache of the first constant is not included in the shared directory unit 500, that is, in the case where there is no processing core in which the first constant is cached in at least two processing cores, the first processing core caches the first constant into the first cache. Further, a first constant cache condition is newly added in the shared directory unit 500. That is, if there is no processing core that caches the first constant in the at least two processing cores, the first constant may be directly cached in the first cache by ignoring the condition constraint of the access frequency and the occupancy, and it is first ensured that there is a processing core that caches the first constant in the at least two processing cores, so that other processing cores do not need to read from the last-level cache or the main memory when the first constant is needed.
In some embodiments, shared directory unit 500 also supports pre-fetching constants into a portion of the cache core in advance. Specifically, the following is shown.
In some embodiments, the shared directory unit 500 is configured to send a prefetch instruction to at least one fourth processing core, where a second constant of the at least one constant satisfies a fourth condition, the prefetch instruction being configured to cache the second constant to a cache of the at least one fourth processing core, each of the at least one fourth processing core being less than a fifth threshold from each of the at least two fifth processing cores, the at least two fifth processing cores being processing cores that access the second constant.
Wherein the fourth condition comprises at least one of the number of at least two fifth processing cores being greater than the first number, and the access frequency of the second constant being greater than a sixth threshold.
Optionally, the access frequency of the second constant is a mean, sum, median, etc. of the access frequencies of the at least two fifth processing cores for the second constant.
For example, when there are more fifth processing cores to frequently access the second constant, the shared directory unit 500 may prefetch the second constant to the fourth processing core in advance, the distance between the fourth processing core and the fifth processing core being less than the fifth threshold. For example, assuming that at least two processing cores are 5 processing cores, processing core 1 through processing core 5 respectively, if shared directory unit 500 detects that processing core 2 and processing core 4 will frequently access the second constant, shared directory unit 500 selects a portion of the processing cores from the at least two processing cores as the first processing core, e.g., processing core 3 as the fourth processing core, and caches the second constant in the cache of processing core 3. At this time, the distance between the processing core 3 and the processing core 2 is 1, and the distance between the processing core 3 and the processing core 4 is also 1, that is, the fifth threshold is 2. Of course, the fifth threshold may also be set to 1, i.e. processing core 2 and processing core 4 as fourth processing cores.
Optionally, the at least one fourth processing core satisfies a fifth condition comprising at least one of a distance from each of the at least one fifth processing core being less than a fifth threshold and a load being less than a sixth threshold. That is, the fourth processing core may also prefetch the second constant into the core with the lower load in advance according to the load determination, so as to ensure that at least two processing cores can read the second constant from the fourth processing core, and ensure that excessive load pressure is not caused to the fourth processing core.
It should be understood that the fifth processing core may be understood as the first processing core described above, and the fourth processing core may be understood as the second processing core described above.
In summary, the processor provided in the embodiment of the present application supports that the shared directory unit pre-fetches the second constant to at least one fourth processing core in advance, where the fourth processing core may be a processing core with a high-frequency access requirement for the second constant, or may be a neighboring core of the processing core with the high-frequency access requirement, or a processing core separated by a distance within a fifth threshold. In one aspect, the second constant is a known shared constant, and generally, there is a higher probability of sharing the second constant between adjacent cores, so that when it is confirmed that the fifth processing core has a higher access requirement, the second constant can be loaded in advance to its adjacent core (i.e. at least one fourth processing core), if the fifth processing core does not cache the second constant, it can be guaranteed that the fifth processing core can quickly fetch the second constant from the fourth processing core, and if the fifth processing core caches the second constant, it can be guaranteed that the fourth processing core can quickly read the second constant prefetched into the fourth processing core when it needs the second constant.
Specifically, the second constant is prefetched into the fourth processing core to enable the plurality of fifth processing cores to obtain the second constant from the adjacent fourth processing core as soon as possible. Under the condition that the plurality of fifth processing cores are closer to each other to form a core cluster, the plurality of fifth processing cores share one fourth processing core, and at the moment, the plurality of fifth processing cores can acquire the second constant from the fourth processing core in time only by pre-storing one second constant in the fourth processing core, so that the cache occupation is reduced, and the reading speed is ensured. In the case where the plurality of fifth processing cores are far apart to constitute a plurality of core clusters, the plurality of fifth processing cores within each core cluster share one fourth processing core, and the fifth processing core acquires the second constant from the fourth processing core within its core cluster. The fourth processing core may be selected such that the sum of routing paths from the fourth processing core to each fifth processing core is minimized within each core cluster. That is, the second constant is typically not occupied by a cache line within the fifth processing core. However, in some special cases, in order to improve the processing efficiency of the fifth processing core, the second constant may also be cached in a part of the processing cores in the core cluster (such as the fifth core), where the access frequency of a certain fifth processing core for the second constant is greater than the agreed threshold.
In the related art, each core independently caches the same constant data, so that storage redundancy is caused, and the overall cache utilization rate is reduced. The last-stage cache which can be shared by all cores is often located at the back end of the network on chip, and delay is high.
Therefore, the embodiment of the application shows a processor, relates to the field of computer architecture, and particularly aims at a shared constant cache consistency protocol in a multi-core/many-core processor (such as a CPU (Central processing Unit) and a GPU), and aims to reduce access delay in cache miss and improve cache utilization rate by optimizing a sharing mechanism of constant data among multiple cores. According to the embodiment of the application, the integral utilization rate of the constant cache is improved through the core constant cache consistency Internet.
Specifically, when a constant cache miss occurs in a certain processing core, data is preferentially requested to a physically/logically adjacent processing core, if none of the adjacent processing cores is hit, the cache states of other non-adjacent cores are queried through a shared directory unit which is shared globally, if no cache record exists in the shared directory unit, the downstream final cache or main memory is accessed finally. Specifically, the following is shown.
1. The directory is shared.
All processing cores share a shared directory unit that includes a centralized directory that records the cache distribution for each constant block.
The directory entry format in the shared directory unit is [ constant block address, cache core bitmap, status flag ].
Wherein the cache core bitmap indicates which cores' private constant caches have the constant block stored therein with a bitmask (e.g., an 8-core system uses 8 bits, each representing a core).
Status flags-the shared status of the normal chunks (e.g., "exclusive" for hot data prefetch optimization).
2. Cache miss processing flow.
When processing core a accesses constant data, the private constant cache of the local core (i.e., processing core a) misses. Data is requested from the neighboring core, and if the private constant cache of the neighboring core hits, the following steps are not continued.
If the private constant cache of the adjacent core is not hit, a query request containing a target address (i.e., a constant block address of constant data) is sent to the centralized directory.
Upon receipt of the request, the shared directory unit examines the directory entry.
If the shared directory unit hits, the shared directory unit may return the target cache (e.g., the private constant cache of core B) containing the target address to the private constant cache of processing core A. A read request to the private constant cache of processing core B is initiated by the private constant cache of processing core a.
In other embodiments, the shared directory unit may concurrently send a read request corresponding to the constant block address to the private constant cache of processing core B and mark the read request as destined for the private constant cache of processing core A.
If the shared directory unit is not hit, the miss information is returned to the processing core A, and the processing core A normally initiates a read request to the final-stage cache.
In other embodiments, processing core A may initiate a read request to both the shared directory unit and the last level cache, and discard the returned result based on the neighbor core cache hit.
3. Data caching and directory updating.
The directory entry update is triggered whenever the private constant cache contents of the processing core change, specifically including the following two cases.
Newly adding a cache, namely marking the core A in a cache core bitmap of the directory entry.
And (3) cache replacement, namely if the core A replaces a constant block due to insufficient space, informing the directory to clear the self ID.
Further optimization may be done in several ways.
4. The neighbor cores are dynamically grouped.
The proximity core range is dynamically adjusted based on physical topology (e.g., on-chip NoC adjacencies) or run-time access patterns (e.g., the same thread group in the GPU). Preferably, low-load cores or processing cores that are physically closer together (like GPU clusters) are selected to transfer data, reducing latency.
5. Access frequency dynamic application.
The local cache decides whether to store the data read back from the caches of the adjacent cores into the cache line according to the hit rate of the address access. For the case of repeated hits with no different addresses, data present in the caches of adjacent cores is not written to the local cache line if the caches are full.
6. And prefetching.
When a constant block is frequently accessed by a plurality of cores, the directory actively triggers a prefetch instruction to be loaded into a cache of the adjacent core in advance.
Fig. 6 shows a flowchart of a method for reading constants according to an exemplary embodiment of the present application. The method is performed by a processor that includes at least two processing cores. The method includes step 210.
Step 210. A first processing core of the at least two processing cores obtains a first constant cached in a second processing core in case the first constant misses in the first cache.
The first cache is a cache of the first processing core.
Alternatively, each of the at least two processing cores may communicate with each other via a bus, may communicate with each other via a network on chip as shown in fig. 2, may communicate with each other via wires on a circuit board of the processor, etc. It should be noted that the foregoing only shows an example of a connection manner between at least two processing cores, but does not constitute limitation of the embodiment of the present application with respect to the connection manner between at least two processing cores.
Optionally, a first processing core of the at least two processing cores requests the first constant from a second cache, the second cache being a cache of the second processing core, in case the first constant misses in the first cache.
Alternatively, the first cache and the second cache may be caches dedicated to constants, that is, the data cached in the first cache and the second cache are read-only data, where the first cache may be referred to as a first constant cache and the second cache may be referred to as a second constant cache. Or, the first cache and the second cache are universal caches, that is, the first cache and the second cache are used for both cache constants and cache variables, that is, the data cached in the first cache and the second cache can be read-only data or data supporting write operations (i.e., writable data). For example, the first cache and the second cache are level one caches in the first processing core and the second processing core, respectively.
Optionally, the first cache is a private cache of the first processing core, i.e. the first cache is an exclusive cache of the first processing core. The first cache typically only supports access by the first processing core, or the data in the first cache is not directly available for querying by other processing cores.
Optionally, for a first processing core of the at least two processing cores, if a first constant is required to be used, first accessing the first cache, if the first constant hits in the first cache, no subsequent steps are required, in case the first constant misses in the first cache, selecting a second processing core of the at least two processing cores, and sending a read request carrying an address of the first constant and an identification of the first processing core to the second processing core.
Alternatively, the second processing core is one processing core or a plurality of processing cores, i.e. the first processing core may acknowledge one or more of the at least two processing cores as the second processing core.
Optionally, the second processing core accesses the second cache based on a read request sent by the first processing core, sends the first constant to the first processing core according to the identification of the first processing core indicated in the read request in case of a hit of the first constant in the second cache, and returns miss information to the first processing core according to the identification of the first processing core indicated in the read request in case of a miss of the first constant in the second cache.
Optionally, the first constant is data that is frequently read by at least two processing cores, or the read frequency of the at least two processing cores for the first constant is greater than a frequency threshold. That is, in the case where the read frequency of at least two processing cores for constants is greater than the frequency threshold, the method provided in the embodiment of the present application may be used to improve the read efficiency of at least two processing cores for constants.
It should be noted that, with respect to the "request" and "response" shown in the embodiments of the present application, whether the read request, the query request, the read response, the query response, etc., are generally in the form of instructions in the processor, but the embodiments of the present application are not limited thereto.
In summary, the processor provided in the embodiment of the present application supports that the first processing core requests the first constant from the other processing cores under the condition that the first constant is not hit. Compared with the traditional mode of requesting the first constant from the last-stage cache or the main memory, on one hand, the read-write speed of the cache in the processing core is generally higher than that of the last-stage cache and the main memory, namely, the speed of the second processing core searching the first constant from the second cache is faster than that of the last-stage cache or the main memory, or the read delay of the second processing core is lower than that of the last-stage cache or the main memory, so that the speed of obtaining the first constant through the second processing core request is indirectly faster, and the read efficiency of the first processing core for the constant is improved. On the other hand, due to the architecture problem of the processor, the transmission path (or transmission delay) between the first processing core and the last stage cache or the main memory is often longer than the transmission path (or transmission delay) between the first processing core and the second processing core, the transmission path between the first processing core and the last stage cache is shown as a path 10 in fig. 1, the transmission path between the first processing core and the second processing core is shown as a path 20 in fig. 2, and the transmission delay of the first processing core for acquiring the read request of the first constant from the second processing core and the transmission delay of the second processing core for returning the read response of the first constant to the first processing core is lower than that of the conventional method, so that the speed for acquiring the first constant through the second processing core request is faster, and the reading efficiency of the first processing core for the constant is improved.
In addition, the processor shown in the embodiment of the present application realizes the sharing of the cache between at least two processing cores, and by means of sharing the cache, each processor does not need to ensure that the constant is cached to its own cache, but only needs to ensure that there is a processing core in which the constant is cached in at least two processing cores. For example, when the first processing core needs the first constant, the first constant is read from the second processing core, so that the sharing of private caches between at least two processing cores is realized, and as the sharing mechanism of the private caches is mainly aimed at the constant, the constant has read-only property, and the problem of cache inconsistency is not easy to occur, thereby not only ensuring the reliability of constant data, but also avoiding the first constant from generating redundancy in at least two processing cores, and improving the utilization rate of the caches in at least two processing cores.
1. And determining a second processing core.
For the first processing core, the first constant is preferably chosen to be requested from the second processing core to improve the reading efficiency. But if the first constant does not hit in the second processing core, the first processing core may be required to select the second processing core again or instead read the first constant into the last level cache or main memory using conventional means. Thus, additional design is required in the selection of the second processing core. The embodiment of the application provides two determination modes of the second processing core.
The first mode of determination is to determine an adjacent processing core as a second processing core.
And the second determination mode is that a shared directory unit is set, and the processing core cached with the first constant is determined as a second processing core by the shared directory unit.
Next, description will be made one by one regarding the determination manners of the above two second processing cores, and it should be understood that the description order does not represent the merits of the determination manners.
The first mode of determination is to determine an adjacent processing core as a second processing core.
Alternatively, step 210 may be implemented by the first processing core or the processor or the shared directory unit determining that all or part of the neighboring cores are second processing cores, the neighboring cores being processing cores that are physically or logically neighboring to the first processing core, in case the first constant misses in the first cache, requesting the first constant from the second processing core, or sending an identification of the second processing core to the first processing core to cause the first processing core to request the first constant from the second processing core. That is, the second processing core is all or part of the neighboring cores.
It should be noted that the unit determining the adjacent processing core as the second processing core may be the first processing core, the processor, the shared directory unit, or the like, which are shown above, or may be other control modules in the processor, other processing cores (such as processing cores other than the first processing core), or the like. The embodiment of the present application is not limited to the unit that determines the adjacent processing core as the second processing core.
Optionally, for each of the at least two processing cores, there is at least one adjacent core. The adjacent core is a processing core that is physically or logically adjacent to the first processing core.
Optionally, physical adjacency refers to an adjacency where there is a directly connected trace on the circuit board of the processor. That is, a first processing core and a second processing core may be considered physically adjacent when there are directly connected traces on the circuit board.
Alternatively, logical adjacency refers to adjacency as indicated by a driver or firmware or other means. For example, for a register corresponding to one neighboring core in the first processing core, one or more processing cores are indicated as neighboring cores of the first processing core by a driver. The one or more processing cores may be indicated from a physical or logical identifier or the like.
Optionally, the neighboring core is a processing core that is logically neighboring the first processing core. The neighboring core support processors of the first processing core are dynamically adjusted according to the state of at least two processing cores. Specifically, the adjacent processing cores are processing cores that satisfy the first condition. The first condition includes at least one of a load lower than a first threshold, a top n bits ordered from small to large in load, n being a positive integer, a distance from the first processing core lower than a second threshold, and a top k bits ordered from small to large in distance from the first processing core, k being a positive integer.
In some embodiments, the method further comprises the processor or the shared directory unit determining processing cores that meet a first condition as neighboring cores to the first processing core, wherein the first condition comprises at least one of load below a first threshold, top n bits ordered from small to large by load, n being a positive integer, distance from the first processing core below a second threshold, top k bits ordered from small to large by distance from the first processing core, k being a positive integer.
In some embodiments, the above-described "the first processing core determining all or part of the neighboring cores as the second processing core in the case where the first constant misses in the first cache" may be implemented as the first processing core determining the neighboring cores satisfying the first condition as the second processing core in the case where the first constant misses in the first cache.
For details, reference may be made to the related content of the above-described processor for "determining a neighboring processing core as the second processing core" in the first determining manner, which is not described herein.
And the second determination mode is that a shared directory unit is set, and the processing core cached with the first constant is determined as a second processing core by the shared directory unit.
Optionally, the processor includes a shared directory unit coupled to the at least two processing cores, or the shared directory unit is independent of the processor, the processor being coupled to the shared directory unit.
In some embodiments, step 210 may be implemented by the first processing core sending a query request to the shared directory unit, the query request including an address of the first constant, in the event that the first constant misses in the first cache, the shared directory unit being configured to indicate a cache condition of the at least one constant in the at least two processing cores. The shared directory unit determines, based on the query request, the processing core cached with the first constant as the second processing core. And the shared directory unit is used for requesting the first constant from the second processing core or sending the identification of the second processing core to the first processing core so that the first processing core requests the first constant from the second processing core.
Optionally, the shared directory unit is configured to indicate a cache condition of the at least one constant in the at least two processing cores, or the shared directory unit stores a cache condition of the at least one constant in the at least two processing cores.
Illustratively, the shared directory unit is embodied as a directory-like structure, or table structure. The shared directory unit stores whether each of the at least one constant is present in a cache of a respective one of the at least two processing cores.
Optionally, the shared directory unit includes at least one cache core bitmap, the at least one cache core bitmap corresponding one-to-one to the at least one constant. The cache core bitmap corresponding to each constant of the at least one constant is used to indicate a processing core in which the constant is cached in the at least two processing cores.
Wherein the at least two processing cores are all or part of the processing cores in the processor.
Alternatively, the shared directory unit may also filter all or part of the processing cores cached with the first constant as the second processing core by the first condition, in a manner shown in the above-described "determining a neighboring processing core as the second processing core" when the second processing core is confirmed. That is, the shared directory unit determines, based on the query request, that the processing core that has cached the first constant and satisfies the first condition is the second processing core. The first condition includes at least one of a load lower than a first threshold, a first n bits ordered from small to large by a load, n being a positive integer, a distance from the first processing core lower than a second threshold, and a first k bits ordered from small to large by a distance from the first processing core, k being a positive integer.
Optionally, the shared directory unit further includes at least one state flag, where the at least one state flag corresponds to the at least one constant one-to-one, and each state flag in the at least one state flag is used for the constant corresponding to the state flag to be in an exclusive state or a shared state. The exclusive state is used for indicating that the constant is cached by one processing core, the shared state is used for indicating that the constant is cached by a plurality of processing cores, or the exclusive state is used for indicating that the constant exists in the cache of one processing core, and the shared state is used for indicating that the constant exists in the cache of a plurality of processing cores. That is, the setting of the shared directory unit follows the cache coherency protocol, setting a status flag for each constant to identify whether the constant is exclusive to one processing core or shared by multiple processing cores. It should be appreciated that the cache coherency protocol is set to optimize data access efficiency and data coherency between multiprocessing cores, but that for constants, the constant itself is of a read-only nature (i.e., not modifiable or globally shared), and thus the setting of the state flags is an option for constants. After the state flag is set, the shared directory unit can quickly learn the state of the constant in at least two processing cores, so that the shared directory unit can realize global adjustment on the constant based on the state flag, for example, the shared directory unit determines whether the constant needs to be cached in other processing cores according to the state change of the constant and the change of the access frequency of the processing cores to the constant. The constant hot data prefetching priority in the exclusive state is higher than the hot data prefetching priority in the shared state, and the hot data prefetching priority is used for indicating the priority of hot data prefetched into a cache of a processing core, wherein the hot data refers to data frequently accessed by the processing core or data expected to be frequently accessed by the processing core. By preferentially prefetching the data in the exclusive state, the prefetching accuracy can be improved, so that the cache is utilized more efficiently, namely, the cache is preferentially a hot and exclusive data reserved space, namely, the data prefetching is realized in a targeted and more efficient manner, and the overall performance of the system is improved.
In some embodiments, the method further comprises the shared directory unit sending miss information to the first processing core if the cache core cached with the first constant is not queried. The first processing core is used for requesting a first constant from a final stage cache or a main memory, wherein the final stage cache is the last stage cache in the processor.
The main memory may also be referred to as a memory, a main memory, or the like. Main memory is the primary storage hierarchy in the processor that is directly accessed by the processor for storing instructions and data executed by the processor.
Optionally, the miss information is to indicate that there is no processing core of the at least two processing cores that caches the first constant.
Optionally, after confirming that the first constant is requested from the last level cache or the main memory, or after storing the first constant requested from the last level cache or the main memory to the first cache, the first processing core sends cache update information to the shared directory unit, where the cache update information indicates that the first constant is cached in the first cache. The shared directory unit needs to update the cache state in the shared directory unit according to the cache update information. For details reference is made to the following "2. Maintenance of shared directory units".
Optionally, the shared directory unit identifies the processing core that cached the first constant but did not satisfy the first condition as the second processing core if the processing core that cached the first constant and satisfied the first condition is not queried. That is, the shared directory unit reconfirms the second processing core on the basis of the first constant cached in the shared directory unit without the processing core having both the first constant cached in the shared directory unit and the first condition. The first condition is a condition related to at least one of a load of the processing core and a distance of the processing core from the first processing core. The first condition is that the shared directory unit determines an additional condition of the second processing core. If the processing core meeting the basic condition and the additional condition does not exist, the shared directory unit can reduce the screening condition of the second processing core, and only the basic condition is needed to be met, so that the first processing core can obtain the first constant in the shortest time delay possible, and the reading efficiency of the constant is improved. Especially, in the case that the first condition is a condition related to the distance of the first processing core, although the first constant cannot be guaranteed to achieve the shortest first read delay in reading, compared with the second read delay read from the last level cache or the main memory in the conventional manner, the first read delay is still far lower than the second read delay, that is, the screening of the additional condition is abandoned, although an optimal read efficiency cannot be achieved, the read efficiency is still improved compared with the conventional manner.
Optionally, the shared directory unit sends the miss information to the first processing core without querying the cache core cached with the first constant and meeting the first condition. That is, the shared directory unit directly sends miss information to the first processing core without the processing core having both the first constant cached and satisfying the first condition. To inform the first processing core that the first constant should be requested from the last level cache or main memory. In the case where the first condition is a load-related condition, when the loads of the processing cores in which the first constant is cached are high, in order to avoid a situation in which the loads further increase due to the first processing core requesting the first constant from the processing cores, the first processing core is directly informed of a miss to cause the first processing core to request the first constant from the last-stage cache or main memory. The method and the device avoid load pressure caused by the read request of the first processing core, ensure normal operation of the processing cores which are cached with the first constant but do not meet the first condition, and avoid suspension of the operation of the first processing core caused by long-time waiting of the first processing cores for the processing cores to meet the first condition, namely ensure the working efficiency of at least two processing cores.
The specific content may refer to the related content of the above-described processor for the "determination mode two", in which the shared directory unit is set, and the processing core cached with the first constant is determined as the second processing core by the shared directory unit, which is not described herein.
After the shared directory unit confirms the second processing core, there are two ways how to request the first constant for the first processing core to the second processing core, one is that the first processing core sends a read request to the second processing core by itself, and the other is that the shared directory unit helps the first processing core send a read request to the second processing core. Specifically, the following is shown.
(1) The first processing core sends a read request to the second processing core.
In some embodiments, the method further includes the shared directory unit sending a query response to the first processing core, the query response including an identification of the second processing core.
Optionally, the second processing core queries the second cache for the first constant based on the address of the first constant and returns the first constant to the first processing core in a read response.
For specific content, reference may be made to the related content of the processor shown above for "(1) the first processing core sends a read request to the second processing core", which is not described herein.
(2) The shared directory unit sends a read request to the second processing core.
In some embodiments, the method further includes the shared directory unit sending a read request to the second processing core, the read request including an address of the first constant, the read request targeting the first cache.
In some embodiments, based on "determination mode two, a shared directory unit is set, and a plurality of processing cores exist in the second processing core determined by the shared directory unit that determines the processing core cached with the first constant as the second processing core". That is, there are a plurality of second processing cores. At this time, the "the first processing core or the shared directory unit requests the first constant from the second processing core" may be implemented as that the first processing core or the shared directory unit requests the first constant from the processing cores satisfying the second condition among the plurality of second processing cores, where the second condition includes at least one of the lowest load and closest distance from the first processing core.
The above-mentioned "determination mode one" in which the adjacent processing core is determined as the second processing core "and" determination mode two "in which the shared directory unit is provided and the processing core in which the first constant is cached is determined as the second processing core" may be implemented as an independent embodiment or may be implemented as a combined embodiment. The first processing core first executes a first determination mode, wherein the adjacent core is determined to be a second processing core, the adjacent core is determined to be the second processing core, the first constant is requested to the second processing core, if the first constant is not cached in the adjacent core of the first processing core, namely, the first constant is not hit in the second processing core, the first processing core executes a second determination mode, wherein a shared directory unit is set, the processing core cached with the first constant is determined to be the second processing core by the shared directory unit, the processing core cached with the first constant in at least two processing cores is queried through the shared directory unit, the processing core is confirmed to be the second processing core, then the first constant is requested to the second processing core, and if the shared directory unit returns miss information, the processing core cached with the first constant in at least two processing cores is not stored, or the processing core cached with the first constant and meeting the first condition is not stored. At this point, the first processing core should request a first constant from the last level cache or main memory. That is, a first processing core configured to determine all or a portion of neighboring cores as a second processing core in the event that a first constant misses in a first cache, the neighboring cores being processing cores that are physically or logically neighboring to the first processing core, request the first constant from the second processing core, send a query request to a shared directory unit in the event that the first constant misses in the second processing core, the query request including an address of the first constant, the shared directory unit being configured to indicate a cache condition of at least one constant in at least two processing cores, the shared directory unit being configured to determine the processing core in which the first constant is cached as the second processing core based on the query request, the first processing core or the shared directory unit being configured to request the first constant from the second processing core.
In other embodiments, the first processing core may request the first constant from the last level cache or main memory at the same time, in addition to performing "determination mode two: set shared directory unit, which determines the processing core cached with the first constant as the second processing core". That is, a first processing core configured to send a query request to a shared directory unit in case of a miss of the first constant in the first cache, and to send a read request to a last level cache or main memory, the first constant being determined based on an earlier arriving (or earliest arriving) response of the query response or the read response, wherein the query request comprises an address of the first constant, the read request comprises an address of the first constant, the shared directory unit is configured to indicate a cache condition of at least one constant in at least two processing cores, the query response is a response of the shared directory unit to the query request, the read response is the last level cache or main memory response to the read request, and the last level cache is the last level cache in the processor. Ignoring a later arriving response of the query response and the read response, or taking the later arriving response of the query response and the read response as an invalid response, or discarding the later arriving response of the query response and the read response. If the query response is earlier than the last-level cache or the main-memory read response, waiting for the second processing core read response or sending a read request to the second processing core according to the query response, and setting the last-level cache or the main-memory read response to be invalid (or discarding the last-level cache or the main-memory read response). If the last level cached or hosted read response is earlier than the query response, the query response is set to invalid (or the query response is discarded). In this way, the excessive delay caused by the fact that the first constant is not existed in the shared directory unit and the first processing core still needs to request the first constant from the final stage cache or the main memory can be avoided, that is, the longest transmission delay of the mode provided by the embodiment of the application is shortened to be the same as the traditional mode as much as possible.
For the shared directory unit, in order to ensure timeliness of the caching condition of at least one constant in the shared directory unit, the shared directory unit should be maintained in time.
2. Maintenance of shared directory units.
In some embodiments, the method further includes updating, by the shared directory unit, at least one constant cache condition in the shared directory unit in the event of a cache update in any of the at least two processing cores.
It should be noted that, after the first processing core reads the first constant from the second processing core, the last stage cache, or the main memory, the first constant may be cached in the first cache, or the first constant may not be cached in the first cache. In some embodiments, the first constant is not cached in the first cache, but is cached in the second processing core, and compared with the conventional method, the method provided by the embodiment of the application supports the first processing core to read the first constant from the second processing core, so that the reading efficiency of the first constant is higher than that of the conventional method. In another writing embodiment, after the first constant is cached to the first cache, the first processing core can directly read the first constant from the corresponding first cache, without reading the first constant from the second processing core, the last-stage cache or the main memory, so that the efficiency of reading the first constant by the first processing core can be greatly improved and the time delay of reading the first constant can be reduced. However, the capacity of the first buffer of the first processing core is often smaller to ensure a higher read-write speed, so the first processing core needs to determine whether to buffer the first constant into the first buffer.
In some embodiments, the method further comprises the first processing core caching the first constant to the first cache if a third condition is met, wherein the third condition comprises at least one of an access frequency of the first constant being greater than a third threshold, the access frequency being used to indicate a frequency with which the first processing core requests the first constant from the third processing core, and an occupancy of the first cache being less than a fourth threshold.
In some embodiments, the shared directory unit also supports pre-fetching constants into the partial cache core in advance. Specifically, the following is shown.
In some embodiments, the method further includes the shared directory unit sending a prefetch instruction to the at least one fourth processing core to cache the second constant to the cache of the at least one fourth processing core if a second constant of the at least one constant satisfies a fourth condition, each of the at least one fourth processing core being less than a fifth threshold from each of the at least two fifth processing cores, the at least two fifth processing cores being processing cores that access the second constant.
Wherein the fourth condition comprises at least one of the number of at least two fifth processing cores being greater than the first number, and the access frequency of the second constant being greater than a sixth threshold.
In another aspect, an embodiment of the present application provides a graphics card, where the graphics card includes a processor as described in each of the above embodiments. Optionally, the processor is a GPU.
In another aspect, embodiments of the present application provide a computer device comprising a processor as described above. Optionally, the processor is a GPU. The computer device may be at least one of a portable computer, a desktop computer, a server cluster, an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) computing cluster, a cloud computing cluster. The AI computing clusters may also be simply referred to as intelligent computing clusters or intelligent computing clusters.
It should be understood that references herein to "a plurality" are to two or more. The character "/" generally indicates that the context-dependent object is an "or" relationship. In addition, the step numbers described herein are merely exemplary of one possible execution sequence among steps, and in some other embodiments, the steps may be executed out of the order of numbers, such as two differently numbered steps being executed simultaneously, or two differently numbered steps being executed in an order opposite to that shown, which is not limiting.
The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.