CRASH-CONSISTENT PERSISTENT MEMORY DEVICES AND METHODS
TECHNICAL FIELD
The present disclosure relates to devices and methods for information processing technology. More specifically, the present disclosure relates to devices and methods with persistent memory implementing a crash-consistent execution.
BACKGROUND
Persistent memory (PMEM) relates to storing data structures such that they can continue to be accessed using memory instructions or memory application programming interfaces (APIs) even after the end of the process that created or last modified them. With PMEMs there is no need to maintain a copy of the data in separate storage for recoverability. On the other hand, the data must be usable after crash or power failure, which can occur at any point. When the failure happens during a modification such as an insertion or a deletion from a database, it leaves the system in an arbitrary state, and reusing the memory content is usually unsupported, as it is too complicated, expensive and error prone. If such reuse is possible, the system or the data-structure may be referred to as crash-consistent. While keeping volatile shared memory consistency requires atomic operations to implement locking, crash consistency cannot employ locks as a power failure will not try to acquire a lock before the system is being shut down. Thus, some new hardware assistance is required for facilitating the consistency of persistent memory after an asynchronous crash. The consistency itself is an attribute of the data of a specific application which means that this application can use this data to continue execution after a crash. It is derived from the application logic, and not from the hardware.
SUMMARY
It is an objective of the present disclosure to provide improved devices and methods with persistent memory implementing a crash-consistent execution. The foregoing and other objectives are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.
According to a first aspect a data processing apparatus is provided. The data processing apparatus comprises a persistent memory configured to store data and a central processing unit, CPU. The persistent memory may implement a memory hierarchy, including one or more cache memories. The CPU is configured to execute a program code, in particular an assembly code defining an application. The program code comprises one or more memory access instructions of an instruction set architecture, ISA, of the CPU, wherein the one or more memory access instructions are configured to access the data in the persistent memory. The ISA further includes one or more instructions for a crash-consistent execution, CCE, mode of the CPU for an atomic CCE of the one or more memory access instructions. In the CCE mode the CPU, in case of a read-write conflict, is configured to continue the CCE of the one or more memory access instructions, i.e. not to abort the one or more memory access instructions. Thus, an improved device with persistent memory implementing a crashconsistent execution is provided.
In a further possible implementation form, the CPU is configured such that in the CCE mode a concurrent read operation sees a committed value without aborting a corresponding write operation.
In a further possible implementation form, the one or more memory access instructions comprise one or more memory write instructions for writing data to the persistent memory and wherein during the CCE the one or more memory write instructions are tentative.
In a further possible implementation form, the one or more memory access instructions comprise one or more memory load instructions for loading data from the persistent memory and wherein the one or more memory load instructions, in case of an abort, are configured to return the last committed coherent value.
In a further possible implementation form, the CPU is a single thread, e.g. single-core CPU and the CCE of the one or more memory access instructions starts from a repairable state and uses one or more tentative writes. In a further possible implementation form, upon failure of the CCE, the CCE cancels the one or more tentative writes to return to the repairable state.
In a further possible implementation form, the CPU is multi thread, e.g. multi-core CPU and the CCE of the one or more memory access instructions is configured to handle a write-write conflict between different threads of a plurality of threads of the CPU.
In a further possible implementation form, the plurality of threads of the CPU see write memory instructions in the same order.
In a further possible implementation form, the one or more instructions for a crash-consistent execution of the one or more memory access instructions comprise an instruction for marking the beginning of, i.e. for starting the CCE of the one or more memory access instructions (herein referred to as cce begin instruction). In other words, the cce begin instruction triggers the CPU to enter the CCE mode.
In a further possible implementation form, the one or more instructions for a crash-consistent execution of the one or more memory access instructions comprise an instruction for marking the end of, i.e. for ending the CCE of the one or more memory access instructions (herein referred to as cce end instruction). In other words, the cce end instruction triggers the CPU to return from the CCE mode to the default execution mode.
In a further possible implementation form, the one or more instructions for a CCE of the one or more memory access instructions comprise an instruction for verifying whether the CPU is executing the one or more memory access instructions in the CCE mode (herein referred to as cce test instruction).
In a further possible implementation form, the one or more instructions for a CCE of the one or more memory access instructions comprise an instruction for aborting the CCE mode (herein referred to as cce abort instruction). In a further possible implementation form, the instruction for aborting the CCE mode, i.e. the cce abort further instructs the CPU to issue a user-defined error code. The error code may be a number (code) that is returned by the user of the CCE and interpreted by the application.
In a further possible implementation form, the CPU is configured in the CCE mode to implement a queue (herein also referred to as re-acquire table) configured to hold memory addresses associated with read-write conflicts. For instance, if the CCE wrote a shared variable X and a concurrent CCE reads the shared variable X, the writing CCE is configured to add the address of the share variable X to the queue, i.e. the re-acquire table to regain ownership at CCE commit.
According to a second aspect a method for operating a data processing apparatus is provided. The method comprises the steps of storing data in a persistent memory of the data processing apparatus; and executing a program code, in particular an assembly program code by a CPU of the data processing apparatus, wherein the program code comprises one or more memory access instructions of an instruction set architecture, ISA, of the CPU, wherein the one or more memory access instructions are configured to access the data in the persistent memory and wherein the ISA includes further one or more instructions for an crash-consistent execution, CCE, mode of the CPU for an atomic CCE of the one or more memory access instructions, wherein in the CCE mode the CCE of the one or more memory access instructions is continued, i.e. not aborted, in case of a read-write conflict.
The method according to the second aspect of the present disclosure can be performed by the data processing apparatus according to the first aspect of the present disclosure. Thus, further features of the method according to the second aspect of the present disclosure result directly from the functionality of the data processing apparatus according to the first aspect of the present disclosure as well as its different implementation forms described above and below.
Details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description, drawings, and claims. BRIEF DESCRIPTION OF THE DRAWINGS
In the following, embodiments of the present disclosure are described in more detail with reference to the attached figures and drawings, in which:
Fig. 1 shows a schematic diagram illustrating a data processing apparatus according to an embodiment;
Fig. 2a shows a diagram illustrating an exemplary memory operation in the form of a tree insertion;
Fig. 2b shows exemplary code for a data processing apparatus according to an embodiment to execute the exemplary memory operation of figure 2a in a CCE mode;
Figs. 3a and 3b show exemplary code for a data processing apparatus according to an embodiment for a thread safe write memory operation and a thread safe write memory operation in a CCE mode;
Fig. 4a shows a flow diagram illustrating conflict handling implemented by a data processing apparatus according to an embodiment for a read-write conflict;
Fig. 4b shows a flow diagram illustrating conflict handling implemented by a data processing apparatus according to an embodiment for a write-write conflict between two CCEs with requester-loses policy;
Fig. 4c shows a flow diagram illustrating conflict handling implemented by a data processing apparatus according to an embodiment for a write-write conflict between a CCE and a non- CCE with requester-wins policy;
Fig. 5 shows a table illustrating differences between a CCE implemented by a data processing apparatus according to an embodiment and hardware transactional memory;
Figs. 6a-h illustrate different CCE contention types handled by a data processing apparatus according to an embodiment; and Fig. 7 is a flow diagram illustrating a method for operating a data processing apparatus according to an embodiment.
In the following, identical reference signs refer to identical or at least functionally equivalent features.
DETAILED DESCRIPTION OF THE EMBODIMENTS
In the following description, reference is made to the accompanying figures, which form part of the disclosure, and which show, by way of illustration, specific aspects of embodiments of the present disclosure or specific aspects in which embodiments of the present disclosure may be used. It is understood that embodiments of the present disclosure may be used in other aspects and comprise structural or logical changes not depicted in the figures. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims.
For instance, it is to be understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if one or a plurality of specific method steps are described, a corresponding device may include one or a plurality of units, e.g. functional units, to perform the described one or plurality of method steps (e.g. one unit performing the one or plurality of steps, or a plurality of units each performing one or more of the plurality of steps), even if such one or more units are not explicitly described or illustrated in the figures. On the other hand, for example, if a specific apparatus is described based on one or a plurality of units, e.g. functional units, a corresponding method may include one step to perform the functionality of the one or plurality of units (e.g. one step performing the functionality of the one or plurality of units, or a plurality of steps each performing the functionality of one or more of the plurality of units), even if such one or plurality of steps are not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary embodiments and/or aspects described herein may be combined with each other, unless specifically noted otherwise. Figure 1 shows a schematic diagram illustrating a data processing apparatus 100 according to an embodiment. In an embodiment, the data processing apparatus 100 may be implemented as a cloud server, mobile device, tablet computer, laptop computer or other type of data processing apparatus. As illustrated in figure 1, the data processing apparatus 100 comprises processing circuitry 101, comprising one or more central processing units, CPUs, 101, and a memory 105 with at least a persistent memory portion (possibly including further types of memory as well) configured to persistently store data. Each CPU 101 may comprise one or more cores. The memory 105 may implement a memory hierarchy, including one or more persistent cache memories implementing a cache coherence protocol, such as the MESI protocol. As will be described in more detail below, in an embodiment, the memory 105 is configured to ensure that data stored within the CPU caches will be flushed to the persistent- memory 105 upon a power failure.
The one or more CPUs 101 may be implemented in hardware and/or software and may comprise digital circuitry, or both analog and digital circuitry. Digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), digital signal processors (DSPs), or general-purpose processors. The memory 105 of the data processing apparatus 100 may be configured to store executable program code, in particular assembly code 107 which, when executed by the one or more CPUs 101, causes the data processing apparatus 100 to perform the functions and methods described herein. As illustrated in figure 1, the data processing apparatus 100 may further comprise a communication interface 103 configured to communicate via wired and/or wireless connections.
As will be described in more detail below, the CPU 101 is configured to execute a program code, in particular an assembly code 107 defining an application. The assembly code 107 comprises one or more memory access instructions of an instruction set architecture, ISA, of the CPU, wherein the one or more memory access instructions are configured to access the data in the persistent memory 105. The ISA further includes one or more instructions for a crash-consistent execution, CCE, mode of the CPU 101 for an atomic CCE of the one or more memory access instructions. As will be described in more detail below, in the CCE mode the CPU 101, in case of a read-write conflict, is configured to continue the CCE of the one or more memory access instructions, i.e. not to abort the one or more memory access instructions. Moreover, the CPU 101 is configured such that in the CCE mode a concurrent read operation sees a committed value without aborting a corresponding write operation. As used herein, if an execution by the CPU 101 is a crash consistent execution (CCE), it means that all the writes of the execution are made persistent and public atomically by the CPU 101. Thus, if an execution brings the data from a consistent state to a consistent state, the application will acquire crash-consistency.
Figure 2a illustrates an exemplary operation in the form of a data insertion into a balanced binary tree which involves writes to three different pointers (as indicated in the figure 2a). A data processing apparatus 100 according to an embodiment may perform the same operation in the CCE mode for making this operation crash consistent with the exemplary code 107 illustrated in figure 2b, which includes as part of ISA of the CPU 101 an instruction cce begin for triggering the CPU 101 to enter the CCE mode and an instruction cce end for triggering the CPU 101 to stop operating in the CCE mode. As will be appreciated from the following more detailed description, the crash-consistent persistent memory 105 of the data processing apparatus 100 according to an embodiment is simple to use, robust, safe and adds negligible overhead to existing multicore software.
In the following further embodiments of the data processing apparatus 100 with the crashconsistent persistent memory 105 will be described in more detail. Generally, in the CCE mode the CPU 101, when executing a code or code segment C 107, is configured to ensure atomic execution, i.e. all write memory operations are tentative and visible only to the writer until commit, and at a commit, all the write memory operations of the code segment C 107 are instantly shared in the cache of the memory 105 and persistent. If the execution aborts before commit (for instance, due to a power failure), the write memory operations are cancelled and the state prior to the execution of the code segment C 107 is recovered.
As will be appreciated, while CCE is atomic, concurrent CCE executions might be interleaved (because by design a CCE is not serializable). For example, a first CCE1 may commit and write a shared variable X which is in the read-set of a concurrent second CCE2. After this event, the second CCE2 continues and reads the shared variable X again. The code C may be responsible for its own serializability and synchronization. However, the CCE mode of the CPU 101 of the data processing apparatus according to an embodiment guarantees that all atomic instructions work correctly in the CCE mode and that hardware transactional memory (HTM) works correctly with a concurrent CCE so that software serializability and correctness is preserved while gaining crash-consistency.
In an embodiment, the ISA of the CPU 101 may comprise one or more of the following instructions for the CCE mode of the CPU 101, i.e. for delimiting and managing a code segment to be executed in a crash-consistent manner by the CPU 101.
In an embodiment, the one or more instructions for a crash-consistent execution of the one or more memory access instructions comprise an instruction for marking the beginning of, i.e. for starting the CCE of the one or more memory access instructions (herein referred to as cce begin instruction) by the CPU 101. In other words, the cce begin instruction triggers the CPU 101 to enter the CCE mode. If entering the CCE mode was successful, the cce begin instruction may return a confirmation or, in case of a failure, a dedicated error code.
In an embodiment, the one or more instructions for a crash-consistent execution of the one or more memory access instructions comprise an instruction for marking the end of, i.e. for ending the CCE of the one or more memory access instructions (herein referred to as cce end instruction) by the CPU 101. In other words, the cce end instruction triggers the CPU 101 to return from the CCE mode to the default execution mode. In an embodiment, the cce end instruction may further trigger the CPU 101 to share all writes.
In an embodiment, the one or more instructions for a CCE of the one or more memory access instructions comprise an instruction for verifying, i.e. testing whether the CPU 101 is executing the one or more memory access instructions in the CCE mode (herein referred to as cce test instruction). In an embodiment, the cce test instruction may trigger the CPU 101 to check whether a cce begin instruction was actually called.
In an embodiment, the one or more instructions for a CCE of the one or more memory access instructions comprise an instruction for aborting the CCE mode (herein referred to as cce abort instruction) of the CPU 101. In an embodiment, the instruction for aborting the CCE mode, i.e. the cce abort instruction further triggers the CPU to issue a user-defined error code. The error code may be a number (code) that is returned by the user of the CCE and interpreted by the application. In an embodiment, the cce abort instruction may further trigger the CPU 101 to cancel all tentative writes. As will be appreciated, program code 107 for the data processing apparatus 100 according to an embodiment may use the instructions described above to define the points in the code where the system should be repairable and consistent.
While the exemplary code segment 107 shown in figure 2b illustrates how CCE by the CPU 101 of the data processing apparatus 100 according to an embodiment may be used to add crash consistency to serial execution by performing all writes atomically, the exemplary code 107 illustrated in figures 3a and 3b uses CCE to add crash-consistency to a critical section of the code (assuming this section connects two consistent states). For this purpose the code inside the CCE may be protected with a sequential lock. As will be appreciated, a sequential lock is free when it is even, locked by incrementing it atomically to the next odd value, and freed by incrementing it to the next even value. A reader may sample the lock before and after reading and may retry if the lock was changed or odd, to get a snapshot.
The exemplary code shown in figure 3a illustrates why Write-Write-Conflict (WWC) aborts, as implemented by the data processing apparatus 100 according to an embodiment, are beneficial for correctness. If there are two concurrent CCEs that try to acquire the same sequential lock simultaneously (assuming the lock persistent value is free), both may read zero from the lock, and may manage to write 1 to the lock locally. In that case both CCE may enter the critical section together and mutual exclusion will be broken. This is addressed by the data processing apparatus 100 according to an embodiment by letting the first CCE write the 1 to the lock, and aborting the second writer. In other words, in an embodiment, the CPU 101 of the data processing apparatus 100 is configured to implement the “requester-loses” policy in the CCE mode. As will be appreciated, if the second writer would have aborted the first writer (as would be the case for HTM which implements a “requester-wins” policy), the first CCE would have retried and abort the second CCE and the system would have entered a live-lock.
In an embodiment, the CPU 101 may be configured to use compare-and-swap (CAS) for the atomicity of the swap (as WWC may not be adding synchronization) in order to avoid the following scenario. If a plain read is used to see whether the lock is free, and then a plain write is used to lock it, there could be a concurrent thread that wrote 1 to the lock, acquired it, and committed between the read and the write so that mutual exclusion would be broken. The exemplary code shown in figure 3b makes use of the fact that according to an embodiment the CPU 101 of the data processing apparatus 100 may read in the CCE mode the persistent value, even in the presence of a concurrent speculative writer. If reading the critical section is finished before the concurrent CCE writer commits, the reading may be linearized before that CCE, and if the CCE writer committed before the second sampling of the seq lock, the reading may be retried and linearizes after the CCE.
In order to illustrate features of CCE implemented by the CPU 101 of the data processing apparatus 100 in the following some differences between CCE and HTM are highlighted. HTM is implemented in a couple of leading processor architectures, such as ARM, x86 and POWER PC, and some of the components of HTM may be used also for CCE. However, while HTM verifies both atomicity and serializability, CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment only requires atomicity. This less stringent requirement of CCE results in several differences between CCE and HTM that make CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment computationally lighter and more robust than HTM.
In the table shown in figure 5 some of the main differences between a current conventional HTM implementation and embodiments of the crash consistent memory implemented by the data processing apparatus 100 are highlighted. A first fundamental difference concerns the main goal of HTM and crash consistent memory. More specifically, while the goal of HTM is a lock-free synchronization scheme, the goal of crash consistent memory is free recovery. This main design difference implies that loads are not tracked by CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment as they cannot be related to recovery. In addition, as the code is assumed to be synchronized, CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment verifies the linearizability of each memory location but read-write conflicts, i.e. a situation where CCE1 reads a variable X that is concurrently speculatively written in CCE2 do not abort any execution and the software handles the synchronization of X as if its written without CCE. Not tracking the reads reduces the footprint of CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment dramatically and avoiding read-write conflict aborts and most capacity violation aborts improves dramatically the stability of CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment compared to HTM.
In an embodiment, only a WWC may trigger an abort, to preserve one valid order of writes to each memory object. The aborts may be triggered by the CPU 101 according to the requester- loses policy, where the second writer aborts, unlike HTM implementations which use the requester-wins, where the last writer survives and the previous writer aborts. When an execution CCE1 wrote the shared variable X and then a concurrent execution CCE2 writes X, CCE2 is the requester. If X is a lock, and the policy is requester-wins, CCE2 will abort CCE1, which will immediately retry and kill CCE2 and this process will repeat forever. As CCE is usually wrapping lock-based code this scenario can be avoided by implementing the requester-loses policy for CCE-CCE WWC aborts by the CPU 101 of the data processing apparatus 100 according to an embodiment. According to the requester-loses policy, if CCE1 acquired the lock X, then when CCE2 will try to write X, CCE2 will abort itself and retry, until CCE1 will commit and then CCE2 will manage to acquire X. This is exactly the behavior expected in a lock-based code.
When the requester is HTM or plain writes, and not another CCE, the CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment may respond according to the requester-wins policy. This may be beneficial, as the plain writes do not have a retry mechanism, and HTM implements the requester-wins policy, as if a concurrent thread writes on its read-set it is no longer consistent.
Finally, unlike HTM transactions which have no progress guarantee, small CCEs implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment, that fit in the LI cache size of the memory 105 and associativity, are guaranteed to pass in hardware.
In HTM the critical section writes speculatively into private LI cache, and if there was a power failure all speculative writes are cancelled. Thus, HTM is also crash consistent. However, HTM has some requirements that make HTM unusable for crash-consistency and rarely used for concurrency control. For instance, HTM aborts frequently on read-write conflicts, capacity conflicts, and for other reasons that can be avoided for crash-consistency. As HTM is best-effort with no progress guarantee, it is usually required that each HTM transaction will have a software fallback path in case it could not pass in hardware. This fallback path is not acceptable for crash consistency as in many cases, such a path fails to exist.
The microarchitecture implemented by the data processing apparatus 100 will be described in the following in the context of figures 4a, 4b and 4c. As will be appreciated, in an embodiment the CCE may be implemented by the CPU 101 on top of HTM managed using cache memories. In the embodiments shown in figures 4a, 4b and 4c the data processing apparatus 100 implements, by way of example, the MESI cache coherence protocol. Once the CPU 101 (or a core thereof) enters the CCE mode through the instruction cce begin (see 411 of figure 4a), any internal store may be allowed to modify the data only locally within the LI cache memory similarly to how HTM handles its transactional writes. In an embodiment, the cache lines will be updated to a modified state (also referred to as a M state in the MESI cache coherence protocol) in the LI cache and marked with the same w-bit used by conventional HTM (see 413 of figure 4a). If an existing cache line has a previously modified state from before the transaction started, it may be written back to the L2 cache memory before applying the current store in order to preserve the pre-transaction data. However, unlike HTM, according to an embodiment of the data processing apparatus 100 loads will not mark their cache lines with R-bits to avoid having to track the read-set and aborting on conflicts.
After the write is done, any read (such as the read in 416 of figure 4a) from other threads or cores snooping this cache line will create a read-after- write scenario (whether the other thread is within its own CCE mode/transaction or not). The CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment may differ from HTM in several aspects regarding the MESI (Modified Exclusive Shared Invalid) protocol. Firstly, the read-after-write scenario may not be defined as a conflict and will not abort the local transaction. Secondly, the external reader will not see the locally updated value from the stores within the CCE transaction.
The snoop response from the CCE is tweaked to "pretend" it missed the LI cache. Instead, it will receive the cache line from the next level of the cache hierarchy (L2 or beyond) that is holding the pre-transactional value. If no lower cache memory holds the cache line it will be read from the main memory of the memory hierarchy 105. This embodiment satisfies the CCE requirement to provide pre-transaction data on any read. In an embodiment, the snoop response may include a new attribute to force the external reader to receive the cache line in Shared state, even though its read is provided by a lower cache level and could have otherwise been given exclusive ownership. The local cache line will be moved into Shared state (also referred to as S state in the MESI cache coherence protocol) while keeping the w-bit on (see 417 of figure 4a) and the requester will get the line in Shared state (see 418 of figure 4a). As illustrated in figure 4a, the address of the snooped cache line will be pushed into a dedicated queue 417a (referred to as re-acquire table 417a in figures 4a- c,) so that the CPU 101 (or the core thereof) doing the transaction can regain sole ownership of that cache line again at the end of the transaction. The size of the queue 417a may be configurable. By way of example, the queue 417a may hold 8 to 16 entries. If the queue 417a becomes full, the CPU 101 may be configured to abort the CCE mode. Incoming invalidating snoops may CAM this queue and abort the CCE mode upon hit.
The process described above is illustrated in figure 4a, where the pre-transaction value is Vp and the transaction changes the locally stored lines into Vt. The other core/thread reading the cache line will receive the old value (Vp). Once the CCE reaches cce_end (see 423 of figure 4a) the CPU 101 of the data processing apparatus 100 according to an embodiment is configured to stall execution and start looping over the re-acquire table 405a. For each address stored in the re-acquire table 405a the CPU 101 of the data processing apparatus 110 according to an embodiment is configured to issue a Read-For-Ownership (RFO) (see 423a of figure 4a) like a regular store, ignoring the local copy available for that cache line, and wait for the cache line to arrive thereby acknowledging the invalidation of any other shared copies in other cores. Other copies are guaranteed to hold the old value since they could not have been silently modified by other cores as the line was provided by the CPU 101 only in the Shared state. This would have forced any other core that wanted to modify the cache line try to acquire the cache line first (and snooping the CCE core again which would let it know there’s a true conflict). Since in an embodiment the stores will be committed atomically by the operations in the CCE mode, the CPU 101 may be configured to issue the RFOs in parallel to save the latency (step 423a of figure 4a). Once all the cache lines are owned by the CPU 101 and the re-acquire queue 417a is empty, the CPU 101 may commit the transaction by removing the w-bit from all the transactional lines (see 423b of figure 4a), making them system-wide coherent. Figure 4b shows a slightly different scenario where the other core is also doing a write causing a write after- write conflict (WWC). In this scenario the handling by the CPU 101 may be split into the four following WWC cases:
• WWC Case 1 : if the second (requester) core is within a CCE transaction, this CCE transaction may be aborted and its changes may be rollbacked to ensure the requester-loses policy (see 424 and 426 of figure 4b), as implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment (this case is also illustrated in figure 6e).
• WWC Case 2: if the second core is within an HTM transaction that started earlier, this HTM transaction may be aborted (this case is illustrated in figure 6h) to support the legacy HTM requester-wins policy. As will be appreciated, this is an exception to the requester-loses policy implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment to maintain legacy correctness of HTM. Due to HTM limitations this case also applies for the read-write conflict where the HTM transaction only performs a read and CCE performs a write.
• WWC Case 3: if the second core is within an HTM transaction that started later than the CCE transaction, the local CCE transaction may need to be aborted (this case is illustrated in figure 6g) in accordance with the requester-loses policy implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment.
• WWC Case 4: If the second core is not within any transaction, then its operations should not be aborted and the policy implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment may dictate that the local core aborts its CCE transaction (this case is illustrated in figure 6f).
To correctly distinguish between these cases the CPU 101 of the data processing apparatus 100 according to an embodiment may be configured to include an additional flag for snoops to indicate that the respective snoop originates from a CCE transaction. Thus, an external invalidating snoop triggered by an RFO from another core will invalidate the local copy and cause a local CCE to abort only if the other core was not within a CCE transaction (corresponding to the WWC cases 2 to 4 mentioned above). If the snoop is marked as CCE then it will be ignored, and upon returning to the other core it will abort the CCE transaction there.
The example illustrated in figure 4b is slightly more complex than the example shown in figure 4a, because it is not only written to the cache line, but the cache line is also read first (see 416 of figure 4b, which sends the line in Shared mode in 418). As will be appreciated, this implies that the snoop is not asking for ownership, but it will only ask to share the cache line (assuming a Modified-to-Shared transition is allowed, otherwise it will be an invalidating snoop and behave as before). In this case, the requester will not abort after seeing the load, but once it tries to store to that cache line it will be forced to request ownership for it. In an embodiment, a CCE transaction seeing an RFO hitting a shared cache line in the local cache can automatically abort (saving the need to snoop again), as illustrated by 424 and 426 of figure 4b.
Finally, figure 4c shows the write-write conflict for a non-CCE other core. In this case, as mentioned, the local CCE transaction may have to be aborted (see 430 of figure 4c) by the CPU 101 according to an embodiment, because both writes should not be allowed to proceed (otherwise both may acquire the same lock for example, as the other core sees pre-CCE data that lets it go into the critical section in parallel). This is achieved, when the other core issues an RFO before writing to the line which will manifest as a snoop to invalidate the local copy in the CCE core (regardless of any prior snoops by reads that keep the line in a shared state). The invalidating snoop will hit the re-acquire table 417a and abort the CCE transaction as desired.
As will be appreciated, the notion of correctness of CCE is linearizability and not serializability. Thus, if the CCE store happens after a concurrent non-CCE store, even if that non-CCE store happened after the cce begin instruction, the CCE does not have to be aborted, and the non-CCE store can be linearized before the CCE, as linearizability is per object. To prove correctness of the CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment, it will be shown in the following that CCE maintains the progress of the plain code it executes and preserves the linearizability of the hardware. As will be further appreciated, unlike HTM which is best effort, has no progress guarantee, and always requires a software fallback, any set of CCEs which fits in the LI cache of the memory 105 of the data processing apparatus 100 according to an embodiment, and have no deadlocks with one another, is guaranteed to commit successfully after a finite number of retries. This guarantee is possible as the requester-loses policy enforced by the CPU 101 of the data processing apparatus 100 according to an embodiment verifies that if multiple executions try to enter a critical section, the one that entered first will complete the execution of that section, and the other will retry. As, according to an embodiment, reads are not tracked, it is also easier to control the size of the CCE footprint.
As a CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment tracks only writes and is less sensitive to contention, it may be used simply wrap full operations and make them crash consistent. This also keeps the overhead of the CCE instructions low (i.e. the code will contain less CCE instructions). However, if there is a potentially significant write-write contention than it may be beneficial to divide the application operations into smaller executions that will avoid unnecessary contention, and still leave the system in a repairable state without logging.
If a code inside a CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment released a lock, but didn’t end the CCE, a concurrent CCE may abort while trying to lock a free lock. This will reduce the performance of the software and may introduce more contention and even live-locks. Thus, after releasing a lock, the CPU 101 of the data processing apparatus 100 according to an embodiment may be configured to end the CCE. If a CCE aborted it may retry from the beginning, so it is better to start the CCE from the lock acquisition to avoid retrying the read-only prefix. However, as the read-only prefix is not adding to contention, it is sometimes more convenient to just restart the CCE to break a big operation to smaller pieces, without skipping the read-only prefix.
Linearizability is a widely accepted notion of correctness, and recent research found that the standard definition can be used to prove the correctness of hardware weak memory models. Linearizability means that every object has a sequential history, or in the case of a CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment, that every address has a total global order of writes. To prove CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment is correct in this sense it has to be shown that (a) it is linearizable, (b) HTM and plain writes that are concurrent with CCE maintain linearizability and (c) HTM also maintains serializability in the presence of CCE. As will be appreciated, linearizability is a correctness condition for concurrent objects that is non-blocking, which means that linearizability as such never requires one thread to wait for another to complete an ongoing operation. Moreover, it is local, which means that an object composed of linearizable objects is itself linearizable. It has to be verified that at the commit of a CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment, i.e. after calling the cce end instruction, all addresses are linearizable, and that every address is linearizable also after a failure. As a commit always leaves the last committed and persistent value, it is enough to prove that after a commit all addresses are linearizable.
If each CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment writes a different address then all addresses are trivially linearizable. So, if there is inconsistency it happens at a conflict, where two concurrent executions write the same address. For the sake of illustration it is assumed that each conflicting execution wrote that address once, but if there were several writes to the same address in one CCE they can be linearized one after the other. Figure 6 shows all the possible conflict scenarios, with the following symbols: A = Aborted; C = Committed xend (HTM) or cce end; R = Read from the shared variable X; and W = Write to the shared variable X.
In an embodiment, in conflict scenarios the shared variable X stays linearizable in the CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment and serializable in HTM and plain access. Plain access serializability is trivial. A plain read before the CCE commit in figure 6b reads the shared variable X before the write is committed so it is strictly before the write and in figure 6f the plain write kills the CCE so this case is trivial.
For a concurrent HTM transaction it can be shown that the transaction serializability is preserved and not only the linearizability of the shared variable X. In figure 6c HTM reads after the speculative write was executed in CCE, but before CCE commits, and the HTM commits before CCE commits, so the read is serialized before the write. Figure 6d starts like figure 6c so the HTM read sees the persistent value but the CCE commits and sends a snoop in a reacquiring phase which kills the HTM, so serializability is not affected. This is important because the CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment commits writes on the HTM read-set and does break serializability.
In figure 6g the HTM write is after the CCE write so the HTM kills the CCE in accordance with the HTM requester-wins legacy.
In figure 6h the CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment writes after HTM access, so that the snoop, sent by the CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment, kills the HTM (again in accordance with the requester-wins policy of HTM).
Thus, it remains to verify that X is linearizable in the CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment. The conflict illustrated in figure 6a is a special case in that the conflicting read (Rc on the right CCE) is after the speculative write (Ws in the left CCE). This is acceptable so commit is successful, as Rc read the persistent value, so it can be linearized before the commit of Ws. The entire CCE is not serializable, and if the software needs serializability it may need to add synchronization, such as a lock, to protect itself.
If the conflicting write, like in the scenario of figure 6e, is in a concurrent CCE, after the speculative write, the concurrent CCE aborts according to the requester-loses policy implemented according to embodiments disclosed herein. It is enough to see that in all WWC (i.e. the scenarios of figures 6d-h) there is never more than one speculative value of the shared variable X at the same time. As a result, it is possible to determine a total and global order on all the writes to the shared variable X which is the order of CCEs that wrote X and committed successfully.
As will be appreciated, an atomic operation, such as compare-and-switch or fetch-and-add, inside the CCE implemented by the CPU 101 of the data processing apparatus 100 according to an embodiment is just like a plain write and in case of a conflict only the first will survive and linearizability will be preserved. Figure 7 is a flow diagram illustrating a method 700 for operating the data processing apparatus 100 according to an embodiment.
The method 700 comprises a step 701 of storing data in the persistent memory 105 of the data processing apparatus 100.
Moreover, the method 700 comprises a step 703 of executing a program code by a CPU 101 of the data processing apparatus 100, wherein the program code comprises one or more memory access instructions of an instruction set architecture, ISA, of the CPU 101, wherein the one or more memory access instructions are configured to access the data in the persistent memory 105. As already described above in great detail, the ISA of the CPU 101 further includes one or more instructions for a crash-consistent execution, CCE, mode of the CPU 101 for a CCE of the one or more memory access instructions, wherein in the CCE mode the CCE of the one or more memory access instructions is continued, in case of a read-write conflict.
As the method 700 can be implemented by the data processing apparatus 100, further features of the method 700 result directly from the functionality of the data processing apparatus 100 and its different embodiments described above and below.
The person skilled in the art will understand that the "blocks" ("units") of the various figures (method and apparatus) represent or describe functionalities of embodiments of the present disclosure (rather than necessarily individual "units" in hardware or software) and thus describe equally functions or features of apparatus embodiments as well as method embodiments (unit = step).
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. The described embodiment of an apparatus is merely exemplary. For example, the unit division is merely logical function division and may be another division in an actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments. In addition, functional units in the embodiments disclosed herein may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.