BACKGROUND OF THE INVENTION1. Field of the Invention
The present invention relates to a data processing system.
2. Description of the Prior Art
Management of power consumption is a major design goal for designers of system-on-chip integrated circuits and data processing apparatuses in general. With the increased prevalence of portable data processing devices such as portable telephones, personal organisers and personal computers, careful control of power consumption is becoming more of a key factor in system design. Even in non-portable data processing devices, reduction of power dissipation and power consumption is important because it reduces the running costs, simplifies the design of cooling and power supplies and increases the reliability of operation.
There is a need for different power-performance modes of data processing devices because many power-constrained applications of processors require relatively low processor performance for the majority of the run time of the device but sometimes require considerably higher performance for relatively short periods of processing time. Data processing devices are likely to incorporate several and perhaps even tens of processors to implement a number of different processing tasks and any processors that are unused, even temporarily, during operation of the data processing device may well have both (a) a high-performance high-power consumption mode; and (b) one or more lower-performance lower-power consumption modes. This allows for the performance of the processor to be tailored to the demands of the current processing workload (i.e. operating system and one or more program applications) thus saving power when maximum processing performance is not required.
Processor power dissipation is often divided into dynamic (or switching) power and a static (or leakage) power component. The dynamic power component is associated with electrical signals changing voltage levels. In processor designs that use a typical clocked complementary metal-oxide semi-conductor (CMOS) circuit, the dynamic power is consumed when circuits are clocked or inputs change logic levels. By way of contrast, static power is consumed for the entire duration of time that power is supplied to the processing circuitry.
It is known to control power and performance of a data processing apparatus by using dynamic voltage and frequency scaling (DVFS). In the DVFS approach low energy consumption modes are entered by reducing the voltage supply to the processing circuits so that they use less power. This reduction in voltage also makes the transistors of the processing circuitry switch more slowly, which in turn means that the frequency of the processor clock should necessarily be reduced corresponding to the reduction in voltage. A simple DVFS processing device will typically have a full-voltage and full-performance point and at least one lower voltage lower performance point.
One problem with power management using DVFS is that it requires relatively complex power supply and clocking systems and this can increase the complexity of processor designs where the processing circuitry is voltage-frequency scaled but external circuitry is not or is voltage-frequency scaled by a different amount.
DVFS power management also has a limit to the lowest power point that it can provide in a data processing system. This is because transistors of the processing circuitry cannot operate below a certain characteristic minimum voltage for a given fabrication technology. This means that large high performance processors cannot be voltage-frequency scaled down to an arbitrarily low performance-power point. Furthermore, voltage scaling typically does not eliminate static power consumption. Large high performance processors that are required for the more demanding processing applications typically comprise large numbers of transistors and correspondingly have large static power consumption. Thus, it is desirable to provide a data processing system that provides performance scaling in a manner that enables high-performance processing to be achieved yet provides a lower more efficient power point at the lower performance end of the scale, and which has a simplified power supply and clocking circuitry.
It is also known to provide data processing systems having heterogeneous multiprocessors. Such heterogeneous multiprocessors are made up of at least two processors of different types, for example, one high-performance high-power processor and at least one lower performance lower power processor. Such multiprocessor systems are designed to be capable of concurrently executing two or more separate instruction streams corresponding to the individual number of processors making up the multiprocessor. Each processor's instruction stream contains instructions from the process it is currently running and when running at full performance, multiprocessor systems will typically have all of the high performance as well as all of the low power processors running substantially simultaneously. Such heterogeneous multiprocessors have operating systems specially written to handle multiprocessing comprising a scheduler serving to re-allocate processors to different processing tasks, albeit relatively infrequently.
To migrate a process between different processors of a multiprocessor system, the multiprocessor operating system is required to save all of its run-time state from a source processor to external memory and then to reload all (or a substantial portion) of that run time state on to the destination processor. It typically takes the operating system several thousand processing cycles to reload the necessary processor state. Furthermore, once the state has been reloaded there will typically be some start-up performance cost associated with the task migration on the destination processor while contents of structures like caches, translation lookaside buffer (TLB) and branch history tables adjust to the migrated instruction stream. The cost of entering the operating system and the cost of the transfer of processor state means that migration of a process has a high cost in terms of both processing time and energy in such known heterogeneous multiprocessor systems. Thus there is a requirement for more efficient migration of one or more processing tasks between different processors.
It is also known to manage processor power consumption by using fine-grain power configurable processors in which individual components of the high performance processor each have a high-performance configuration and a lower performance (but more energy-efficient configuration). In this case the overall structure of the processor remains the same and instructions travel through the processor via much the same path. For example, a high performance processor may comprise super scalar processor with multiple ALU (arithmetic logic unit) pipeline where all but one ALU pipeline can be powered down to improve energy efficiency. A further example is high power high performance data processors comprising branch predictors that can switch between highly aggressive and less aggressive speculation about the future course of the instruction stream.
In such fine-grain power configurable processors no state transfer is required to move between the high performance and the high efficiency modes. However, a disadvantage of this fine-grain power configurable processing is that the inclusion of the high efficiency mode can introduce extra transistor gates into critical paths that can in turn reduce the maximum performance of the high efficiency mode. Furthermore in the high efficiency mode signals are required to travel over the full area of the high performance processor which means that signals are required to propagate over considerable distances. This increases the signal loading and thus increases power dissipation. Furthermore reducing static power consumption in fine-grain dynamically configurable processors is problematic because it is not easy to cut off power to unused circuits in these systems due to the fact that power switching is necessarily distributed throughout the design of the high performance processor.
Thus there is the requirement for an alternative way of implementing processor performance scaling that simplifies implementation of the high performance and the high efficiency modes of operation and enables static leakage current to be reduced an a further requirement to more efficiently migrate processing tasks between processors.
SUMMARY OF THE INVENTIONAccording to a first aspect the present invention provides apparatus for processing data comprising:
first processing circuitry configured to operate in a first power domain;
second processing circuitry configured to operate in a second power domain different from said first power domain;
shared processing circuitry configured to operate in a shared power domain such that said first processing circuitry and said shared processing circuitry are configurable to operate together to form a first hybrid processing unit having access to an external memory and said second processing circuitry and said shared processing circuitry are configurable to operate together to form a second hybrid processing unit having access to said external memory, wherein said first hybrid processing unit and said second hybrid processing unit together comprise a uni-processing environment for executing a single instruction stream;
execution flow transfer circuitry for transferring execution of said single instruction stream between said first hybrid processing unit and said second hybrid processing unit at a transfer execution point, wherein said execution flow transfer comprises transfer of at least one bit of processing-state restoration information from a source one of said first hybrid processing unit and said second hybrid processing unit to a destination one of said first hybrid processing unit and said second hybrid processing unit, said processing-state restoration information being information enabling said destination hybrid processing unit to successfully resume execution of said single instruction stream from said transfer execution point.
The present invention recognises that processor performance scaling can be provided in an efficient manner by providing first processing circuitry operating in a first power domain and second processing circuitry operating in a second different power domain and providing shared processing circuitry operating in a shared power domain and configurable to operate together with either the first processing circuitry or the second processing circuitry. Although there are at least two sets of non-shared processing circuitry, the apparatus represents a uni-processing environment for executing a single instruction stream such that either the first processing circuitry and the shared processing circuitry are operating together as a first hybrid processing unit or the second processing circuitry and the shared processing circuitry are operating together as a second hybrid processing unit and execution of a single instruction stream can be migrated between the first hybrid processing circuitry and the second hybrid processing circuitry using execution flow transfer circuitry.
The execution flow transfer circuitry is capable of directly (without routing via external or main memory) transferring at least one bit of processing state restoration information between the first hybrid processing unit and the second hybrid processing unit. Since only one of the first processing circuitry and the second processing circuitry is in control of execution of the single instruction stream at any one time, the system is configured to provide both a high performance and a high efficiency (lower power) point by arranging that one of the first processing circuitry and the second processing circuitry is higher efficiency processing circuitry and the other of the first processing circuitry and the second processing circuitry is a higher performance processing circuitry. Since the higher efficiency processing circuitry can be readily powered down when the higher efficiency circuitry is in operation, the static leakage power can be reduced. This allows a lower minimum power-performance point to be attainable with DVFS than would be possible using a single high performance processor and DVFS.
Even without DVFS the present invention offers performance scaling while avoiding the requirement for any complex power supply and clocking circuitry since the data processing apparatus offers performance scaling without necessarily being capable of dynamic voltage and frequency changes. By providing the capability to directly transfer state between the first hybrid processing unit and the second hybrid processing unit using the execution flow transfer circuitry, migration time for migrating the single instruction stream between the two hybrid processing units can be reduced. Furthermore, the shared processing circuitry enables at least a subset of the processing state restoration information to be readily accessible to both hybrid processing units without the requirement to store the processing state restoration information out to external (off-chip) memory, requiring it to be reloaded as part of the migration operation.
The system according to the present technique is simple since it provides a uni-processing environment in which an operating system can effectively see a single processing circuit operating in (i) a high performance and (ii) a high efficiency mode rather than seeing two separate processors. In known multiprocessor systems typically a special multiprocessing operating system is required to manage migration of execution of an instruction stream. The transfer of the at least one bit of processing-state restoration information from a source hybrid processing unit to a destination hybrid processing unit using the execution flow transfer circuitry to avoid the transfer involving first copying the data to the main (external) memory means that the transfer is typically faster and requires less energy. This means the minimum period of low performance for which it is worth switching over can be reduced, which in turn means that more time can be spent in the lower power state and more energy can be saved.
According to the present technique execution flow transfer circuitry is provided to manage the migration of the single instruction stream. Thus, according to at least embodiments of the present invention, migration of the instruction stream can be performed without necessarily requiring an operating system to manage it. For example, the migration can be triggered directly via a user-mode program instruction without having to enter any operating system. Since entering an operating system can take several thousand processing cycles, the ability to circumvent involvement of the operating system in the instruction stream migration makes it possible to switch modes more rapidly. Furthermore, since it is possible to make migration of execution of the instruction stream rapid, it is also possible to automatically work in a high performance mode on entering the operating system.
In contrast to fine-grain dynamically configurable processors, the present invention provides a relatively simple power-domain map, having a first power domain for the first processing circuitry, a second different power domain for the second processing circuitry and a shared power domain for the shared processing circuitry. The design, fabrication, implementation and testing of the data processing apparatus according to the present technique because the high performance circuitry and the high efficiency circuitry (corresponding to one or other of the first processing circuitry and the second processing circuitry) are not fine-grain intermingled.
It will be appreciated that it is possible that the first processing circuitry and the second processing circuitry have substantially the same processing performance characteristics, but that a single instruction stream can be switched between the two distinct sets of processing circuitry. However, in some embodiments, the first processing circuitry and the second processing circuitry have different processing performance characteristics. This makes the data processing system adaptable to different power performance tuning requirements by enabling one of the first processing circuitry and second processing circuitry to have relatively high performance, and the other of the first processing circuitry and the second processing circuitry to have relatively higher efficiency and relatively lower performance. Thus, when computationally intensive processing tasks are to be performed, execution of the single instruction stream can be controlled by the processing circuitry having the higher performance characteristics, yet when the workload becomes less demanding or when external factors such as device battery life becomes low, execution of the single instruction stream can be relatively seamlessly switched to the lower-power higher-efficiency processing circuitry. The (first or second) processing circuitry that is not currently in control of execution of the instruction stream may be powered down to further improve the power saving capabilities.
It will be appreciated that although only one of the first processing circuitry and the second processing circuitry has control of execution of the instruction stream at any one time, the set of processing circuitry that does not have active control of execution of the instruction stream could remain idle throughout the duration of the processing task. However, according to some embodiments, the execution flow transfer circuitry comprises power control circuitry for independently controlling power to the first processing circuitry and the second processing circuitry such that each of the first processing circuitry and the second processing circuitry can be placed in the powered-up state in which it is ready to perform processing operations and a power-saving state in which it is awaiting activation (not ready to perform processing). Provision of the power-saving state allows static power consumption to be further reduced.
It will be appreciated that since the shared processing circuitry will typically be used (at least in part) for the full duration of any processing activities. It will be used either by the first processing circuitry as part of the first hybrid processing unit or by the second processing circuitry as part of the second hybrid processing unit.
In some embodiments the power control circuitry can be configured to independently control power to the shared circuitry as well as to the first processing circuitry and the second processing circuitry. This enables static power consumption to be reduced, for example, when both the first processing circuitry and the second processing circuitry are idle (not performing a processing workload).
The shared processing circuitry can be utilised regardless of whether the first processing circuitry or the second processing circuitry currently has control of execution of the single instruction stream, so the shared processing circuitry could be configured to operate at the same level of power consumption regardless of whether the first or the second processing circuitry is in control of execution. However, in some embodiments, the power control circuitry is configured so that the shared processing circuitry operates at a first level of power consumption when the first hybrid processing unit has control of execution and to operate at a second, different level of power consumption when a second hybrid processing unit has control of execution of the single instruction stream. This means that, for example, the shared processing circuitry can be configured to support high performance processing (by one of the first processing circuitry and the second processing circuitry) by operating at a higher level of power consumption to provide, for example, improved cache performance but to operate at a lower power consumption when the apparatus is in a high efficiency mode (where the one of the first and second processing circuitry that currently has control of execution of the single instruction stream is configured to be high efficiency processing circuitry). This provides additional flexibility in tuning the performance of both the higher performance mode of operation and the higher efficiency mode of operation by improving processing performance in the higher performance mode yet further reducing power consumption in the high efficiency mode of operation.
It will be appreciated that the process of execution transfer could involve switching execution between an active processor (source) and quiescent idle processor (destination). However, in some embodiments, the power control circuitry is configured to switch one of the first processing circuitry and the second processing circuitry corresponding to the destination hybrid-processing unit from a power-saving state to a powered-up state and switch the other of the first processing circuitry and the second processing circuitry corresponding to the source hybrid processing unit from a powered-up state to a power-saving state as part of the execution transfer process.
Although the first processing circuitry and the second processing circuitry could have a variety of different individual processing characteristics, according to some embodiments, the first processing circuitry and the second processing circuitry are architecturally compatible but the first processing circuitry differs micro-architecturally from the second processing circuitry. The architectural compatibility of the first processing circuitry and second processing circuitry facilitates straightforward transfer of processing-state restoration information when migrating execution of the single instruction stream between the first processing circuitry and the second processing circuitry. Arranging for micro-architectural differences between the first processing circuitry and the second processing circuitry is a convenient mechanism for providing diversity in performance levels and energy consumption levels when performing processing tasks.
In some embodiments, the architectural compatibility comprises compatibility of at least general purpose registers and control registers of the first processing circuitry and the second processing circuitry. In some such embodiments, the control registers comprise co-processor registers having at least one memory circuitry control register.
In some embodiments where the first and second processing circuitries are architecturally compatible but micro-architecturally different, the micro-architectural differences between the first processing circuitry and the second processing circuitry comprise at least one of: pipeline length, instruction issue width, cache configuration, branch prediction capability and TLB configuration. Such differences in micro-architecture are straightforward to implement and fabricate.
It will be appreciated that the first processing circuitry and the second processing circuitry could each have any one of a variety of different processing performance characteristics, but according to some embodiments one of the first processing circuitry and the second processing circuitry is higher performance processing circuitry relative to the other of the first processing circuitry and the second processing circuitry and the other of the first processing circuitry and the second processing circuitry is higher efficiency processing circuitry relative to the other. This provides good complementary processing characteristics between the first processing circuitry and the second processing circuitry making the system readily adaptable to many different processing applications.
In some embodiments, the apparatus is fabricated on a single integrated circuit with the higher performance processing circuitry being substantially physically localised to a first distinct area of the integrated circuit and the higher efficiency processing circuitry being substantially physically localised to a second distinct area of the integrated circuit and wherein the second distinct area is different from the first distinct area.
Having the circuits associated with a higher-performance processing physically close to each other and the circuits associated with the energy-efficient processing region physically close to each other rather than being distributed across the whole area of the processor may seem inefficient as it requires duplicating some processing circuitry that is needed in both regions. However, there are high overheads in integrating unit-level clock gating or power switching and input/output signal clamping in a fine-grain distributed manner across a large processor. The power switching has an area overhead and the signal clamping can lengthen critical paths, reducing the peak clock frequency. Thus according to such embodiments, although there is a cost with regard to duplication of some processing circuitry, this is compensated for the fact that the clock gating or power-gating and input/output signal clamping is simplified and the critical paths in the higher performance processing region are not impacted much by the energy-efficient mode. Furthermore, the distance that many signals need to travel in the high efficiency low-energy mode can be effectively reduced, making the data processing apparatus more energy efficient overall.
Although the shared power domain could be configured to operate at the same voltage as one of the voltages corresponding to the first power domain and the second power domain, in some embodiments the apparatus is fabricated on a single integrated circuit and the shared power domain is configured to operate at a different voltage from the voltages corresponding to each of the first power domain and the second power domain.
It will be appreciated that the shared processing circuitry could comprise any of a variety of different processing components that are jointly accessible by both the first processing circuitry and the second processing circuitry. The sharing avoids duplication of that particular type of shared processing circuitry in both the first processing circuitry and the second processing circuitry. However, in some embodiments, the shared processing circuitry comprises cache circuitry. This cache sharing at least reduces the circuit area penalty that would otherwise be required to individually provide caches to the first processing circuitry and the second processing circuitry.
In some such embodiments with shared cache circuitry, the shared cache circuitry is configured to serve as a level two cache for the higher performance processing circuitry and configured to serve as a level one cache for higher efficiency processing circuitry. This provides sharing of resources yet allows for the higher performance processing circuitry to have a non-shared level one cache and thus to have a more complex and versatile cache system than the higher efficiency processing circuitry.
In some embodiments where the shared processing circuitry comprises shared cache circuitry, the shared cache circuitry comprises both level one cache circuitry and level two cache circuitry, where each of thelevel 1 cache circuitry and thelevel 2 cache circuitry serves both the higher performance processing circuitry and the higher efficiency processing circuitry. By sharinglevel 1 andlevel 2 caches, the amount of refilling of caches from main (external) memory after switches between source and destination hybrid processing units is reduced.
In some embodiments, the shared processing circuitry comprises translation lookaside buffer circuitry. This reduces or avoids the circuit area penalty of providing two TLB's and the need to save and restore the TLB state.
It will be appreciated that the shared processing circuitry could comprise any one of a number of different types of processing circuitry, but in some embodiments the shared processing circuitry comprises at least one of: cache circuitry, translation lookaside buffer circuitry, special purpose registers, bus interface circuitry, bus pins and trace circuitry. This reduces the volume of state that needs to be specifically transferred between the first processing circuitry and the second processing circuitry, since it is accessible directly via the shared processing circuitry.
The non-shared processing circuitry of the first processing circuitry and/or the second processing circuitry could comprise a variety of different processing circuitry components. However, in some embodiments the first processing circuitry and the second processing circuitry comprise at least one of: a program counter, general purpose registers, branch prediction circuitry, decoding/sequencing circuitry, an execution data path and load/store circuitry. Each of the individual components of the first processing circuitry and the second processing circuitry can thus be differently configured according to the processing performance requirements to tune it towards either high performance or high efficiency.
It will be appreciated that the execution flow transfer circuitry could be configured in a variety of different ways to provide a transmission pathway for the at least one bit of processing state restoration information. However, in some embodiments, the execution flow transfer circuitry is configured to provide a transmission pathway from the source hybrid processing unit to the destination hybrid processing unit for the at least one bit of processing state restoration information and wherein the transmission pathway bypasses the external memory. Bypassing of the external memory reduces the time penalty in terms of processing cycles for performing the transfer of execution of the single instruction stream because accesses to external memory are typically processing-cycle intensive. In some such embodiments, the transmission pathway comprises a dedicated buffer between the first processing circuitry and the second processing circuitry. This is a simple way of implementing a direct transfer path between the two set of processing circuitry.
It will be appreciated that the at least one bit of processing state restoration information transferred by the execution flow transfer circuitry could be any one of a number of different types of architectural state required to enable successful resumption of execution of an instruction stream on destination processor circuitry that was previously executing on source processing circuitry. However, in one embodiment, the at least one bit of processing state restoration information comprises at least one of: a program counter and at least a subset of the general purpose register content of the source hybrid processing unit at the transfer execution point.
It will be appreciated that the at least one bit of processing state restoration information transferred by the execution flow transfer circuitry could be any one of a number of different types of micro-architectural state required to reduce the time and energy expended by the destination processor in re-constituting that state corresponding to the source processor. However, in one embodiment, the at least one bit of processing state restoration information comprises at least one of: branch predictor history information, TLB contents, micro-TLB contents and data prefetcher pattern information.
In some embodiments of the data processing apparatus, at least the source hybrid processing unit comprises an associated non-shared cache (separate from the shared processing circuitry) as part of one of the first processing circuitry and the second processing circuitry corresponding to the source hybrid processing unit, the associated non-shared cache being separate from the shared processing circuitry. In such embodiments the non-shared cache of the first hybrid processing unit is cleaned as part of the execution flow transfer process and circuitry is provided that allows data and instructions required by the instruction stream executing on second hybrid processing unit to be obtained from the cache of the first hybrid processing unit. Typically the data and instructions would then be cached in the second hybrid processing unit or the shared processing circuitry. This improves the efficiency of the system by reducing the likelihood of having to retrieve data from external memory when execution of the transferred single instruction stream that has been suspended on the source processing unit is resumed on the destination processing unit. As an alternative to cleaning the non-shared cache as part of the transfer, the non-shared cache can use a write-back policy so that it is always clean.
It will be appreciated that the transfer of execution of the single instruction stream from the source hybrid processing unit to the destination hybrid processing unit could be triggered in a number of different ways, for example, via hardware, software or an operating system. However, in some embodiments, the execution flow transfer circuitry is configured to initiate transfer of execution of the single instruction stream from the source hybrid processing unit to the destination hybrid processing unit in response to a hardware trigger such as a temperature sensor, a series of cache misses, the initiation of a hardware page table walk or the processor entering a polling or wait-for-interrupt state. A hardware triggered transfer is functionally transparent to an operating system and any other software executing on the data processing apparatus. Providing an execution flow transfer trigger that is transparent to the operating system and software reduces the processing cycle penalty required to achieve the transfer of execution. The alternative of entering the operating system to perform the execution flow transfer would typically have a significant cost in terms of time and energy.
In alternative embodiments, the execution flow transfer circuitry is configured to initiate transfer execution of the single instruction stream from the source hybrid processing unit to the destination hybrid processing unit in response to an external trigger. In some such embodiments the external trigger is configured to receive a trigger stimulus from at least one of a temperature monitor external to the uni-processor, a power controller and memory mapped input/output.
In other embodiments the execution flow transfer circuitry is configured to initiate transfer of the execution of the single instruction stream from the source hybrid processing unit to the destination hybrid processing unit in response to a software trigger. This provides more flexibility in triggering the execution flow transfer according to current software processing requirements. In some such embodiments the software trigger is a processor instruction, a write data access to a special purpose register or a write data access to a specific address in a memory map.
Although the software trigger could be performed in any one of a number of different ways, in some embodiments virtualisation software running on the data processing apparatus provides a software trigger. The virtualisation software serves to mask configuration control information specific to the first processing circuitry and the second processing circuitry from the operating system executing on the data processing apparatus. Thus the transfer of execution of the single instruction stream is transparent to both the operating system and the one or more applications executing on the data processing apparatus at a time corresponding to the transfer execution point. This means that the operating system need not be aware of the migration of the execution of the instructions between the first processing circuitry and the second processing circuitry and any applications of the processing workload need not be adapted to execute on the data processing apparatus. Avoiding requirements to modify operating systems and applications to run on the data processing apparatus in order to make use of the power performance scaling characteristics makes the data processing apparatus more versatile and readily adaptable to use with generic software and operating systems.
It will be appreciated that the first processing circuitry and the second processing circuitry could be configured to operate at a single voltage and a single frequency. However, further power performance varying properties could be provided by providing performance-level varying circuitry as part of the data processing apparatus configured to vary a processing performance level of at least one of the first hybrid processing unit and the second hybrid processing unit. This provides flexibility in the number of performance levels and efficiency levels to which the apparatus can be tuned.
Although the performance-level varying circuitry could take any one of a number of different forms, in some embodiments performance-level varying circuitry is configured to perform dynamic voltage and frequency scaling of processing performance of at least one of the first hybrid processing unit and the second hybrid processing unit by varying at least one of: a voltage of the first power domain; voltage of the second power domain; a voltage of the shared power domain; a frequency of operation of the first processing circuitry; a frequency of operation of the second processing circuitry; and frequency of operation of the shared processing circuitry.
It will be appreciated that a data processing apparatus, according to embodiments of the present invention could be used exclusively for applications that require a uni-processing environment. However, in some embodiments at least one of the uni-processing data processing apparatuses according to the present technique is employed to form part of a multi-processing data processing system by connecting a plurality of uni-processors including the at least one uni-processing data processing apparatus according to the present technique via a bus or network. This allows for substantially simultaneous execution of the plurality of instruction streams. Such a system provides a versatility not only to switch execution of an individual instruction stream between a high performance mode and a high efficiency mode, but also to switch individual instruction streams between different ones of the plurality of uni-processing data processing apparatuses. In some embodiments the multi-processing data processing system only a subset of the uni-processors support switching of execution of an individual instruction stream between a high performance mode and a high efficiency mode. This provides very flexible system that is adaptable to many different power and performance characteristics.
According to a second aspect the present invention provides apparatus for processing data comprising:
first means for processing configured to operate in a first power domain;
second means for processing configured to operate in a second power domain different from said first power domain;
means for shared processing configured to operate in a shared power domain such that said means for first processing and said means for shared processing are configurable to operate together to form a first means for hybrid processing having access to an external memory and said second means for processing and said means for shared processing are configurable to operate together to form a second means for hybrid processing having access to said external memory, wherein said first means for hybrid processing and said second means for hybrid processing together comprise a uni-processing environment for executing a single instruction stream;
means for execution flow transfer for transferring execution of said single instruction stream between said first means for hybrid processing and said second means for hybrid processing at a transfer execution point, wherein said execution flow transfer comprises transfer of at least one bit of processing-state restoration information from a source one of said first means for hybrid processing and said second means for hybrid processing to a destination one of said first means for hybrid processing and said second means for hybrid processing, said processing-state restoration information being information enabling said destination means for hybrid processing to successfully resume execution of said single instruction stream from said transfer execution point.
According to a third aspect the present invention provides A data processing method comprising the steps of:
operating first processing circuitry in a first power domain;
operating second processing circuitry in a second power domain different from said first power domain;
operating shared processing circuitry in a shared power domain and forming a first hybrid processing unit by operating said first processing circuitry and said shared processing circuitry together, said first hybrid processing unit having access to an external memory;
forming a second hybrid processing unit by operating and said second processing circuitry and said shared processing circuitry together, said second hybrid processing unit having access to said external memory;
providing a uni-processing environment for executing a single instruction stream, said uni-processing environment comprising said hybrid processing circuitry and said second hybrid processing circuitry together;
transferring execution of said single instruction stream between said first hybrid processing circuitry and said second hybrid processing circuitry at a transfer execution point using transfer execution circuitry, wherein said execution flow transfer comprises transfer of at least one bit of processing-state restoration information from a source one of said first hybrid processing unit and said second hybrid processing unit to a destination one of said first hybrid processing unit and said second hybrid processing unit, said processing-state restoration information being information enabling said destination hybrid processing unit to successfully resume execution of said single instruction stream from said transfer execution point.
The above, and other objects, features and advantages of this invention will be apparent from the following detailed description of illustrative embodiments which is to be read in connection with the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGSFIG. 1 schematically illustrates a data processing apparatus having high-performance processing circuitry, high-efficiency processing circuitry and shared processing circuitry;
FIG. 2 schematically illustrates constituent elements of processor circuit state information comprising both architectural and micro-architectural state;
FIG. 3 schematically illustrates a data processing apparatus having shared processing circuitry and three power domains and illustrating a distribution of processing state restoration information across the power domains;
FIG. 4A schematically illustrates a timeline for an execution stream of a known uni-processor;
FIG. 4B schematically illustrates two distinct execution streams and their respective time lines in a known multi-processor system;
FIG. 5 schematically illustrates an instruction stream execution switching process on a hybrid uni-processor according to an embodiment of the present invention;
FIG. 6 is a flow chart that schematically illustrates how execution of a single instruction stream is transferred from a source hybrid processing unit to a destination hybrid processing unit;
FIG. 7 schematically illustrates how first processing circuitry and shared processing circuitry operate together to execute a single instruction stream;
FIG. 8 is a data processing apparatus that schematically illustrates how the second processing circuitry and the shared processing circuitry operate together to form a second hybrid processing unit in the same system as illustrated inFIG. 7;
FIG. 9 schematically illustrates a relationship between hardware, virtualizer software, operating system software and application software;
FIG. 10 schematically illustrates a data processing apparatus according to an embodiment of the present invention in which there are both first and second sets of processing circuitry and the shared processing circuitry comprises a shared cache, shared control registers and a shared TLB;
FIG. 11 schematically illustrates a data processing apparatus according to an embodiment in which the shared processing circuitry comprises control registers, L1 cache, L2 cache, TLB and bus interface unit; and
FIG. 12 is a multiprocessing data processing apparatus according to an embodiment of the present invention comprising three individual uni-processing apparatuses, two of which each have distinct sets of high-performance processing circuitry, low-performance processing circuitry and shared processing circuitry and one of which has a single set of processing circuitry.
DESCRIPTION OF EMBODIMENTSFIG. 1 schematically illustrates a uni-processing data processing apparatus100 having two distinct sets of processing circuitry and shared processing circuitry. The data processing apparatus100 comprises a singleintegrated circuit110 on which are fabricated: highperformance processing circuitry120 fabricated primarily on a distinct physical region (or sub-region) of theintegrated circuit110; and highefficiency processing circuitry140 substantially fabricated on a separate distinct physical area of theintegrated circuit110 from the high performance processing circuitry. The apparatus further comprises a set of sharedprocessing circuitry160 that is accessible to either the highperformance processing circuitry120 or the highefficiency processing circuitry140 depending upon which of the two sets ofprocessing circuitry120,140 has current control of the execution of the single instruction stream of the uni-processor100.
To enable transfer of execution of the single instruction stream between the highperformance processing circuitry120 and the highefficiency processing circuitry140, a set of executionflow transfer circuitry170 is provided. Theintegrated circuit110 of the data processing apparatus100 performs data input/output to a bus180 (in alternative embodiments thebus180 is an inter-core network corresponding to a Network-on-chip) via bidirectional communication with anexternal interface174. Thisbus180 is also used by theintegrated circuit110 to perform a communication with an external (main)memory182, a set ofperipheral devices184 andfurther processors186.
The highperformance processing circuitry120 is designed primarily for good processing performance whereas the highefficiency processing circuitry140 is designed primarily for high efficiency and reduced energy consumption (relative to the high performance processing circuitry120). Although the highperformance processing circuitry120 and the highefficiency processing circuitry140 are combined on a singleintegrated circuit110, they operate in different power domains i.e., the highperformance processing circuitry120 operates in a first power domain whilst thehigh efficiency circuitry140 operates in a second, different power domain. The sharedprocessing circuitry160 and the executionflow transfer circuitry170 are both configured to operate in a third (shared) power domain. Due to the three different power domains some clamping or level shifting of signal voltages is performed by circuitry (not shown) on theintegrated circuit110.
The highperformance processing circuitry120 comprises thefirst program counter122, the firstload store unit124,branch prediction circuitry126, a first decoder/sequencer128 for decoding and sequencing the instruction stream and a first execution data path130 for executing program instructions. The execution data path130 of the highperformance processing circuitry120 has a long pipeline that is deep and out-of order and thebranch prediction circuitry126 is capable of complex branch prediction and these two characteristics contribute to the high performance level of thiscircuitry120.
The high efficiency (comparatively low performance)processing circuitry140 comprises: asecond program counter142 that keeps track of the instructions being executed when the highefficiency processing circuitry140 has control of the single execution stream; a secondload store unit144 for loading instructions from cache and/orexternal memory182; a second decoder/sequencer146 and a secondexecution data path148.
The secondexecution data path148 is a single issue execution data path with a relaxed shallow pipeline. There is no branch prediction circuitry in the highefficiency processing circuitry140 in the embodiment ofFIG. 1. In alternative embodiments both the high performance and the high efficiency processing circuitry are provided with branch prediction circuitry but the high efficiency processing circuitry uses simpler branch prediction.
Theintegrated circuit110 of the data processing apparatus100 ofFIG. 1 can be viewed as an integrated circuit in which a highperformance processing core120 and a highefficiency processor core140 are merged and in which one or more resources on theintegrated circuit110 are shared by the highperformance processing circuit120 and the high energy highefficiency processing circuitry140. The resources of the sharedprocessing circuitry160 depend on the particular embodiment, but examples of the shared resources include bus pins, a bus interface unit (BIU), control registers, a TLB or L1 and/or L2 caches.
In the arrangement ofFIG. 1, only one of the highperformance processing circuitry120 and the highefficiency processing circuitry140 is operational at any single point in time. However, in order to perform execution of instructions, both the highperformance processing circuitry120 and the high efficiency processing circuitry have use of the sharedprocessing circuitry160. Execution of a single instruction stream can be switched between the highperformance processing circuitry120 and the highefficiency processing circuitry140 according to the current processing requirements and/or according to external factors such as the external temperature or the remaining battery life of the data processing apparatus incorporating theintegrated circuit110.
The switching of the single instruction stream between the two different sets ofprocessing circuitry120,140 is mediated by the executionflow transfer circuitry170. In order to perform transfer of execution of the instruction stream, at least one bit of processing-state restoration information is copied from a “source” one of the highperformance processing circuitry120 and the highefficiency processing circuitry140 and a “destination” one of these two sets ofprocessing circuitry120,140. Transfer of execution of the single instruction stream between the two different sets of processing circuitry necessarily involves enabling the destination processing circuitry to be able to resume processing from a point where processing ceased on the source processing circuitry. Since only one of the highperformance processing circuitry120 and the highefficiency processing circuitry140 is operational at any one point in time, it can be seen that the shared processing circuitry is also dedicated to operating with either the highperformance processing circuitry120 or the highefficiency processing circuitry140 at any one point in time according to which set of circuitry holds control of the instruction stream execution.
The configuration of the shared processing circuitry120 (shared between thehigh performance circuitry120 and the high efficiency processing circuitry140) simplifies the task of switching execution because it allows the destination processing circuitry to can access at least some of the processing state of the source processing circuitry at the execution transfer point via the sharedprocessing circuitry160.
The transfer of execution of the instruction stream is initiated by the executionflow transfer circuitry170 in response to receipt of theflow transfer stimulus171. Theflow transfer stimulus171 can be provided via a hardware trigger from other processing circuitry within the data processing apparatus or from an external trigger. Theflow transfer stimulus171 in this example embodiment can originate from either the highperformance processing circuitry124 or from the highefficiency processing circuitry140. The hardware trigger may be, for example, at least one of: a temperature sensor, a series of cache misses, initiation of a hardware page table walk, processing circuitry entering a polling state and processing circuitry entering a wait-for-interrupt state. Atemperature monitor176 is shown in the embodiment ofFIG. 1 and this is configured to output a transfer stimulus to the executionflow transfer circuitry170 to effect transfer of execution to the high efficiency processing circuitry when the detected temperature exceeds a predetermined threshold value. In alternative embodiments the flow transfer stimulus is provided in response to a software trigger or in response to an external trigger such as a change in temperature or a sudden reduction in available battery power. Alternatively, the software trigger is: a processor instruction (e.g. a special execution flow switching instruction); a write to a special purpose register such as the CP15 register in ARM™ processors; or a write to a specific address in a memory map of the data processing system. The highperformance processing circuitry120 and the highefficiency processing circuitry140 execute program instructions using resources provided in the sharedprocessing circuitry160 so that the highperformance processing circuitry120 and the sharedprocessing circuitry160 can be together considered as a first hybrid processing unit. Similarly, the highefficiency processing circuitry140 together with the sharedprocessing circuitry160 form a second hybrid processing unit. However, it will be appreciated that, depending on the particular instruction being processed the resources of the sharedprocessing circuitry160, although available, may or may not be utilised during execution of individual instructions of the single instruction stream.
FIG. 2 schematically illustrates different categories of processor state information. The processing circuitry ofFIG. 1 comprises circuit elements such as flip-flops and Random Access Memories (RAMs) that hold state information. Typically this state information is updated upon changes a clock signal. Theprocessor circuit state200 as illustrated inFIG. 2 comprises all of the state held by circuits in the corresponding processor. Theprocessor circuit state200 illustrated byFIG. 2 could represent either: the processor state of a first hybrid processor unit corresponding to a combination of the highperformance processing circuitry120 and the sharedprocessing circuitry160 ofFIG. 1; or a second hybrid processing unit corresponding to the highefficiency processing circuitry140 together with the sharedprocessing circuitry160. As shown inFIG. 2, theprocessor circuit state200 comprises both anarchitectural state210 and amicro-architectural state240. Thearchitectural state210 comprises: general purpose registers220; aprogram counter222; astack pointer224; and special purpose registers226.
The processorarchitectural state210 is a subset of the processor state that a software programmer can expect to exist and obey an architecture specification characteristic of all processors that implement a given architecture. The general purpose registers220,program counter222,stack pointer224 and special purpose registers226 illustrated inFIG. 2 are (non-exhaustive) examples of architectural state. Thearchitectural state210 can be considered to be a minimum set of processor circuit state that should be preserved before removing power to the processor so that upon stopping execution at an arbitrary instruction in a computer program running on the processor, the preserved state can be used upon resumption of power to allow processing to be successfully restarted at the point it was stopped.
Modern processors are often heavily pipelined and allow many in-flight or partially finished instructions at any one time. In the embodiment ofFIG. 1, prior to saving thearchitectural state210, any in-flight instruction is completed or cancelled. Any in-flight instructions prior to the instruction on which the processor stopped execution is completed and any instruction after the instruction on which the processor stopped is cancelled. Thearchitectural state210 is then saved for later use in resumption of processing.
Themicroarchitectural state240 as illustrated inFIG. 2 comprises:cache contents242; Translation Lookaside Buffer (TLB)contents244 andbranch predictor history246.
The microarchitectural state is the state that exists in a specific processor implementation but that is not defined in the architecture specification to exist in all implementations. A software programmer cannot rely on every implementation of a processor architecture having the same microarchitectural state. Thecache contents242,TLB contents244 andbranch predictor history246 are non-exhaustive examples of microarchitectural state. In the event that processing is stopped, assuming that the architectural state is saved, it is typically unnecessary to preserve microarchitectural state prior to removing power to the processor in order to later be able to successfully restart processing from the point at which it was stopped. However, preserving some microarchitectural state can improve the performance of the processor for a period after processing is restarted. For example, if the TLB entries are preserved then the processor will have less page table walks to perform in order to obtain virtual to physical address translations shortly after restarting processing. Similarly, if thebranch predictor history246 information is saved prior to removing power to the processor then the branch predictor will perform well immediately upon restart of processing and the branch prediction history will not have to be rebuilt from scratch.
A set of processing state information called “processing state restoration information” can be defined to at least state information that allows processing to be restarted later. It should contain enough information to allow processing to be restarted but may contain more information than that. Typically this means that the processing state restoration information should contain at least thearchitectural state210 and in addition may comprise other state. An example of additional state that it may contain is some microarchitectural state such as the TLB contents. Saving microarchitectural state in addition to architectural state will likely improve processing performance for a period after processing is restarted (on the same processing circuitry or different but architecturally-compatible processing circuitry).
The highperformance processing circuitry120 and highefficiency processing circuitry140 ofFIG. 1 are capable of implementing privilege levels during execution of program instructions that restrict access to some of thearchitectural state210 at certain times. For example, the operating system executed on the data processing apparatus100 has access to more of thearchitectural state210 than a typical program application. However, all of thearchitectural state210 is accessible from the highest privilege level.
Two different processors having the same version of thearchitectural state210 but having different versions of themicro-architectural state240 should be able to run the same program instructions, but a given program is likely to take different lengths of time to execute on the two different micro-architectures and expend different amounts of energy.
In the data processing apparatus100 ofFIG. 1, the execution transfer involves ceasing execution of the instruction stream by the source one of the highperformance processing circuitry120 and the high efficiency processing circuitry140 (depending on which one of these two sets of circuitry is currently actively executing the instruction stream) and re-starting the execution of the instruction stream from the same execution point following enactment of the switch. To enable a successful restarting of processing at the execution transfer point on a different set of processing circuitry it is necessary to provide to the destination processing circuitry a set of processing state restoration information representing a subset of theprocessor state200 ofFIG. 2 as defined above.
It will be appreciated that in addition to the processing state restoration information, to successfully resume execution, the destination processing circuitry should also have access to the relevant program instructions and data in main memory that was being used by the source data processing circuitry to execute the instruction stream. In this particular embodiment the processing state restoration information comprises at least a subset of thearchitectural state210 as illustrated inFIG. 2 and thus includes theprogram counter222, the general purpose registers220, the special purpose registers226 and thestack pointer224.
The executionflow transfer circuitry170 ofFIG. 1 represents a direct exchange mechanism for processing state restoration information between the highperformance processing circuitry120 and the highefficiency processing circuitry140. In previously known systems, transferring execution of a single instruction stream from one set of processing circuitry to another set of processing circuitry will typically have involved pushing all of the processing state restoration information from the source processing circuitry held in registers in hardware that would be powered-down after the transfer out into external memory (not powered down) and then reloading all of that information saved to main memory back into the destination processing circuitry.
The direct transfer mechanism for transferring at least a subset of the processing state restoration information between the highperformance processing circuitry120 and the highefficiency processing circuitry140 ofFIG. 1 is an alternative mechanism for transferring a flow of execution to that used in multi-processor systems, where in order to transfer processing from one processor to another processor the source processor is signalled that power will be removed well in advance (i.e. one or more processing cycles) and the amount of processor state that is actually saved in order to enable successful resumption of processing is reduced by ensuring that the source processor finishes all processing associated with instructions in the instruction stream for a special storing instruction. A special stopping instruction is used such that all processing associated with the stop instructions before the processing instruction must be completed before the execution is transferred and all instructions following this stopping instruction in the execution sequence are cancelled. This previously known technique avoids a need to store processor state associated with partially complete instructions.
Another technique of avoiding having to store a full set of processing state restoration information in buffers and caches is to push any values that are temporally being held in buffers or caches out to their intended destinations prior to the transfer of execution. This technique is similar to the techniques used to implement exceptions and interrupts.
However, according to the embodiment of the invention illustrated inFIG. 1, transfer of the processing state restoration information required to successfully resume execution at the transfer execution point is transferred between the source processing circuitry and the destination processing circuitry either: (i) directly via the executionflow transfer circuitry170 without the requirement to first save in theexternal memory182; or (ii) via the shared accessibility to values stored in the sharedprocessing circuitry160. Thus not all and perhaps none of the processing state restoration information that is in the power domain to be shut down has to be pushed out to memory that will not be powered down.
Examples of processing state restoration information that are transferred between the source processing circuitry and the destination processing circuitry in the event of switching between high performance and high efficiency processing modes comprise a program counter, at least a subset of general purpose register contents of the source hybrid processing unit at the transfer execution point and/or micro-architectural state information such as branch predictor history information, TLB contents, micro-TLB contents and data prefetcher pattern information.
FIG. 3 schematically illustrates the mechanism for transfer of execution of a single instruction stream from the source hybrid processing unit to the destination hybrid processing unit in the system ofFIG. 1. As illustrated inFIG. 3, a singleintegrated circuit300 corresponds to a uni-processing data processing apparatus having afirst power domain310 corresponding to first processing hardware, asecond power domain320 corresponding to second processing hardware and athird power domain330 in which shared processing hardware operates. Theintegrated circuit300 has anexternal interface350 providing a communication pathway with a bus360 (or a network in alternative embodiments). A main memory (external)362, a set ofperipheral devices364 and one or morefurther processors366 are also connected to thebus366.
Thefirst processing hardware310 and thesecond processing hardware320 are architecturally compatible but have different micro-architectures. The processing state restoration information of the first hybrid processing unit, which is formed when thefirst processing hardware310 has execution control, resides partly in thefirst power domain310. However, a portion of the first processing state restoration information also resides in the shared processing circuitry and thus in thethird power domain330. Similarly, the processing state restoration information of thesecond processing hardware320 has a portion that resides exclusively in thesecond power domain320, but there is also a portion of the second hybrid processing unit processing state restoration information that is stored in thethird power domain330.
It is also clear fromFIG. 3 that there is anoverlap340 between the first processingstate restoration information312 and the second processingstate restoration information322 represented by the region of intersection of the two sets of states. The executionflow transfer circuitry170 ofFIG. 1 provides a direct exchange mechanism for at least a portion of the processing state restoration information required to transfer execution of the single instruction stream between thefirst power domain310 and thesecond power domain320. As shown inFIG. 3, there is a direct processing staterestoration exchange pathway371 between the first processingstate restoration information312 and the second processingstate restoration information322.
The data processing apparatus ofFIG. 1 represents a uni-processing environment.FIG. 4A schematically illustrates execution of a single instruction stream according to a conventional uni-processor. A conventional uni-processor executes a single stream of instructions and the uni-processor operating system will typically switch execution between several threads of execution or between processes corresponding to application programs that the processor is executing at the time. However, this type of multithreading in a uni-processing data processing apparatus corresponds to time division multiplexing of the processing hardware to allow sharing of a single instruction stream execution engine.
Thus in the uni-processor device ofFIG. 4A, the processing hardware can be running program code from, for example, an operating system or from a program application, but it can never be contemporaneously executing, for example, both operating system instructions and application program instructions. This is clear from the illustration of “execution stream 1” inFIG. 4A where, as time progresses, the uni-processor starts off by executingprocess 1 ofprogram application 1, proceeds to executeprocess 1 of operating system software then returns toapplication 1 to executeprocess 2 and subsequently executesprocess 1 ofprocessing application 2.
Note thatapplication 1process 2 was executed for a first block oftime410, interrupted byprocess 1 ofapplication 2 executing intime block412, but execution ofprocess 2 ofapplication 1 is resumed in a later time slot414, whereuponprocess 2 is completed. Thus it is clear fromFIG. 4A that in a uni-processing processor, a single instruction stream is executing although multi-threading is possible such that the execution time is time-division multiplexed between processes of different applications such as application one and application two and processes corresponding to the operating system.
FIG. 4B schematically illustrates a multi-processor in which two different instruction streams are executed substantially concurrently (or simultaneously). In a multiprocessing system, the physical hardware supports substantially contemporaneous execution of more than one instruction stream. Thus it can be seen that as time progresses, according to a first instruction execution stream “execution stream one”, a sequence comprising execution of:application 1process 1, thenapplication 2process 2 is performed and then the particular processor running “execution stream 1” becomes idle. Meanwhile, a second processor representing different physical processing hardware runs “execution stream 2”, which executes (in time sequence)application 2process 1 thenapplication 2process 2 then anoperating system process 2 thenapplication 3process 1.
It can be seen fromFIG. 4B that the multiprocessor substantially concurrently executesapplication 1process 1 andapplication 2process 1 inexecution stream 1 andexecution stream 2 substantially concurrently. Similarly,operating system process 1 executes substantially simultaneously withapplication 2process 2 andapplication 1process 2 executes substantially simultaneously withoperating system process 2. At any one time, not all of the processors in a multi-processor system need be actually executing instructions. This is illustrated by the last sequence of “execution stream 1” inFIG. 4B, which corresponds to idle time. At this time in the example ofFIG. 4B, the second set of processing hardware is executingapplication 3process 1.
Further details of different types of processor taxonomy can be found in the textbook Hennessy and Paterson “Computer architecture a quantitative approach”, 2ndedition on page 636. A multi-processor contains the physical hardware resources necessary to substantially simultaneously process instructions corresponding to multiple instruction streams (i.e. execution streams), whereas a uni-processor has the physical resources to execute only a single instruction stream at any given time.
In normal operation, processing circuitry should process instructions in the order that they occur in the processor memory until a branch or jump instruction causes processing to jump to an instruction from another area of the memory. High performance processing circuitry such as thecircuitry120 ofFIG. 1 is capable of internally performing some processing of instructions of an instruction sequence not strictly in the order that the instructions are found in the memory. However, when out of order execution is performed the instruction sequence preserves the same functional behaviour as if the instructions were executed in the actual order that they are found in the memory. Thefirst program counter122 ofFIG. 1 tracks the memory addresses of the instruction being processed by the highperformance processing circuitry120 and the value held by thefirst program counter122 advances sequentially between instructions unless a branch jump instruction is taken, in which case the value held by the firstprogram counter circuit122 changes corresponding to the instruction at the branch or the jump target address. Although the data processing apparatus ofFIG. 1 has both thefirst program counter122 and asecond program counter142, because the data processing apparatus represents a uni-processing environment only one of the twoprogram counters122,142 is active at any given time and only one of the sets ofprocessing circuitry120,140 is actively executing instructions at any one time.
FIG. 5 schematically illustrates how the data processing apparatus ofFIG. 1 represents a uni-processing environment executing a single instruction stream that can be transferred between a first hybrid processing unit and a second hybrid processing unit. The execution stream illustrated inFIG. 5 corresponds to the same execution stream illustrated for execution by the uni-processor ofFIG. 4A. InFIG. 5, execution of the instruction stream A begins with execution ofapplication 1process 1 on the first hybrid processing unit corresponding to the highperformance processing circuitry120 together with the sharedprocessing circuitry160 ofFIG. 1. Followingapplication 1process 1, the hybrid uni-processor proceeds to executeoperating system process 1. Onceoperating system process 1 has executed, a flow transfer stimulus is received commanding that execution be transferred from the first hybrid processing unit to the second hybrid processing unit formed by the highefficiency processing circuitry140 together with the sharedprocessing circuitry160 ofFIG. 1.
It can be seen from the two parallel timelines ofFIG. 5 corresponding to the first hybrid processing unit and the second hybrid processing unit, that while the first hybrid processing unit has control of the execution stream and executesapplication 1process 1 andoperating system process 1, the second hybrid processing unit is idle (i.e. inactive) and performs no processing. During this time period the second hybrid processing unit is powered down to preserve energy. However, once the flow of execution is transferred at the transfer execution point, execution of “execution stream A” is switched from the first hybrid processing unit to the second hybrid processing unit and following execution ofoperating system process 1 on the first hybrid processing unit, the second hybrid processing unit takes over execution of the instruction stream where the first hybrid processing unit left off and proceeds to executeapplication 1process 2 followed byapplication 2process 2 followed by the remainder ofapplication 1process 2.
Thus by comparing and contrastingFIG. 4A andFIG. 5, it can be seen that execution of the single instruction stream is switched between the two different hybrid processing units. Note that once execution has been transferred from the first hybrid processing unit to the second hybrid processing unit the second hybrid processing unit takes control of the execution stream and performs the processing tasks whilst the first hybrid processing unit (non-shared part) enters a powered down more quiescent state and performs no processing. In some embodiments, instead of simply powering down the processor that has relinquished control of execution, the clock signal to the non-active one of the first processing circuitry and the second processing circuitry is stopped (e.g. gated). This reduces dynamic power consumption.
The hybrid uni-processor ofFIG. 5 differs from a conventional multi-processor system at least because although there are two sets of processing hardware (high performance and high efficiency) these together form a uni-processing environment in which there can be no substantially contemporaneous processing of different instruction streams on the first hybrid processing unit and the second hybrid processing unit. In the systems illustrated byFIG. 4A andFIG. 4B, in each case the operating system is required to have knowledge whether it is operating on either a uni-processing environment (in the case ofFIG. 4A) or in a multiprocessing environment (in the case ofFIG. 4B).
FIG. 6 is a flow chart that schematically illustrates how execution of a single processing stream in the uni-processing environment according to embodiments of the invention is transferred between a first (source) hybrid processing unit and a second (destination) hybrid processing unit. In each case, the hybrid processing unit is formed from one or the other of the highperformance processing circuitry120 and the highefficiency processing circuitry140 ofFIG. 1 in addition to the sharedprocessing circuitry160.
The process begins atstage610 when the source hybrid processing unit that is executing instruction stream A receives a trigger (hardware or software based) for transfer of execution of the single instruction stream. In response to the trigger, the first hybrid processing unit sends a request to the processing circuitry of the second hybrid processing unit (typically the non-shared processing circuitry) to power up the destination processing circuitry corresponding to the destination hybrid processing unit.
Then on the side of the destination hybrid processor unit the process proceeds to stage652 where the powered up second processing circuitry becomes ready for transfer of the instruction stream execution and sends and accept signal accepting the transfer request received from the source hybrid processor unit. The source hybrid processing unit receives the accept transfer request atstage612 and. Responds by completing or cancelling all partially complete instructions
Next, on the source hybrid processing unit side, the process proceeds to stage614 where the source hybrid processing unit initiates direct transfer of at least a portion of the processing state restoration information via the executionflow transfer circuitry170. Also atstage614, the source hybrid processing unit relinquishes control of the shared processing circuitry. On the side of the destination hybrid processing unit, the processing state restoration information is received atstage654, just prior to the destination hybrid processing unit assuming control of the shared processing circuitry atstage656.
On the side of the source hybrid processing unit, after the direct transfer of the processing state restoration information has occurred at614, the process proceeds to stage618 where the source hybrid processing unit relinquishes control of the instruction stream at the execution transfer point and signals to the destination hybrid processor unit that everything is ready on the source side for transfer of execution of the instruction stream.
In response to the ready signal output by the source hybrid processing unit, the destination hybrid processing unit assumes control of the shared processing circuitry atstage656 and subsequently assumes full control of execution of instruction stream A starting from the execution transfer point. Atstage658, the destination hybrid processing unit signals back to the source processing hybrid unit that the execution transfer has successfully resumed, whereupon the source hybrid processing unit powers down the associated non-shared processing circuitry corresponding to either the highperformance processing circuitry120 or the highefficiency processing circuitry140 ofFIG. 1. Note that when the destination hybrid processing unit assumes control of the shared processing circuitry atstage656 it automatically has access to that part of the processing state restoration information that has been stored by the non-shared processing circuitry of the source hybrid processing unit prior to the transfer of execution.
FIG. 7 schematically illustrates a data processing apparatus according to an embodiment of the present invention, illustrating an operational state where the first processing circuitry and shared processing circuitry have control of execution of the single instruction stream. The arrangement ofFIG. 7 comprises dataprocessing having hardware710 comprisingfirst processing circuitry712second processing circuitry714 and sharedprocessing circuitry720. The arrangement further comprisespower control circuitry750 incorporating dynamic voltage and frequency scaling (DVFS)circuitry752. InFIG. 7, thebox780 represents the source hybrid processing circuitry which is formed from the combination of thefirst processing circuitry712 and the sharedprocessing circuitry720.
Thepower control circuitry750, is providing full power to the first processing circuitry and to the shared processing circuitry via theswitches756 and758, but because thesecond processing circuitry714 is not active at the time represented byFIG. 7, theswitch759 connecting thepower control circuitry750 to thesecond processing circuitry714 is open. In the embodiment ofFIG. 1, the voltages of operation of thehigh performance circuitry120, the highefficiency processing circuitry140 and the sharedprocessing circuitry160 are substantially fixed. However, in the embodiments ofFIG. 7 andFIG. 8, thepower control circuitry750 has the capacity to independently perform dynamic voltage and frequency scaling of each of (i) thefirst processing circuitry712; (ii) thesecond processing circuitry714; and (iii) the sharedprocessing circuitry720.
TheDVFS circuitry752 provides an extra mechanism to deal with the different power-performance modes that can be required by application software. Many program applications are power-constrained applications that require relatively little processor performance for the majority of time that they are executing, but periodically and somewhat transiently require relatively higher performance levels for short periods of time. TheDVFS circuitry752 enables the performance of the processing circuitry i.e. thefirst processing circuitry712, thesecond processing circuitry714 and the sharedprocessing circuitry720 to be tailored to the demands of the particular processing workload being serviced and thus allows power to be saved when maximum performance is not required.
In the system ofFIG. 7, a lower energy consumption processing mode of the apparatus can be entered by causing theDVFS circuitry752 to reduce voltage to one or more of theprocessing circuits712,714 and720 so that less power is consumed. Reducing the voltage also makes the transistors switch more slowly so that the frequency of the processor clock should be correspondingly reduced. TheDVFS circuitry752 ofFIG. 7 is a simple circuitry that has one full voltage/frequency performance point and a single lower voltage/frequency performance point. However, in an alternative embodiment, a plurality of different performance points or even a continuum of voltages and frequencies are provided (within a predetermined range).
The arrangement ofFIG. 7 also comprises within theDVFS752, circuitry to level-shift and re-synchronise signals crossing between the voltage domains corresponding to the first processing circuitry, the second processing circuitry and the shared processing circuitry. Note that the operating system772 is a uni-processing operating system that sees the single processor having two operating modes: a first operating mode corresponding to thefirst processing circuitry712 and the sharedprocessing circuitry720 collectively having execution control; and a second processing mode corresponding to thesecond processing circuitry714 and the sharedprocessing circuitry720 having shared execution control of the single instruction stream.
In the system ofFIG. 7, the virtualiser software (seeFIG. 9) hides the details of configuration of the first processing circuitry and the second processing circuitry from the operating system (seeFIG. 9). This means that the virtualiser and not the operating system is used to manage migration (switching) of execution of the single instruction stream between thefirst processing circuitry712 and thesecond processing circuitry714. Thus the execution stream migration can be triggered directly via a user-mode program instruction without having to enter the operating system. Entering the operating system could in principle take up to several thousand processing cycles and this could have an impact on the time taken to switch between execution on the source hybrid processing unit and the second hybrid processing unit, but the use of the virtualiser expedites switching of execution control. A transfer of execution flow between thefirst processing circuitry712 andsecond processing circuitry714 can be triggered by an application program itself. For example, an application program executing on higher efficiency processing circuitry can trigger and execution flow switch to execution on the higher performance processing circuitry in the event of a jump into the operating system or a jump into the virtualisation software.
Furthermore, the transfer of execution of the instruction stream can be triggered and performed purely in hardware without any software support, for example when the battery charge level falls below a predetermined threshold which triggers the data processing apparatus to switch from executing the instruction stream on high performance processing circuitry to high efficiency processing circuitry.
FIG. 8 schematically illustrates the power control configuration of an apparatus having identical hardware and software to the apparatus illustrated inFIG. 7, but represents the system after execution control has been transferred from the first processing circuitry to the second processing circuitry. In this case, thesecond processing circuitry714 together with the sharedprocessing circuitry720 form the second or destination hybrid processing unit so that the power controller switches759 and758 are both closed (representing power being supplied). Thepower control switch756 connecting thepower control circuitry750 to thefirst processing circuitry712 is open. Clearly, the first processing circuitry is powered down and execution control has been assumed by thesecond processing circuitry714.
Thesecond processing circuitry714 can be operating at either the high performance operating point or the lower performance operating point as dictated by theDVFS circuitry752. In the arrangement ofFIG. 8, the workload comprising the operating system and the two processing applications (seeFIG. 9) is being performed by the second hybrid processing unit.
FIG. 9 schematically illustrates relationships between processing hardware, virtualizer software, operating system software and application software according to an embodiment of the present invention. Theprocessing system900 comprises: afirst operating system910 arranged to run one or more of afirst application program912, asecond application program914, athird application program916 and afourth application program918. Theprocessing system900 further comprises asecond operating system920 on which one or more of afifth program application922, asixth program application924 and aseventh program application926 are arranged to run. Both the first andsecond operating system910,920 and all sevenprogram applications912,914,916,918,922,924,926 run on processinghardware940 through the intermediary of avirtualizer930. Thevirtualizer930 is software that runs directly on theprocessing hardware940. Theapplications912,914,916,918,922,924,926 run at a lower level of privilege than theoperating systems910,920. Theoperating systems910,920 run at a lower level of privilege than thevirtualizer930.
Theprocessing hardware940 can be considered to collectively represent all of theprocessing circuitry110 of theFIG. 1 embodiment. In this particular example, thefirst operating system910 and thesecond operating system920 can both run on either the highperformance processing circuitry120 or the highefficiency processing circuitry140. Thevirtualizer930 enables multiple operating systems to be run on thehardware940. Thevirtualizer930 is used to mask events where execution of the single instruction stream is switched between the highperformance processing circuitry120 and the highefficiency processing circuitry140 ofFIG. 1 so that the switch of execution flow control is transparent to theoperating systems910,920 and all applications running on the operating systems.
Essentially, thevirtualizer930 hides details of the configuration of the highperformance processing circuitry120 and the highefficiency processing circuitry140 from theoperating systems910,920 and the virtualizer manages migration of the single instruction stream between the two sets ofprocessing circuitry120,140. This means that migration of the single execution stream can be triggered directly via a user-mode program instruction without having to enter the code of anyoperating system910,920. Avoiding entering the operating system is likely to expedite switching of processing the execution stream relative to embodiments where theoperating system910,920 is entered to effect the switch because entering the operating system could take up to several thousand processing cycles.
FIG. 10 schematically illustrates a data processing apparatus according to an embodiment of the present invention in which the shared processing circuitry comprises a shared cache that forms an L1 cache for one set of non-shared processing circuitry and an L2 cache for a further set of non-shared processing circuitry. Thedata processing apparatus1000 comprises:first processing circuitry1010 corresponding to a high performance central processing unit (CPU) core comprising an out-of-order, three-way issue super-scalar CPU core with thirteen pipeline stages.
The highperformance CPU core1010 comprises: a generalpurpose register file1020 with eight read ports and three write ports; a firstload store unit1030; an instruction fetch and branch prediction unit10410 comprising a first program counter for keeping track of the currently executing instruction when the high performance CPU core has control of the execution of the single instruction stream; and an instruction decoder/sequencer1044. The highperformance CPU core1010 further comprises: a level onecache1050 that is in bi-directional communication with the firstload storage unit1030 to store instructions and/or data extracted from and en route to a main memory (not shown).
Thedata processing apparatus1000 also comprises a second set of processing circuitry corresponding to a lowpower CPU core1100 that is low power relative to the highperformance CPU core1010 and also has relatively lower performance. The lowpower CPU core1100 is an in-order, single issue CPU core with three pipeline stages. The reduced number of pipeline stages relative to the highperformance CPU core1010, the single issue rather than the three-way issue of instructions and the in-order execution rather than out-of-order execution of program instructions contribute to the lower power consumption and lower relative performance.
The lowpower CPU core1100 comprises asecond register file1110 corresponding to a general purpose register file with two read ports and one write port. The lowpower CPU core1100 also comprises a secondload store unit1020 that is simpler than the firstload storage unit1030 and a second instruction fetchunit1130 that is relatively simple in comparison to the first instruction fetchunit1040 of the highperformance CPU core1010. The secondinstruction set unit1030 comprises a second program counter1033 that is used to keep track of a currently executing instruction whilst the flow of execution is being controlled by the lowperformance CPU core1100 rather than the highperformance CPU core1010.
Thedata processing apparatus1000 is provided with a plurality ofdirect transfer pathways1001 for transferring data directly between the highperformance CPU core1010 and the lowpower CPU core1100. The plurality ofdirect transfer pathways1001 include: a direct transfer pathway between the firstgeneral register file1020 and the secondgeneral register file1010; a further direct transfer pathway between thefirst program counter1042 and thesecond program counter1133; and a further direct transfer pathway that enables direct by bi-rectional communication between the highperformance CPU core1010 and the lowpower CPU core1100. Thesedirect transfer pathways1001 enable at least a portion of the processing state restoration information required for transferring execution between the highperformance CPU core1010 and the lowpower CPU core1100. The directly transferred portion of the processing state restoration information need not be pushed out to external (main) memory or stored in buffers or caches prior to transfer between the twoCPU cores1010,1100.
In this particular embodiment, the shared processing circuitry is accessible to both the highperformance CPU core1010 and the lowpower CPU core1110 via themultiplexer1200. The shared processing circuitry comprises a set ofcontrol registers1310 including flag registers, mode registers and configuration registers and also aTLB1320 for providing a mapping between virtual memory addresses and physical memory addresses. The shared processing circuitry further comprises a sharedcache1400 and abus interface unit1500. Thebus interface unit1500 enables communication of data to and from anexternal bus1600.
Since the highperformance CPU core1100 already has theL1 cache1050, the sharedcache1400 serves as a level two cache for the highperformance CPU core1010 but also serves as a level one cache for the lowpower CPU core1100, which does not have its own non-shared level one cache. In the case that the highperformance CPU core1100 L1 caches uses a write-back cache policy the cache should be cleaned to push any modified entries to L2 as part of the instruction stream transfer process. This would be necessary anyway in order to power down the highperformance CPU core1100. It can be avoided by using a write-through L1 cache policy that does not require cleaning. Write-through caches can sometimes have performance and power disadvantages. Since the power efficient core does not have an L1 cache there is no need to perform a cache clean and this may reduce the time needed to switch into high-performance processing. In alternative embodiments both high performance and low power cores have an L1 cache. For example, the low power core having a power optimized L1 cache and the high performance core having a performance optimized L1 cache.
Adirect transfer pathway1001 serves to transfer any non-shared performance critical state between the highperformance CPU core1010 and the lowpower CPU core1100 upon a switch of the instruction stream execution i.e. a transfer of execution between the two CPU cores. In the event of a transfer of flow of execution from the highperformance CPU core1010 to the lowpower CPU core1100, thenon-shared L1 cache1050 of the highperformance CPU core1010 remains powered during the execution flow transfer and is cleaned as part of the transfer and circuitry is provided within thedata processing apparatus1000 that allows data and instructions required by the instruction stream newly transferred and executing on the low power CPU core to be obtained from thenon-shared L1 cache1050 of the high performance CPU core1010 (source hybrid processing unit). The data and instructions obtained in this way from theL1 cache1050 are then cached in the destination hybrid processing unit, i.e. in this case the lowpower CPU core1100 or the shared processing circuitry. This improves the efficiency of the system by reducing the likelihood of having to retrieve data from external memory when execution of the transferred single instruction stream that has been suspended on the source hybrid processing unit is resumed on the destination processing hybrid unit.
Thedata processing apparatus1000 corresponds to a uni-processing environment in which only one of the highperformance CPU core1010 and the lowpower CPU core1100 has overall control of execution of the single instruction stream at any one time.
In thedata processing apparatus1000 ofFIG. 10, thehigh performance CPU1010 corresponds to a high-performance processing region of a single integrated circuit whilst the lowpower CPU core1100 corresponds to an energy-efficient processing region of the same integrated circuit. Thus the processing circuitry associated with the highperformance processing region1010 are physically close to each other on the integrated circuit and similarly the circuitry associated with the energyefficient processing region1100 are physically close to each other on the integrated circuit, rather than being distributed across the whole area of the integrated circuit (core processor). Although there is some duplication of the processing circuitry by providing, for example, a firstload store unit1030 and the secondload store unit1020 and a first fetchunit1040 and a second fetchunit1030, this duplication is mitigated by the fact that there are typically high overheads and integrating unit-level clock gating or power-switching and input/output signal clamping in a fine grain distributed manner across a large processor (integrated circuit). The power-switching has an area overhead and the signal clamping can lengthen critical paths thus reducing the peak clock frequency of the processor. Thus although there is the above mentioned duplication of processing circuitry by provision of the twodifferent CPU cores1010,1100 on the same integrated circuit in the embodiment ofFIG. 10 the clock gating or power-gating and input/output signal clamping is simplified relative to having more distributed circuitry and the critical paths in the highperformance processing region1010 are not impacted as much by the energy efficient mode made possible by the lowpower CPU core1100. An additional benefit is in reducing the distance that many signals must travel when the execution is being performed by the lowpower CPU core1100 in a low-energy mode and this improves the energy efficiency.
FIG. 11 schematically illustrates data processing apparatus according to an embodiment of the present invention in which a high performance CPU core and a low power CPU core share a set of special purpose registers, a TLB, a bus interface unit and both a level one cache and a level two cache. Thedata processing apparatus2000 ofFIG. 11, similar to the apparatus ofFIG. 10 comprises a highperformance CPU core1210 and a lowpower CPU core1220 the individual components of the lowpower CPU core1220 are identical to the individual components of the lowpower CPU core1100 of the embodiment ofFIG. 10.
The constituent circuitry of the highperformance CPU core1210 ofFIG. 11 comprises an identical firstgeneral register file1020,load store unit1030, decoder/sequencer1044 and set andbranch point unit1040 to the arrangement of the highperformance CPU core1010 ofFIG. 10. However there is a key difference between the highperformance CPU core1210 ofFIG. 11 and that ofFIG. 10 in that the highperformance CPU core1210 ofFIG. 11 does not have a non-shared level one cache. Instead, the shared processing circuitry comprises both a shared level onecache1710 and a shared level twocache1720.
The shared processing circuitry, like the embodiment ofFIG. 10 further comprises a set ofcontrol registers1310 which is a master copy of information in the special purpose registers, a sharedTLB1320 and a sharedbus interface unit1500 connecting theprocessor2000 to external memory and the rest of the system in which thedata processing apparatus2000 is incorporated. The same as the case for the arrangement of the embodiment ofFIG. 10, a plurality ofdirect transfer pathways1001 enable direct transfer of non-shared performance critical state between the highperformance CPU core1210 and the lowpower CPU core1220. The embodiment ofFIG. 11 also has shared embedded trace macrocell (ETM)circuitry1730 as part of the shared processing circuitry, which avoids the need for two separate trace units for the highperformance CPU core1210 and the lowpower CPU core1220.
In yet further alternative embodiments, the program counters1042 and1133 could also be replaced by a shared program counter common to the highperformance CPU core1010 and the lowpower CPU core1100. However, it is likely that the hardware transfer of program counter data between the non-shared portions of the circuitry has less impact on critical paths than transfer of cache data, TLB registers or special purpose register data. In order to simplify the transfer of the processing state restoration information, some embodiments drain the pipelines of the highperformance CPU core1010 and the lowpower CPU core1100 to simplify the transfer of the control of the single execution stream, The draining of the pipeline typically costs only a few tens of processing cycles.
FIG. 12 schematically illustrates a multi-processor system constructed from two separate instances of the uni-processor of the embodiment ofFIG. 11 and a standard single-core processor. Thus themultiprocessing system3000 comprises a firstintegrated circuit2000 having a high performance CPU and a low power CPU and shared L1 cache and L2 cache, TLB registers and special purpose registers and bus interface unit and a further dual-core uni-processing environment2000′ as well as asingle core uni-processor2000″. The three uni-processingintegrated circuits2000,2000′ and2000″ are connected via thebus3001. Within each of the two dual-core uni-processing systems2000,2000′ only one of the high performance CPU core and the low power CPU core can have control of execution of the single instruction stream being executed in that particular integrated circuit at any one time. Thesingle core uniprocessor2000″ is a standard processing core where the single instruction stream runs on a single core.
Thus themultiprocessor system3000 is constituted by three separate uni-processor systems. Thus themultiprocessor3000 comprises a selection ofseparate processors2000,2000′ and2000″ and each of the three uni-processors processes a different instruction stream. Furthermore, each individual uni-processor has the physical resources necessary to process an instruction stream without the requirement to time division multiplex and eachindividual processor2000,2000′ and2000″ communicates with external memory and any other processors within the network system in which the multiprocessor is used through one or more bus or network interfaces.
Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope and spirit of the invention as defined by the appended claims.