CN101916184B

Movatterモバイル変換

Info

Publication number: CN101916184B
Application number: CN201010260377.2A
Authority: CN
Inventors: 汤玛斯·C·麦当劳
Original assignee: Via Technologies Inc
Current assignee: Via Technologies Inc
Priority date: 2009-08-28
Filing date: 2010-08-20
Publication date: 2014-02-12
Anticipated expiration: 2030-08-20
Also published as: CN101916184A

Abstract

本发明提供一种更新微处理器中的分支目标地址快取的方法及其微处理器，其中该微处理器包括分支目标地址快取(BTAC)、执行单元及更新逻辑电路。执行单元执行事先从一指令快取的提取总量中提取的分支指令。更新逻辑电路耦接至BTAC与执行单元，更新逻辑电路判断BTAC是否已经储存位于提取总量中的N个分支指令的分支预测信息，其中N至少等于二；若BTAC尚未储存N个分支指令的分支预测信息，则使用分支指令的分支信息来更新BTAC；若BTAC已经储存N个分支指令的分支预测信息，则判断分支指令的替换优先权是否高于BTAC中的N个分支指令的替换优先权；以及若分支指令的替换优先权高于BTAC中的N个分支指令的替换优先权，则使用分支指令的分支信息来更新BTAC。

The present invention provides a method for updating a branch target address cache in a microprocessor and a microprocessor thereof, wherein the microprocessor includes a branch target address cache (BTAC), an execution unit, and an update logic circuit. The execution unit executes a branch instruction previously extracted from a total extraction amount of an instruction cache. The update logic circuit is coupled to the BTAC and the execution unit, and the update logic circuit determines whether the BTAC has stored branch prediction information of N branch instructions in the total extraction amount, wherein N is at least equal to two; if the BTAC has not stored branch prediction information of the N branch instructions, the branch information of the branch instruction is used to update the BTAC; if the BTAC has stored branch prediction information of the N branch instructions, the branch information of the branch instruction is determined to be higher than the replacement priority of the N branch instructions in the BTAC; and if the replacement priority of the branch instruction is higher than the replacement priority of the N branch instructions in the BTAC, the branch information of the branch instruction is used to update the BTAC.

Description

Translated fromChinese

更新微处理器中的分支目标地址快取的方法及其微处理器Method for updating branch target address cache in microprocessor and microprocessor thereof

技术领域technical field

本发明是关于微处理器，特别是关于微处理器中的分支目标地址快取(branch target address caches)。The present invention relates to microprocessors, and more particularly to branch target address caches in microprocessors.

背景技术Background technique

传统的分支目标地址快取(branch target address cache；BTAC)大约只能将两个分支指令储存至指令数据的一给定对齐(aligned)的16字节片段中。此设计选择是为了缩短耗时并减少功率消耗与晶粒尺寸。允许储存三个或四个分支指令要比储存两个分支指令复杂的多。虽然从指令快取中提取三个或多个分支指令(其初始字节皆在相同的16字节中)的情况并不多见，但此情况确实会发生并且会对效能产生负面影响。A conventional branch target address cache (BTAC) can only store approximately two branch instructions into a given aligned 16-byte segment of instruction data. This design choice was made to reduce time consumption and reduce power consumption and die size. Allowing storage of three or four branch instructions is more complex than storing two branch instructions. While it is not uncommon for three or more branch instructions (whose initial bytes are all in the same 16 bytes) to be fetched from the instruction cache, it does happen and can have a negative impact on performance.

发明内容Contents of the invention

本发明提供一种微处理器，包括一分支目标地址快取、一执行单元以及一更新逻辑电路。分支目标地址快取中的各个项目用以储存至多N个分支指令的多个分支预测信息。执行单元用以执行事先从一指令快取的一提取总量中提取的一分支指令。更新逻辑电路耦接至分支目标地址快取与执行单元，更新逻辑电路用以判断分支目标地址快取是否已经储存位于提取总量中的N个分支指令的分支预测信息，其中N至少等于二；若分支目标地址快取尚未储存位于提取总量中的N个分支指令的分支预测信息，则使用分支指令的分支信息来更新分支目标地址快取；若分支目标地址快取已经储存位于提取总量中的N个分支指令的分支预测信息，则判断分支指令的替换优先权是否高于分支目标地址快取中的N个分支指令的替换优先权；以及若分支指令的替换优先权高于分支目标地址快取中的N个分支指令的替换优先权，则使用分支指令的分支信息来更新分支目标地址快取。The invention provides a microprocessor, which includes a branch target address cache, an execution unit and an update logic circuit. Each entry in the branch target address cache is used to store a plurality of branch prediction information for at most N branch instructions. The execution unit is used for executing a branch instruction previously fetched from a fetch volume of an instruction cache. The update logic circuit is coupled to the branch target address cache and the execution unit, and the update logic circuit is used for judging whether the branch target address cache has stored branch prediction information of N branch instructions in the fetched total, wherein N is at least equal to two; If the branch target address cache has not yet stored the branch prediction information of the N branch instructions in the fetch total, then use the branch information of the branch instruction to update the branch target address cache; if the branch target address cache has already stored in the fetch total The branch prediction information of the N branch instructions in the branch instruction, then judge whether the replacement priority of the branch instruction is higher than the replacement priority of the N branch instructions in the branch target address cache; and if the replacement priority of the branch instruction is higher than the branch target For the replacement priorities of the N branch instructions in the address cache, use the branch information of the branch instructions to update the branch target address cache.

本发明提供一种更新微处理器中的一分支目标地址快取(BTAC)的方法，其中分支目标地址快取中的各个项目用以储存来自一指令快取的一提取总量中至多N个分支指令的多个分支预测信息。上述方法包括执行事先从指令快取的提取总量中提取的一分支指令。判断分支目标地址快取是否已经储存位于提取总量中的N个分支指令的分支预测信息，其中N至少等于二。若分支目标地址快取尚未储存位于提取总量中的N个分支指令的分支预测信息，则使用分支指令的分支信息来更新分支目标地址快取。若分支目标地址快取已经储存位于提取总量中的N个分支指令的分支预测信息，则判断分支指令的替换优先权是否高于分支目标地址快取中的N个分支指令的替换优先权。若分支指令的替换优先权高于分支目标地址快取中的N个分支指令的替换优先权，则使用分支指令的分支信息来更新分支目标地址快取。The present invention provides a method of updating a branch target address cache (BTAC) in a microprocessor, wherein each entry in the branch target address cache is used to store at most N of a total number of fetches from an instruction cache Multiple branch prediction information for branch instructions. The method described above includes executing a branch instruction previously fetched from the fetch volume of the instruction cache. It is judged whether the branch target address cache has stored the branch prediction information of N branch instructions in the fetched total, wherein N is at least equal to two. If the branch target address cache has not stored the branch prediction information of the N branch instructions in the fetch total, the branch target address cache is updated with the branch information of the branch instructions. If the branch target address cache has stored the branch prediction information of the N branch instructions in the fetched total, it is determined whether the replacement priority of the branch instruction is higher than the replacement priority of the N branch instructions in the branch target address cache. If the replacement priority of the branch instruction is higher than the replacement priorities of the N branch instructions in the branch target address cache, then the branch information of the branch instruction is used to update the branch target address cache.

为让本发明的上述和其它目的、特征、和优点能更明显易懂，下文特举出较佳实施例，并配合所附图式，作详细说明如下。In order to make the above and other objects, features, and advantages of the present invention more comprehensible, preferred embodiments are listed below and described in detail in conjunction with the accompanying drawings.

附图说明Description of drawings

图1为本发明实施例的微处理器的方块图；Fig. 1 is the block diagram of the microprocessor of the embodiment of the present invention;

图2为本发明实施例的指令快取的方块图；Fig. 2 is the block diagram of the command cache of the embodiment of the present invention;

图3为图1中的分支目标地址快取的配置方块图；Fig. 3 is a configuration block diagram of the branch target address cache in Fig. 1;

图4为图1中的更新逻辑电路所使用的分支指令型式优先权的结构图；FIG. 4 is a structural diagram of the branch instruction type priority used by the update logic circuit in FIG. 1;

图5A和5B为图1中的微处理器的操作流程图。5A and 5B are flowcharts of the operation of the microprocessor in FIG. 1 .

[主要元件标号说明][Description of main component labels]

100～微处理器； 102～指令快取；100～microprocessor; 102～instruction cache;

104～提取单元； 106～指令解码器；104～extraction unit; 106～instruction decoder;

108～指令队列； 112～加法器；108～instruction queue; 112～adder;

116～寄存器别名表； 118～保留站；116～register alias table; 118～reservation station;

122～执行单元； 124～引退单元；122～execution unit; 124～retirement unit;

126～第二分支历史表； 128～分支目标地址快取；126～second branch history table; 128～branch target address cache;

132～返回堆栈； 134～控制逻辑电路；132～return stack; 134～control logic circuit;

136～更新逻辑电路； 138～虚拟随机产生器；136～update logic circuit; 138～virtual random generator;

142～提取地址； 144～下一个序列提取地址；142～extraction address; 144～next sequence extraction address;

146～预测分支目标地址；148～预测返回地址；146～predicted branch target address; 148～predicted return address;

152～正确目标地址； 154～分支目标地址；152～correct target address; 154～branch target address;

162～整体分支样式； 164～第一分支历史表；162 ~ overall branch style; 164 ~ first branch history table;

166～虚拟随机指标； 168～通用寄存器；166～virtual random index; 168～general registers;

202～快取线； 302～项目；202～cache line; 302～item;

304～分支目标地址预测；306～方向预测；304～branch target address prediction; 306～direction prediction;

308～分支指令型式； 312～有效位。308～branch instruction type; 312～valid bit.

具体实施方式Detailed ways

为了减少上述问题所造成的效能影响，以下实施例将提供一种替换策略(replacement policy)，适用于从指令快取中提取的快取线的相同部分或总量(例如16字节)中具有额外分支指令(例如第三分支指令)的情况。此替换策略为一种以相关分支指令的型式为基础的优先机制(priority scheme)，并且具有取代优先机制的虚拟随机措施(pseudo-random provision)用以适应不同的极端状况(corner cases)。In order to reduce the performance impact caused by the above problems, the following embodiments will provide a replacement policy (replacement policy), which is suitable for the same part or the total amount (for example, 16 bytes) of the cache lines extracted from the instruction cache. The case of an additional branch instruction (such as a third branch instruction). The replacement strategy is a priority scheme based on the type of related branch instructions, and has a pseudo-random provision to adapt to different corner cases.

图1为本发明实施例的微处理器100的方块图。微处理器100包括一指令快取102以及一提取单元104，提取单元104提供的提取地址142用以存取指令快取102。提取单元104通过选择不同来源所提供的多个地址中的一者来输出提取地址142，上述来源包括：提取地址142本身、用以递增提取地址142的加法器112所提供的下一个序列提取地址144、分支目标地址快取(BTAC)128提供的预测分支目标地址146、返回堆栈(return stack)132提供的预测返回地址148、执行单元122提供的正确目标地址152，以及指令解码器106提供的分支目标地址154。控制逻辑电路134用以根据来自第一分支历史表164与第二分支历史表126的方向预测以及分支目标地址快取128的信息，控制提取单元104选择多个输入中的一者。举例而言，分支目标地址快取128的信息包括方向预测与分支指令预测的型式(例如呼叫/返回指令、间接分支(indirect branch)指令、条件相对(conditional relative)指令、非条件相对(unconditional relative)指令)。FIG. 1 is a block diagram of amicroprocessor 100 according to an embodiment of the present invention. Themicroprocessor 100 includes aninstruction cache 102 and afetch unit 104 , and thefetch address 142 provided by thefetch unit 104 is used to access theinstruction cache 102 .Fetch unit 104outputs fetch address 142 by selecting one of a plurality of addresses provided by different sources, including:fetch address 142 itself, the next sequential fetch address provided byadder 112 for incrementingfetch address 142 144, the predictedbranch target address 146 provided by the branch target address cache (BTAC) 128, the predictedreturn address 148 provided by the return stack (return stack) 132, thecorrect target address 152 provided by theexecution unit 122, and theinstruction decoder 106 providedBranch target address 154. Thecontrol logic circuit 134 is used to control thefetch unit 104 to select one of the inputs according to the direction predictions from the first branch history table 164 and the second branch history table 126 and information from the branchtarget address cache 128 . For example, the information of the branchtarget address cache 128 includes direction prediction and type of branch instruction prediction (such as call/return instruction, indirect branch instruction, conditional relative instruction, unconditional relative instruction) )instruction).

指令快取102根据提取地址142提供指令字节的快取线202至指令解码器106。指令快取102在每个时钟周期提供部分快取线202而不是整个快取线202。如图2所示，在本实施例中各个快取线202为64字节，并且指令快取102在每个时钟周期提供部分快取线202(16字节)至指令解码器106或指令缓冲器(图未显示)。指令解码器106用以将指令字节解码。在本实施例中，指令解码器106将x86架构指令转译成微指令(microinstructions)并提供至指令队列(instruction queue)108。当指令解码器106将一分支指令解码时(该分支指令的目标地址是以相对于分支指令的地址的偏移量来计算)，指令解码器106计算分支目标地址154并将分支目标地址154提供至提取单元104。此外，指令解码器106将分支指令的地址提供至第二分支历史表(branchhistory table)126。第二分支历史表126储存关于先前执行的分支指令的方向历史信息。若分支指令地址命中于(hits in)第二分支历史表126，则分支指令地址预测分支指令会被取用(taken)并将预测结果传送至控制逻辑电路134。控制逻辑电路134使用上述预测来控制提取单元104。Theinstruction cache 102 provides a cache line 202 of instruction bytes to theinstruction decoder 106 according to thefetch address 142 . Theinstruction cache 102 provides a portion of the cache line 202 rather than the entire cache line 202 every clock cycle. As shown in FIG. 2, each cache line 202 is 64 bytes in the present embodiment, and theinstruction cache 102 provides a part of the cache line 202 (16 bytes) to theinstruction decoder 106 or the instruction cache in each clock cycle device (not shown). Theinstruction decoder 106 is used for decoding instruction bytes. In this embodiment, theinstruction decoder 106 translates the x86 architecture instructions into microinstructions and provides them to theinstruction queue 108 . Wheninstruction decoder 106 decodes a branch instruction whose target address is calculated as an offset from the address of the branch instruction,instruction decoder 106 calculatesbranch target address 154 and providesbranch target address 154 to theextraction unit 104. In addition, theinstruction decoder 106 provides the address of the branch instruction to a second branch history table 126 . The second branch history table 126 stores direction history information about previously executed branch instructions. If the branch instruction address hits in the second branch history table 126 , the branch instruction address predicts that the branch instruction will be taken and the prediction result is sent to thecontrol logic circuit 134 .Control logic circuit 134 uses the above prediction to controlextraction unit 104 .

指令队列108将程序顺序中的指令提供至寄存器别名表(register alias table；RAT)116，寄存器别名表116用以维护并产生各个指令的相依性信息(dependency information)。寄存器别名表116将指令配送(dispatch)至保留站(reservation station)118，保留站118用以将指令(可能是程序顺序外的指令)发送至执行单元122。执行单元122用以执行分支指令。执行单元122也显示不同的分支预测器(分支目标地址快取128、返回堆栈132、第二分支历史表126以及第一分支历史表164)是否已正确地预测分支指令。执行单元122也根据分支指令的执行，使用历史信息来更新上述不同的分支预测器。执行单元122也将正确目标地址152提供至提取单元104。执行单元122也更新微处理器100所储存的整体分支样式(globalbranch pattern)162，当提取地址142出现于第一分支历史表164时，第一分支历史表164会使用整体分支样式162来执行方向预测。在执行单元122执行指令之后，引退单元(retire unit)124用以引退由重排序缓冲器(图未显示)所储存的程序顺序中的指令。Theinstruction queue 108 provides the instructions in the program sequence to a register alias table (register alias table; RAT) 116, and the register alias table 116 is used to maintain and generate dependency information of each instruction. The register alias table 116 dispatches instructions to areservation station 118 for sending instructions (possibly out-of-program order instructions) to theexecution unit 122 . Theexecution unit 122 is used for executing branch instructions.Execution unit 122 also displays whether the various branch predictors (branchtarget address cache 128, returnstack 132, second branch history table 126, and first branch history table 164) have correctly predicted the branch instruction.Execution unit 122 also uses history information to update the various branch predictors described above based on the execution of branch instructions.Execution unit 122 also providescorrect target address 152 to fetchunit 104 . Theexecution unit 122 also updates the overall branch pattern (global branch pattern) 162 stored in themicroprocessor 100. When the fetchaddress 142 appears in the first branch history table 164, the first branch history table 164 will use theglobal branch pattern 162 to execute the direction predict. After theexecution unit 122 executes the instructions, the retireunit 124 is used to retire the instructions in the program order stored by the reorder buffer (not shown).

请参考图3，图3为图1中的分支目标地址快取128的配置方块图。分支目标地址快取128用以储存关于先前执行的分支指令的信息，并且在后续执行期间使用此信息来预测这些分支指令的目标地址、方向以及型式。如图3所示，分支目标地址快取128中的各个项目(entry)302包括一有效位312、一分支目标地址预测304、一方向预测306(即分支指令是否会被取用(taken)或不取用(not taken))以及一分支指令型式308。在一实施例中，分支指令型式308用以指定分支指令是否为一呼叫/返回指令、间接分支指令、条件相对分支指令或非条件相对分支指令。微处理器100中的更新逻辑电路136的优点在于使用分支指令型式308用以明智地替换分支目标地址快取128中的项目302，细节将在以下做进一步说明。如图3所示，分支目标地址快取128可在指令快取102中的快取线202的各个部分或提取总量(fetchquantum)(例如16字节)中储存两个项目302(标记为“A”与“B”)。换言之，分支目标地址快取128可储存部分快取线202中的至多两个分支指令的预测信息。如上所述，在部分快取线202中具有超过两个分支指令的情况下，此限制会降低分支预测的效能。然而，更新逻辑电路136使用一明智的替换策略用以降低效能影响，细节将在以下做进一步说明。在一实施例中，分支目标地址快取128也包括各个A/B项目对(entry pairs)的一最近最少使用(least-recently-used；LRU)位(图未显示)，用以显示最近最少使用A侧还是B侧以便决定是否要替换A项目302或B项目302。在本实施例中，虽然分支目标地址快取128储存每个部分快取线202(16字节)中的两个分支指令的预测信息，但可依据设计需要来改变部分快取线202的大小以及每个部分快取线202中的分支指令的数目。Please refer to FIG. 3 , which is a configuration block diagram of the branchtarget address cache 128 in FIG. 1 . The branchtarget address cache 128 is used to store information about previously executed branch instructions and to use this information during subsequent executions to predict the target address, direction and type of these branch instructions. As shown in FIG. 3 , each item (entry) 302 in the branchtarget address cache 128 includes avalid bit 312, a branchtarget address prediction 304, and a direction prediction 306 (that is, whether the branch instruction will be taken or not). Not taken) and a branch instruction type 308. In one embodiment, the branch instruction type 308 is used to specify whether the branch instruction is a call/return instruction, an indirect branch instruction, a conditional relative branch instruction or an unconditional relative branch instruction. An advantage of theupdate logic 136 in themicroprocessor 100 is that it uses the branch instruction pattern 308 to judiciously replace theentry 302 in the branchtarget address cache 128, as further described below. As shown in FIG. 3, the branchtarget address cache 128 may store two entries 302 (labeled " A" and "B"). In other words, the branchtarget address cache 128 can store prediction information for at most two branch instructions in the partial cache line 202 . As mentioned above, this limitation reduces the performance of branch prediction in the case of partial cache lines 202 having more than two branch instructions. However, updatelogic 136 uses a sensible replacement strategy to reduce performance impact, as will be further described below. In one embodiment, the branchtarget address cache 128 also includes a least-recently-used (least-recently-used; LRU) bit (not shown) for each A/B entry pair (entry pairs) to indicate the least-recently-used Either side A or side B is used in order to decide whether to replace theA item 302 or theB item 302 . In this embodiment, although the branchtarget address cache 128 stores the prediction information of two branch instructions in each partial cache line 202 (16 bytes), the size of the partial cache line 202 can be changed according to design requirements and the number of branch instructions in each partial cache line 202 .

参考回图1，当提取地址142出现于分支目标地址快取128时，分支目标地址快取128将信息提供至提取单元104、指令解码器106、返回堆栈132以及控制逻辑电路134。仔细而言，分支目标地址快取128将作为预测分支目标地址146的分支目标地址预测304提供至提取单元104，并且将方向预测306与分支指令型式308提供至控制逻辑电路134。此外，分支指令型式308沿着具有分支指令的管线传递，并且执行单元122随后将分支指令型式308提供至更新逻辑电路136以便执行分支目标地址快取128的替换策略，细节将在以下做进一步说明。Referring back to FIG. 1 , branchtarget address cache 128 provides information to fetchunit 104 ,instruction decoder 106 , returnstack 132 , andcontrol logic 134 when fetchaddress 142 appears in branchtarget address cache 128 . Specifically, branchtarget address cache 128 provides branchtarget address prediction 304 as predictedbranch target address 146 to fetchunit 104 , and provides direction prediction 306 and branch instruction pattern 308 to controllogic circuit 134 . In addition, branch instruction version 308 is passed along the pipeline with the branch instruction, andexecution unit 122 then provides branch instruction version 308 to updatelogic circuit 136 in order to implement the replacement strategy of branchtarget address cache 128, which will be described in further detail below. .

返回堆栈132用以储存由呼叫指令产生的返回地址。当分支目标地址快取128显示提取地址142所指定的部分快取线202包含一呼叫指令时，返回堆栈132将具有一返回地址。当分支目标地址快取128显示提取地址142所指定的部分快取线202包含一返回指令时，返回堆栈132将预测返回地址148提供至提取单元104。Thereturn stack 132 is used to store the return address generated by the call instruction. When the branchtarget address cache 128 indicates that the partial cache line 202 specified by the fetchaddress 142 contains a call instruction, thereturn stack 132 will have a return address. Thereturn stack 132 provides the predictedreturn address 148 to the fetchunit 104 when the branchtarget address cache 128 indicates that the partial cache line 202 specified by the fetchaddress 142 contains a return instruction.

微处理器100也包括一虚拟随机产生器138用以提供一虚拟随机指针166至更新逻辑电路136。更新逻辑电路136的优点在于使用虚拟随机指针166来执行分支目标地址快取128的替换策略，用以改善以严格优先权为基础(strictly priority-based)的替换策略，细节将在以下做进一步说明。在一实施例中，虚拟随机产生器138为一15位的线性反馈移位寄存器(linearfeedback shift register；LFSR)，用以在虚拟随机顺序中的所有2¹⁵个状态(除了全为0状态)内循环，并且在虚拟随机产生器138产生相同的重复产生样式(generation pattern repeats)之前，时钟周期数量为32767个时钟周期。当有需要时，可从15位中取样5位来产生虚拟随机指针166。因此，虚拟随机指标166大约每32个时钟周期平均为真值(true)一次。Themicroprocessor 100 also includes a virtualrandom generator 138 for providing a virtualrandom pointer 166 to theupdate logic circuit 136 . The advantage of updating thelogic circuit 136 is to use the virtualrandom pointer 166 to implement the replacement strategy of the branchtarget address cache 128 to improve the strictly priority-based replacement strategy. The details will be further explained below. . In one embodiment, thepseudo-random generator 138 is a 15-bit linear feedback shift register (linear feedback shift register; LFSR), which is used in all 2^{to 15} states (except all 0 states) in the pseudo-random sequence cycle, and the number of clock cycles is 32767 clock cycles before thepseudorandom generator 138 generates the same generation pattern repeats. When necessary, 5 bits out of 15 bits can be sampled to generate virtualrandom pointer 166 . Therefore, thepseudo-stochastic indicator 166 averages true approximately every 32 clock cycles.

请参考图4，图4为图1中的更新逻辑电路136所使用的分支指令型式优先权的结构图。如图4所示，间接型式的分支指令具有最高优先权(表示最后才被替换)，呼叫/返回型式的分支指令具有第二高优先权，条件相对型式的分支指令具有第三高优先权，而非条件相对型式的分支指令具有最低优先权(表示可最先被替换)。Please refer to FIG. 4 , which is a structural diagram of the branch instruction type priorities used by the updatinglogic circuit 136 in FIG. 1 . As shown in Figure 4, the indirect type of branch instruction has the highest priority (meaning that it is replaced last), the call/return type of branch instruction has the second highest priority, and the conditional relative type of branch instruction has the third highest priority. Branch instructions that are not conditional relative types have the lowest priority (meaning they can be replaced first).

相对型式的分支指令的目标地址是以相对于分支指令的地址的总偏移量来计算，并且偏移量为指令本身中的字段。因此，指令解码器106可正确地计算相对型式的分支指令(包括条件相对分支指令以及非条件相对分支指令)的分支目标地址154。此外，由于已经知道非条件相对分支指令的方向，因此指令解码器106可准确地解析(resolve)非条件相对分支指令。因此，分支目标地址快取128误预测(mispredict)一非条件相对分支指令所产生的代价(penalty)，相对小于误预测其它型式的分支指令所产生的代价。在一实施例中，误预测代价在最糟的情况下大约为七个时钟周期，但根据指令队列108的使用率(fullness)，误预测代价也会少于七个时钟周期。这就是为什么非条件相对分支指令具有最低优先权(表示可最先被替换)。在一实施例中，分支目标地址快取128的项目302包括一旗标(flag)用以显示分支指令是否为一非条件相对分支指令，因此若部分快取线202中具有超过两个分支指令，则更新逻辑电路136替换分支目标地址快取128中的非条件相对分支指令，并且更新逻辑电路136通常不会将其它型式的分支指令替换为一非条件相对分支指令。The target address of a relative type branch instruction is calculated as a total offset from the address of the branch instruction, and the offset is a field in the instruction itself. Therefore, theinstruction decoder 106 can correctly calculate thebranch target address 154 of relative type branch instructions (including conditional relative branch instructions and unconditional relative branch instructions). In addition, since the direction of the unconditional relative branch instruction is already known, theinstruction decoder 106 can accurately resolve the unconditional relative branch instruction. Therefore, the penalty generated by the branchtarget address cache 128 for mispredicting an unconditional relative branch instruction is relatively smaller than the penalty for mispredicting other types of branch instructions. In one embodiment, the misprediction cost is about seven clock cycles in the worst case, but depending on the fullness of theinstruction queue 108, the misprediction cost may be less than seven clock cycles. That's why unconditional relative branch instructions have the lowest priority (meaning they can be replaced first). In one embodiment, theentry 302 of the branchtarget address cache 128 includes a flag to indicate whether the branch instruction is an unconditional relative branch instruction, so if there are more than two branch instructions in the partial cache line 202 , theupdate logic circuit 136 replaces the unconditional relative branch instruction in the branchtarget address cache 128 , and theupdate logic circuit 136 generally does not replace other types of branch instructions with an unconditional relative branch instruction.

与相对型式的分支指令相比，微处理器100的通用寄存器168中的某些操作数(operand)或存储器位置中的某些操作数可用来计算一间接型式的分支指令目标地址。因此，指令解码器106不会预测间接分支指令，并且是由执行单元122来计算间接分支指令目标地址。因此，分支目标地址快取128误预测一间接分支指令所产生的代价，通常会大于误预测其它型式的分支指令所产生的代价。这就是为什么间接分支指令具有最高优先权(表示最后才被替换)。In contrast to a relative branch instruction, certain operands in thegeneral register 168 or certain operands in memory locations of themicroprocessor 100 may be used to calculate the target address of an indirect branch instruction. Therefore, theinstruction decoder 106 does not predict the indirect branch instruction, and it is up to theexecution unit 122 to calculate the indirect branch instruction target address. Therefore, the cost of mispredicting an indirect branch instruction to the branchtarget address cache 128 is generally greater than the cost of mispredicting other types of branch instructions. That's why indirect branch instructions have the highest priority (meaning they are replaced last).

此外，替换分支目标地址快取128中的呼叫/返回指令(返回堆栈132中具有该呼叫/返回指令的一有效返回地址)，会导致返回堆栈132未对齐(misaligned)使得返回堆栈132很有可能在之后会误预测，因而产生负面效能影响。这就是为什么呼叫/返回指令具有第二高优先权。In addition, replacing the call/return instruction in the branch target address cache 128 (with a valid return address for the call/return instruction in the return stack 132), would cause thereturn stack 132 to be misaligned making thereturn stack 132 very likely It will be mispredicted later, thus negatively affecting performance. That's why call/return instructions have the second highest priority.

最后，虽然通过指令解码器106(目标地址)、第二分支历史表126(方向)以及分支目标地址快取128来预测条件相对分支指令，但由于在本实施例中的分支目标地址快取128的大小大于第二分支历史表126，因此分支目标地址快取128的方向预测会比较准确。此外，从分支目标地址快取128中移除条件相对分支指令会导致整体分支样式162产生误差。基于上述理由，条件相对分支指令是高于非条件相对分支指令因而具有第三高优先权。Finally, although the conditional relative branch instruction is predicted by the instruction decoder 106 (target address), the second branch history table 126 (direction), and the branchtarget address cache 128, since the branchtarget address cache 128 in this embodiment The size of is larger than the second branch history table 126, so the direction prediction of the branchtarget address cache 128 will be more accurate. Additionally, removal of conditional relative branch instructions from branchtarget address cache 128 can cause errors inoverall branch pattern 162 . For the above reasons, conditional relative branch instructions are higher than non-conditional relative branch instructions and thus have the third highest priority.

请参考图5，图5为图1中的微处理器100的操作流程图。流程从步骤502开始。Please refer to FIG. 5 , which is a flow chart illustrating the operation of themicroprocessor 100 in FIG. 1 . The flow starts from step 502 .

在步骤502中，执行单元122执行一全新的分支指令并提供相关信息至更新逻辑电路136。流程前进至步骤504。In step 502 , theexecution unit 122 executes a new branch instruction and provides related information to theupdate logic circuit 136 . The process proceeds to step 504 .

在步骤504中，更新逻辑电路136使用上述分支指令的地址用以在分支目标地址快取128中建立索引。流程前进至判断步骤506。In step 504 , theupdate logic circuit 136 uses the address of the branch instruction to index in the branchtarget address cache 128 . The process proceeds todecision step 506 .

在判断步骤506中，更新逻辑电路136检查A项目302与B项目302的有效位312，用以判断快取线202的相同部分(same portion)中是否具有超过两个分支指令。若有，流程前进至步骤512；若没有，流程前进至步骤508。In thedecision step 506 , theupdate logic circuit 136 checks thevalid bit 312 of theA entry 302 and theB entry 302 to determine whether there are more than two branch instructions in the same portion of the cache line 202 . If yes, the process proceeds to step 512 ; if not, the process proceeds to step 508 .

在步骤508中，更新逻辑电路136使用与上述分支指令相关的执行信息来更新分支目标地址快取128。换言之，更新逻辑电路136写入无效的A项目302或B项目302。流程结束于步骤508。Instep 508, theupdate logic circuit 136 updates the branchtarget address cache 128 with the execution information related to the aforementioned branch instruction. In other words, theupdate logic circuit 136 writes the invalid A-entry 302 or B-entry 302 . The flow ends atstep 508 .

在步骤512中，更新逻辑电路136检查执行单元122所提供的上述分支指令的分支指令型式308，以及A项目302与B项目302中的两个有效分支指令(根据不同实施例，上述两个有效分支指令是来自于分支目标地址快取128或执行单元122)的分支指令型式308。流程前进至判断步骤514。Instep 512, theupdate logic circuit 136 checks the branch instruction type 308 of the above branch instruction provided by theexecution unit 122, and the two valid branch instructions in theA item 302 and the B item 302 (according to different embodiments, the above two valid The branch instruction is the branch instruction type 308 from the branchtarget address cache 128 or the execution unit 122 ). Flow proceeds todecision step 514 .

在判断步骤514中，更新逻辑电路136判断上述分支指令的分支指令型式308是否高于A项目302与B项目302中的两个有效分支指令的分支指令型式308。若是，流程前进至步骤516；若否，流程前进至步骤518。In thedetermination step 514 , theupdate logic circuit 136 determines whether the branch instruction type 308 of the above-mentioned branch instruction is higher than the branch instruction types 308 of the two valid branch instructions in theA entry 302 and theB entry 302 . If yes, the process proceeds to step 516 ; if not, the process proceeds to step 518 .

在步骤516中，更新逻辑电路136使用与上述分支指令相关的执行信息来更新分支目标地址快取128。换言之，更新逻辑电路136替换A项目302与B项目302中的两个有效分支指令中的一者。在一实施例中，更新逻辑电路136根据LRU位选择索引集合(indexed set)与选择路径(selected way)的A项目302或B项目302。流程结束于步骤516。Instep 516, theupdate logic circuit 136 updates the branchtarget address cache 128 with the execution information associated with the aforementioned branch instruction. In other words, theupdate logic circuit 136 replaces one of the two valid branch instructions in theA entry 302 and theB entry 302 . In one embodiment, theupdate logic circuit 136 selects theA item 302 or theB item 302 of the indexed set and the selected way according to the LRU bits. The process ends atstep 516 .

参考步骤518，更新逻辑电路136检查虚拟随机指针166。流程前进至判断步骤522。Referring to step 518 , theupdate logic circuit 136 checks the virtualrandom pointer 166 . Flow proceeds todecision step 522 .

在判断步骤522中，更新逻辑电路136判断上述分支指令是否为一非条件相对型式的分支指令。若是，流程前进至判断步骤524；若否，流程前进至判断步骤532。In thedetermination step 522, theupdate logic circuit 136 determines whether the above-mentioned branch instruction is an unconditional relative type branch instruction. If yes, the process proceeds todecision step 524 ; if not, the process proceeds todecision step 532 .

在判断步骤524中，更新逻辑电路136检查虚拟随机指针166是否为真值。若是，流程前进至步骤526；若否，流程前进至步骤528。Indecision step 524, updatelogic circuit 136 checks whether virtualrandom pointer 166 is true. If yes, the process proceeds to step 526 ; if not, the process proceeds to step 528 .

在步骤526中，更新逻辑电路136使用新执行的分支指令的分支信息来更新分支目标地址快取128。流程结束于步骤526。Instep 526, theupdate logic circuit 136 updates the branchtarget address cache 128 with the branch information of the newly executed branch instruction. The flow ends atstep 526 .

在步骤528中，更新逻辑电路136不使用新执行的分支指令的分支信息来更新分支目标地址快取128。流程结束于步骤528。Instep 528, theupdate logic circuit 136 updates the branchtarget address cache 128 without using the branch information of the newly executed branch instruction. The process ends atstep 528 .

在判断步骤532中，更新逻辑电路136判断三个分支指令(即新执行的分支指令以及A项目302与B项目302中的两个分支指令)是否皆为条件相对分支指令。若是，流程前进至判断步骤534；若否，流程前进至步骤528。In thedetermination step 532 , theupdate logic circuit 136 determines whether the three branch instructions (ie, the newly executed branch instruction and the two branch instructions in theA entry 302 and the B entry 302 ) are all conditional relative branch instructions. If yes, the process proceeds todecision step 534 ; if not, the process proceeds to step 528 .

在判断步骤534中，更新逻辑电路136判断指令解码器106或第二分支历史表126是否正确地预测新执行的分支指令。若是，流程前进至判断步骤524；若否，流程前进至步骤526。Indecision step 534, updatelogic circuit 136 determines whetherinstruction decoder 106 or second branch history table 126 correctly predicted the newly executed branch instruction. If yes, the process proceeds todecision step 524 ; if not, the process proceeds to step 526 .

本发明人观察到在部分快取线202中具有三个分支指令的情况下，程序有时会按顺序执行其指令而造成重复执行这三个分支指令的情形，因此有可能会替换分支目标地址快取128中的另一个分支指令。然而，大部分的时间只会执行这三个分支指令中的两个(或一个)分支指令，这将影响上述步骤502～516中以严格优先权为基础的替换策略的效能。举例而言，假设程序具有一外循环与一内循环，其中外循环包括一条件相对分支指令(例如第一x86JCC指令)，内循环包括一第二x86JCC指令与一非条件相对分支指令(例如x86JMP指令)，并且内循环跟随在x86JCC指令之后，x86JMP指令跟随在第二x86JCC指令之后。在此情况下，通常希望分支目标地址快取128的A项目302与B项目302中包含内循环中的分支指令(第二x86JCC指令与x86JMP指令)，而不是包含外循环中的分支指令(第一x86JCC指令)。然而，由于x86JCC指令是高于x86JMP指令，因此根据以严格优先权为基础的替换策略，分支目标地址快取128中的A项目302与B项目302会包含两个x86JCC指令，并且更新逻辑电路136不会将这两个x86JCC指令中的任一者替换为x86JMP指令，这种结果是不理想的。The inventor has observed that in the case of having three branch instructions in the partial cache line 202, the program sometimes executes its instructions sequentially to cause repeated execution of the three branch instructions, thus possibly replacing the branch target address cache. Another branch instruction in 128 is taken. However, only two (or one) of the three branch instructions are executed most of the time, which will affect the performance of the replacement strategy based on strict priority in the above steps 502-516. For example, suppose a program has an outer loop and an inner loop, wherein the outer loop includes a conditional relative branch instruction (such as a first x86JCC instruction), and the inner loop includes a second x86JCC instruction and an unconditional relative branch instruction (such as x86JMP instruction), and the inner loop follows the x86JCC instruction, and the x86JMP instruction follows the second x86JCC instruction. In this case, it is generally desirable that theA entry 302 and theB entry 302 of the branchtarget address cache 128 contain the branch instruction in the inner loop (the second x86JCC instruction and the x86JMP instruction), rather than the branch instruction in the outer loop (the second x86JCC instruction and x86JMP instruction). an x86JCC instruction). However, since the x86JCC instruction is higher than the x86JMP instruction, according to the strict priority-based replacement strategy, theA entry 302 and theB entry 302 in the branchtarget address cache 128 will contain two x86JCC instructions, and thelogic circuit 136 is updated Neither of these two x86JCC instructions would be replaced by an x86JMP instruction, which would be undesirable.

为了降低效能影响，虚拟随机产生器138提供虚拟随机指针166至更新逻辑电路136，相关细节请参考上述步骤518～528。值得注意的是，虚拟随机指针166随着微处理器100的时钟周期呈现规律性的变化，但由于大部分程序并不随着时钟周期规律地执行一给定的分支指令，因此虚拟随机指针166与分支指令的执行呈现随机性的变化。In order to reduce performance impact, the virtualrandom generator 138 provides the virtualrandom pointer 166 to theupdate logic circuit 136 , please refer to the above steps 518 - 528 for related details. It should be noted that the virtualrandom pointer 166 changes regularly with the clock cycle of themicroprocessor 100, but since most programs do not regularly execute a given branch instruction along with the clock cycle, the virtualrandom pointer 166 and The execution of branch instructions varies randomly.

因此，假设虚拟随机指标166大约每32个时钟周期平均为真值(true)一次，步骤518～528所实现的替换策略会使得更新逻辑电路136将外循环中的第一x86JCC指令替换为内循环的第32个执行实例(execution instance)中的x86JMP指令，并且内循环中的x86JMP指令会储存在分支目标地址快取128中，直到外循环中的第一x86JCC指令再次被执行。Therefore, assuming that thepseudo-random indicator 166 is true once every 32 clock cycles on average, the replacement strategy implemented in steps 518-528 will cause theupdate logic circuit 136 to replace the first x86JCC instruction in the outer loop with the inner loop The x86JMP instruction in the 32nd execution instance (execution instance), and the x86JMP instruction in the inner loop will be stored in the branchtarget address cache 128 until the first x86JCC instruction in the outer loop is executed again.

此外，若在一给定的部分快取线202中具有三个x86JCC指令，更新逻辑电路136会检查指令解码器106或第二分支历史表126是否正确地预测x86JCC指令，若有正确地预测x86JCC指令，则根据步骤532、534以及528，更新逻辑电路136通常不会替换其它两个x86JCC指令中的一者。由于在本实施例中，第二分支历史表126的大小与所使用的算法复杂度皆小于分支目标地址快取128与第一分支历史表164，因此必须将难以预测(hard-to-predict)的x86JCC指令储存在方向预测最准确的分支目标地址快取128中。然而，为了避免上述类似情况(较常查见(see)三个x86JCC指令中的两者，并且很少执行三个x86JCC指令中的一者)，根据步骤532、534以及526，更新逻辑电路136会允许运作良好(well-behaved)的x86JCC指令(即内循环中被指令解码器106或第二分支历史表126所正确预测的x86JCC指令)继续执行(go ahead)，并且替换其它x86JCC指令中的一者(通常位于内循环的第32个执行实例(execution instance)中)。Additionally, if there are three x86JCC instructions in a given partial cache line 202, theupdate logic 136 checks whether theinstruction decoder 106 or the second branch history table 126 correctly predicted the x86JCC instruction, and if so instruction, updatelogic circuit 136 generally does not replace one of the other two x86 JCC instructions according tosteps 532, 534, and 528. Because in this embodiment, the size of the second branch history table 126 and the complexity of the algorithm used are smaller than the branchtarget address cache 128 and the first branch history table 164, it must be difficult to predict (hard-to-predict) The x86 JCC instructions stored in the most accurately direction-predicted branchtarget address cache 128. However, to avoid situations similar to those described above (where two of the three x86JCC instructions are more often seen and one of the three x86JCC instructions is rarely executed),logic 136 is updated according tosteps 532, 534, and 526 Will allow well-behaved x86JCC instructions (i.e., x86JCC instructions in the inner loop that are correctly predicted by theinstruction decoder 106 or the second branch history table 126) to continue execution (go ahead), and replace the x86JCC instructions in other x86JCC instructions One (usually in the 32nd execution instance of the inner loop).

本发明虽以各种实施例揭露如上，然其仅为范例参考而非用以限定本发明的范围，任何本领域技术人员，在不脱离本发明的精神和范围内，当可做些许的更动与润饰。举例而言，可使用软件来实现本发明所述的装置与方法的功能、构造、模块化、模拟、描述及/或测试。此目的可通过使用一般程序语言(例如C、C++)、硬件描述语言(包括Verilog或VHDL硬件描述语言等等)、或其它可用的程序来实现。该软件可被设置在任何计算机可用的媒体，例如半导体、磁盘、光盘(例如CD-ROM、DVD-ROM等等)中。本发明实施例中所述的装置与方法可被包括在一半导体智慧财产权核心(semiconductorintellectual property core)，例如以硬件描述语言(HDL)实现的微处理器核心中，并被转换为硬件型态的集成电路产品。此外，本发明所描述的装置与方法可通过结合硬件与软件的方式来实现。因此，本发明不应该被本文中的任一实施例所限定，而当视所附的权利要求范围与其等效物所界定者为准。特别是，本发明是实现于一般用途计算机的微处理器装置中。最后，任何本领域技术人员，在不脱离本发明的精神和范围内，当可作些许更动与润饰，因此本发明的保护范围当视所附的权利要求范围所界定者为准。Although the present invention has been disclosed above with various embodiments, they are only exemplary references rather than limiting the scope of the present invention. Anyone skilled in the art may make some modifications without departing from the spirit and scope of the present invention. Move and retouch. For example, software can be used to realize the functions, configurations, modules, simulations, descriptions and/or tests of the devices and methods described in the present invention. This purpose can be achieved by using general programming languages (such as C, C++), hardware description languages (including Verilog or VHDL hardware description languages, etc.), or other available programs. The software can be provided on any computer usable medium such as semiconductor, magnetic disk, optical disk (eg CD-ROM, DVD-ROM, etc.). The device and method described in the embodiments of the present invention can be included in a semiconductor intellectual property core (semiconductor intellectual property core), such as a microprocessor core implemented in a hardware description language (HDL), and converted into a hardware type integrated circuit products. In addition, the devices and methods described in the present invention can be implemented by combining hardware and software. Accordingly, the invention should not be limited by any of the embodiments herein, but rather as defined by the scope of the appended claims and their equivalents. In particular, the invention is implemented in a microprocessor device of a general purpose computer. Finally, any person skilled in the art may make some changes and modifications without departing from the spirit and scope of the present invention. Therefore, the protection scope of the present invention should be defined by the scope of the appended claims.