US20050198479A1

Movatterモバイル変換

Info

Publication number: US20050198479A1
Application number: US09/906,381
Authority: US
Inventors: Brent Bean; G. Henry; Thomas McDonald
Original assignee: IP First LLC
Current assignee: IP First LLC
Priority date: 2001-07-03
Filing date: 2001-07-16
Publication date: 2005-09-08
Also published as: US7203824B2

Abstract

A branch control apparatus in a microprocessor. The apparatus includes a branch target address cache (BTAC) that caches indications of whether a branch instruction wraps across two cache lines. When an instruction cache fetch address of a first cache line containing the first part of the branch instruction hits in the BTAC, the BTAC outputs a target address of the branch instruction and indicates the wrap condition. The target address is stored in a register. The next sequential fetch address selects a second cache line containing the second part of the branch instruction. After the two cache lines containing the branch instruction are fetched, the target address from the register is provided to the instruction cache in order to fetch a third cache line containing a target instruction of the branch. The three cache lines are stored in order in an instruction buffer for decoding.

Description

CROSS REFERENCE TO RELATED APPLICATIONS

This application is related to the following U.S. patent applications, having a common filing date and a common assignee. Each of these applications is hereby incorporated by reference in its entirety for all purposes:



Docket #	Serial #	Title

CNTR: 2020		APPARATUS AND METHOD
		FOR DENSELY
		PACKING A BRANCH
		INSTRUCTION
		PREDICTED BY A
		BRANCH TARGET ADDRESS
		CACHE AND ASSOCIATED
		TARGET INSTRUCTIONS
		INTO A BYTE-WIDE
		INSTRUCTION BUFFER
CNTR: 2024		APPARATUS AND METHOD
		FOR SELECTIVELY
		ACCESSING DISPARATE
		INSTRUCTION
		BUFFER STAGES BASED
		ON BRANCH TARGET
		ADDRESS CACHE HIT
		AND INSTRUCTION
		STAGE WRAP

FIELD OF THE INVENTION

This invention relates in general to the field of branch target address caching in pipelined microprocessors, and more particularly to branch instructions that wrap across instruction cache lines.

BACKGROUND OF THE INVENTION

Pipelined microprocessors include multiple pipeline stages, each stage performing a different function necessary in the execution of program instructions. Typical pipeline stage functions are instruction fetch, instruction decode, instruction execution, memory access, and result write-back.

The instruction fetch stage fetches the next instruction in the currently executing program. The next instruction is typically the instruction with the next sequential memory address. However, in the case of a taken branch instruction, the next instruction is the instruction at the memory address specified by the branch instruction, commonly referred to as the branch target address. The instruction fetch stage fetches instructions from an instruction cache. If the instructions are not present in the instruction cache, they are fetched into the instruction cache from another memory higher up in the memory hierarchy of the machine, such as from a higher-level cache or from system memory. The fetched instructions are provided to the instruction decode stage.

The instruction decode stage includes instruction decode logic that decodes the instruction bytes received from the instruction fetch stage. In the case of a processor that supports variable length instructions, such as an x86 architecture processor, one function of the instruction decode stage is to format a stream of instruction bytes into separate instructions. Formatting a stream of instructions includes determining the length of each instruction. That is, instruction format logic receives a stream of undifferentiated instruction bytes from the instruction fetch stage and formats, or parses, the stream of instruction bytes into individual groups of bytes. Each group of bytes is an instruction, and the instructions make up the program being executed by the processor. The instruction decode stage may also include translating macro-instructions, such as x86 instructions, into micro-instructions that are executable by the remainder of the pipeline.

The execution stage includes execution logic that executes the formatted and decoded instructions received from the instruction decode stage. The execution logic operates on data retrieved from a register set of the processor and/or from memory. The write-back stage stores the results produced by the execution logic into the processor register set.

An important aspect of pipelined processor performance is keeping each stage of the processor busy performing the function it was designed to perform. In particular, if the instruction fetch stage does not provide instruction bytes when the instruction decode stage is ready to decode the next instruction, then processor performance will suffer. In order to prevent starvation of the instruction decode stage, an instruction buffer is commonly placed between the instruction cache and instruction format logic. The instruction fetch stage attempts to keep several instructions worth of instruction bytes in the instruction buffer so that the instruction decode stage will have instruction bytes to decode, rather than starving.

Typically, an instruction cache provides a cache line of instruction bytes, typically 16 or 32 bytes, at a time. The instruction fetch stage fetches one or more cache lines of instruction bytes from the instruction cache and stores the cache lines into the instruction buffer. When the instruction decode stage is ready to decode an instruction, it accesses the instruction bytes in the instruction buffer, rather than having to wait on the instruction cache.

The instruction cache provides a cache line of instruction bytes selected by a fetch address supplied to the instruction cache by the instruction fetch stage. During normal program operation, the fetch address is simply incremented by the size of a cache line since it is anticipated that program instructions are executed sequentially. The incremented fetch address is referred to as the next sequential fetch address. However, if a branch instruction is decoded by the instruction decode logic and the branch instruction is taken (or predicted taken), then the fetch address is updated to the target address of the branch instruction (modulo the cache line size), rather than being updated to the next sequential fetch address.

However, by the time the fetch address is updated to the branch target address, the instruction buffer has likely been populated with instruction bytes of the next sequential instructions after the branch instruction. Because a branch has occurred, the instructions after the branch instruction must not be decoded and executed. That is, proper program execution requires the instructions at the branch target address to be executed, not the next sequential instructions after the branch instruction. The instruction bytes in the instruction buffer were erroneously pre-fetched in anticipation of the more typical case of sequential instruction flow in the program. To remedy this error, the processor must flush all instruction bytes behind the branch instruction, which includes the instruction bytes in the instruction buffer.

Flushing the instruction buffer upon a taken branch instruction is costly since now the instruction decode stage will be starved until the instruction buffer is re-populated from the instruction cache. One solution to this problem is to branch prior to decoding the branch instruction. This may be accomplished by employing a branch target address cache (BTAC) that caches fetch addresses of instruction cache lines containing previously executed branch instructions and their associated target addresses.

The instruction cache fetch address is applied to the BTAC essentially in parallel with the application of the fetch address to the instruction cache. In the case of an instruction cache fetch address of a cache line containing a branch instruction, the cache line is provided to the instruction buffer. In addition, if the fetch address hits in the BTAC, the BTAC provides an associated branch target address. If the branch instruction hitting in the BTAC is predicted taken, the instruction cache fetch address is updated to the target address provided by the BTAC. Consequently, the cache line containing the target instructions, i.e., the instructions at the target address, will be stored in the instruction buffer behind the cache line containing the branch instruction.

However, the situation is complicated by the fact that in processors that execute variable length instructions, the branch instruction may wrap across two cache lines. That is, the first part of the branch instruction bytes may be contained in a first cache line, and the second part of the branch instruction bytes may be contained in the next cache line. Therefore, the next sequential fetch address must be applied to the instruction cache rather than the target address in order to obtain the cache line with the second part of the branch instruction. Then the target address must somehow be applied to the instruction cache to obtain the target instructions.

Therefore, what is needed is a branch control apparatus that provides proper program operation in the case of wrapping BTAC branches.

SUMMARY

The present invention provides a branch control apparatus in a pipelined processor that provides proper program operation in the case of wrapping BTAC branches. Accordingly, in attainment of the aforementioned object, it is a feature of the present invention to provide a branch control apparatus in a microprocessor having an instruction cache, coupled to an address bus, for providing cache lines to an instruction buffer. The apparatus includes a target address of a branch instruction. A branch target address cache (BTAC) provides the target address. The apparatus also includes a wrap signal, provided by the BTAC, which indicates whether the branch instruction wraps across first and second cache lines. The apparatus also includes an address register, coupled to the BTAC, that stores the target address. If the wrap signal indicates the branch instruction wraps across the first and second cache lines, the address register provides the target address on the address bus to the instruction cache to select a third cache line. The third cache line contains a target instruction of the branch instruction.

In another aspect, it is a feature of the present invention to provide a pipelined microprocessor. The microprocessor includes an instruction cache, coupled to an address bus that receives a first fetch address for selecting a first cache line. The microprocessor also includes a branch target address cache (BTAC), coupled to the address bus, which provides a wrap indicator for indicating whether a branch instruction wraps beyond the first cache line. The microprocessor also includes an address register, coupled to the BTAC, that stores a target address of the branch instruction. The target address is provided by the BTAC. The microprocessor also includes a multiplexer, coupled to the BTAC, which selects a second fetch address for provision on the address bus if the wrap indicator is true. The second fetch address selects a second cache line containing a portion of the branch instruction wrapping beyond the first cache line. The multiplexer selects the target address from the address register for provision on the address bus after selecting the second fetch address for provision on the address bus.

In another aspect, it is a feature of the present invention to provide a branch control apparatus in a microprocessor. The branch control apparatus includes a branch target address cache (BTAC) that caches indications of whether previously executed branch instructions wrap across two cache lines. The branch control apparatus also includes a register, coupled to the BTAC, that receives from the BTAC a target address of one of the previously executed instructions. The branch control apparatus also includes control logic, coupled to the BTAC, that receives one of the indications. If the one of the indications indicates the one of the previously executed branch instructions wraps across two cache lines, the control logic causes the microprocessor to branch to the target address, after causing the two cache lines containing the one of the previously executed branch instructions to be fetched.

In another aspect, it is a feature of the present invention to provide a microprocessor branch control apparatus. The branch control apparatus includes an incrementer, coupled to an instruction cache address bus, that provides a first fetch address on the address bus. The first fetch address selects a first cache line containing a first portion of a branch instruction. The branch control apparatus also includes a branch target address cache (BTAC), coupled to the address bus, which provides a target address of the branch instruction in response to the first fetch address. The branch control apparatus also includes an address register, coupled to the BTAC, that stores the target address if the BTAC indicates the branch instruction wraps beyond the first cache line. The incrementer provides a second fetch address on the address bus. The second fetch address selects a second cache line containing a second portion of the branch instruction. The address register provides the target address on the address bus. The target address selects a third cache line containing a target instruction of the branch instruction.

In another aspect, it is a feature of the present invention to provide a method for performing branches in a microprocessor with an instruction cache. The method includes applying a first fetch address to the instruction cache for selecting a first cache line containing at least a portion of a branch instruction, providing a target address of the branch instruction in response to the first fetch address, and determining whether the branch instruction wraps beyond the first cache line. The method also includes storing the target address in a register if the branch instruction wraps beyond the first cache line, applying a second fetch address to the instruction cache, if the branch instruction wraps beyond the first cache line, for selecting a second cache line containing a remainder of the branch instruction, and providing the target address from the register to the instruction cache for selecting a third cache line containing a target instruction of the branch instruction.

An advantage of the present invention is that it potentially improves branch performance in a pipelined microprocessor that uses a BTAC by enabling the processor to take a BTAC branch even if the branch wraps across multiple cache lines. The invention enables wrapped branching even in processors that do not have stalling circuitry in the pre-decode stages of the processor, thereby avoiding the branch penalty associated with mispredicting the branch as not taken and subsequently correcting for the misprediction. The avoidance of the branch penalty is particularly advantageous in a processor having a large number of pipeline stages.

Other features and advantages of the present invention will become apparent upon study of the remaining portions of the specification and drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating a pipelined microprocessor according to the present invention.

FIG. 2 is a block diagram of portions of the pipelined microprocessor ofFIG. 1 including a branch control apparatus according to the present invention.

FIG. 3 is a table illustrating two cache lines containing a branch instruction that wraps across the two cache lines according to the present invention.

FIG. 4 is a flowchart illustrating operation of the branch control apparatus ofFIG. 2 according to the present invention.

FIGS. 5, 6, and7 are timing diagrams illustrating examples of operation of the branch control apparatus ofFIG. 2 according to the flowchart ofFIG. 4 according to the present invention.

FIG. 8 is a flowchart illustrating operation of the branch control apparatus ofFIG. 2 according to an alternate embodiment of the present invention.

FIG. 9 is a timing diagram illustrating an example of operation of the branch control apparatus ofFIG. 2 according to the flowchart ofFIG. 8 according to the present invention.

DETAILED DESCRIPTION

Referring now toFIG. 1, a block diagram illustrating a pipelinedmicroprocessor100 according to the present invention is shown. Theprocessor pipeline100 includes a plurality ofstages101 through132. In one embodiment, themicroprocessor100 comprises an x86 a architecture processor.

The first stage of themicroprocessor100 is the C-stage101, or instruction cache address generation stage. The C-stage101 generates a fetchaddress162 that selects a cache line in an instruction cache202 (seeFIG. 2).

The next stage is the I-stage102, or instruction fetch stage. The I-stage102 is the stage where theprocessor100 provides the fetchaddress162 to the instruction cache202 (seeFIG. 2) in order to fetch instructions for theprocessor100 to execute. Theinstruction cache202 is described in more detail with respect toFIG. 2. In one embodiment, theinstruction cache202 is a two-cycle cache. A B-stage104 is the second stage of theinstruction cache202 access. Theinstruction cache202 provides its data to a U-stage106, where the data is latched in. The U-stage106 provides the instruction cache data to a V-stage108.

In the present invention, theprocessor100 further comprises a speculative branch target address cache (BTAC)216 (seeFIG. 2), described in detail below. TheBTAC216 is accessed in parallel with theinstruction cache202 in the I-stage102 using theinstruction cache202 fetchaddress162, thereby enabling relatively fast branching to reduce branch penalty. TheBTAC216 provides a speculativebranch target address152 that is provided to the I-stage102. Theprocessor100 selectively chooses thetarget address152 as theinstruction cache202 fetch address to achieve a branch to thespeculative target address152.

Advantageously, as may be seen fromFIG. 1, thebranch target address152 supplied by the branchtarget address cache216 in the U-stage106 enables theprocessor100 to branch relatively early in thepipeline100, creating only a two-cycle instruction bubble. That is, if theprocessor100 branches to thespeculative target address152, only two stages worth of instructions must be flushed. In other words, within two cycles, the target instructions of the branch will be available at the U-stage106 in the typical case, i.e., if the target instructions are present in theinstruction cache202.

Advantageously, in most cases, the two-cycle bubble is small enough that aninstruction buffer142, F-stage instruction queue144 and/orX-stage instruction queue146, described below, may absorb the bubble. Consequently, in many cases, thespeculative BTAC216 enables theprocessor100 to achieve zero-penalty branches.

The V-stage108 is the stage in which instructions are written to theinstruction buffer142. Theinstruction buffer142 buffers instructions for provision to an F-stage112. Theinstruction buffer142 comprises a plurality of stages, or registers, for storing instruction bytes received from theinstruction cache202. In one embodiment, theinstruction buffer142 is capable of buffering128 instruction bytes. In one embodiment, theinstruction buffer142 is similar to the instruction buffer described in the U.S. patent application entitled APPARATUS AND METHOD FOR SELECTIVELY ACCESSING DISPARATE INSTRUCTION BUFFER STAGES BASED ON BRANCH TARGET ADDRESS CACHE HIT AND INSTRUCTION STAGE WRAP, incorporated by reference above. The V-stage108 also includes decode logic for providing information about the instruction bytes to theinstruction buffer142, such as x86 prefix and mod R/M information, and whether an instruction byte is a branch opcode value.

The F-stage112, orinstruction format stage112, includes instruction format logic214 (seeFIG. 2) for formatting instructions. Preferably, theprocessor100 is an x86 processor, which allows for variable length instructions in its instruction set. Theinstruction format logic214 receives a stream of instruction bytes from theinstruction buffer142 and parses the stream into discrete groups of bytes constituting an x86 instruction, and in particular providing the length of each instruction.

The F-stage112 also includes branch instruction target address calculation logic for generating a non-speculativebranch target address154 based on an instruction decode, rather than based speculatively on theinstruction cache202 fetch address, like theBTAC216 in the I-stage102. The F-stage112

non-speculative address

154 is provided to the I-stage102. Theprocessor100 selectively chooses the F-stage112

non-speculative address

154 as theinstruction cache202 fetch address to achieve a branch to thenon-speculative address154.

An F-stage instruction queue144 receives the formatted instructions. Formatted instructions are provided by the F-stage instruction queue144 to an instruction translator in theX-stage114.

The X-stage114, ortranslation stage114, instruction translator translates x86 macroinstructions into microinstructions that are executable by the remainder of the pipeline stages. The translated microinstructions are provided by the X-stage114 to anX-stage instruction queue146.

TheX-stage instruction queue146 provides translated microinstructions to an R-stage116, or registerstage116. The R-stage116 includes the user-visible x86 register set, in addition to other non-user-visible registers. Instruction operands for the translated microinstructions are stored in the R-stage116 registers for execution of the microinstructions by subsequent stages of thepipeline100.

An A-stage118, oraddress stage118, includes address generation logic that receives operands and microinstructions from the R-stage116 and generates addresses required by the microinstructions, such as memory addresses for load/store microinstructions.

A D-stage122, ordata stage122, includes logic for accessing data specified by the addresses generated by theA-stage118. In particular, the D-stage122 includes a data cache for caching data within theprocessor100 from a system memory. In one embodiment, the data cache is a two-cycle cache. The D-stage122 provides the data cache data to an E-stage126.

The E-stage126, orexecution stage126, includes execution logic, such as arithmetic logic units, for executing the microinstructions based on the data and operands provided from previous stages. In particular, theE-stage126 produces a resolvedtarget address156 of all branch instructions. That is, the E-stage126

target address

156 is known to be the correct target address of all branch instructions with which all predicted target addresses must match. In addition, theE-stage126 produces a resolved direction for all branch instructions, i.e., whether the branch is taken or not taken.

An S-stage128, orstore stage128, performs a store to memory of the results of the microinstruction execution received from theE-stage126. In addition, thetarget address156 of branch instructions calculated in theE-stage126 is provided to theinstruction cache202 in the I-stage102 from the S-stage128. Furthermore, theBTAC216 of the I-stage102 is updated from the S-stage128 with the resolved target addresses of branch instructions executed by thepipeline100 for caching in theBTAC216. In addition, other speculative branch information (SBI)236 (seeFIG. 2) is updated in theBTAC216 from the S-stage128. Thespeculative branch information236 includes the branch instruction length, the location within aninstruction cache202 line of the branch instruction, whether the branch instruction wraps overmultiple instruction cache202 lines, whether the branch is a call or return instruction, and information used to predict the direction of the branch instruction.

A W-stage132, or write-back stage132, writes back the result from the S-stage128 into the R-stage116 registers, thereby updating theprocessor100 state.

Theinstruction buffer142, F-stage instruction queue144 andX-stage instruction queue146, among other things, serve to minimize the impact of branches upon the clocks per instruction value of theprocessor100.

Referring now toFIG. 2, a block diagram of portions of the pipelinedmicroprocessor100 ofFIG. 1 including a branch control apparatus according to the present invention is shown.

Themicroprocessor100 includes aninstruction cache202 that caches instruction bytes. Theinstruction cache202 comprises an array of cache lines for storing instruction bytes. The array of cache lines is indexed by a fetchaddress162 ofFIG. 1. That is, the fetchaddress162 selects one of the cache lines in the array. Theinstruction cache202 provides the selected cache line of instruction bytes to theinstruction buffer142 ofFIG. 1 via adata bus242.

In one embodiment, theinstruction cache202 comprises a 64KB 4-way set associative cache, with 32-byte cache lines per way. In one embodiment, one half of the selected cache line of instruction bytes is provided by theinstruction cache202 at a time, i.e., 16 bytes are provided during two separate periods each. In one embodiment, theinstruction cache202 is similar to an instruction cache described in U.S. patent application Ser. No.______ entitled SPECULATIVE BRANCH TARGET ADDRESS CACHE, (docket number CNTR:2021), having a common assignee, and which is hereby incorporated by reference in its entirety for all purposes. Theinstruction cache202 generates a true value on aMISS signal204 if the fetchaddress162 misses in theinstruction cache202.

Themicroprocessor100 also includes a bus interface unit (BIU)206 that fetches cache lines from a memory via adata bus266. In particular, theBIU206 fetches cache lines from the memory if theinstruction cache202 generates a true value onMISS signal204. Theinstruction cache202 also provides the MISS signal204 to theBIU206.

Themicroprocessor100 also includes aresponse buffer208. Theresponse buffer208 receives caches lines from theBIU206. Theresponse buffer208 also receives cache lines from a level-2 cache viadata bus212. Theresponse buffer208 provides cache lines of instruction bytes to theinstruction buffer142 via adata bus244. When theresponse buffer208 has a cache line of instruction bytes to provide to theinstruction buffer142, theresponse buffer208 generates a true value on anRBRDY signal238.

When a cache line is stored into theinstruction buffer142, either from theinstruction cache202 or from theresponse buffer208, such that theinstruction buffer142 becomes full, theinstruction buffer142 generates a true value on aFULL signal246 to indicate that it cannot presently accept instruction bytes.

Themicroprocessor100 also includesinstruction format logic214. Theinstruction format logic214 receives instruction bytes from theinstruction buffer142. Theinstruction format logic214 formats, or parses, the instruction bytes received into an instruction. In particular, theinstruction format logic214 determines the size in bytes of the instruction. Theinstruction format logic214 provides the length of the currently formatted instruction viainstruction length signal248. Theinstruction format logic214 provides the formatted instruction to the remainder of themicroprocessor100 pipeline for further decode and execution. In one embodiment, theinstruction format logic214 is capable of formatting multiple instructions permicroprocessor100 clock cycle.

Themicroprocessor100 also includes a branch target address cache (BTAC)216. TheBTAC216 also receives theinstruction cache202 fetchaddress162. TheBTAC216 comprises an array of storage elements for caching fetch addresses of previously executed branch instructions and their associated branch target addresses. The storage elements also store other speculative branch information related to the branch instructions for which the target addresses are cached. In particular, the storage elements store an indication of whether the multi-byte branch instructions wrap across two instruction cache lines. The fetchaddress162 indexes the array of storage elements in theBTAC216 to select one of the storage elements.

TheBTAC216 outputs thetarget address152 ofFIG. 1 and speculative branch information (SBI)236 from the storage element selected by the fetchaddress162. In one embodiment, theSBI236 includes the branch instruction length, the location of the branch instruction in the cache line, whether the branch is a call or return instruction, and a prediction of whether the branch instruction will be taken or not taken.

TheBTAC216 also outputs aHIT signal234 that indicates whether the fetchaddress162 hit in theBTAC216. In one embodiment, theBTAC216 is similar to a BTAC described in U.S. patent application entitled SPECULATIVE BRANCH TARGET ADDRESS CACHE, which is incorporated by reference above. In one embodiment, theBTAC216 is a speculative BTAC because themicroprocessor100 branches to thetarget address152 provided by theBTAC216 before the instruction cache line provided by theinstruction cache202 is decoded to know whether or not a branch instruction is even present in the cache line selected by the fetch address. That is, themicroprocessor100 speculatively branches even though the possibility exists that no branch instruction is present in the cache line selected by the fetch address hitting in theBTAC216.

TheBTAC216 also outputs aWRAP signal286, which specifies whether the branch instruction wraps across two cache lines. The WRAP signal286 value is cached in theBTAC216 along with the branch instruction target address after execution of the branch instruction.

Referring now toFIG. 3, a table illustrating two cache lines containing a branch instruction that wraps across the two cache lines is shown. The table shows a first cache line, denotedcache line A302 whose last instruction byte contains an opcode byte for an x86 JCC (conditional jump) instruction. The table also shows a second cache line, denotedcache line B304 whose first instruction byte contains a signed displacement byte (disp) for the JCC instruction. Whenever themicroprocessor100 executes a branch instruction and caches the fetch address of the cache line containing the branch instruction in theBTAC216 along with the target address of the branch instruction, themicroprocessor100 also caches an indicator of whether the branch instruction wraps across two cache lines, like the JCC instruction ofFIG. 3. If the fetch address subsequently hits in theBTAC216, theBTAC216 provides the cached wrap indicator on theWRAP signal286. The wrap indicator enables the branch control apparatus to know that the fetch address of both cache lines must be provided to theinstruction cache202 in order to obtain all the instruction bytes for the branch instruction.

Referring again toFIG. 2, themicroprocessor100 also includescontrol logic222. TheHIT signal234, theSBI236, theWRAP signal286, theMISS signal204, theFULL signal246, theRBRDY signal238, and theinstruction length signal248 are all provided as inputs to thecontrol logic222. The operation of thecontrol logic222 is described in more detail below.

Themicroprocessor100 also includes amux218. Themux218 receives at least six addresses as inputs and selects one of the inputs as the fetchaddress162 to theinstruction cache202 in response to a control signal168 generated by thecontrol logic222. Themux218 receives thetarget address152 from theBTAC216. Themux218 also receives a next sequential fetchaddress262. The next sequential fetchaddress262 is the previous fetch address incremented by the size of aninstruction cache202 cache line by anincrementer224. Theincrementer224 receives the fetchaddress162 and provides the next sequential fetchaddress262 to themux218.

Themux218 also receives the resolvedtarget address156 ofFIG. 1. The resolvedtarget address156 is provided by execution logic in themicroprocessor100. The execution logic calculates the resolvedtarget address156 based execution of a branch instruction. If after branching to thetarget address152 provided by theBTAC216, themicroprocessor100 later determines that the branch was erroneous, themicroprocessor100 corrects the error by flushing the pipeline and branching to either the resolvedtarget address156 or to the fetch address of a cache line including the instruction following the branch instruction. In one embodiment, themicroprocessor100 corrects the error by flushing the pipeline and branching to the fetch address of a cache line including the branch instruction itself, if themicroprocessor100 determines that no branch instruction was present in the cache line as presumed. The error correction is as described in U.S. patent application Ser. No.______ entitled APPARATUS, SYSTEM AND METHOD FOR DETECTING AND CORRECTING ERRONEOUS SPECULATIVE BRANCH TARGET ADDRESS CACHE BRANCHES, (docket number CNTR:2022), having a common assignee, and which is hereby incorporated by reference in its entirety for all purposes.

In one embodiment, themux218 also receives thenon-speculative target address154 ofFIG. 1. Thenon-speculative target address154 is generated by other branch prediction elements, such as a call/return stack and a branch target buffer (BTB) that caches target addresses of indirect branch instructions based on the branch instruction pointer. Themux218 selectively overrides thetarget address152 provided by theBTAC216 with thenon-speculative target address154 as described in U.S. patent application Ser. No. ______ entitled SPECULATIVE BRANCH TARGET ADDRESS CACHE WITH SELECTIVE OVERRIDE BY SECONDARY PREDICTOR BASED ON BRANCH INSTRUCTION TYPE, (docket number CNTR:2052), having a common assignee, and which is hereby incorporated by reference in its entirety for all purposes.

Themux218 also receives a backup fetchaddress274. Themicroprocessor100 includes a fetchaddress register file282 that provides the backup fetchaddress274 to themux218. In one embodiment of themicroprocessor100, stagesC101 throughV108 cannot stall. That is, all of the state is not saved for these stages on each clock cycle. Consequently, if a cache line reaches theinstruction buffer142 and theinstruction buffer142 is full, the cache line is lost. If theinstruction buffer142 is relatively large, it may be advantageous to save complexity and space in themicroprocessor100 by not having the state saving logic.

Although the upper stages of thepipeline100 may not stall, the fetch address of a cache line that is lost due to afull instruction buffer142 is saved in the fetchaddress register file282 and provided to themux218 as the backup fetchaddress274. As cache lines flow down the pre-decode pipeline stages of themicroprocessor100, the corresponding fetchaddress152, provided by themux218, flows down the fetchaddress register file282. Use of the backup fetchaddress274 will be described in more detail below with respect to the remaining figures.

Themux218 also receives a savedtarget address284. The savedtarget address284 is a previous value of thetarget address152 output by theBTAC216. The savedtarget address284 is saved in asave register228. Thesave register228 receives the output of asave mux226. Thesave mux226 receives theBTAC216

target address

152. Thesave mux226 also receives the output of the save register228 for holding the value of the savedtarget address284. Thesave mux226 is controlled by acontrol signal276 generated by thecontrol logic222.

Themicroprocessor100 also includes aflag register232. Thecontrol logic222 sets theflag register232 to a true value whenever a wrappedBTAC216 branch instruction is pending. That is, theflag register232 indicates that the save register228 currently stores aBTAC216

target address

152 for a branch instruction that wraps across two cache lines.

Referring now toFIG. 4, a flowchart illustrating operation of the branch control apparatus ofFIG. 2 according to the present invention is shown. In the following description ofFIG. 4, fetch address A refers to a fetch address of a cache line A that contains at least a first portion of a branch instruction, such ascache line A302 ofFIG. 3. Fetch address B refers to a fetch address of a cache line B that contains a second portion of a wrapping branch instruction, such ascache line B304 ofFIG. 3. Cache line T refers to a cache line that contains one or more target instructions of a branch instruction selected by a target address of the branch instruction. For clarity and simplicity, the flowchart ofFIG. 4 assumes that both cache line A and cache line T hit in theinstruction cache202 ofFIG. 2. The flowchart ofFIG. 4 specifies operation if cache line B hits in theinstruction cache202 and specifies operation if cache line B does not hit in theinstruction cache202. Flow begins atblock402.

Atblock402,mux218 ofFIG. 2 applies fetch address A as the fetchaddress162 to theinstruction cache202 and to theBTAC216 ofFIG. 2. In the typical case, program flow proceeds sequentially, hence,mux218 selects the next sequential fetchaddress262 ofFIG. 2 as fetch address A. Flow proceeds fromblock402 to block404.

Atblock404, theinstruction cache202 provides line A ondata bus242 ofFIG. 2 in response to the application of fetch address A duringstep402. Line A contains at least a first portion of a branch instruction, and fetch address A is cached in theBTAC216. Whether cache line A contains all or part of the branch instruction will be determined atdecision block408 described below. TheBTAC216 provides atarget address152 ofFIG. 1 for the cached branch instruction in response to fetch address A. Flow proceeds fromblock404 to block406.

Atblock406, thetarget address152 provided by theBTAC216 duringstep404 is stored in the save register228 ofFIG. 2. That is,control logic222 controls savemux226 ofFIG. 2 to selecttarget address152 from theBTAC216 for storage in thesave register228 because aBTAC216 hit occurred, as indicated onHIT signal234 ofFIG. 2. Upon storing thetarget address152 into thesave register228, thecontrol logic222 sets theflag register232 to a true value. Flow proceeds fromblock406 todecision block408.

Atdecision block408,control logic222 ofFIG. 2 determines whether the branch instruction wraps beyond cache line A, i.e., across two cache lines. In particular,control logic222 examines the WRAP signal286 ofFIG. 2 to determine if it has a true value. If not, then the branch instruction is wholly contained in cache line A, and flow proceeds to block412. Otherwise, the first part of the branch instruction is contained in cache line A, the second part of the branch instruction is contained in cache line B, and flow proceeds to block428.

Atblock412, thetarget address152 provided by theBTAC216 duringstep404 is selected bymux218 and applied as fetchaddress162 to theinstruction cache202. If flow reachesblock412, then theBTAC216 branch instruction is not a wrapping branch instruction. Hence, thetarget address152 is applied after fetch address A, since it would be incorrect to apply fetch address B to theinstruction cache202, since the entire branch instruction is contained in cache line A. Flow proceeds fromblock412 to block414.

Atblock414, cache line A is stored in theinstruction buffer142 ofFIG. 2. Flow proceeds fromblock414 to block416.

Atblock416, theinstruction cache202 provides cache line T, which contains the target instructions of the branch instruction. Theinstruction cache202 provides cache line T in response to thetarget address152 applied to theinstruction cache202 duringstep412. Flow proceeds fromblock416 todecision block418.

Atdecision block418,control logic222 determines whether theinstruction buffer142 is full. In particular, thecontrol logic222 examines the value of theFULL signal246 ofFIG. 2 generated by theinstruction buffer142 to see if it is true. If not, flow proceeds to block422. Otherwise, flow proceeds to block424.

Atblock422, cache line T is stored in theinstruction buffer142. At this point, the branch instruction and its target instructions are stored in theinstruction buffer142 so that they can be formatted by theinstruction format logic214 ofFIG. 2. Upon storing cache line T into theinstruction buffer142, thecontrol logic222 sets theflag register232 to a false value. If the branch instruction was a non-wrapping branch, i.e., if flow proceeded fromdecision block408 to block412, then theinstruction buffer142 contains a cache line A containing the entire branch instruction, and cache line T, containing the target instructions. However, if the branch instruction was a wrapping branch, i.e., if flow proceeded fromdecision block408 to block428, then theinstruction buffer142 will contain cache line A containing the first portion of the branch instruction, cache line B, containing the second portion of the branch instruction, and cache line T, containing the target instructions, as described below. Flow ends atblock422.

Atblock424,control logic222 waits for theinstruction buffer142 to become not full. That is,control logic222 examines theFULL signal246 until it becomes false. While thecontrol logic222 is waiting for theFULL signal246 to become false, the savedtarget address284 continues to be held in thesave register228. Flow proceeds fromblock424 to block426.

Atblock426,mux218 selects the savedtarget address284 provided by thesave register228 and applies thesave target address284 as fetchaddress162 to theinstruction cache202. The savedtarget address284 was stored in thesave register228 duringstep406. If flow reaches block426 fromblock454 described below, then theBTAC216 branch instruction is a wrapping branch instruction. In this case, thetarget address152 is applied after fetch address B so that the entire branch instruction is stored in theinstruction buffer142 prior to the branch target instructions in cache line T being stored in theinstruction buffer142. Flow proceeds fromblock426 to block416.

Atblock428, cache line A is stored in theinstruction buffer142. In this case, cache line A contains only the first portion of the wrapping branch instruction, not the entire branch instruction. Flow proceeds fromblock428 to block432.

Atblock432,mux218 selects the next sequential fetchaddress262 provided by theincrementer224 ofFIG. 2, which will be fetch address B, and applies fetch address B as the fetchaddress162 to theinstruction cache202. It is necessary to apply fetch address B in order to obtain cache line B, which contains the second portion of the wrapping branch instruction, so that all the instruction bytes of the branch instruction may be stored in the instruction buffer for decoding. Flow proceeds fromblock432 todecision block434.

Atdecision block434,control logic222 andBIU206 ofFIG. 2 determine whether fetch address B hit in theinstruction cache202. In particular,control logic222 andbus BIU206 examine the MISS signal204 ofFIG. 2 generated by theinstruction cache202 to determine if the value is true. If not, flow proceeds to block436. Otherwise, flow proceeds to block444.

Atblock436, either theBIU206 fetches cache line B from memory, or cache line B is provided by the level-2 cache. When cache line B arrives inresponse buffer208 ofFIG. 2, theresponse buffer208 generates a true value on theRBRDY signal238 to notifycontrol logic222 that cache line B is available. Flow proceeds fromblock436 to block438.

Atblock438, cache line B is stored in theinstruction buffer142 from theresponse buffer208. Flow proceeds fromblock438 to block442.

Atblock442,mux218 selects the next sequential fetchaddress262 provided by theincrementer224 and applies the next sequential fetch address as the fetchaddress162 to theinstruction cache202. That is, if cache line B is not present in theinstruction cache202, this condition is treated as aBTAC216 miss. If theE-stage126 ofFIG. 1 later determines that the branch instruction is taken, the misprediction will be corrected by branching to the resolvedtarget address156. The embodiment ofFIG. 4 has the advantage of requiring less control logic than the embodiment ofFIG. 8, described below, which handles the case of awrapping BTAC216 branch, whose second cache line misses in theinstruction cache202. In amicroprocessor100 in which the probability is very low that a branch instruction will wrap and generate aninstruction cache202 miss for its second portion, the embodiment ofFIG. 4 is advantageous because it requires less complexity. Flow ends atblock442.

Atblock444, theinstruction cache202 provides cache line B ondata bus242 in response to the application of fetch address B duringstep432. Line B contains the second portion of the branch instruction. Flow proceeds fromblock444 todecision block446.

Atdecision block446,control logic222 determines whether theinstruction buffer142 is full by examining the value of theFULL signal246 to see if it is true. That is, thecontrol logic222 determines whether the store of cache line A into theinstruction buffer142 duringstep428 filled theinstruction buffer142. If so, flow proceeds to block448. If not, flow proceeds to block454.

Atblock448,control logic222 waits for theinstruction buffer142 to become not full. That is,control logic222 examines theFULL signal246 until it becomes false. Flow proceeds fromblock448 to block452.

Atblock452

mux

218 selects the backup fetchaddress274 ofFIG. 2 provided by the fetchaddress register file282 ofFIG. 2, which will be fetch address B, and applies fetch address B as the fetchaddress162 to theinstruction cache202. It is necessary to apply fetch address B in order to obtain cache line B, which contains the second portion of the wrapping branch instruction. Flow proceeds fromblock452 to block434 to determine whether the application of the backup fetch address B hits in theinstruction cache202.

Atblock454, cache line B is stored in theinstruction buffer142. Cache line B contains the second portion of the wrapping branch instruction. Flow proceeds fromblock454 to block426 to get cache line T, which contains the branch target instructions, into theinstruction buffer142.

As may be seen from the flowchart ofFIG. 4, the present invention provides an improvement over a solution to the wrappingBTAC216 branch problem that simply treats all wrappingBTAC216 branches asBTAC216 misses. The percentage ofBTAC216 branches that wrap is non-negligible, and the present invention provides a means of branching rather than not branching and having to correct, thereby potentially saving many clock cycles. This is particularly beneficial in amicroprocessor100 in which the number of stages is relatively large.

Referring now generally toFIGS. 5, 6,7, and9, timing diagrams illustrating examples of operation of the branch control apparatus ofFIG. 2 according to the present invention are shown.FIGS. 5, 6, and7 illustrate operation according to the flowchart ofFIG. 4, whereas,FIG. 9 illustrates operation according to the alternate embodiment flowchart ofFIG. 8 described below. The timing diagrams comprise a matrix of cells having7 rows and33 columns. The first column, beginning with the second row, is denoted C, I, B, U, V, and F, corresponding to the C-stage101, I-stage102, B-stage104, U-stage106, V-stage108, and F-stage112 ofFIG. 1. The first row, beginning with the second column, is denoted1 through32, corresponding to 32 clock cycles of themicroprocessor100.

Each of the cells in the matrix specifies the contents of the specified stage during the specified clock cycle. For clarity and simplicity, each of the cells is denoted herein as (s,c), where s is the stage, and c is the clock cycle. For example, cell (V,5) denotes the contents of the V-stage108 duringclock cycle5. The cells are either blank, or have one of four letters A, B, C, or T in them. The letter A designates either fetch address A or cache line A, ofFIG. 4, or both depending upon the context of the stage. Similarly, the letter B designates either fetch address B or cache line B, ofFIG. 4, or both, and the letter T designates either aBTAC216 target address of a branch instruction or cache line T, ofFIG. 4, or both. The letter C designates either the next sequential fetch address after fetch address B or the next sequential cache line after cache line B or both. For example, inFIG. 5, the cell corresponding to the contents of the I-stage102 duringclock cycle2, denoted (I,2), contains an A, to signify that the I-stage102 receives fetch address A duringclock cycle2. That is, address A is applied as the fetchaddress162 ofFIG. 1 to theinstruction cache202 ofFIG. 2 as described with respect to block402 ofFIG. 4.

In addition, below the matrix, the values of theWRAP signal286, theFULL signal246, theMISS signal204, and RBRDY signal238 ofFIG. 2 during each of the32 clock cycles is shown. A polarity convention is chosen for illustration purposes such that if the signal is low, then the value is false. For example, if theFULL signal246 is low, the value is false, i.e., theinstruction buffer142 ofFIG. 1 is not full; conversely, if theFULL signal246 is high, the value is true, i.e., theinstruction buffer142 ofFIG. 1 is full. However, the invention is susceptible to use of either polarity convention. References to block numbers, such asblock412, are to blocks of the flowchart ofFIG. 4.

Referring now toFIG. 5, a timing diagram is shown illustrating an example of operation of themicroprocessor100 ofFIG. 1 according to the flowchart ofFIG. 4 in the case of anon-wrapping BTAC216 branch, wherein theinstruction buffer142 is full when cache line T is initially ready for storage in theinstruction buffer142. Hence, theWRAP signal286, theMISS signal204, and theRBRDY signal238 are false throughout the clock cycles ofFIG. 5, and theFULL signal246 is true during a portion of the clock cycles, in particular clock cycles6 through10.

In cell (C,1),mux218 ofFIG. 2 selects fetch address A as fetchaddress162. In cell (I,2), the I-stage102 applies fetch address A to theinstruction cache202 and to theBTAC216, according to block402. In cell (B,3), theinstruction cache202 is selecting cache line A, during its second access cycle. In cell (U,4), theinstruction cache202 provides cache line A, according to block404.

In cell (V,5), cache line A is written to theinstruction buffer142, according to block414. In the example ofFIG. 5, storing cache line A in theinstruction buffer142 causes theinstruction buffer142 to be full. Hence, duringclock6, theFULL signal246 is true. In the example ofFIG. 5, theFULL signal246 remains true until clock11.

In cell (C,4),mux218 selects thetarget address152 provided by theBTAC216 duringblock404 as the fetchaddress162. In cell (I,5), thetarget address152 is applied to theinstruction cache202, according to block412, since theWRAP signal286 is false in the example.

In cells (V,6) through (V,14), cache line A remains in theinstruction buffer142 and is not provided to theinstruction format logic214 because theinstruction format logic214 is formatting other instructions ahead of cache line A. An example of a cause of theinstruction buffer142 remaining full for several clock cycles is where one or more instructions which require a large number of clock cycles to execute, such as floating point divides, are being executed in the pipeline. These instructions cause the stages of thepipeline100 above theexecution stage126 to stall.

In cell (B,6), theinstruction cache202 is selecting cache line T, during its second access cycle. In cell (U,7), theinstruction cache202 provides cache line T, according to block416. However, duringclock cycle7 theinstruction buffer142 is full, as determined duringblock418. Hence, duringclock cycle8, cache line T is lost since theinstruction buffer142 cannot accept cache line T since theinstruction buffer142 is full.Control logic222 ofFIG. 2 waits until theFULL signal246 is false, according to block424.

In cell (C,11),mux218 selects the savedtarget address284 provided by the save register228 as the fetchaddress162, since thecontrol logic222 determined that theFULL signal246 is now false in clock cycle11. In cell (I,12), the savedtarget address284 is applied to theinstruction cache202, according to block426. In cell (B,13), theinstruction cache202 is selecting cache line T, during its second access cycle. In cell (U,14), theinstruction cache202 provides cache line T, according to block416.

In cell (F,15), cache line A proceeds to theinstruction format logic114 where the branch instruction is formatted. In cell (V,15), cache line T is written to theinstruction buffer142, according to block422, since theinstruction buffer142 is no longer full, as determined duringblock418. In cell (F,16), cache line T proceeds to theinstruction format logic114 where the branch target instruction is formatted.

Referring now toFIG. 6, a timing diagram, similar toFIG. 5, illustrating a second example of operation of the branch control apparatus ofFIG. 2 according to the flowchart ofFIG. 4 according to the present invention is shown.FIG. 6 illustrates an example of operation of themicroprocessor100 ofFIG. 1 according to the flowchart ofFIG. 4 in the case of awrapping BTAC216 branch, wherein the second portion of the branch instruction, contained in cache line B, misses in theinstruction cache102. Hence, theFULL signal246 is false throughout the clock cycles ofFIG. 6, and theWRAP signal286, theMISS signal204, and theRBRDY signal238 are true during a portion of the clock cycles, in particular during clock cycles4,5, and24, respectively.

Cells (C,1), (I,2), (B,3), (U,4), and (V,5) are similar to corresponding cells ofFIG. 5, with fetch address A and cache line A proceeding down the upper stages of themicroprocessor100 pipeline. Duringclock cycle4, theWRAP signal286 is true, specifying that theBTAC216 indicated the branch instruction wraps across cache lines A and B. In cell (F,6), cache line A proceeds to the F-stage112.

In cell (C,2),mux218 selects the next sequential fetchaddress262, which is fetch address B, as the fetchaddress162 since thecontrol logic222 determined that the branch instruction is a wrappingBTAC216 branch, according to block408. In cell (I,3), fetch address B is applied to theinstruction cache202, according to block432, since theWRAP signal286 is true in the example. In cell (B,4), theinstruction cache202 is selecting cache line B, during its second access cycle. However, duringclock5, theinstruction cache102 determines that fetch address B is a miss, and accordingly asserts theMISS signal204. Consequently, theinstruction cache102 is unable to provide cache line B.

During clock cycles7 through23, themicroprocessor100 waits for cache line B to be fetched from memory into theresponse buffer208, according to block436. Duringclock24, theresponse buffer208 ofFIG. 2 asserts theRBRDY signal238 when cache line B arrives. In cell (V,24), cache line B is stored into theinstruction buffer142 from theresponse buffer208, according to block438. In cell (F,25), cache line B proceeds to the F-stage112.

In cell (C,25),mux218 selects the next sequential fetchaddress262, which is fetch address C, as the fetchaddress162, according to block442, since thecontrol logic222 determined that cache line B missed in theinstruction cache102. Hence, themicroprocessor100 treats the case ofFIG. 6 as aBTAC216 miss by not branching to thetarget address132 provided by theBTAC216, but instead fetching the next sequential instruction. In cell (I,26), the I-stage102 applies fetch address C to theinstruction cache202. In cell (B,27), theinstruction cache202 is selecting cache line C, during its second access cycle. In cell (U,28), theinstruction cache202 provides cache line C. In cell (V,29), cache line C is written to theinstruction buffer142. In cell (F,30), cache line C proceeds to the F-stage112.

Referring now toFIG. 7, a timing diagram, similar toFIG. 5, illustrating a third example of operation of the branch control apparatus ofFIG. 2 according to the flowchart ofFIG. 4 according to the present invention is shown.FIG. 7 illustrates an example of operation of themicroprocessor100 ofFIG. 1 according to the flowchart ofFIG. 4 in the case of awrapping BTAC216 branch, wherein cache line A fills theinstruction buffer142. Hence, theMISS signal204 and theRBRDY signal238 are false throughout the clock cycles ofFIG. 5, and theWRAP signal286 and theFULL signal246 is true during a portion of the clock cycles. In particular, theWRAP signal286 is true duringclock cycle4, and theFULL signal246 is true duringclock cycles6 through10.

Cells (C,1), (I,2), (B,3), (U,4), (V,5) through (V,14), and (F,15) are similar to corresponding cells ofFIG. 5, with fetch address A and cache line A proceeding down the upper stages of themicroprocessor100 pipeline to the F-stage112. Duringclock cycle4, theWRAP signal286 is true, specifying that theBTAC216 indicated the branch instruction wraps across cache lines A and B.

Cells (C,2), (I,3), and (B,4) are similar to corresponding cells ofFIG. 6, with fetch address B and cache line B proceeding down the C, I, and B stages of themicroprocessor100 pipeline. In cell (U,5), theinstruction cache102 provides cache line B, according to block444, since fetch address B hit in theinstruction cache102.

However, duringclock6, theinstruction buffer142 asserts theFULL signal246 because cache line A has filled theinstruction buffer142. Consequently, thecontrol logic222 waits for theFULL signal246 signal to become false, according to block448, which occurs in clock cycle11.

In cell (C,11),mux218 selects the backup fetchaddress274 from the fetchaddress register file282, which is fetch address B, in response to theFULL signal246 becoming false. In cell (I,12), fetch address B is applied to theinstruction cache102, according to block452. In cell (B,13), theinstruction cache202 is selecting cache line B, during its second access cycle. In cell (U,14), theinstruction cache202 provides cache line B, according to block444, since fetch address B hits in theinstruction cache102. In cell (V,15), cache line B is written to theinstruction buffer142, according to block454, since theinstruction buffer142 is not full. In cell (F,16), cache line B progresses to the F-stage112.

In cell (C,12),mux218 selects the savedtarget address284 from saveregister228. In cell (I,13), the savedtarget address284 is applied to theinstruction cache102, according to block426. In cell (B,14), theinstruction cache202 is selecting cache line T, during its second access cycle. In cell (U,15), theinstruction cache202 provides cache line T, according to block416. In cell (V,16), cache line T is written to theinstruction buffer142, according to block422, since theinstruction buffer142 is not full. In cell (F,17), cache line T progresses to the F-stage112.

Referring now toFIG. 8, a flowchart illustrating operation of the branch control apparatus ofFIG. 2 according to an alternate embodiment of the present invention is shown. The flowchart ofFIG. 8 is identical to the flowchart ofFIG. 4 with the exception thatFIG. 8 does not include

blocks

438 and442. Instead, flow proceeds fromblock436 todecision block446. That is, rather than treating a miss of fetch address B in theinstruction cache102 as aBTAC216 miss, the embodiment ofFIG. 8 handles the condition. The embodiment handles the condition by backing up to fetch address B after theinstruction buffer142 is no longer full, and subsequently applying the savedtarget address284 to obtain cache line T, as will be illustrated with respect toFIG. 9.

Referring now toFIG. 9, a timing diagram, similar toFIG. 6, illustrating an example of operation of the branch control apparatus ofFIG. 2 according to the flowchart ofFIG. 8 according to the present invention is shown.FIG. 9 illustrates an example of operation of themicroprocessor100 ofFIG. 1 according to the flowchart ofFIG. 8 in the case of awrapping BTAC216 branch, wherein cache line B, which contains the second portion of the branch instruction, misses in theinstruction cache102, and cache line A fills theinstruction buffer142. Clock cycles1 through23 ofFIG. 9 are the same as corresponding ones ofFIG. 6, except that theFULL signal246 is true duringclock cycles6 through25 inFIG. 9.

When cache line B arrives in theresponse buffer208 duringclock cycle24, theinstruction buffer142 is full, as determined according to block446. Hence, cache line B is not written into theinstruction buffer142, but is written into theinstruction cache102.

In the example, theFULL signal246 goes false duringclock cycle26, as determined duringblock448. Hence, in cell (C,26),mux218 selects backup fetchaddress274 as fetchaddress152. In cell (I,27), the backup fetchaddress274 is applied to theinstruction cache102, according to block452. In cell (B,28), theinstruction cache202 is selecting cache line B, during its second access cycle. In cell (U,29), theinstruction cache202 provides cache line B, according to block444. Cache line B was previously written into theinstruction cache102 from theresponse buffer208, duringclock cycle25. In cell (V,30), cache line B is written to theinstruction buffer142, according to block454, since theinstruction buffer142 is not full. In cell (F,31), cache line B progresses to the F-stage112.

In cell (C,27),mux218 selects the savedtarget address284 as fetchaddress152. In cell (I,28), the savedtarget address284 is applied to theinstruction cache102, according to block426. In cell (B,29), theinstruction cache202 is selecting cache line T, during its second access cycle. In cell (U,30), theinstruction cache202 provides cache line T, according to block416. In cell (V,31), cache line T is written to theinstruction buffer142, according to block422, since theinstruction buffer142 is not full. In cell (F,32), cache line T progresses to the F-stage112.

As may be observed fromFIGS. 8 and 9, the alternate embodiment has the advantage of not incurring the additional clock cycles associated with correcting a mispredicted taken branch, i.e., aBTAC216 hit that is treated as aBTAC216 miss because it wraps and the second cache line containing the second part of the branch misses in theinstruction cache202. Rather, as may be observed fromFIG. 9, theBTAC216

target address

152 is supplied to theinstruction cache202 at the earliest clock cycle possible after fetch address B.

Although the present invention and its objects, features, and advantages have been described in detail, other embodiments are encompassed by the invention. For example, the number and arrangement of stages in the pipeline may vary. The size and construction of the BTAC, instruction cache, or instruction buffer may vary. The size of a cache line may vary.

Those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiments as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the spirit and scope of the invention as defined by the appended claims.