Movatterモバイル変換


[0]ホーム

URL:


US9921831B2 - Opcode counting for performance measurement - Google Patents

Opcode counting for performance measurement
Download PDF

Info

Publication number
US9921831B2
US9921831B2US15/291,351US201615291351AUS9921831B2US 9921831 B2US9921831 B2US 9921831B2US 201615291351 AUS201615291351 AUS 201615291351AUS 9921831 B2US9921831 B2US 9921831B2
Authority
US
United States
Prior art keywords
instructions
select gates
program
counters
executed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
US15/291,351
Other versions
US20170068536A1 (en
Inventor
Alan Gara
David L. Satterfield
Robert E. Walkup
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines CorpfiledCriticalInternational Business Machines Corp
Priority to US15/291,351priorityCriticalpatent/US9921831B2/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATIONreassignmentINTERNATIONAL BUSINESS MACHINES CORPORATIONASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: GARA, ALAN, WALKUP, ROBERT E., SATTERFIELD, DAVID L.
Publication of US20170068536A1publicationCriticalpatent/US20170068536A1/en
Priority to US15/918,363prioritypatent/US10713043B2/en
Application grantedgrantedCritical
Publication of US9921831B2publicationCriticalpatent/US9921831B2/en
Expired - Fee Relatedlegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Definitions

Landscapes

Abstract

Methods, systems and computer program products are disclosed for measuring a performance of a program running on a processing unit of a processing system. In one embodiment, the method comprises informing a logic unit of each instruction in the program that is executed by the processing unit, assigning a weight to each instruction, assigning the instructions to a plurality of groups, and analyzing the plurality of groups to measure one or more metrics. In one embodiment, each instruction includes an operating code portion, and the assigning includes assigning the instructions to the groups based on the operating code portions of the instructions. In an embodiment, each type of instruction is assigned to a respective one of the plurality of groups. These groups may be combined into a plurality of sets of the groups.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS
This application is a continuation of copending U.S. patent application Ser. No. 14/800,311, filed Jul. 15, 2015, which is a continuation of U.S. patent application Ser. No. 14/063,610, filed Oct. 25, 2013, which is a continuation of U.S. patent application Ser. No. 12/688,773 filed Jan. 15, 2010, now U.S. Pat. No. 8,571,834, issued Oct. 29, 2013. The entire contents and disclosures of U.S. patent application Ser. Nos. 14/800,311, 14/063,610 and 12/688,773 are hereby incorporated herein by reference.
GOVERNMENT CONTRACT
This invention was Government supported under Contract No. B554331 awarded by Department of Energy. The Government has certain rights in this invention.
This application relates to commonly-owned, U.S. Provisional Patent Application Ser. No. 61/293,611 entitled A MULTI-PETASCALE HIGHLY EFFICIENT PARALLEL SUPERCOMPUTER filed on Jan. 8, 2010 and incorporated by reference as if fully set forth herein.
BACKGROUND OF THE INVENTION
Field of the Invention
The present invention generally relates to data processing, and more specifically, the invention relates to counting instructions executed by programs running on data processing systems.
Background Art
In analyzing and enhancing performance of a data processing system and the applications executing within the data processing system, it is helpful to know which software modules within a data processing system are using system resources. Effective management and enhancement of data processing systems requires knowing how and when various system resources are being used. Performance tools are used to monitor and examine a data processing system to determine resource consumption as various software applications are executing within the data processing system. For example, a performance tool may identify the most frequently executed modules and instructions in a data processing system, or may identify those modules which allocate the largest amount of memory or perform the most I/O requests. Hardware performance tools may be built into the system or added at a later point in time.
Currently, processors have minimal support for counting carious instruction types executed by a program. Typically, only a single group of instructions may be counted by a processor by using the internal hardware of the processor. This is not adequate for some applications, where users want to count many different instruction types simultaneously. In addition, there are certain metrics that are used to determine application performance (counting floating point instructions for example), that are not easily measured with current hardware. Using the floating point example, a user may need to count a variety of instructions, each having a different weight, to determine the number of floating point operations performed by the program A scalar floating point multiply would count as one FLOP, whereas a floating point multiply-add instruction would count as 2 FLOPS. Similarly, a quad-vector floating point add would count as 4 FLOPS, while a quad-vector floating point multiply-add would count as 8 FLOPS.
BRIEF SUMMARY
Embodiments of the invention provide methods, systems and computer program products for measuring a performance of a program running on a processing unit of a processing system. In one embodiment, the method comprises informing a logic unit of each instruction in the program that is executed by the processing unit, assigning a weight to said each instruction, assigning the instructions to a plurality of groups, and analyzing said plurality of groups to measure one or more metrics of the program.
In one embodiment, each instruction includes an operating code portion, and the assigning includes assigning the instructions to said groups based on the operating code portions of the instructions. In an embodiment, each instruction is one type of a given number of types, and the assigning includes assigning each type of instruction to a respective one of said plurality of groups. In an embodiment, these groups may be combined into a plurality of sets of the groups.
In an embodiment of the invention, to facilitate the counting of instructions, the processor informs an external logic unit of each instruction that is executed by the processor. The external unit then assigns a weight to each instruction, and assigns it to an opcode group. The user can combine opcode groups into a larger group for accumulation into a performance counter. This assignment of instructions to opcode groups makes measurement of key program metrics transparent to the user.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
FIG. 1 is a block diagram of a data processing system in accordance with an embodiment of the invention.
FIG. 2 shows in more detail one of the processing units of the system ofFIG. 1.
FIG. 3 illustrates the counting and grouping of program instructions in accordance with an embodiment of the invention.
FIG. 4 shows a circuit that may be used to count operating instructions and flop instructions in an embodiment of the invention.
DETAILED DESCRIPTION
As will be appreciated by one skilled in the art, the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer usable program code embodied in the medium.
Any combination of one or more computer usable or computer readable medium(s) may be utilized. The computer-usable or computer-readable medium may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CDROM), an optical storage device, a transmission media such as those supporting the Internet or an intranet, or a magnetic storage device. Note that the computer-usable or computer-readable medium could even be paper or another suitable medium, upon which the program is printed, as the program can be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. In the context of this document, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The computer-usable medium may include a propagated data signal with the computer-usable program code embodied therewith, either in baseband or as part of a carrier wave. The computer usable program code may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc.
Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
The present invention is described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
Referring now toFIG. 1, there is shown the overall architecture of themultiprocessor computing node50 implemented in a parallel computing system in which the present invention is implemented. In one embodiment, the multiprocessor system implements the proven Blue Gene® architecture, and is implemented in a BluGene/Q massively parallel computing system comprising, for example, 1024 compute node ASICs (BCQ), each including multiple processor cores.
A compute node of this present massively parallel supercomputer architecture and in which the present invention may be employed is illustrated inFIG. 1. Thecompute node50 is a single chip (“nodechip”) based on low power A2 PowerPC cores, though the architecture can use any low power cores, and may comprise one or more semiconductor chips. In the embodiment depicted, the node includes 16 PowerTC A2 at 1600 MHz, in cores in one embodiment.
More particularly, thebasic nodechip50 of the massively parallel supercomputer architecture illustrated inFIG. 1 includes (sixteen or seventeen) 16+1 symmetric multiprocessing (SMP)cores52, each core being 4-way hardware threaded supporting transactional memory and thread level speculation, and, including a Quad Floating Point Unit (FPU)53 on each core (204.8 GF peak node). In one implementation, the core operating frequency target is 1.6 GHz providing, for example, a 563 GB/s bisection bandwidth to sharedL2 cache70 via afull crossbar switch60. In one embodiment, there is provided 32 MB of sharedL2 cache70, each core having associated 2 MB ofL2 cache72. There is further provided external DDR SDRAM (e.g., Double Data Rate synchronous dynamic random access)memory80, as a lower level in the memory hierarchy in communication with the L2. In one embodiment, the node includes 42.6 GB/s DDR3 bandwidth (1.333 GHz DDR3) (2 channels each with chip kill protection).
EachFPU53 associated with acore52 has a 32 B wide data path to the L1-cache55 of the A2, allowing it to load orstore 32 B per cycle from or into the L1-cache55. Eachcore52 is directly connected to a private prefetch unit (level-1 prefetch, L1P)58, which accepts, decodes and dispatches all requests sent out by the A2. The store interface from theA2 core52 to theL1P55 is 32 B wide and the load interface is 16 B wide, both operating at processor frequency. TheL1P55 implements a fully associative, 32 entry prefetch buffer. Each entry can hold an L2 line of 128 B size. The L1P provides two prefetching schemes for the private prefetch unit58: a sequential prefetcher as used in previous BlueGene architecture generations, as well as a list prefetcher.
As shown inFIG. 1, the 32 MiB shared L2 is sliced into 16 units, each connecting to a slave port of theswitch60. Every physical address is mapped to one slice using a selection of programmable address bits or a XOR-based hash across all address bits. The L2-cache slices, the L1Ps and the L1-D caches of the A2s are hardware-coherent. A group of 4 slices is connected via a ring to one of the twoDDR3 SDRAM controllers78.
By implementing a direct memory access engine referred to herein as a Messaging Unit, “MU” such asMU100, with each MU including a DMA engine and Network Card interface in communication with the XBAR switch, chip I/O functionality is provided. In one embodiment, the compute node further includes, in a non-limiting example: 10 intra-rack interprocessor links90, each at 2.0 GB/s, for example, i.e., 10*2 GB/s intra-rack & inter-rack (e.g., configurable as a 5-D torus in one embodiment); and, one I/O link92 interfaced with the MU at 2.0 GB/s (2 GB/s I/O link (to I/O subsystem)) is additionally provided. The system node employs or is associated and interfaced with a 8-16 GB memory/node. The ASIC may consume up to about 30 watts chip power.
Although not shown, each A2 core has associated a quad-wide fused multiply-add SIMD floating point unit, producing 8 double precision operations per cycle, for a total of 128 floating point operations per cycle per compute chip. A2 is a 4-way multi-threaded 64b PowerPC implementation. Each A2 core has its own execution unit (XU), instruction unit (IU), and quad floating point unit (QPU) connected via the AXU (Auxiliary eXecution Unit) (FIG. 2). The QPU is an implementation of the 4-way SIMD QPX floating point instruction set architecture. QPX is an extension of the scalar PowerPC floating point architecture. It defines 32 32 B-wide floating point registers per thread instead of the traditional 32 scalar 8 B-wide floating point registers.
As described above, each processor includes four independent hardware threads sharing a single L1 cache with sixty-four byte line size. Each memory line is stored in a particular L2 cache slice, depending on the address mapping. The sixteen L2 slices effectively comprise a single L2 cache. Those skilled in the art will recognize that the invention may be embodied in different processor configurations.
FIG. 2 illustrates one of theprocessor units200 ofsystem50. The processor unit includes aQPU210, anA2 processor core220, and L1 cache, and a level 1 pre-fetch (L1P)230. The QPU has a 32 B wide data path to the L1-cache of the A2 core, allowing it to load orstore 32 B per cycle from or into the L1-cache. Each core is directly connected to a private prefetch unit (level-1 prefetch, L1P)230, which accepts, decodes and dispatches all requests sent out by the A2 core. The store interface from the A2 core to the L1P is 32 B wide and the load interface is 16 B wide, both operating at processor frequency. The L1P implements a fully associative 32 entry prefetch buffer. Each entry can hold an L2 line of 128 B size.
TheL1P230 provides two prefetching schemes: a sequential prefetcher, as well as a list prefetcher. The list prefetcher tracks and records memory requests sent out by the core, and writes the sequence as a list to a predefined memory region. It can replay this list to initiate prefetches for repeated sequences of similar access patterns. The sequences do not have to be identical, as the list processing is tolerant to a limited number of additional or missing accesses. This automated learning mechanism allows a near perfect prefetch behavior for a set of important codes that show the required access behavior, as well as perfect prefetch behavior for codes that allow precomputation of the access list.
EachPU200 connects to a central low latency, highbandwidth crossbar switch240 via a master port. The central crossbar routes requests and write data from the master ports to the slave ports and read return data back to the masters. The write data path of each master and slave port is 16 B wide. The read data return port is 32 B wide.
As mentioned above, currently, processors have minimal support for counting various instruction types executed by a program. Typically, only a single group of instructions may be counted by a processor by using the internal hardware of the processor. This is not adequate for some applications, where users want to count many different instruction types simultaneously. In addition, there are certain metrics that are used to determine application performance (counting floating point instructions for example) that are not easily measured with current hardware.
Embodiments of the invention provide methods, systems and computer program products for measuring a performance of a program running on a processing unit of a processing system. In one embodiment, the method comprises informing a logic unit of each instruction in the program that is executed by the processing unit, assigning a weight to said each instruction, assigning the instructions to a plurality of groups, and analyzing said plurality of groups to measure one or more metrics of the program.
With reference toFIG. 3, to facilitate the counting of instructions, the processor informs anexternal logic unit310 of each instruction that is executed by the processor. Theexternal unit310 then assigns a weight to each instruction, and assigns it to anopcode group320. The user can combine opcode groups into alarger group330 for accumulation into a performance counter. This assignment of instructions to opcode groups makes measurement of key program metrics transparent to the user.
As one specific example of the present invention,FIG. 4 shows acircuit400 that may be used to count a variety of instructions, each having a different weight, to determine the number of floating point operations performed by the program. Thecircuit400 includes two flopselect gates402,404 and two ops selectgates406,410.Counters412,414 are used to count the number of outputs from theflop gates402,404, and the outputs ofselect gates406,410 are applied to reducegates416,420. Thread compares422,424 receivethread inputs426,430 and the outputs of reducegates416,420. Similarly, thread compares432,434 receivethread inputs426,430 and the outputs of flop counters412,414.
The implementation, in an embodiment, is hardware dependent. The processor runs at two times the speed of the counter, and because of this, the counter has to process two cycles of A2 data in one counter cycle. Hence, the two OPS0/1 and the two FLOPS0/1 are used in the embodiment ofFIG. 4. If the counter were in the same clock domain as the processor, only a single OPS and a single FLOPS input would be needed. An OPS and a FLOPS are used because the A2 can execute one integer and one floating point operation per cycle, and the counter needs to keep up with these operations of the A2.
In one embodiment, the highest count that the A2 can produce is 9. This is because the maximum weight assigned to one FLOP is 8 (the highest possible weight this embodiment), and, in this implementation, all integer instructions have a weight of 1. This totals 9 (8 flop and 1 op) per A2 cycle. When this maximum count is multiplied by two clock cycles per counting cycle, the result is a maximum count of 18 per count cycle, and as a result, the counter has to be able to add from 0-18 every counting cycle. Also, because all integer instructions have a weight of 1, a reduce (logical OR) is done in the OP path, instead of weighting logic like on the FLOP path.
Boxes402/404 perform the set selection logic. They pick which groups go into the counter for adding. The weighting of the incoming groups happens in theFLOP_CNT boxes412/414. In an implementation, certain groups are hard coded to certain weights (e.g. FMA gets 2, quad fma gets 8). Other group weights are user programmable (DIV/SQRT), and some groups are hard coded to a weight of 1. The reduce block on the op path functions as an OR gate because, in this implementation, all integer instructions are counted as 1, and the groups are mutually exclusive since each instruction only goes into one group. In other embodiments, this reduce box can be as simple as an OR gate, or complex, where, for example, each input group has a programmable weight.
The Thread Compare boxes are gating boxes. With each instruction that is input to these boxes, the thread that is executing the instruction is recorded. A 4 bit mask vector is input to this block to select which threads to count.Incrementers436 and440 are used, in the embodiment shown inFIG. 4, because the value of the OP input is always 1 or 0. If there were higher weights on the op side, a full adder of appropriate size may be used. Themuxes442 and444 are used to mux in other event information into thecounter446. For opcode counting, in one embodiment, these muxes are not needed.
The outputs of thread compares422,424 are applied to and counted by incrementer436, and the outputs of thread compares432,434 are applied to and counted byincrementer440. The outputs ofincrementers436,440 are passed to multiplexers442,444, and the outputs of the multiplexers are applied to sixbit adder446. The output of sixbit adder446 is transmitted to fourteenbit adder450, and the output of the fourteen bit adder is transmitted to counterregister452.
While it is apparent that the invention herein disclosed is well calculated to fulfill the objects discussed above, it will be appreciated that numerous modifications and embodiments may be devised by those skilled in the art, and it is intended that the appended claims cover all such modifications and embodiments as fall within the true spirit and scope of the present invention.

Claims (20)

The invention claimed is:
1. A digital circuit for counting instructions executed by a program running a data processing system, the digital circuit comprising:
an input section, including a plurality of digital select gates for receiving input signals when the program executes specified types of instructions on the data processing system, each of the select gates including a select input for receiving a select signal for determining which groups of the input signals received by the each select gate are output as output signals from the said each select gate and for representing selected ones of the executed instructions;
a weighting section, including a plurality of digital weighting counters, for receiving output signals representing the selected ones of the executed instructions, and for generating weighted output signals representing assigned weighted values of the selected ones of the executed instructions;
a group of op counters for counting the weighted output signals; and
a logic section, including a plurality of digital comparators, for directing each of the weighted output signals to one op counter of the group of op counters, wherein the op counters maintain counts representing each of the specified types of instructions executed by the program on the data processing system.
2. The circuit according toclaim 1, wherein the select gates determine which types of instructions are counted by the op counters.
3. The circuit according toclaim 1, wherein the instructions executable by the program include floating point operations, and the select gates include one or more flop select gates for receiving one of the input signals when the program executes one of the floating point operations.
4. The circuit according toclaim 1, wherein the instructions executable by the program include integer operations, and the select gates includes one or more integer select gates for receiving one of the input signals when the program executes one of the integer operations.
5. The circuit according toclaim 1, wherein each of the instructions includes an operating code portion, and the directing includes assigning the instructions to the group of counters based on the operating code portions of the instructions.
6. The circuit according toclaim 1, wherein the input section includes:
a first circuit portion for receiving one of the input signals when the program executes a floating point operation; and
a second circuit portion for receiving one of the input signals when the program executes an integer operation.
7. The circuit according toclaim 1, wherein a plurality of threads operate and execute the specified instructions on the processing system, and the circuit further comprises;
a thread compare section to identify, for each of the executed specified instructions, the one of the threads that executed said each instruction.
8. The circuit according toclaim 7, wherein the thread compare section includes a plurality of gating boxes.
9. The circuit according toclaim 8, wherein each of the gating boxes receives a mask for one of the plurality of threads.
10. The circuit according toclaim 1, wherein each of the gating boxes generates an output and applies said output to one counter of the group of counters.
11. A method of operating a digital circuit for counting instructions executed by a program running on a data processing system, the digital circuit comprising an input section, a weighting section, a group of op counters, and a logic section, and the input section including a plurality of select gates, the method comprising:
when the program executes specified types of instructions on the data processing system, applying input signals to the select gates of input section of the digital circuit,
applying a select signal to each of the select gates the input section to determine which groups of the input signals received by the each select gates are output as output signals from the each select gates for representing selected ones of the executed instructions;
applying to the weighting section the output signals from the input section representing the selected ones of the executed instructions, and using the weighting section for generating weighted output signals representing assigned weighted values of the selected ones of the executed instructions;
applying the weighted output signals to the group of op counters; and
using the selection logic section for directing each of the weighted output signals to one op counter of the group of op counters, wherein the op counters maintain counts representing each of the specified types of instructions executed by the program on the data processing system.
12. The method according toclaim 11, wherein the select gates determine which types of instructions are counted by the op counters.
13. The method according toclaim 11, wherein the instructions executable by the program include floating point operations, and the select gates include one or more flop select gates for receiving one of the input signals when the program executes one of the floating point operations.
14. The method according toclaim 11, wherein the instructions executable by the program include integer operations, and the select gates includes one or more integer select gates for receiving one of the input signals when the program executes one of the integer operations.
15. The method according toclaim 11, wherein each of the instructions includes an operating code portion, and the directing includes assigning the instructions to the group of counters based on the operating code portions of the instructions.
16. An article of manufacture comprising:
at least one tangible computer readable hardware medium having computer readable program code logic to execute machine instructions in one or more processing units for counting instructions executed by a program running on a data processing system, the program code logic, when executing, performing the following:
when the program executes specified types of instructions on the data processing system, applying input signals to select gates of an input module,
applying a select signal to each of the select gates the input section to determine which groups of the input signals received by the each select gates are output as output signals from the each select gates for representing selected ones of the executed instructions;
applying to a weighting module the output signals from the input module representing the selected ones of the executed instructions, and using the weighting module for-generating weighted output signals representing assigned weighted values of the selected ones of the executed instructions;
applying the weighted output signals to a group of op counters; and
using a selection logic module for directing each of the input executed instruction weighted output signals to one op counter of the group of counters, wherein the op counters maintain counts representing each of the specified types of instructions executed by the program on the data processing system.
17. The article of manufacture according toclaim 16, wherein the select gates determine which types of instructions are counted by the op counters.
18. The article of manufacture according toclaim 16, wherein the instructions executable by the program include floating point operations, and the select gates include one or more flop select gates for receiving one of the input signals when the program executes one of the floating point operations.
19. The article of manufacture according toclaim 16, wherein the instructions executable by the program include integer operations, and the select gates includes one or more integer select gates for receiving one of the input signals when the program executes one of the integer operations.
20. The article of manufacture according toclaim 16, wherein each of the instructions includes an operating code portion, and the directing includes assigning the instructions to the group of counters based on the operating code portions of the instructions.
US15/291,3512010-01-082016-10-12Opcode counting for performance measurementExpired - Fee RelatedUS9921831B2 (en)

Priority Applications (2)

Application NumberPriority DateFiling DateTitle
US15/291,351US9921831B2 (en)2010-01-082016-10-12Opcode counting for performance measurement
US15/918,363US10713043B2 (en)2010-01-082018-03-12Opcode counting for performance measurement

Applications Claiming Priority (5)

Application NumberPriority DateFiling DateTitle
US29361110P2010-01-082010-01-08
US12/688,773US8571834B2 (en)2010-01-082010-01-15Opcode counting for performance measurement
US14/063,610US9106656B2 (en)2010-01-082013-10-25Opcode counting for performance measurement
US14/800,311US9473569B2 (en)2010-01-082015-07-15Opcode counting for performance measurement
US15/291,351US9921831B2 (en)2010-01-082016-10-12Opcode counting for performance measurement

Related Parent Applications (1)

Application NumberTitlePriority DateFiling Date
US14/800,311ContinuationUS9473569B2 (en)2010-01-082015-07-15Opcode counting for performance measurement

Related Child Applications (1)

Application NumberTitlePriority DateFiling Date
US15/918,363ContinuationUS10713043B2 (en)2010-01-082018-03-12Opcode counting for performance measurement

Publications (2)

Publication NumberPublication Date
US20170068536A1 US20170068536A1 (en)2017-03-09
US9921831B2true US9921831B2 (en)2018-03-20

Family

ID=44259208

Family Applications (9)

Application NumberTitlePriority DateFiling Date
US12/688,773Expired - Fee RelatedUS8571834B2 (en)2009-11-132010-01-15Opcode counting for performance measurement
US12/693,972Expired - Fee RelatedUS8458267B2 (en)2009-11-132010-01-26Distributed parallel messaging for multiprocessor systems
US12/723,277Expired - Fee RelatedUS8521990B2 (en)2009-11-132010-03-12Embedding global barrier and collective in torus network with each node combining input from receivers according to class map for output to senders
US13/975,943Expired - Fee RelatedUS9374414B2 (en)2010-01-082013-08-26Embedding global and collective in a torus network with message class map based tree path selection
US14/063,610Expired - Fee RelatedUS9106656B2 (en)2010-01-082013-10-25Opcode counting for performance measurement
US14/800,311Expired - Fee RelatedUS9473569B2 (en)2010-01-082015-07-15Opcode counting for performance measurement
US15/160,766Expired - Fee RelatedUS10740097B2 (en)2010-01-082016-05-20Embedding global barrier and collective in a torus network
US15/291,351Expired - Fee RelatedUS9921831B2 (en)2010-01-082016-10-12Opcode counting for performance measurement
US15/918,363Expired - Fee RelatedUS10713043B2 (en)2010-01-082018-03-12Opcode counting for performance measurement

Family Applications Before (7)

Application NumberTitlePriority DateFiling Date
US12/688,773Expired - Fee RelatedUS8571834B2 (en)2009-11-132010-01-15Opcode counting for performance measurement
US12/693,972Expired - Fee RelatedUS8458267B2 (en)2009-11-132010-01-26Distributed parallel messaging for multiprocessor systems
US12/723,277Expired - Fee RelatedUS8521990B2 (en)2009-11-132010-03-12Embedding global barrier and collective in torus network with each node combining input from receivers according to class map for output to senders
US13/975,943Expired - Fee RelatedUS9374414B2 (en)2010-01-082013-08-26Embedding global and collective in a torus network with message class map based tree path selection
US14/063,610Expired - Fee RelatedUS9106656B2 (en)2010-01-082013-10-25Opcode counting for performance measurement
US14/800,311Expired - Fee RelatedUS9473569B2 (en)2010-01-082015-07-15Opcode counting for performance measurement
US15/160,766Expired - Fee RelatedUS10740097B2 (en)2010-01-082016-05-20Embedding global barrier and collective in a torus network

Family Applications After (1)

Application NumberTitlePriority DateFiling Date
US15/918,363Expired - Fee RelatedUS10713043B2 (en)2010-01-082018-03-12Opcode counting for performance measurement

Country Status (1)

CountryLink
US (9)US8571834B2 (en)

Families Citing this family (44)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US8819272B2 (en)*2010-02-112014-08-26Massachusetts Institute Of TechnologyMultiprocessor communication networks
US9253248B2 (en)*2010-11-152016-02-02Interactic Holdings, LlcParallel information system utilizing flow control and virtual channels
US9092214B2 (en)*2012-03-292015-07-28Intel CorporationSIMD processor with programmable counters externally configured to count executed instructions having operands of particular register size and element size combination
US9514028B2 (en)*2012-03-292016-12-06Intel CorporationSystem and method for determining correct execution of software based on baseline and real time trace events
US8990450B2 (en)2012-05-142015-03-24International Business Machines CorporationManaging a direct memory access (‘DMA’) injection first-in-first-out (‘FIFO’) messaging queue in a parallel computer
US9077616B2 (en)2012-08-082015-07-07International Business Machines CorporationT-star interconnection network topology
US9160607B1 (en)*2012-11-092015-10-13Cray Inc.Method and apparatus for deadlock avoidance
US9146833B2 (en)*2012-12-202015-09-29Intel CorporationSystem and method for correct execution of software based on a variance between baseline and real time information
US9047092B2 (en)*2012-12-212015-06-02Arm LimitedResource management within a load store unit
US9645802B2 (en)*2013-08-072017-05-09Nvidia CorporationTechnique for grouping instructions into independent strands
CN103425620B (en)*2013-08-202018-01-12复旦大学The coupled structure of accelerator and processor based on multiple token-ring
US10394751B2 (en)2013-11-062019-08-27Solarflare Communications, Inc.Programmed input/output mode
US9930117B2 (en)*2014-09-302018-03-27Interactic Holdings, LlcMatrix vector multiply techniques
US9760511B2 (en)*2014-10-082017-09-12International Business Machines CorporationEfficient interruption routing for a multithreaded processor
US9858140B2 (en)*2014-11-032018-01-02Intel CorporationMemory corruption detection
US10073727B2 (en)2015-03-022018-09-11Intel CorporationHeap management for memory corruption detection
US9619313B2 (en)2015-06-192017-04-11Intel CorporationMemory write protection for memory corruption detection architectures
US9652375B2 (en)2015-06-222017-05-16Intel CorporationMultiple chunk support for memory corruption detection architectures
US9710354B2 (en)*2015-08-312017-07-18International Business Machines CorporationBasic block profiling using grouping events
US20170177336A1 (en)*2015-12-222017-06-22Intel CorporationHardware cancellation monitor for floating point operations
US11134031B2 (en)*2016-03-112021-09-28Purdue Research FoundationComputer remote indirect memory access system
US10031859B2 (en)*2016-06-222018-07-24Arista Networks, Inc.Pulse counters
US10191791B2 (en)2016-07-022019-01-29Intel CorporationEnhanced address space layout randomization
US10083127B2 (en)*2016-08-222018-09-25HGST Netherlands B.V.Self-ordering buffer
US10425358B2 (en)2016-09-292019-09-24International Business Machines CorporationNetwork switch architecture supporting multiple simultaneous collective operations
US10228938B2 (en)*2016-12-302019-03-12Intel CorporationApparatus and method for instruction-based flop accounting
KR102610984B1 (en)*2017-01-262023-12-08한국전자통신연구원Distributed file system using torus network and method for operating of the distributed file system using torus network
TW201833421A (en)*2017-03-082018-09-16林 招慶A system of an electronic lock for updating a firmware of the electronic lock
US11277455B2 (en)2018-06-072022-03-15Mellanox Technologies, Ltd.Streaming system
US20200106828A1 (en)*2018-10-022020-04-02Mellanox Technologies, Ltd.Parallel Computation Network Device
US11625393B2 (en)2019-02-192023-04-11Mellanox Technologies, Ltd.High performance computing system
EP3699770B1 (en)2019-02-252025-05-21Mellanox Technologies, Ltd.Collective communication system and methods
US11068269B1 (en)2019-05-202021-07-20Parallels International GmbhInstruction decoding using hash tables
US11403247B2 (en)*2019-09-102022-08-02GigaIO Networks, Inc.Methods and apparatus for network interface fabric send/receive operations
EP4049143A4 (en)2019-10-252024-02-21GigaIO Networks, Inc.Methods and apparatus for dma engine descriptors for high speed data systems
US11750699B2 (en)2020-01-152023-09-05Mellanox Technologies, Ltd.Small message aggregation
US11252027B2 (en)2020-01-232022-02-15Mellanox Technologies, Ltd.Network element supporting flexible data reduction operations
US11386020B1 (en)2020-03-032022-07-12Xilinx, Inc.Programmable device having a data processing engine (DPE) array
US11876885B2 (en)2020-07-022024-01-16Mellanox Technologies, Ltd.Clock queue with arming and/or self-arming features
US11556378B2 (en)2020-12-142023-01-17Mellanox Technologies, Ltd.Offloading execution of a multi-task parameter-dependent operation to a network device
US11425195B1 (en)*2021-03-122022-08-23Innovium, Inc.Massively parallel in-network compute
JP7220814B1 (en)*2022-01-212023-02-10エヌ・ティ・ティ・アドバンステクノロジ株式会社 Data acquisition device and data acquisition method
US12309070B2 (en)2022-04-072025-05-20Nvidia CorporationIn-network message aggregation for efficient small message transport
US11922237B1 (en)2022-09-122024-03-05Mellanox Technologies, Ltd.Single-step collective operations

Citations (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US4484269A (en)1982-05-051984-11-20Honeywell Information Systems Inc.Apparatus for providing measurement of central processing unit activity
US5485574A (en)1993-11-041996-01-16Microsoft CorporationOperating system based performance monitoring of programs
US20060277395A1 (en)2005-06-062006-12-07Fowles Richard GProcessor performance monitoring
US20070150705A1 (en)2005-12-282007-06-28Intel CorporationEfficient counting for iterative instructions
US7937691B2 (en)2003-09-302011-05-03International Business Machines CorporationMethod and apparatus for counting execution of specific instructions and accesses to specific data locations
US8689190B2 (en)2003-09-302014-04-01International Business Machines CorporationCounting instruction execution and data accesses

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
GB8915136D0 (en)*1989-06-301989-08-23Inmos LtdMethod for controlling communication between computers
US5280474A (en)*1990-01-051994-01-18Maspar Computer CorporationScalable processor to processor and processor-to-I/O interconnection network and method for parallel processing arrays
US5361363A (en)*1990-10-031994-11-01Thinking Machines CorporationInput/output system for parallel computer for performing parallel file transfers between selected number of input/output devices and another selected number of processing nodes
EP1059589B1 (en)*1999-06-092005-03-30Texas Instruments IncorporatedMulti-channel DMA with scheduled ports
JP4114480B2 (en)*2001-02-242008-07-09インターナショナル・ビジネス・マシーンズ・コーポレーション Global interrupt and barrier network
KR100553143B1 (en)*2001-02-242006-02-22인터내셔널 비지네스 머신즈 코포레이션 Global tree network for compute structures
CA2438195C (en)*2001-02-242009-02-03International Business Machines CorporationOptimized scalabale network switch
US7330432B1 (en)*2001-11-152008-02-12Network Appliance, Inc.Method and apparatus for optimizing channel bandwidth utilization by simultaneous reliable transmission of sets of multiple data transfer units (DTUs)
WO2006020298A2 (en)*2004-07-192006-02-23Blumrich Matthias ACollective network for computer structures
US7113985B2 (en)*2002-10-152006-09-26Intel CorporationAllocating singles and bursts from a freelist
US8307194B1 (en)*2003-08-182012-11-06Cray Inc.Relaxed memory consistency model
US7630332B1 (en)*2004-02-172009-12-08Verizon Corporate Services Group Inc. & BBN Technologies Corp.Time division multiple access for network nodes with multiple receivers
US20060215620A1 (en)*2005-03-232006-09-28Z-Com, Inc.Advanced WLAN access point and a message processing method for the same
US7380102B2 (en)*2005-09-272008-05-27International Business Machines CorporationCommunication link control among inter-coupled multiple processing units in a node to respective units in another node for request broadcasting and combined response
US20070245122A1 (en)*2006-04-132007-10-18Archer Charles JExecuting an Allgather Operation on a Parallel Computer
US7706275B2 (en)*2007-02-072010-04-27International Business Machines CorporationMethod and apparatus for routing data in an inter-nodal communications lattice of a massively parallel computer system by employing bandwidth shells at areas of overutilization
US7738443B2 (en)*2007-06-262010-06-15International Business Machines CorporationAsynchronous broadcast for ordered delivery between compute nodes in a parallel computing system where packet header space is limited
US7761687B2 (en)*2007-06-262010-07-20International Business Machines CorporationUltrascalable petaflop parallel supercomputer
US8171047B2 (en)*2007-08-072012-05-01International Business Machines CorporationQuery execution and optimization utilizing a combining network in a parallel computer system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US4484269A (en)1982-05-051984-11-20Honeywell Information Systems Inc.Apparatus for providing measurement of central processing unit activity
US5485574A (en)1993-11-041996-01-16Microsoft CorporationOperating system based performance monitoring of programs
US7937691B2 (en)2003-09-302011-05-03International Business Machines CorporationMethod and apparatus for counting execution of specific instructions and accesses to specific data locations
US8689190B2 (en)2003-09-302014-04-01International Business Machines CorporationCounting instruction execution and data accesses
US20060277395A1 (en)2005-06-062006-12-07Fowles Richard GProcessor performance monitoring
US20070150705A1 (en)2005-12-282007-06-28Intel CorporationEfficient counting for iterative instructions

Also Published As

Publication numberPublication date
US10740097B2 (en)2020-08-11
US20140237045A1 (en)2014-08-21
US20110173413A1 (en)2011-07-14
US20110172969A1 (en)2011-07-14
US10713043B2 (en)2020-07-14
US20110173399A1 (en)2011-07-14
US20140052970A1 (en)2014-02-20
US9106656B2 (en)2015-08-11
US8521990B2 (en)2013-08-27
US20170068536A1 (en)2017-03-09
US20150347141A1 (en)2015-12-03
US20160316001A1 (en)2016-10-27
US9374414B2 (en)2016-06-21
US9473569B2 (en)2016-10-18
US8571834B2 (en)2013-10-29
US8458267B2 (en)2013-06-04
US20180203693A1 (en)2018-07-19

Similar Documents

PublicationPublication DateTitle
US10713043B2 (en)Opcode counting for performance measurement
US11029949B2 (en)Neural network unit
US11216720B2 (en)Neural network unit that manages power consumption based on memory accesses per period
US9690625B2 (en)System and method for out-of-order resource allocation and deallocation in a threaded machine
US9104532B2 (en)Sequential location accesses in an active memory device
US11226840B2 (en)Neural network unit that interrupts processing core upon condition
CN112506568A (en)System, method and apparatus for heterogeneous computing
US20180088956A1 (en)System and Method for Load Balancing in Out-of-Order Clustered Decoding
US10979337B2 (en)I/O routing in a multidimensional torus network
US20110078414A1 (en)Multiported register file for multithreaded processors and processors employing register windows
US10635442B2 (en)Instruction and logic for tracking fetch performance bottlenecks
US20110173357A1 (en)Arbitration in crossbar interconnect for low latency
CN113934455B (en) Instruction conversion method and device
US8275954B2 (en)Using DMA for copying performance counter data to memory
US9507564B2 (en)Processing fixed and variable length numbers
US20160019062A1 (en)Instruction and logic for adaptive event-based sampling
US8683181B2 (en)Processor and method for distributing load among plural pipeline units
CN109977701B (en)Fixed floating point arithmetic device
Martorell et al.Blue Gene/L performance tools
US9565094B2 (en)I/O routing in a multidimensional torus network
US20220326956A1 (en)Processor embedded with small instruction set
Shi et al.Efficient and High-Performance Sparse Matrix-Vector Multiplication on a Many-Core Array
Unsal et al.Empowering a helper cluster through data-width aware instruction selection policies
SilversteinThe minimum fyee energy regularization connection: Linear-MFE
Almasi et al.Blue Gene/L performance tools

Legal Events

DateCodeTitleDescription
ASAssignment

Owner name:INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text:ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GARA, ALAN;SATTERFIELD, DAVID L.;WALKUP, ROBERT E.;SIGNING DATES FROM 20150429 TO 20150611;REEL/FRAME:039996/0641

STCFInformation on status: patent grant

Free format text:PATENTED CASE

FEPPFee payment procedure

Free format text:MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

LAPSLapse for failure to pay maintenance fees

Free format text:PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCHInformation on status: patent discontinuation

Free format text:PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FPLapsed due to failure to pay maintenance fee

Effective date:20220320


[8]ページ先頭

©2009-2025 Movatter.jp