Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Performance Counter Analysis

NotificationsYou must be signed in to change notification settings

Angelogeb/perfctr-ht

Repository files navigation

authordatetitle
Beatrice Bevilacqua, Anxhelo Xhebraj
March 2019
Performance counters analysis for Hyper-Threading

Performance Counters Frameworks

The complexity of newer architectures has led to the necessity ofa better knowledge of the underlying hardware in order to get peakperformance. Following these trends new interfaces have been madeavailable to developers for spotting performance bottlenecks in theirapplications such as Performance Monitoring Units (PMU).

PMUs enable developers to observe and count events in the CPU such asbranch mispredictions, cache misses and other finer grained details overthe whole pipeline. Although powerful, dealing with such informationremains burdensome given the diversity of the events, making it difficultto truly identify optimization opportunities.Depending by the processor family, on average 4 counters can be readcontemporarily at any time using Model Specific Registers. In orderto read more than 4 events, various tools multiplex such registersin atime-sharing fashion.

Many tools for performance analysis based on PMUs have been developedranging fromraw event count to more sofisticated and aggregatedmeasures as follows:

  • msr: direct access to the device files/dev/cpu/*/msr
  • PAPI : A Performance Application Programming Interface thatoffers a set of APIs for using performance counters.Supports multiple architectures and multiplexing.
  • likwid : A suite of applications and libraries for analysingHigh Performance Computing applications. Itcontains out of the box utilies to work with MPI,power profiling and architecture topology.
  • Intel Vtune Amplifier : Application for performance analysis onintel architectures. Gives insights regarding possible bottlenecksof the application annotating its source code and providespossible solutions.
  • perf : In a similar vein to Intel Vtune Amplifier shows whichfunctions are more critical to the application. Additionallyprovides more high level information such as I/O and Networking.It is possible to analyse raw hardware performance counters butits main goal is abstracting over them.
  • pmu-tools : is a collection of tools for profile collectionand performance analysis on Intel CPUs on top of Linux perf

likwid

Given that the goal of this document is to analyze system behaviourthrough performance counters to provide insights regarding newpossible scheduling strategies in Hyper-Threading systems, we chooseto use thelikwid applications and libraries for our task. The choicewas especially driven by the presence of useful benchmarks in thelikwidrepository for stressing FPU and other core subsystems. AdditionallyIntel Vtune Amplifier was used to profile the benchmarks in order tocharacterize their workload.

likwid-perfctr -e allows to query all the available events forthe current architecture whilelikwid-perfctr -a shows the pre-configuredevent sets, called performance groups, with useful pre-selected eventsets and derived metrics. Multiple modes of execution of performance monitoringare available as documented in thelikwid wiki. Of main interest arewrapper mode andtimeline mode. The former produces a summary of theevents, while the latter outputs performance metrics at a specifiedfrequency (specified through the-t flag).In case multiple groups need to be monitored multiplexing is performedat the granularity set through the-t flag (in timeline mode, otherwise-T for wrapper mode) and the output produced are the id of the group readat a given timestep and its values.

Tests have shown that for measurements below 100 milliseconds, theperiodically printed results are not valid results anymore (they are higherthan expected) but the behavior of the results is still valid. E.g. if youtry to resolve the burst memory transfers, you need results for smallintervals. The memory bandwidth for each measurement may be higher thanexpected (could even be higher than the theoretical maximum of the machine)but the burst and non-burst traffic is clearly identifiable by highs andlows of the memory bandwidth results.

Benchmarks

The benchmark available inlikwid can be run through thelikwid-benchcommand. For an overview of the available benchmarks runlikwid-bench -a.All benchmarks perform operations over one-dimensional arrays. The benchmarksused in our setting are:

  • ddot_sp: Single-precision dot product of two vectors, only scalaroperations
  • copy: Double-precision vector copy, only scalar operations
  • ddot_sp_avx: Single-precision dot product of two vectors, optimized for AVX
  • sum_int: Custom benchmark similar tosum but working on integers
  • copy_scattered: Same ascopy but accesses at page granularity (4096)

The last two benchmarks (sum_int andcopy_scattered) can be found underthebenchmarks/ folder of this repository. Instructions on how to compilethem in likwid can be foundhere.

All benchmarks are run with multiple configurations of number of threads (with orwithout Hyper-Threading), processor frequencies with TurboBoost disabled, workingset size. The latter is needed in order to emulatecore-bound executions(working set fitting in cache) andmemory-bound ones.

Repository Structure

The repository has two branches:master andaggregated. The formercontains under thedata/ folder the profiling of the benchmarks runin timeline mode (-t flag) while the latter underdata_aggregated/the data of the profiling in wrapper mode (-T).

Both folders were genereated through thebench.sh file and minorvariations of it for different number of threads and Hyper Threadingactivation.

master Details

Data naming convention

<BenchmarkName>-<#threads>-<freq>.<ext>
  • #threads: 1 or 2. When the value is 2, the threads run in thesame physical core in Hyper Threading

  • freq: 1.0, 1.5, 2.2. The smallest frequency is not precise andranges from 1.0 to 1.28.

  • ext: stdout, stderr, header. In stderr there is the actual datawith one line per group, with groups repeating for each sample.In header can be found the name of the metrics/performance countersin each group.In stdout the total running time of the benchmark and additional info.The output produced bybench.sh is not clean therefore such datahas been processed throughprocess_data.py.

Plotting

In order to easily analyze the dataplot_data.py can be used.It is a Python3.6 script which uses the matplotlib library.

Usage
usage: plot_data.py [-h] [-g GROUP] [-m METRIC] [-p] file [file ...]Plot datapositional arguments:  file                  Path to the file without extensionoptional arguments:  -h, --help            show this help message and exit  -g GROUP, --group GROUP  -m METRIC, --metric METRIC  -p, --print-groups
  • First choose a file or set of files to work on
  • Runpython plot_data.py -p <files> to see which groups are availablefor analysis
  • Choose one group and its id to plot
  • Runpython plot_data.py -g <groupId> <files>
  • Choose the name of one of the metrics of the group printed out
  • Runpython plot_data.py -g <groupId> -m "<metricName>" <files>
  • Note: Escape the metric name with double quotes
$ python plot_data.py -g 16 -m"Power DRAM [W]" data/ALU-core-1-1.0 data/ALU-core-1-1.5 data/ALU-core-1-2.2 data/Copy-mem-2-2.2 data/ALU-mem-1-1.0 data/ALU-mem-1-1.5 data/ALU-mem-1-2.2

aggregated Details

Data naming convention

<BenchmarkName>-<#threads><-HT>-<freq>.<ext>
  • #threads: 1, 3, 6. Number of physical cores used in the benchmark.
  • -HT: present or not. If present then in each physical cores therewere two threads running in Hyper Threading otherwise just one
  • freq: 1.2, 1.7, 2.2. Maximum frequency available for the benchmark.In this case the lowest frequency is precise
  • ext: csv or stdout. In the csv there are the values of the metricsof the benchmark while stdout contains the running time and additionaldetails

Plotting

In order to easily analyze the datahistogram.py can be used.It is a Python3.6 script which uses the matplotlib library.

Usage
usage: histogram.py [-h] [-g GROUP] [-m METRIC] [-t TYPE] [-p]                    file [file ...]Plot datapositional arguments:  file                  Path to the file without extensionoptional arguments:  -h, --help            show this help message and exit  -g GROUP, --group GROUP  -m METRIC, --metric METRIC  -t TYPE, --type TYPE  -p, --print-groups
  • First choose a file or set of files to work on
  • Runpython histogram.py -p <files> to see which groups are availablefor analysis
  • Choose one group and its id to plot
  • Runpython histogram.py -g <groupId> <files>
  • Choose the name of one of the metrics of the group printed outand additionally choose also the type of aggregation across threadsgiven that there is one value of the metric for each thread.Possible values areavg,sum,any,min,max.
  • Runpython plot_data.py -g <groupId> -m "<metricName>" -t <type> <files>
  • Note: Escape the metric name with double quotes and when passingthe file names remove their extension
$ python histogram.py -g 1 -t avg -m"Avg stall duration [cycles]" data_aggregated/Scattered*.csv data_aggregated/ALU-core*.csv

Thexs represent the running time of the benchmark (right y axis) whilethe bars represent the metric (left y axis).

correlations

AdditionallyPearson Correlation Coefficients for each varying inputi.e. #Threads, Frequencies and HT have been computed throughpearson_corr.pyand can be found incorrelations/.

Naming Convention
  • <BenchmarkName>-freq.csv: for each combination of #threads and HTthe correlation coefficient between the three values of thefrequency and the values of the metrics at those frequencies.

  • <BenchmarkName>-HT.csv: for each combination of #threads and frequenciesthe correlation coefficient between the two values of theHT and the values of the metrics with HT enabled or disabled.

  • <BenchmarkName>-threads.csv: for each combination of frequencies and HTthe correlation coefficient between the three values of the#threads and the values of the metrics at those #threads.

Plotting
$ python plot_corr.py correlations/Scattered-mem-freq.csv

Details

The tests were run on a Dell XPS 9750 with i7-8750H while charging. With TurboBoost disabledthe available frequencies range from 1.2 to 2.2 GHz. There is one socket with6 Physical cores and 12 Logical cores (in Hyper Threading).


[8]ページ先頭

©2009-2025 Movatter.jp