- Notifications
You must be signed in to change notification settings - Fork2
Angelogeb/perfctr-ht
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
author | date | title | |
---|---|---|---|
| March 2019 | Performance counters analysis for Hyper-Threading |
The complexity of newer architectures has led to the necessity ofa better knowledge of the underlying hardware in order to get peakperformance. Following these trends new interfaces have been madeavailable to developers for spotting performance bottlenecks in theirapplications such as Performance Monitoring Units (PMU).
PMUs enable developers to observe and count events in the CPU such asbranch mispredictions, cache misses and other finer grained details overthe whole pipeline. Although powerful, dealing with such informationremains burdensome given the diversity of the events, making it difficultto truly identify optimization opportunities.Depending by the processor family, on average 4 counters can be readcontemporarily at any time using Model Specific Registers. In orderto read more than 4 events, various tools multiplex such registersin atime-sharing fashion.
Many tools for performance analysis based on PMUs have been developedranging fromraw event count to more sofisticated and aggregatedmeasures as follows:
msr
: direct access to the device files/dev/cpu/*/msr
- PAPI : A Performance Application Programming Interface thatoffers a set of APIs for using performance counters.Supports multiple architectures and multiplexing.
- likwid : A suite of applications and libraries for analysingHigh Performance Computing applications. Itcontains out of the box utilies to work with MPI,power profiling and architecture topology.
- Intel Vtune Amplifier : Application for performance analysis onintel architectures. Gives insights regarding possible bottlenecksof the application annotating its source code and providespossible solutions.
- perf : In a similar vein to Intel Vtune Amplifier shows whichfunctions are more critical to the application. Additionallyprovides more high level information such as I/O and Networking.It is possible to analyse raw hardware performance counters butits main goal is abstracting over them.
- pmu-tools : is a collection of tools for profile collectionand performance analysis on Intel CPUs on top of Linux perf
Given that the goal of this document is to analyze system behaviourthrough performance counters to provide insights regarding newpossible scheduling strategies in Hyper-Threading systems, we chooseto use thelikwid
applications and libraries for our task. The choicewas especially driven by the presence of useful benchmarks in thelikwid
repository for stressing FPU and other core subsystems. AdditionallyIntel Vtune Amplifier was used to profile the benchmarks in order tocharacterize their workload.
likwid-perfctr -e
allows to query all the available events forthe current architecture whilelikwid-perfctr -a
shows the pre-configuredevent sets, called performance groups, with useful pre-selected eventsets and derived metrics. Multiple modes of execution of performance monitoringare available as documented in thelikwid
wiki. Of main interest arewrapper mode andtimeline mode. The former produces a summary of theevents, while the latter outputs performance metrics at a specifiedfrequency (specified through the-t
flag).In case multiple groups need to be monitored multiplexing is performedat the granularity set through the-t
flag (in timeline mode, otherwise-T
for wrapper mode) and the output produced are the id of the group readat a given timestep and its values.
Tests have shown that for measurements below 100 milliseconds, theperiodically printed results are not valid results anymore (they are higherthan expected) but the behavior of the results is still valid. E.g. if youtry to resolve the burst memory transfers, you need results for smallintervals. The memory bandwidth for each measurement may be higher thanexpected (could even be higher than the theoretical maximum of the machine)but the burst and non-burst traffic is clearly identifiable by highs andlows of the memory bandwidth results.
The benchmark available inlikwid
can be run through thelikwid-bench
command. For an overview of the available benchmarks runlikwid-bench -a
.All benchmarks perform operations over one-dimensional arrays. The benchmarksused in our setting are:
ddot_sp
: Single-precision dot product of two vectors, only scalaroperationscopy
: Double-precision vector copy, only scalar operationsddot_sp_avx
: Single-precision dot product of two vectors, optimized for AVXsum_int
: Custom benchmark similar tosum
but working on integerscopy_scattered
: Same ascopy
but accesses at page granularity (4096)
The last two benchmarks (sum_int
andcopy_scattered
) can be found underthebenchmarks/
folder of this repository. Instructions on how to compilethem in likwid can be foundhere.
All benchmarks are run with multiple configurations of number of threads (with orwithout Hyper-Threading), processor frequencies with TurboBoost disabled, workingset size. The latter is needed in order to emulatecore-bound executions(working set fitting in cache) andmemory-bound ones.
The repository has two branches:master
andaggregated
. The formercontains under thedata/
folder the profiling of the benchmarks runin timeline mode (-t
flag) while the latter underdata_aggregated/
the data of the profiling in wrapper mode (-T
).
Both folders were genereated through thebench.sh
file and minorvariations of it for different number of threads and Hyper Threadingactivation.
<BenchmarkName>-<#threads>-<freq>.<ext>
#threads: 1 or 2. When the value is 2, the threads run in thesame physical core in Hyper Threading
freq: 1.0, 1.5, 2.2. The smallest frequency is not precise andranges from 1.0 to 1.28.
ext: stdout, stderr, header. In stderr there is the actual datawith one line per group, with groups repeating for each sample.In header can be found the name of the metrics/performance countersin each group.In stdout the total running time of the benchmark and additional info.The output produced by
bench.sh
is not clean therefore such datahas been processed throughprocess_data.py
.
In order to easily analyze the dataplot_data.py
can be used.It is a Python3.6 script which uses the matplotlib library.
usage: plot_data.py [-h] [-g GROUP] [-m METRIC] [-p] file [file ...]Plot datapositional arguments: file Path to the file without extensionoptional arguments: -h, --help show this help message and exit -g GROUP, --group GROUP -m METRIC, --metric METRIC -p, --print-groups
- First choose a file or set of files to work on
- Run
python plot_data.py -p <files>
to see which groups are availablefor analysis - Choose one group and its id to plot
- Run
python plot_data.py -g <groupId> <files>
- Choose the name of one of the metrics of the group printed out
- Run
python plot_data.py -g <groupId> -m "<metricName>" <files>
- Note: Escape the metric name with double quotes
$ python plot_data.py -g 16 -m"Power DRAM [W]" data/ALU-core-1-1.0 data/ALU-core-1-1.5 data/ALU-core-1-2.2 data/Copy-mem-2-2.2 data/ALU-mem-1-1.0 data/ALU-mem-1-1.5 data/ALU-mem-1-2.2
<BenchmarkName>-<#threads><-HT>-<freq>.<ext>
- #threads: 1, 3, 6. Number of physical cores used in the benchmark.
- -HT: present or not. If present then in each physical cores therewere two threads running in Hyper Threading otherwise just one
- freq: 1.2, 1.7, 2.2. Maximum frequency available for the benchmark.In this case the lowest frequency is precise
- ext: csv or stdout. In the csv there are the values of the metricsof the benchmark while stdout contains the running time and additionaldetails
In order to easily analyze the datahistogram.py
can be used.It is a Python3.6 script which uses the matplotlib library.
usage: histogram.py [-h] [-g GROUP] [-m METRIC] [-t TYPE] [-p] file [file ...]Plot datapositional arguments: file Path to the file without extensionoptional arguments: -h, --help show this help message and exit -g GROUP, --group GROUP -m METRIC, --metric METRIC -t TYPE, --type TYPE -p, --print-groups
- First choose a file or set of files to work on
- Run
python histogram.py -p <files>
to see which groups are availablefor analysis - Choose one group and its id to plot
- Run
python histogram.py -g <groupId> <files>
- Choose the name of one of the metrics of the group printed outand additionally choose also the type of aggregation across threadsgiven that there is one value of the metric for each thread.Possible values are
avg
,sum
,any
,min
,max
. - Run
python plot_data.py -g <groupId> -m "<metricName>" -t <type> <files>
- Note: Escape the metric name with double quotes and when passingthe file names remove their extension
$ python histogram.py -g 1 -t avg -m"Avg stall duration [cycles]" data_aggregated/Scattered*.csv data_aggregated/ALU-core*.csv
Thex
s represent the running time of the benchmark (right y axis) whilethe bars represent the metric (left y axis).
AdditionallyPearson Correlation Coefficients for each varying inputi.e. #Threads, Frequencies and HT have been computed throughpearson_corr.py
and can be found incorrelations/
.
<BenchmarkName>-freq.csv
: for each combination of #threads and HTthe correlation coefficient between the three values of thefrequency and the values of the metrics at those frequencies.<BenchmarkName>-HT.csv
: for each combination of #threads and frequenciesthe correlation coefficient between the two values of theHT and the values of the metrics with HT enabled or disabled.<BenchmarkName>-threads.csv
: for each combination of frequencies and HTthe correlation coefficient between the three values of the#threads and the values of the metrics at those #threads.
$ python plot_corr.py correlations/Scattered-mem-freq.csv
The tests were run on a Dell XPS 9750 with i7-8750H while charging. With TurboBoost disabledthe available frequencies range from 1.2 to 2.2 GHz. There is one socket with6 Physical cores and 12 Logical cores (in Hyper Threading).