Movatterモバイル変換

Angelogeb/perfctr-htPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star4

Performance Counter Analysis

4 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
benchmarks		benchmarks
correlations		correlations
data_aggregated		data_aggregated
imgs		imgs
.gitignore		.gitignore
bench.sh		bench.sh
histogram.py		histogram.py
pearson_corr.py		pearson_corr.py
plot_corr.py		plot_corr.py
readme.md		readme.md

Repository files navigation

author

date

title

Beatrice Bevilacqua, Anxhelo Xhebraj

March 2019

Performance counters analysis for Hyper-Threading

Performance Counters Frameworks

The complexity of newer architectures has led to the necessity ofa better knowledge of the underlying hardware in order to get peakperformance. Following these trends new interfaces have been madeavailable to developers for spotting performance bottlenecks in theirapplications such as Performance Monitoring Units (PMU).

PMUs enable developers to observe and count events in the CPU such asbranch mispredictions, cache misses and other finer grained details overthe whole pipeline. Although powerful, dealing with such informationremains burdensome given the diversity of the events, making it difficultto truly identify optimization opportunities.Depending by the processor family, on average 4 counters can be readcontemporarily at any time using Model Specific Registers. In orderto read more than 4 events, various tools multiplex such registersin atime-sharing fashion.

Many tools for performance analysis based on PMUs have been developedranging fromraw event count to more sofisticated and aggregatedmeasures as follows:

msr: direct access to the device files/dev/cpu/*/msr
PAPI : A Performance Application Programming Interface thatoffers a set of APIs for using performance counters.Supports multiple architectures and multiplexing.
likwid : A suite of applications and libraries for analysingHigh Performance Computing applications. Itcontains out of the box utilies to work with MPI,power profiling and architecture topology.
Intel Vtune Amplifier : Application for performance analysis onintel architectures. Gives insights regarding possible bottlenecksof the application annotating its source code and providespossible solutions.
perf : In a similar vein to Intel Vtune Amplifier shows whichfunctions are more critical to the application. Additionallyprovides more high level information such as I/O and Networking.It is possible to analyse raw hardware performance counters butits main goal is abstracting over them.
pmu-tools : is a collection of tools for profile collectionand performance analysis on Intel CPUs on top of Linux perf

`likwid`

Given that the goal of this document is to analyze system behaviourthrough performance counters to provide insights regarding newpossible scheduling strategies in Hyper-Threading systems, we chooseto use thelikwid applications and libraries for our task. The choicewas especially driven by the presence of useful benchmarks in thelikwidrepository for stressing FPU and other core subsystems. AdditionallyIntel Vtune Amplifier was used to profile the benchmarks in order tocharacterize their workload.

likwid-perfctr -e allows to query all the available events forthe current architecture whilelikwid-perfctr -a shows the pre-configuredevent sets, called performance groups, with useful pre-selected eventsets and derived metrics. Multiple modes of execution of performance monitoringare available as documented in thelikwid wiki. Of main interest arewrapper mode andtimeline mode. The former produces a summary of theevents, while the latter outputs performance metrics at a specifiedfrequency (specified through the-t flag).In case multiple groups need to be monitored multiplexing is performedat the granularity set through the-t flag (in timeline mode, otherwise-T for wrapper mode) and the output produced are the id of the group readat a given timestep and its values.

Tests have shown that for measurements below 100 milliseconds, theperiodically printed results are not valid results anymore (they are higherthan expected) but the behavior of the results is still valid. E.g. if youtry to resolve the burst memory transfers, you need results for smallintervals. The memory bandwidth for each measurement may be higher thanexpected (could even be higher than the theoretical maximum of the machine)but the burst and non-burst traffic is clearly identifiable by highs andlows of the memory bandwidth results.

Benchmarks

The benchmark available inlikwid can be run through thelikwid-benchcommand. For an overview of the available benchmarks runlikwid-bench -a.All benchmarks perform operations over one-dimensional arrays. The benchmarksused in our setting are:

ddot_sp: Single-precision dot product of two vectors, only scalaroperations
copy: Double-precision vector copy, only scalar operations
ddot_sp_avx: Single-precision dot product of two vectors, optimized for AVX
sum_int: Custom benchmark similar tosum but working on integers
copy_scattered: Same ascopy but accesses at page granularity (4096)

The last two benchmarks (sum_int andcopy_scattered) can be found underthebenchmarks/ folder of this repository. Instructions on how to compilethem in likwid can be foundhere.

All benchmarks are run with multiple configurations of number of threads (with orwithout Hyper-Threading), processor frequencies with TurboBoost disabled, workingset size. The latter is needed in order to emulatecore-bound executions(working set fitting in cache) andmemory-bound ones.

Repository Structure

The repository has two branches:master andaggregated. The formercontains under thedata/ folder the profiling of the benchmarks runin timeline mode (-t flag) while the latter underdata_aggregated/the data of the profiling in wrapper mode (-T).

Both folders were genereated through thebench.sh file and minorvariations of it for different number of threads and Hyper Threadingactivation.

`master` Details

Data naming convention

<BenchmarkName>-<#threads>-<freq>.<ext>

#threads: 1 or 2. When the value is 2, the threads run in thesame physical core in Hyper Threading
freq: 1.0, 1.5, 2.2. The smallest frequency is not precise andranges from 1.0 to 1.28.
ext: stdout, stderr, header. In stderr there is the actual datawith one line per group, with groups repeating for each sample.In header can be found the name of the metrics/performance countersin each group.In stdout the total running time of the benchmark and additional info.The output produced bybench.sh is not clean therefore such datahas been processed throughprocess_data.py.

Plotting

In order to easily analyze the dataplot_data.py can be used.It is a Python3.6 script which uses the matplotlib library.

Usage

usage: plot_data.py [-h] [-g GROUP] [-m METRIC] [-p] file [file ...]Plot datapositional arguments:  file                  Path to the file without extensionoptional arguments:  -h, --help            show this help message and exit  -g GROUP, --group GROUP  -m METRIC, --metric METRIC  -p, --print-groups

First choose a file or set of files to work on
Runpython plot_data.py -p <files> to see which groups are availablefor analysis
Choose one group and its id to plot
Runpython plot_data.py -g <groupId> <files>
Choose the name of one of the metrics of the group printed out
Runpython plot_data.py -g <groupId> -m "<metricName>" <files>
Note: Escape the metric name with double quotes

$ python plot_data.py -g 16 -m"Power DRAM [W]" data/ALU-core-1-1.0 data/ALU-core-1-1.5 data/ALU-core-1-2.2 data/Copy-mem-2-2.2 data/ALU-mem-1-1.0 data/ALU-mem-1-1.5 data/ALU-mem-1-2.2

`aggregated` Details

Data naming convention

<BenchmarkName>-<#threads><-HT>-<freq>.<ext>

#threads: 1, 3, 6. Number of physical cores used in the benchmark.
-HT: present or not. If present then in each physical cores therewere two threads running in Hyper Threading otherwise just one
freq: 1.2, 1.7, 2.2. Maximum frequency available for the benchmark.In this case the lowest frequency is precise
ext: csv or stdout. In the csv there are the values of the metricsof the benchmark while stdout contains the running time and additionaldetails

Plotting

In order to easily analyze the datahistogram.py can be used.It is a Python3.6 script which uses the matplotlib library.

Usage

usage: histogram.py [-h] [-g GROUP] [-m METRIC] [-t TYPE] [-p]                    file [file ...]Plot datapositional arguments:  file                  Path to the file without extensionoptional arguments:  -h, --help            show this help message and exit  -g GROUP, --group GROUP  -m METRIC, --metric METRIC  -t TYPE, --type TYPE  -p, --print-groups

First choose a file or set of files to work on
Runpython histogram.py -p <files> to see which groups are availablefor analysis
Choose one group and its id to plot
Runpython histogram.py -g <groupId> <files>
Choose the name of one of the metrics of the group printed outand additionally choose also the type of aggregation across threadsgiven that there is one value of the metric for each thread.Possible values areavg,sum,any,min,max.
Runpython plot_data.py -g <groupId> -m "<metricName>" -t <type> <files>
Note: Escape the metric name with double quotes and when passingthe file names remove their extension

$ python histogram.py -g 1 -t avg -m"Avg stall duration [cycles]" data_aggregated/Scattered*.csv data_aggregated/ALU-core*.csv

Thexs represent the running time of the benchmark (right y axis) whilethe bars represent the metric (left y axis).

`correlations`

AdditionallyPearson Correlation Coefficients for each varying inputi.e. #Threads, Frequencies and HT have been computed throughpearson_corr.pyand can be found incorrelations/.

Naming Convention

<BenchmarkName>-freq.csv: for each combination of #threads and HTthe correlation coefficient between the three values of thefrequency and the values of the metrics at those frequencies.
<BenchmarkName>-HT.csv: for each combination of #threads and frequenciesthe correlation coefficient between the two values of theHT and the values of the metrics with HT enabled or disabled.
<BenchmarkName>-threads.csv: for each combination of frequencies and HTthe correlation coefficient between the three values of the#threads and the values of the metrics at those #threads.

Plotting

$ python plot_corr.py correlations/Scattered-mem-freq.csv

Details

The tests were run on a Dell XPS 9750 with i7-8750H while charging. With TurboBoost disabledthe available frequencies range from 1.2 to 2.2 GHz. There is one socket with6 Physical cores and 12 Logical cores (in Hyper Threading).

About

Performance Counter Analysis

Releases1

Added data.zip Latest

Mar 20, 2019

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Performance counters analysis for Hyper-Threading

Performance Counters Frameworks

`likwid`

Benchmarks

Repository Structure

`master` Details

Data naming convention

Plotting

Usage

`aggregated` Details

Data naming convention

Plotting

Usage

`correlations`

Naming Convention

Plotting

Details

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases1

Packages

Languages

Movatterモバイル変換

Angelogeb/perfctr-ht

Folders and files

Latest commit

History

Repository files navigation

Performance counters analysis for Hyper-Threading

Performance Counters Frameworks

likwid

Benchmarks

Repository Structure

master Details

Data naming convention

Plotting

Usage

aggregated Details

Data naming convention

Plotting

Usage

correlations

Naming Convention

Plotting

Details

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases1

Packages0

Languages

`likwid`

`master` Details

`aggregated` Details

`correlations`

Packages