Movatterモバイル変換


[0]ホーム

URL:


504 views

Energy Efficient Computing using Dynamic Tuning

The document outlines a methodology for dynamic tuning of high-performance computing (HPC) applications to improve energy efficiency and resource utilization. It discusses the Readex project, which develops tools for automatic tuning of applications, and evaluates the impact of tuning on energy consumption and performance across various applications. Key findings include significant energy savings achieved through dynamic tuning, particularly with the bem4i application, and the importance of adjusting hardware parameters to optimize performance.

Embed presentation

Download to read offline
Dynamic Tuning of HPC Applications
OverviewMethodology• Motivation and Introduction• READEX project overview• What can you achieve with static and dynamic tuning• Tuning of the hardware parameters• Effect on the energy consumption• Effect of hardware parameter tuning on kernels with various arithmetic intensity• Evaluation of complex HPC applications• BEM4I• OpenFOAM• Scalability tests with ESPRESO• Tuning of the application parameters
Motivation and Introduction
READEX Project & Motivation• Energy efficiency is critical to current and future systems• Applications exhibit dynamic behavior• Changing resource requirements• Computational characteristics• Changing load on processors over timeGoal was to create a tools-aided methodology for automatic tuning of parallel applications.Dynamically adjust system parameters to actual resource requirements
What is dynamic tuningFREQ=2 GHzPhase regionSignificant regionSignificant regionFREQ=1.5 GHz
READEX Tool Suite1. Instrument application• Score-P provides different kinds of instrumentation2. Detect dynamism• Check whether runtime situations could benefitfrom tuning3. Detect energy saving potential andconfigurations (DTA)• Use tuning plugin and power measurementinfrastructure to search for optimal configuration• Create tuning model4. Runtime application tuning (RAT)• Apply tuning model, use optimal configurationPeriscope TuningFrameworkREADEXTuning PluginApplicationTuning ModelScore-PREADEX RuntimeLibraryOnlineAccessInterfaceSubstratePluginInterfaceParameterControl PluginEnergyMeasurements(HDEEM)READEX Tool Suite
READEX Test SuiteConsists of benchmarks, proxy apps and complex productionapplicationsKey features:• Full set of scripts allows reproducibility of experiments on• TUD Taurus HSW (HDEEM) and BDW partitions• IT4I Salomon machine (RAPL)• Support for Slurm and PBS schedulers• Automatic savings evaluation• Performs evaluation of• hardware and system parameter tuning• application parameter tuning• Contains manual instrumentation of significant regions• using header file à can be adopted to test other toolsApplicationtypeApplicationnamebenchmarks orproxy appsAMG2013BlasbenchKripkeLuleshNPB3.3productionapplicationsBEM4IESPRESOINDEEDOpenFOAM
What can you expect from static tuningMANUAL STATIC TUNING12.6%PROPOSAL4.3%17.6% Test Suite MAXTest Suite MINTest Suite AVGSoftwareStatic tuningsavingsAMG2013 12.5 %Blasbench 7.4 %Kripke 11.5 %Lulesh 17.6 %NPB3.3 11.0 %BEM4I 15.7 %INDEED 17.6 %ESPRESO 4.3 %OpenFOAM 15.9 %Average 12.6 %
What can you expect from dynamic tuningTest Suite MAXTest Suite MINTest Suite AVGproposal goal: up to 30%Test Suite MAXMANUAL DYNAMIC TUNING34.1%PROPOSALTest Suite MIN 8.2%Test Suite AVG 17.%SoftwareDynamic tuningsavingsAMG2013 12.5 %Blasbench 15.3 %Kripke 18.5 %Lulesh 18.7 %NPB3.3 11.0%BEM4I 34.1 %INDEED 19.5 %ESPRESO 8.2 %OpenFOAM 20.1%Average 17.5 %
Energy savings achieved by static and dynamic tuningApplication(default is Intel compiler)(* uses GCC compiler)HW parametersStatic tuning savingnode energy / timeDynamic tuningsavingsnode energy/timeREADEX tuninsavingsnode energy/tiAMG2013 CF, UCF, threads 12.5% / −0.9% N/A 7.0% / −14.0%Blasbench CF, UCF, threads 7.4% / −0.9% 15.3% / −18.1% 9.9% / −9.2%Kripke CF, UCF 11.5% / −28.3% 18.8% / − 18.7% 10.5% / −28.9Lulesh CF, UCF, threads 17.6% / −8.9% 18.7% / −11.7% 18.2% / −25.7NPB3.3-BT-MZ CF, UCF, threads 11% / −11.3% N/A 10.8% / −12%BEM4I CF, UCF, threads 15.7% / −6.2% 34.1% / 10.9% 34.0% / 10.9%INDEED CF, UCF, threads 17.6% / −12.8% 19.5% / −14.2% 19.1% / −17.3ESPRESO CF, UCF, threads 4.3% / −8.9% 8.2% / −10.1% 7.1% / −12.3%OpenFOAM CF, UCF 15.9% / −10.5% 20.1% / 11.5% 9.8% / −9.8%Evaluation of READEX Tool Suite on TUD Taurus Haswell system with HDEEM energy measurementsKey findings:• Best savings achieved with BEM4I application – up to 34% for energy and 11% for runtime• In general energy savings are ”paid” by extra runtime
Tuning of the hardware parameters
Hardware parameter tuningInvestigation of impact of CPU uncore frequency tuning on memory bound code:• Optimal frequency, with low energy consumption, and a small performance impactEvaluation using STREAM Copy benchmarkResults by TU Dresden under READEX project
Hardware parameter tuningEffect of changing core frequencies on uncore performance using memory bound code• Just a small impact on the Bandwidth and EnergyEvaluation using STREAM Copy benchmarkResults by TU Dresden under READEX projectHeatmap of the energy consumption of a stream benchmark for different core and uncore frequencies.The data array does not fit in the processor’s L3 processor cache
Hardware parameter tuningL3 Cache Energy efficiency and Bandwidth:• Different optimal uncore frequenciesEvaluation using STREAM Copy benchmarkResults by TU Dresden under READEX projectHeatmap of the energy consumption of a stream benchmark for different core and uncore frequencies.The data array does fit in the processor’s L3 processor cache
Effect of hardware parameter tuning on kernelswith various arithmetic intensity
Arithmetic intensity
Static tuning for various arithmetic intensityRatio from 1:9
Static tuning for various arithmetic intensityRatio from 2:8
Static tuning for various arithmetic intensityRatio from 3:7
Static tuning for various arithmetic intensityRatio from 4:6
Static tuning for various arithmetic intensityRatio from 5:5
Static tuning for various arithmetic intensityRatio from 6:4
Static tuning for various arithmetic intensityRatio from 7:3
Static tuning for various arithmetic intensityRatio from 8:2
Static tuning for various arithmetic intensityRatio from 9:1
Hardware parameter tuningBehavior of the simple application with two kernels• Low computational intensity – DGEMV• High computational intensity – DGEMM• Tuning of three parameters• Core frequency• Uncore frequency• Number of OpenMP threads• Visualized by RADAR....Low CI (DGEMV) High CI (DGEMM)10 threads2.2 GHz UCF1.2 GHz CF12 threads1.2 GHz UCF2.5 GHz CFStatic tuning for both kernels12 threads2.2 GHz UCF2.4 GHz CFComputenodeenergyconsumption[J]CPU core frequency [GHz] CPU core frequency [GHz] CPU core frequency [GHz]Computenodeenergyconsumption[J]Computenodeenergyconsumption[J]Note: runtime of both kernels was equal for default settingsTwo kernels with1:1 workload ratioEnergyconsumptionEnergysavingsDefault settings 2017J - -Static optimal 1833J 179J 9%Dynamic optimal 1612J 221J 12%Total savings - 400J 20%
Core and uncore frequency tuning under power cap
Experiments description and testbed parametersTestbed: Broadwell partition of the Galileosupercomputer in CINECA• dual socket server• two 18-core Intel Xeon E5-2697v4 processor• 2.3 GHz nominal frequency.• 2.7 GHz turbo frequency when all 18 cores are utilized• 145W TDPKey tunable parameters of the 18-core Intel Xeon E5-2697v4 processor and their respective ranges and steps.A set of experiments performed on Intel Broadwell Architecture
Tuning of COMPUTE bound workload• behavior of the platform when running memory bound workload• under 145 W (TDP level, no power cap)• three different power cap levels 100 W, 80 W and 60 W.3,268s 3,268s3,903s3,903s7,409s3,577s7,693s4,379s3,653s363,4J311,8J 311,8J285,4J304,2J271,6J290,0J0501001502002503003504002,54,56,58,510,512,51,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningof computebound region under 80W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCF8.5% energy savings8.4% time savings12.9% energy savings10.9% time extension3,268s3,450s 3,450s7,411s3,293s4,378s7,698s363,4J344,4J 344,4J300,4J293,0J305,4J271,0J297J0501001502002503003504002,54,56,58,510,512,51,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningof computebound region under 100W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCF14.9% energy savings4.5% time savings21.3% energy savings21.1% time extension3,268s4,944s 4,944s7,410s4,849s4,477s7,692s4,565s 4,606s363,4J296J295,0J268,0J303,0J270,4J0501001502002503003504002,54,56,58,510,512,51,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningof computebound region under 60W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - max UCFEXP7 - DVFS & UCF - min UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - max UCFEXP7 - DVFS & UCF - min UCF9.1% energy savings9.4% time savings
3,268s3,450s 3,450s7,411s3,293s4,378s7,698s363,4J344,4J344,4J300,4J293,0J305,4J271,0J297J0501001502002503003504002,54,56,58,510,512,51,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of compute bound region under 100W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCF14.9% energy savings4.5% time savings21.3% energy savings21.1% time extensionTuning of COMPUTE bound workload under 100W power cap
3,268s3,268s3,903s 3,903s7,409s3,577s7,693s4,379s3,653s363,4J311,8J 311,8J285,4J304,2J271,6J290,0J0501001502002503003504002,54,56,58,510,512,51,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of compute bound region under 80W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCF8.5% energy savings8.4% time savings12.9% energy savings10.9% time extensionTuning of COMPUTE bound workload under 80W power cap
3,268s4,944s 4,944s7,410s4,849s4,477s7,692s4,565s 4,606s363,4J296J295,0J268,0J303,0J270,4J0501001502002503003504002,54,56,58,510,512,51,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of compute bound region under 60W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - max UCFEXP7 - DVFS & UCF - min UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - max UCFEXP7 - DVFS & UCF - min UCF9.1% energy savings9.4% time savingsTuning of COMPUTE bound workload under 60W power cap
Observations for COMPUTE bound workload• To achieve the best possible performance• the uncore frequency must be reduces to minimum• 9.4 % performance gain up to and• 14.9 % lower energy consumption• If further energy savings are required – use DVFS and lower the core freq.• up to 21 % of energy savings• up to 21 % penalty in runtime• this effect is more visible for higher powercap levels
Tuning of memory bound workload• behavior of the platform when running memory bound workload• under 145 W (TDP level, no power cap)• three different power cap levels 100 W, 80 W and 60 W.1,886s1,959s1,886s197,6J188,2J188,2J148,6J115,2J170J145,6J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningofmemorybound region under 100W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF38.7% energy savings3.6% time extension1,886s1,920s1,890s1,959s197,6J153,2J114,4J146,0J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningofmemorybound region under 80W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF21.6% energy savings3.6% time extension1,886s2,475s 2,475s1,945s2,397s1,925s197,6J147,8J116,2J115,0J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningofmemorybound region under 60W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF22.2% energy savings22.2% time savings
1,886s1,959s1,886s197,6J188,2J188,2J148,6J115,2J170J145,6J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of memory bound region under 100W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF38.7% energy savings3.6% time extensionTuning of memory bound workload under 100W power cap
1,886s1,920s1,890s1,959s197,6J153,2J114,4J146,0J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of memory bound region under 80W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF21.6% energy savings3.6% time extensionTuning of memory bound workload under 80W power cap
1,886s2,475s 2,475s1,945s2,397s1,925s197,6J147,8J116,2J115,0J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of memory bound region under 60W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF22.2% energy savings22.2% time savingsTuning of memory bound workload under 60W power cap
Observations for both workloadsObservations for memory bound workload• Under the power budget lower that 80 W• DVFS should be set to minimum value• boost the performance of the uncore part by 22%.• Tuning of the uncore frequency• has low effect on the performance• but a major effect on energy consumption• between 21% (60 W and 80W) to 38% (100W)Observations for compute bound workload• To achieve the best possible performance• the uncore frequency must be reduces to minimum• 9.4 % performance gain up to and• 14.9 % lower energy consumption• If further energy savings are required – use DVFS and lower the core freq.• up to 21 % of energy savings• up to 21 % penalty in runtime• this effect is more visible for higher powercap levels
Evaluation of complex HPC applications
BEM4I ApplicationApplication runtimeassemble_k[s]assemble_v[s]gmres_solve[s]print_vtu[s]main[s]default runtime 5.4 5.9 10.2 5.6 27.3static tuning runtime 9.8 10.6 6.1 2.4 29.0dynamic tuning runtime 7.0 7.2 7.9 2.1 24.3static savings [%] -82.3% -79.1% 40.5% 56.8% -6.2%dynamic savings [%] -30.6% -20.9% 23.2% 62.9% 10.9%Hardware: dual socket system with 2x12 CPU cores – ”standard HW” in HPC centresRegion description:• assemble_k and assemble_v – high utilization of vector units, extreme level ofoptimization – fully compute bound great utilization of both sockets and all cores• gmres_solve – uses DGEMV from MKL – memory bound, suffers on NUMA effect;this routine is more efficient on single socket• print_vtu – single threaded I/O and network bound region why stores data to afile on LUSTRE system”static": {"FREQUENCY": ”25", <--------- 2.5 GHz"NUM_THREADS": ”12", <--------- 12 OpenMP threads"UNCORE_FREQUENCY": ”22” } <--------- 2.2 GHz"assemble_k": {"FREQUENCY": "23","NUM_THREADS": "24","UNCORE_FREQUENCY": ”16”},"assemble_v": {"FREQUENCY": ”25","NUM_THREADS": "24","UNCORE_FREQUENCY": ”14”},"gmres_solve": {"FREQUENCY": ”17","NUM_THREADS": ”8","UNCORE_FREQUENCY": ”22”},"print_vtu": {"FREQUENCY": "25","NUM_THREADS": ”6","UNCORE_FREQUENCY": ”24”}
Compute node energyassemble_k[J]assemble_v[J]gmres_solve[J]print_vtu[J]main[J]default energy 1476 1484 2733 1142 6872static tuning energy 1962 2015 1366 420 5792dynamic tuning energy 1467 1462 1259 293 4531static savings [%] -33.8% -35.8% 50.0% 63.2% 15.7%dynamic savings [%] 0.6% 1.5% 53.9% 74.3% 34.1%BEM4I ApplicationLarge energy savings is combination of optimal HW settings and runtime savingsdue to mitigation of NUMA effect by optimal settings of OpenMP threading• Without savings in runtime caused by similar application will• Energy savings approx. 15 – 20%• Runtime savings approx. -15%”static": {"FREQUENCY": ”25", <--------- 2.5 GHz"NUM_THREADS": ”12", <--------- 12 OpenMP threads"UNCORE_FREQUENCY": ”22” } <--------- 2.2 GHz"assemble_k": {"FREQUENCY": "23","NUM_THREADS": "24","UNCORE_FREQUENCY": ”16”},"assemble_v": {"FREQUENCY": ”25","NUM_THREADS": "24","UNCORE_FREQUENCY": ”14”},"gmres_solve": {"FREQUENCY": ”17","NUM_THREADS": ”8","UNCORE_FREQUENCY": ”22”},"print_vtu": {"FREQUENCY": "25","NUM_THREADS": ”6","UNCORE_FREQUENCY": ”24”}
OpenFOAM ApplicationOpenFOAM Energy consumption Energy savingsDefault settings 14 231J - -Static tuning 12 264J 2 264J 15.9%Dynamic tuningTotal savings• Computational fluid dynamics• Finite volume + multigrid solver
OpenFOAM ApplicationOpenFOAM Energy consumption Energy savingsDefault settings 14 231J - -Static tuning 12 264J 2 264J 15.9%Dynamic tuning 11 370J 597J 4.8%Total savings 2 861J 20.1%• Computational fluid dynamics• Finite volume + multigrid solver
ESPRESO Application33% of energy savings22% of time savings and improved strong scalability• Structural mechanics code• Finite element + sparse FETI solver• Different tuning models for different # of nodes is needed for strong scalability –workload per node is varies• Includes dynamic switching overheadsEnergy savings analysis for the strong scalability test of theESPRESO library when running the cube benchmark
Application parameters tuning
Application parameters tuning of the ESPRESO50% - 66% against ”reasonable” settings86% against the worst case0501001502002503000 500 1000 1500 2000 2500 3000 3500Energyconsumption[kJ]Configuration indexthe “reasonable” settingsthe optimal settings9 parameters3840 combinations• FETI METHOD 2x• PRECONDITIONER 5x• ITERATIVE SOLVER TYPE 2x• HFETI type 2x• NON-UNIFORM PARTS 6x• REDUNDANT LAGRANGE 2x• SCALING 2x• B0_TYPE 2x• ADAPTIVE PRECISION 2x
Application parameters tuningApplication parameter tuning parameters is very promising• application configuration parameters are given in the input file• each setting requires an individual start of the application• tool performs automatic search of application parameter spaceApplicationnumber of parameters tested /total number of optionsEnergy savingscomparedto the worst settingsEnergy savings compared todefault or reasonable settingsESPRESO 9 / 3840 86% 50 – 66%ELMER 1 / 40 97% 50 – 75%OpenFOAM 2 / 12 24% 8%INDEED 3 / 12 35% 25%
Thank you

Recommended

PDF
Overview of HPC Interconnects
PDF
Hardware & Software Platforms for HPC, AI and ML
PDF
NNSA Explorations: ARM for Supercomputing
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PDF
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
PDF
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
PDF
MIT's experience on OpenPOWER/POWER 9 platform
PDF
State of ARM-based HPC
PDF
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
PDF
IBM HPC Transformation with AI
PDF
TAU E4S ON OpenPOWER /POWER9 platform
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
DPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFV
PDF
BXI: Bull eXascale Interconnect
PDF
IBM Data Centric Systems & OpenPOWER
PDF
DOME 64-bit μDataCenter
PDF
Trends in Systems and How to Get Efficient Performance
PDF
ARM HPC Ecosystem
PDF
Intel dpdk Tutorial
PPSX
FD.io Vector Packet Processing (VPP)
PDF
Summit workshop thompto
PDF
AI is Impacting HPC Everywhere
PDF
Lenovo HPC Strategy Update
PDF
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
PDF
HPC Accelerating Combustion Engine Design
PDF
High-Performance and Scalable Designs of Programming Models for Exascale Systems
PDF
A Fresh Look at HPC from Huawei Enterprise
PDF
Runtime Methods to Improve Energy Efficiency in HPC Applications
PDF
Ga techsusthpc patterson

More Related Content

PDF
Overview of HPC Interconnects
PDF
Hardware & Software Platforms for HPC, AI and ML
PDF
NNSA Explorations: ARM for Supercomputing
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PDF
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
PDF
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
PDF
MIT's experience on OpenPOWER/POWER 9 platform
Overview of HPC Interconnects
Hardware & Software Platforms for HPC, AI and ML
NNSA Explorations: ARM for Supercomputing
CUDA-Python and RAPIDS for blazing fast scientific computing
Preparing to program Aurora at Exascale - Early experiences and future direct...
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
MIT's experience on OpenPOWER/POWER 9 platform

What's hot

PDF
State of ARM-based HPC
PDF
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
PDF
IBM HPC Transformation with AI
PDF
TAU E4S ON OpenPOWER /POWER9 platform
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
DPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFV
PDF
BXI: Bull eXascale Interconnect
PDF
IBM Data Centric Systems & OpenPOWER
PDF
DOME 64-bit μDataCenter
PDF
Trends in Systems and How to Get Efficient Performance
PDF
ARM HPC Ecosystem
PDF
Intel dpdk Tutorial
PPSX
FD.io Vector Packet Processing (VPP)
PDF
Summit workshop thompto
PDF
AI is Impacting HPC Everywhere
PDF
Lenovo HPC Strategy Update
PDF
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
PDF
HPC Accelerating Combustion Engine Design
PDF
High-Performance and Scalable Designs of Programming Models for Exascale Systems
PDF
A Fresh Look at HPC from Huawei Enterprise
State of ARM-based HPC
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
IBM HPC Transformation with AI
TAU E4S ON OpenPOWER /POWER9 platform
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
DPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFV
BXI: Bull eXascale Interconnect
IBM Data Centric Systems & OpenPOWER
DOME 64-bit μDataCenter
Trends in Systems and How to Get Efficient Performance
ARM HPC Ecosystem
Intel dpdk Tutorial
FD.io Vector Packet Processing (VPP)
Summit workshop thompto
AI is Impacting HPC Everywhere
Lenovo HPC Strategy Update
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
HPC Accelerating Combustion Engine Design
High-Performance and Scalable Designs of Programming Models for Exascale Systems
A Fresh Look at HPC from Huawei Enterprise

Similar to Energy Efficient Computing using Dynamic Tuning

PDF
Runtime Methods to Improve Energy Efficiency in HPC Applications
PDF
Ga techsusthpc patterson
PDF
Deep Learning Training at Scale: Spring Crest Deep Learning Accelerator
PPTX
Energy Efficiency in Large Scale Systems
PDF
Symposium on HPC Applications – IIT Kanpur
PDF
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
PPTX
Optimizing High Performance Computing Applications for Energy
PDF
E03403027030
PPTX
Hardware-aware thread scheduling: the case of asymmetric multicore processors
PDF
Intel - Challenges and Opportunities in Cloud-Based Genomics Analytics
PDF
Dynamic Frequency Scaling Regarding Memory for Energy Efficiency of Embedded...
PDF
Hp All In 1
PDF
Performance and Energy evaluation
PDF
[IGC2018] AMD Don Woligroski - WHY Ryzen
PDF
The impact of software on data-center energy use - and what can we do about it?
PPTX
An application classification guided cache tuning heuristic for
PDF
POWER10 innovations for HPC
PDF
Barcelona Supercomputing Center, Generador de Riqueza
PDF
IBM zEnterprise 114 Technical Guide
PDF
6 profiling tools
 
Runtime Methods to Improve Energy Efficiency in HPC Applications
Ga techsusthpc patterson
Deep Learning Training at Scale: Spring Crest Deep Learning Accelerator
Energy Efficiency in Large Scale Systems
Symposium on HPC Applications – IIT Kanpur
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Optimizing High Performance Computing Applications for Energy
E03403027030
Hardware-aware thread scheduling: the case of asymmetric multicore processors
Intel - Challenges and Opportunities in Cloud-Based Genomics Analytics
Dynamic Frequency Scaling Regarding Memory for Energy Efficiency of Embedded...
Hp All In 1
Performance and Energy evaluation
[IGC2018] AMD Don Woligroski - WHY Ryzen
The impact of software on data-center energy use - and what can we do about it?
An application classification guided cache tuning heuristic for
POWER10 innovations for HPC
Barcelona Supercomputing Center, Generador de Riqueza
IBM zEnterprise 114 Technical Guide
6 profiling tools
 

More from inside-BigData.com

PDF
Major Market Shifts in IT
PPTX
Transforming Private 5G Networks
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
PDF
HPC Impact: EDA Telemetry Neural Networks
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
PDF
Machine Learning for Weather Forecasts
PPTX
HPC AI Advisory Council Update
PDF
Fugaku Supercomputer joins fight against COVID-19
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
PDF
Versal Premium ACAP for Network and Cloud Acceleration
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
PDF
Scaling TCO in a Post Moore's Era
PDF
Introducing HPC with a Raspberry Pi Cluster
PDF
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
PDF
Data Parallel Deep Learning
PDF
Making Supernovae with Jets
PDF
Adaptive Linear Solvers and Eigensolvers
PDF
Scientific Applications and Heterogeneous Architectures
PDF
SW/HW co-design for near-term quantum computing
Major Market Shifts in IT
Transforming Private 5G Networks
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
HPC Impact: EDA Telemetry Neural Networks
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Machine Learning for Weather Forecasts
HPC AI Advisory Council Update
Fugaku Supercomputer joins fight against COVID-19
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
Versal Premium ACAP for Network and Cloud Acceleration
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Scaling TCO in a Post Moore's Era
Introducing HPC with a Raspberry Pi Cluster
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Data Parallel Deep Learning
Making Supernovae with Jets
Adaptive Linear Solvers and Eigensolvers
Scientific Applications and Heterogeneous Architectures
SW/HW co-design for near-term quantum computing

Recently uploaded

PDF
Cybersecurity Prevention and Detection: Unit 2
PDF
The Necessity of Digital Forensics, the Digital Forensics Process & Laborator...
PDF
How Much Does It Cost To Build Software
PDF
The Evolving Role of the CEO in the Age of AI
PDF
Parallel Computing BCS702 Module notes of the vtu college 7th sem 4.pdf
PPTX
Support, Monitoring, Continuous Improvement & Scaling Agentic Automation [3/3]
PDF
Dev Dives: Build smarter agents with UiPath Agent Builder
PDF
[BDD 2025 - Mobile Development] Exploring Apple’s On-Device FoundationModels
PDF
5 Common Supply Chain Attacks and How They Work | CyberPro Magazine
PDF
Integrating AI with Meaningful Human Collaboration
PDF
Mastering Agentic Orchestration with UiPath Maestro | Hands on Workshop
PDF
Open Source Post-Quantum Cryptography - Matt Caswell
PDF
[BDD 2025 - Full-Stack Development] Agentic AI Architecture: Redefining Syste...
PDF
[BDD 2025 - Mobile Development] Crafting Immersive UI with E2E and AGSL Shade...
PDF
Transforming Supply Chains with Amazon Bedrock AgentCore (AWS Swiss User Grou...
PDF
Crane Accident Prevention Guide: Key OSHA Regulations for Safer Operations
PDF
PCCC25(設立25年記念PCクラスタシンポジウム):エヌビディア合同会社 テーマ2「NVIDIA BlueField-4 DPU」
PDF
[BDD 2025 - Mobile Development] Mobile Engineer and Software Engineer: Are we...
PDF
MuleSoft Meetup: Dreamforce'25 Tour- Vibing With AI & Agents.pdf
PDF
KMWorld - KM & AI Bring Collectivity, Nostalgia, & Selectivity
Cybersecurity Prevention and Detection: Unit 2
The Necessity of Digital Forensics, the Digital Forensics Process & Laborator...
How Much Does It Cost To Build Software
The Evolving Role of the CEO in the Age of AI
Parallel Computing BCS702 Module notes of the vtu college 7th sem 4.pdf
Support, Monitoring, Continuous Improvement & Scaling Agentic Automation [3/3]
Dev Dives: Build smarter agents with UiPath Agent Builder
[BDD 2025 - Mobile Development] Exploring Apple’s On-Device FoundationModels
5 Common Supply Chain Attacks and How They Work | CyberPro Magazine
Integrating AI with Meaningful Human Collaboration
Mastering Agentic Orchestration with UiPath Maestro | Hands on Workshop
Open Source Post-Quantum Cryptography - Matt Caswell
[BDD 2025 - Full-Stack Development] Agentic AI Architecture: Redefining Syste...
[BDD 2025 - Mobile Development] Crafting Immersive UI with E2E and AGSL Shade...
Transforming Supply Chains with Amazon Bedrock AgentCore (AWS Swiss User Grou...
Crane Accident Prevention Guide: Key OSHA Regulations for Safer Operations
PCCC25(設立25年記念PCクラスタシンポジウム):エヌビディア合同会社 テーマ2「NVIDIA BlueField-4 DPU」
[BDD 2025 - Mobile Development] Mobile Engineer and Software Engineer: Are we...
MuleSoft Meetup: Dreamforce'25 Tour- Vibing With AI & Agents.pdf
KMWorld - KM & AI Bring Collectivity, Nostalgia, & Selectivity

Energy Efficient Computing using Dynamic Tuning

  • 1.
    Dynamic Tuning ofHPC Applications
  • 2.
    OverviewMethodology• Motivation andIntroduction• READEX project overview• What can you achieve with static and dynamic tuning• Tuning of the hardware parameters• Effect on the energy consumption• Effect of hardware parameter tuning on kernels with various arithmetic intensity• Evaluation of complex HPC applications• BEM4I• OpenFOAM• Scalability tests with ESPRESO• Tuning of the application parameters
  • 3.
  • 4.
    READEX Project &Motivation• Energy efficiency is critical to current and future systems• Applications exhibit dynamic behavior• Changing resource requirements• Computational characteristics• Changing load on processors over timeGoal was to create a tools-aided methodology for automatic tuning of parallel applications.Dynamically adjust system parameters to actual resource requirements
  • 5.
    What is dynamictuningFREQ=2 GHzPhase regionSignificant regionSignificant regionFREQ=1.5 GHz
  • 6.
    READEX Tool Suite1.Instrument application• Score-P provides different kinds of instrumentation2. Detect dynamism• Check whether runtime situations could benefitfrom tuning3. Detect energy saving potential andconfigurations (DTA)• Use tuning plugin and power measurementinfrastructure to search for optimal configuration• Create tuning model4. Runtime application tuning (RAT)• Apply tuning model, use optimal configurationPeriscope TuningFrameworkREADEXTuning PluginApplicationTuning ModelScore-PREADEX RuntimeLibraryOnlineAccessInterfaceSubstratePluginInterfaceParameterControl PluginEnergyMeasurements(HDEEM)READEX Tool Suite
  • 7.
    READEX Test SuiteConsistsof benchmarks, proxy apps and complex productionapplicationsKey features:• Full set of scripts allows reproducibility of experiments on• TUD Taurus HSW (HDEEM) and BDW partitions• IT4I Salomon machine (RAPL)• Support for Slurm and PBS schedulers• Automatic savings evaluation• Performs evaluation of• hardware and system parameter tuning• application parameter tuning• Contains manual instrumentation of significant regions• using header file à can be adopted to test other toolsApplicationtypeApplicationnamebenchmarks orproxy appsAMG2013BlasbenchKripkeLuleshNPB3.3productionapplicationsBEM4IESPRESOINDEEDOpenFOAM
  • 8.
    What can youexpect from static tuningMANUAL STATIC TUNING12.6%PROPOSAL4.3%17.6% Test Suite MAXTest Suite MINTest Suite AVGSoftwareStatic tuningsavingsAMG2013 12.5 %Blasbench 7.4 %Kripke 11.5 %Lulesh 17.6 %NPB3.3 11.0 %BEM4I 15.7 %INDEED 17.6 %ESPRESO 4.3 %OpenFOAM 15.9 %Average 12.6 %
  • 9.
    What can youexpect from dynamic tuningTest Suite MAXTest Suite MINTest Suite AVGproposal goal: up to 30%Test Suite MAXMANUAL DYNAMIC TUNING34.1%PROPOSALTest Suite MIN 8.2%Test Suite AVG 17.%SoftwareDynamic tuningsavingsAMG2013 12.5 %Blasbench 15.3 %Kripke 18.5 %Lulesh 18.7 %NPB3.3 11.0%BEM4I 34.1 %INDEED 19.5 %ESPRESO 8.2 %OpenFOAM 20.1%Average 17.5 %
  • 10.
    Energy savings achievedby static and dynamic tuningApplication(default is Intel compiler)(* uses GCC compiler)HW parametersStatic tuning savingnode energy / timeDynamic tuningsavingsnode energy/timeREADEX tuninsavingsnode energy/tiAMG2013 CF, UCF, threads 12.5% / −0.9% N/A 7.0% / −14.0%Blasbench CF, UCF, threads 7.4% / −0.9% 15.3% / −18.1% 9.9% / −9.2%Kripke CF, UCF 11.5% / −28.3% 18.8% / − 18.7% 10.5% / −28.9Lulesh CF, UCF, threads 17.6% / −8.9% 18.7% / −11.7% 18.2% / −25.7NPB3.3-BT-MZ CF, UCF, threads 11% / −11.3% N/A 10.8% / −12%BEM4I CF, UCF, threads 15.7% / −6.2% 34.1% / 10.9% 34.0% / 10.9%INDEED CF, UCF, threads 17.6% / −12.8% 19.5% / −14.2% 19.1% / −17.3ESPRESO CF, UCF, threads 4.3% / −8.9% 8.2% / −10.1% 7.1% / −12.3%OpenFOAM CF, UCF 15.9% / −10.5% 20.1% / 11.5% 9.8% / −9.8%Evaluation of READEX Tool Suite on TUD Taurus Haswell system with HDEEM energy measurementsKey findings:• Best savings achieved with BEM4I application – up to 34% for energy and 11% for runtime• In general energy savings are ”paid” by extra runtime
  • 11.
    Tuning of thehardware parameters
  • 12.
    Hardware parameter tuningInvestigationof impact of CPU uncore frequency tuning on memory bound code:• Optimal frequency, with low energy consumption, and a small performance impactEvaluation using STREAM Copy benchmarkResults by TU Dresden under READEX project
  • 13.
    Hardware parameter tuningEffectof changing core frequencies on uncore performance using memory bound code• Just a small impact on the Bandwidth and EnergyEvaluation using STREAM Copy benchmarkResults by TU Dresden under READEX projectHeatmap of the energy consumption of a stream benchmark for different core and uncore frequencies.The data array does not fit in the processor’s L3 processor cache
  • 14.
    Hardware parameter tuningL3Cache Energy efficiency and Bandwidth:• Different optimal uncore frequenciesEvaluation using STREAM Copy benchmarkResults by TU Dresden under READEX projectHeatmap of the energy consumption of a stream benchmark for different core and uncore frequencies.The data array does fit in the processor’s L3 processor cache
  • 15.
    Effect of hardwareparameter tuning on kernelswith various arithmetic intensity
  • 16.
  • 17.
    Static tuning forvarious arithmetic intensityRatio from 1:9
  • 18.
    Static tuning forvarious arithmetic intensityRatio from 2:8
  • 19.
    Static tuning forvarious arithmetic intensityRatio from 3:7
  • 20.
    Static tuning forvarious arithmetic intensityRatio from 4:6
  • 21.
    Static tuning forvarious arithmetic intensityRatio from 5:5
  • 22.
    Static tuning forvarious arithmetic intensityRatio from 6:4
  • 23.
    Static tuning forvarious arithmetic intensityRatio from 7:3
  • 24.
    Static tuning forvarious arithmetic intensityRatio from 8:2
  • 25.
    Static tuning forvarious arithmetic intensityRatio from 9:1
  • 26.
    Hardware parameter tuningBehaviorof the simple application with two kernels• Low computational intensity – DGEMV• High computational intensity – DGEMM• Tuning of three parameters• Core frequency• Uncore frequency• Number of OpenMP threads• Visualized by RADAR....Low CI (DGEMV) High CI (DGEMM)10 threads2.2 GHz UCF1.2 GHz CF12 threads1.2 GHz UCF2.5 GHz CFStatic tuning for both kernels12 threads2.2 GHz UCF2.4 GHz CFComputenodeenergyconsumption[J]CPU core frequency [GHz] CPU core frequency [GHz] CPU core frequency [GHz]Computenodeenergyconsumption[J]Computenodeenergyconsumption[J]Note: runtime of both kernels was equal for default settingsTwo kernels with1:1 workload ratioEnergyconsumptionEnergysavingsDefault settings 2017J - -Static optimal 1833J 179J 9%Dynamic optimal 1612J 221J 12%Total savings - 400J 20%
  • 27.
    Core and uncorefrequency tuning under power cap
  • 28.
    Experiments description andtestbed parametersTestbed: Broadwell partition of the Galileosupercomputer in CINECA• dual socket server• two 18-core Intel Xeon E5-2697v4 processor• 2.3 GHz nominal frequency.• 2.7 GHz turbo frequency when all 18 cores are utilized• 145W TDPKey tunable parameters of the 18-core Intel Xeon E5-2697v4 processor and their respective ranges and steps.A set of experiments performed on Intel Broadwell Architecture
  • 29.
    Tuning of COMPUTEbound workload• behavior of the platform when running memory bound workload• under 145 W (TDP level, no power cap)• three different power cap levels 100 W, 80 W and 60 W.3,268s 3,268s3,903s3,903s7,409s3,577s7,693s4,379s3,653s363,4J311,8J 311,8J285,4J304,2J271,6J290,0J0501001502002503003504002,54,56,58,510,512,51,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningof computebound region under 80W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCF8.5% energy savings8.4% time savings12.9% energy savings10.9% time extension3,268s3,450s 3,450s7,411s3,293s4,378s7,698s363,4J344,4J 344,4J300,4J293,0J305,4J271,0J297J0501001502002503003504002,54,56,58,510,512,51,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningof computebound region under 100W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCF14.9% energy savings4.5% time savings21.3% energy savings21.1% time extension3,268s4,944s 4,944s7,410s4,849s4,477s7,692s4,565s 4,606s363,4J296J295,0J268,0J303,0J270,4J0501001502002503003504002,54,56,58,510,512,51,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningof computebound region under 60W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - max UCFEXP7 - DVFS & UCF - min UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - max UCFEXP7 - DVFS & UCF - min UCF9.1% energy savings9.4% time savings
  • 30.
    3,268s3,450s 3,450s7,411s3,293s4,378s7,698s363,4J344,4J344,4J300,4J293,0J305,4J271,0J297J0501001502002503003504002,54,56,58,510,512,51,0 1,21,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of compute bound region under 100W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCF14.9% energy savings4.5% time savings21.3% energy savings21.1% time extensionTuning of COMPUTE bound workload under 100W power cap
  • 31.
    3,268s3,268s3,903s 3,903s7,409s3,577s7,693s4,379s3,653s363,4J311,8J 311,8J285,4J304,2J271,6J290,0J0501001502002503003504002,54,56,58,510,512,51,01,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of compute bound region under 80W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCF8.5% energy savings8.4% time savings12.9% energy savings10.9% time extensionTuning of COMPUTE bound workload under 80W power cap
  • 32.
    3,268s4,944s 4,944s7,410s4,849s4,477s7,692s4,565s 4,606s363,4J296J295,0J268,0J303,0J270,4J0501001502002503003504002,54,56,58,510,512,51,01,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of compute bound region under 60W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - max UCFEXP7 - DVFS & UCF - min UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - max UCFEXP7 - DVFS & UCF - min UCF9.1% energy savings9.4% time savingsTuning of COMPUTE bound workload under 60W power cap
  • 33.
    Observations for COMPUTEbound workload• To achieve the best possible performance• the uncore frequency must be reduces to minimum• 9.4 % performance gain up to and• 14.9 % lower energy consumption• If further energy savings are required – use DVFS and lower the core freq.• up to 21 % of energy savings• up to 21 % penalty in runtime• this effect is more visible for higher powercap levels
  • 34.
    Tuning of memorybound workload• behavior of the platform when running memory bound workload• under 145 W (TDP level, no power cap)• three different power cap levels 100 W, 80 W and 60 W.1,886s1,959s1,886s197,6J188,2J188,2J148,6J115,2J170J145,6J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningofmemorybound region under 100W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF38.7% energy savings3.6% time extension1,886s1,920s1,890s1,959s197,6J153,2J114,4J146,0J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningofmemorybound region under 80W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF21.6% energy savings3.6% time extension1,886s2,475s 2,475s1,945s2,397s1,925s197,6J147,8J116,2J115,0J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningofmemorybound region under 60W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF22.2% energy savings22.2% time savings
  • 35.
    1,886s1,959s1,886s197,6J188,2J188,2J148,6J115,2J170J145,6J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,41,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of memory bound region under 100W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF38.7% energy savings3.6% time extensionTuning of memory bound workload under 100W power cap
  • 36.
    1,886s1,920s1,890s1,959s197,6J153,2J114,4J146,0J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,41,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of memory bound region under 80W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF21.6% energy savings3.6% time extensionTuning of memory bound workload under 80W power cap
  • 37.
    1,886s2,475s 2,475s1,945s2,397s1,925s197,6J147,8J116,2J115,0J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,21,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of memory bound region under 60W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF22.2% energy savings22.2% time savingsTuning of memory bound workload under 60W power cap
  • 38.
    Observations for bothworkloadsObservations for memory bound workload• Under the power budget lower that 80 W• DVFS should be set to minimum value• boost the performance of the uncore part by 22%.• Tuning of the uncore frequency• has low effect on the performance• but a major effect on energy consumption• between 21% (60 W and 80W) to 38% (100W)Observations for compute bound workload• To achieve the best possible performance• the uncore frequency must be reduces to minimum• 9.4 % performance gain up to and• 14.9 % lower energy consumption• If further energy savings are required – use DVFS and lower the core freq.• up to 21 % of energy savings• up to 21 % penalty in runtime• this effect is more visible for higher powercap levels
  • 39.
    Evaluation of complexHPC applications
  • 40.
    BEM4I ApplicationApplication runtimeassemble_k[s]assemble_v[s]gmres_solve[s]print_vtu[s]main[s]defaultruntime 5.4 5.9 10.2 5.6 27.3static tuning runtime 9.8 10.6 6.1 2.4 29.0dynamic tuning runtime 7.0 7.2 7.9 2.1 24.3static savings [%] -82.3% -79.1% 40.5% 56.8% -6.2%dynamic savings [%] -30.6% -20.9% 23.2% 62.9% 10.9%Hardware: dual socket system with 2x12 CPU cores – ”standard HW” in HPC centresRegion description:• assemble_k and assemble_v – high utilization of vector units, extreme level ofoptimization – fully compute bound great utilization of both sockets and all cores• gmres_solve – uses DGEMV from MKL – memory bound, suffers on NUMA effect;this routine is more efficient on single socket• print_vtu – single threaded I/O and network bound region why stores data to afile on LUSTRE system”static": {"FREQUENCY": ”25", <--------- 2.5 GHz"NUM_THREADS": ”12", <--------- 12 OpenMP threads"UNCORE_FREQUENCY": ”22” } <--------- 2.2 GHz"assemble_k": {"FREQUENCY": "23","NUM_THREADS": "24","UNCORE_FREQUENCY": ”16”},"assemble_v": {"FREQUENCY": ”25","NUM_THREADS": "24","UNCORE_FREQUENCY": ”14”},"gmres_solve": {"FREQUENCY": ”17","NUM_THREADS": ”8","UNCORE_FREQUENCY": ”22”},"print_vtu": {"FREQUENCY": "25","NUM_THREADS": ”6","UNCORE_FREQUENCY": ”24”}
  • 41.
    Compute node energyassemble_k[J]assemble_v[J]gmres_solve[J]print_vtu[J]main[J]defaultenergy 1476 1484 2733 1142 6872static tuning energy 1962 2015 1366 420 5792dynamic tuning energy 1467 1462 1259 293 4531static savings [%] -33.8% -35.8% 50.0% 63.2% 15.7%dynamic savings [%] 0.6% 1.5% 53.9% 74.3% 34.1%BEM4I ApplicationLarge energy savings is combination of optimal HW settings and runtime savingsdue to mitigation of NUMA effect by optimal settings of OpenMP threading• Without savings in runtime caused by similar application will• Energy savings approx. 15 – 20%• Runtime savings approx. -15%”static": {"FREQUENCY": ”25", <--------- 2.5 GHz"NUM_THREADS": ”12", <--------- 12 OpenMP threads"UNCORE_FREQUENCY": ”22” } <--------- 2.2 GHz"assemble_k": {"FREQUENCY": "23","NUM_THREADS": "24","UNCORE_FREQUENCY": ”16”},"assemble_v": {"FREQUENCY": ”25","NUM_THREADS": "24","UNCORE_FREQUENCY": ”14”},"gmres_solve": {"FREQUENCY": ”17","NUM_THREADS": ”8","UNCORE_FREQUENCY": ”22”},"print_vtu": {"FREQUENCY": "25","NUM_THREADS": ”6","UNCORE_FREQUENCY": ”24”}
  • 42.
    OpenFOAM ApplicationOpenFOAM Energyconsumption Energy savingsDefault settings 14 231J - -Static tuning 12 264J 2 264J 15.9%Dynamic tuningTotal savings• Computational fluid dynamics• Finite volume + multigrid solver
  • 43.
    OpenFOAM ApplicationOpenFOAM Energyconsumption Energy savingsDefault settings 14 231J - -Static tuning 12 264J 2 264J 15.9%Dynamic tuning 11 370J 597J 4.8%Total savings 2 861J 20.1%• Computational fluid dynamics• Finite volume + multigrid solver
  • 44.
    ESPRESO Application33% ofenergy savings22% of time savings and improved strong scalability• Structural mechanics code• Finite element + sparse FETI solver• Different tuning models for different # of nodes is needed for strong scalability –workload per node is varies• Includes dynamic switching overheadsEnergy savings analysis for the strong scalability test of theESPRESO library when running the cube benchmark
  • 45.
  • 46.
    Application parameters tuningof the ESPRESO50% - 66% against ”reasonable” settings86% against the worst case0501001502002503000 500 1000 1500 2000 2500 3000 3500Energyconsumption[kJ]Configuration indexthe “reasonable” settingsthe optimal settings9 parameters3840 combinations• FETI METHOD 2x• PRECONDITIONER 5x• ITERATIVE SOLVER TYPE 2x• HFETI type 2x• NON-UNIFORM PARTS 6x• REDUNDANT LAGRANGE 2x• SCALING 2x• B0_TYPE 2x• ADAPTIVE PRECISION 2x
  • 47.
    Application parameters tuningApplicationparameter tuning parameters is very promising• application configuration parameters are given in the input file• each setting requires an individual start of the application• tool performs automatic search of application parameter spaceApplicationnumber of parameters tested /total number of optionsEnergy savingscomparedto the worst settingsEnergy savings compared todefault or reasonable settingsESPRESO 9 / 3840 86% 50 – 66%ELMER 1 / 40 97% 50 – 75%OpenFOAM 2 / 12 24% 8%INDEED 3 / 12 35% 25%
  • 48.

[8]ページ先頭

©2009-2025 Movatter.jp