Movatterモバイル変換


[0]ホーム

URL:


504 views

Energy Efficient Computing using Dynamic Tuning

The document outlines a methodology for dynamic tuning of high-performance computing (HPC) applications to improve energy efficiency and resource utilization. It discusses the Readex project, which develops tools for automatic tuning of applications, and evaluates the impact of tuning on energy consumption and performance across various applications. Key findings include significant energy savings achieved through dynamic tuning, particularly with the bem4i application, and the importance of adjusting hardware parameters to optimize performance.

Embed presentation

Download to read offline
Dynamic Tuning of HPC Applications
OverviewMethodology• Motivation and Introduction• READEX project overview• What can you achieve with static and dynamic tuning• Tuning of the hardware parameters• Effect on the energy consumption• Effect of hardware parameter tuning on kernels with various arithmetic intensity• Evaluation of complex HPC applications• BEM4I• OpenFOAM• Scalability tests with ESPRESO• Tuning of the application parameters
Motivation and Introduction
READEX Project & Motivation• Energy efficiency is critical to current and future systems• Applications exhibit dynamic behavior• Changing resource requirements• Computational characteristics• Changing load on processors over timeGoal was to create a tools-aided methodology for automatic tuning of parallel applications.Dynamically adjust system parameters to actual resource requirements
What is dynamic tuningFREQ=2 GHzPhase regionSignificant regionSignificant regionFREQ=1.5 GHz
READEX Tool Suite1. Instrument application• Score-P provides different kinds of instrumentation2. Detect dynamism• Check whether runtime situations could benefitfrom tuning3. Detect energy saving potential andconfigurations (DTA)• Use tuning plugin and power measurementinfrastructure to search for optimal configuration• Create tuning model4. Runtime application tuning (RAT)• Apply tuning model, use optimal configurationPeriscope TuningFrameworkREADEXTuning PluginApplicationTuning ModelScore-PREADEX RuntimeLibraryOnlineAccessInterfaceSubstratePluginInterfaceParameterControl PluginEnergyMeasurements(HDEEM)READEX Tool Suite
READEX Test SuiteConsists of benchmarks, proxy apps and complex productionapplicationsKey features:• Full set of scripts allows reproducibility of experiments on• TUD Taurus HSW (HDEEM) and BDW partitions• IT4I Salomon machine (RAPL)• Support for Slurm and PBS schedulers• Automatic savings evaluation• Performs evaluation of• hardware and system parameter tuning• application parameter tuning• Contains manual instrumentation of significant regions• using header file à can be adopted to test other toolsApplicationtypeApplicationnamebenchmarks orproxy appsAMG2013BlasbenchKripkeLuleshNPB3.3productionapplicationsBEM4IESPRESOINDEEDOpenFOAM
What can you expect from static tuningMANUAL STATIC TUNING12.6%PROPOSAL4.3%17.6% Test Suite MAXTest Suite MINTest Suite AVGSoftwareStatic tuningsavingsAMG2013 12.5 %Blasbench 7.4 %Kripke 11.5 %Lulesh 17.6 %NPB3.3 11.0 %BEM4I 15.7 %INDEED 17.6 %ESPRESO 4.3 %OpenFOAM 15.9 %Average 12.6 %
What can you expect from dynamic tuningTest Suite MAXTest Suite MINTest Suite AVGproposal goal: up to 30%Test Suite MAXMANUAL DYNAMIC TUNING34.1%PROPOSALTest Suite MIN 8.2%Test Suite AVG 17.%SoftwareDynamic tuningsavingsAMG2013 12.5 %Blasbench 15.3 %Kripke 18.5 %Lulesh 18.7 %NPB3.3 11.0%BEM4I 34.1 %INDEED 19.5 %ESPRESO 8.2 %OpenFOAM 20.1%Average 17.5 %
Energy savings achieved by static and dynamic tuningApplication(default is Intel compiler)(* uses GCC compiler)HW parametersStatic tuning savingnode energy / timeDynamic tuningsavingsnode energy/timeREADEX tuninsavingsnode energy/tiAMG2013 CF, UCF, threads 12.5% / −0.9% N/A 7.0% / −14.0%Blasbench CF, UCF, threads 7.4% / −0.9% 15.3% / −18.1% 9.9% / −9.2%Kripke CF, UCF 11.5% / −28.3% 18.8% / − 18.7% 10.5% / −28.9Lulesh CF, UCF, threads 17.6% / −8.9% 18.7% / −11.7% 18.2% / −25.7NPB3.3-BT-MZ CF, UCF, threads 11% / −11.3% N/A 10.8% / −12%BEM4I CF, UCF, threads 15.7% / −6.2% 34.1% / 10.9% 34.0% / 10.9%INDEED CF, UCF, threads 17.6% / −12.8% 19.5% / −14.2% 19.1% / −17.3ESPRESO CF, UCF, threads 4.3% / −8.9% 8.2% / −10.1% 7.1% / −12.3%OpenFOAM CF, UCF 15.9% / −10.5% 20.1% / 11.5% 9.8% / −9.8%Evaluation of READEX Tool Suite on TUD Taurus Haswell system with HDEEM energy measurementsKey findings:• Best savings achieved with BEM4I application – up to 34% for energy and 11% for runtime• In general energy savings are ”paid” by extra runtime
Tuning of the hardware parameters
Hardware parameter tuningInvestigation of impact of CPU uncore frequency tuning on memory bound code:• Optimal frequency, with low energy consumption, and a small performance impactEvaluation using STREAM Copy benchmarkResults by TU Dresden under READEX project
Hardware parameter tuningEffect of changing core frequencies on uncore performance using memory bound code• Just a small impact on the Bandwidth and EnergyEvaluation using STREAM Copy benchmarkResults by TU Dresden under READEX projectHeatmap of the energy consumption of a stream benchmark for different core and uncore frequencies.The data array does not fit in the processor’s L3 processor cache
Hardware parameter tuningL3 Cache Energy efficiency and Bandwidth:• Different optimal uncore frequenciesEvaluation using STREAM Copy benchmarkResults by TU Dresden under READEX projectHeatmap of the energy consumption of a stream benchmark for different core and uncore frequencies.The data array does fit in the processor’s L3 processor cache
Effect of hardware parameter tuning on kernelswith various arithmetic intensity
Arithmetic intensity
Static tuning for various arithmetic intensityRatio from 1:9
Static tuning for various arithmetic intensityRatio from 2:8
Static tuning for various arithmetic intensityRatio from 3:7
Static tuning for various arithmetic intensityRatio from 4:6
Static tuning for various arithmetic intensityRatio from 5:5
Static tuning for various arithmetic intensityRatio from 6:4
Static tuning for various arithmetic intensityRatio from 7:3
Static tuning for various arithmetic intensityRatio from 8:2
Static tuning for various arithmetic intensityRatio from 9:1
Hardware parameter tuningBehavior of the simple application with two kernels• Low computational intensity – DGEMV• High computational intensity – DGEMM• Tuning of three parameters• Core frequency• Uncore frequency• Number of OpenMP threads• Visualized by RADAR....Low CI (DGEMV) High CI (DGEMM)10 threads2.2 GHz UCF1.2 GHz CF12 threads1.2 GHz UCF2.5 GHz CFStatic tuning for both kernels12 threads2.2 GHz UCF2.4 GHz CFComputenodeenergyconsumption[J]CPU core frequency [GHz] CPU core frequency [GHz] CPU core frequency [GHz]Computenodeenergyconsumption[J]Computenodeenergyconsumption[J]Note: runtime of both kernels was equal for default settingsTwo kernels with1:1 workload ratioEnergyconsumptionEnergysavingsDefault settings 2017J - -Static optimal 1833J 179J 9%Dynamic optimal 1612J 221J 12%Total savings - 400J 20%
Core and uncore frequency tuning under power cap
Experiments description and testbed parametersTestbed: Broadwell partition of the Galileosupercomputer in CINECA• dual socket server• two 18-core Intel Xeon E5-2697v4 processor• 2.3 GHz nominal frequency.• 2.7 GHz turbo frequency when all 18 cores are utilized• 145W TDPKey tunable parameters of the 18-core Intel Xeon E5-2697v4 processor and their respective ranges and steps.A set of experiments performed on Intel Broadwell Architecture
Tuning of COMPUTE bound workload• behavior of the platform when running memory bound workload• under 145 W (TDP level, no power cap)• three different power cap levels 100 W, 80 W and 60 W.3,268s 3,268s3,903s3,903s7,409s3,577s7,693s4,379s3,653s363,4J311,8J 311,8J285,4J304,2J271,6J290,0J0501001502002503003504002,54,56,58,510,512,51,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningof computebound region under 80W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCF8.5% energy savings8.4% time savings12.9% energy savings10.9% time extension3,268s3,450s 3,450s7,411s3,293s4,378s7,698s363,4J344,4J 344,4J300,4J293,0J305,4J271,0J297J0501001502002503003504002,54,56,58,510,512,51,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningof computebound region under 100W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCF14.9% energy savings4.5% time savings21.3% energy savings21.1% time extension3,268s4,944s 4,944s7,410s4,849s4,477s7,692s4,565s 4,606s363,4J296J295,0J268,0J303,0J270,4J0501001502002503003504002,54,56,58,510,512,51,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningof computebound region under 60W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - max UCFEXP7 - DVFS & UCF - min UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - max UCFEXP7 - DVFS & UCF - min UCF9.1% energy savings9.4% time savings
3,268s3,450s 3,450s7,411s3,293s4,378s7,698s363,4J344,4J344,4J300,4J293,0J305,4J271,0J297J0501001502002503003504002,54,56,58,510,512,51,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of compute bound region under 100W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCF14.9% energy savings4.5% time savings21.3% energy savings21.1% time extensionTuning of COMPUTE bound workload under 100W power cap
3,268s3,268s3,903s 3,903s7,409s3,577s7,693s4,379s3,653s363,4J311,8J 311,8J285,4J304,2J271,6J290,0J0501001502002503003504002,54,56,58,510,512,51,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of compute bound region under 80W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCF8.5% energy savings8.4% time savings12.9% energy savings10.9% time extensionTuning of COMPUTE bound workload under 80W power cap
3,268s4,944s 4,944s7,410s4,849s4,477s7,692s4,565s 4,606s363,4J296J295,0J268,0J303,0J270,4J0501001502002503003504002,54,56,58,510,512,51,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of compute bound region under 60W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - max UCFEXP7 - DVFS & UCF - min UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - max UCFEXP7 - DVFS & UCF - min UCF9.1% energy savings9.4% time savingsTuning of COMPUTE bound workload under 60W power cap
Observations for COMPUTE bound workload• To achieve the best possible performance• the uncore frequency must be reduces to minimum• 9.4 % performance gain up to and• 14.9 % lower energy consumption• If further energy savings are required – use DVFS and lower the core freq.• up to 21 % of energy savings• up to 21 % penalty in runtime• this effect is more visible for higher powercap levels
Tuning of memory bound workload• behavior of the platform when running memory bound workload• under 145 W (TDP level, no power cap)• three different power cap levels 100 W, 80 W and 60 W.1,886s1,959s1,886s197,6J188,2J188,2J148,6J115,2J170J145,6J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningofmemorybound region under 100W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF38.7% energy savings3.6% time extension1,886s1,920s1,890s1,959s197,6J153,2J114,4J146,0J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningofmemorybound region under 80W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF21.6% energy savings3.6% time extension1,886s2,475s 2,475s1,945s2,397s1,925s197,6J147,8J116,2J115,0J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningofmemorybound region under 60W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF22.2% energy savings22.2% time savings
1,886s1,959s1,886s197,6J188,2J188,2J148,6J115,2J170J145,6J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of memory bound region under 100W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF38.7% energy savings3.6% time extensionTuning of memory bound workload under 100W power cap
1,886s1,920s1,890s1,959s197,6J153,2J114,4J146,0J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of memory bound region under 80W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF21.6% energy savings3.6% time extensionTuning of memory bound workload under 80W power cap
1,886s2,475s 2,475s1,945s2,397s1,925s197,6J147,8J116,2J115,0J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of memory bound region under 60W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF22.2% energy savings22.2% time savingsTuning of memory bound workload under 60W power cap
Observations for both workloadsObservations for memory bound workload• Under the power budget lower that 80 W• DVFS should be set to minimum value• boost the performance of the uncore part by 22%.• Tuning of the uncore frequency• has low effect on the performance• but a major effect on energy consumption• between 21% (60 W and 80W) to 38% (100W)Observations for compute bound workload• To achieve the best possible performance• the uncore frequency must be reduces to minimum• 9.4 % performance gain up to and• 14.9 % lower energy consumption• If further energy savings are required – use DVFS and lower the core freq.• up to 21 % of energy savings• up to 21 % penalty in runtime• this effect is more visible for higher powercap levels
Evaluation of complex HPC applications
BEM4I ApplicationApplication runtimeassemble_k[s]assemble_v[s]gmres_solve[s]print_vtu[s]main[s]default runtime 5.4 5.9 10.2 5.6 27.3static tuning runtime 9.8 10.6 6.1 2.4 29.0dynamic tuning runtime 7.0 7.2 7.9 2.1 24.3static savings [%] -82.3% -79.1% 40.5% 56.8% -6.2%dynamic savings [%] -30.6% -20.9% 23.2% 62.9% 10.9%Hardware: dual socket system with 2x12 CPU cores – ”standard HW” in HPC centresRegion description:• assemble_k and assemble_v – high utilization of vector units, extreme level ofoptimization – fully compute bound great utilization of both sockets and all cores• gmres_solve – uses DGEMV from MKL – memory bound, suffers on NUMA effect;this routine is more efficient on single socket• print_vtu – single threaded I/O and network bound region why stores data to afile on LUSTRE system”static": {"FREQUENCY": ”25", <--------- 2.5 GHz"NUM_THREADS": ”12", <--------- 12 OpenMP threads"UNCORE_FREQUENCY": ”22” } <--------- 2.2 GHz"assemble_k": {"FREQUENCY": "23","NUM_THREADS": "24","UNCORE_FREQUENCY": ”16”},"assemble_v": {"FREQUENCY": ”25","NUM_THREADS": "24","UNCORE_FREQUENCY": ”14”},"gmres_solve": {"FREQUENCY": ”17","NUM_THREADS": ”8","UNCORE_FREQUENCY": ”22”},"print_vtu": {"FREQUENCY": "25","NUM_THREADS": ”6","UNCORE_FREQUENCY": ”24”}
Compute node energyassemble_k[J]assemble_v[J]gmres_solve[J]print_vtu[J]main[J]default energy 1476 1484 2733 1142 6872static tuning energy 1962 2015 1366 420 5792dynamic tuning energy 1467 1462 1259 293 4531static savings [%] -33.8% -35.8% 50.0% 63.2% 15.7%dynamic savings [%] 0.6% 1.5% 53.9% 74.3% 34.1%BEM4I ApplicationLarge energy savings is combination of optimal HW settings and runtime savingsdue to mitigation of NUMA effect by optimal settings of OpenMP threading• Without savings in runtime caused by similar application will• Energy savings approx. 15 – 20%• Runtime savings approx. -15%”static": {"FREQUENCY": ”25", <--------- 2.5 GHz"NUM_THREADS": ”12", <--------- 12 OpenMP threads"UNCORE_FREQUENCY": ”22” } <--------- 2.2 GHz"assemble_k": {"FREQUENCY": "23","NUM_THREADS": "24","UNCORE_FREQUENCY": ”16”},"assemble_v": {"FREQUENCY": ”25","NUM_THREADS": "24","UNCORE_FREQUENCY": ”14”},"gmres_solve": {"FREQUENCY": ”17","NUM_THREADS": ”8","UNCORE_FREQUENCY": ”22”},"print_vtu": {"FREQUENCY": "25","NUM_THREADS": ”6","UNCORE_FREQUENCY": ”24”}
OpenFOAM ApplicationOpenFOAM Energy consumption Energy savingsDefault settings 14 231J - -Static tuning 12 264J 2 264J 15.9%Dynamic tuningTotal savings• Computational fluid dynamics• Finite volume + multigrid solver
OpenFOAM ApplicationOpenFOAM Energy consumption Energy savingsDefault settings 14 231J - -Static tuning 12 264J 2 264J 15.9%Dynamic tuning 11 370J 597J 4.8%Total savings 2 861J 20.1%• Computational fluid dynamics• Finite volume + multigrid solver
ESPRESO Application33% of energy savings22% of time savings and improved strong scalability• Structural mechanics code• Finite element + sparse FETI solver• Different tuning models for different # of nodes is needed for strong scalability –workload per node is varies• Includes dynamic switching overheadsEnergy savings analysis for the strong scalability test of theESPRESO library when running the cube benchmark
Application parameters tuning
Application parameters tuning of the ESPRESO50% - 66% against ”reasonable” settings86% against the worst case0501001502002503000 500 1000 1500 2000 2500 3000 3500Energyconsumption[kJ]Configuration indexthe “reasonable” settingsthe optimal settings9 parameters3840 combinations• FETI METHOD 2x• PRECONDITIONER 5x• ITERATIVE SOLVER TYPE 2x• HFETI type 2x• NON-UNIFORM PARTS 6x• REDUNDANT LAGRANGE 2x• SCALING 2x• B0_TYPE 2x• ADAPTIVE PRECISION 2x
Application parameters tuningApplication parameter tuning parameters is very promising• application configuration parameters are given in the input file• each setting requires an individual start of the application• tool performs automatic search of application parameter spaceApplicationnumber of parameters tested /total number of optionsEnergy savingscomparedto the worst settingsEnergy savings compared todefault or reasonable settingsESPRESO 9 / 3840 86% 50 – 66%ELMER 1 / 40 97% 50 – 75%OpenFOAM 2 / 12 24% 8%INDEED 3 / 12 35% 25%
Thank you

Recommended

PDF
Hardware & Software Platforms for HPC, AI and ML
PDF
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
PDF
Overview of HPC Interconnects
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
NNSA Explorations: ARM for Supercomputing
PDF
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
PDF
MIT's experience on OpenPOWER/POWER 9 platform
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PDF
Trends in Systems and How to Get Efficient Performance
PDF
Lenovo HPC Strategy Update
PDF
Intel dpdk Tutorial
PDF
Summit workshop thompto
PPSX
FD.io Vector Packet Processing (VPP)
PDF
DOME 64-bit μDataCenter
PDF
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
PDF
State of ARM-based HPC
PDF
IBM HPC Transformation with AI
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
A Fresh Look at HPC from Huawei Enterprise
PDF
IBM Data Centric Systems & OpenPOWER
PDF
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
PDF
ARM HPC Ecosystem
PDF
BXI: Bull eXascale Interconnect
PDF
AI is Impacting HPC Everywhere
PDF
HPC Accelerating Combustion Engine Design
PDF
TAU E4S ON OpenPOWER /POWER9 platform
PDF
DPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFV
PDF
High-Performance and Scalable Designs of Programming Models for Exascale Systems
PDF
Runtime Methods to Improve Energy Efficiency in HPC Applications
PPTX
Energy Efficiency in Large Scale Systems

More Related Content

PDF
Hardware & Software Platforms for HPC, AI and ML
PDF
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
PDF
Overview of HPC Interconnects
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
NNSA Explorations: ARM for Supercomputing
PDF
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
PDF
MIT's experience on OpenPOWER/POWER 9 platform
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
Hardware & Software Platforms for HPC, AI and ML
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Overview of HPC Interconnects
CUDA-Python and RAPIDS for blazing fast scientific computing
NNSA Explorations: ARM for Supercomputing
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
MIT's experience on OpenPOWER/POWER 9 platform
Preparing to program Aurora at Exascale - Early experiences and future direct...

What's hot

PDF
Trends in Systems and How to Get Efficient Performance
PDF
Lenovo HPC Strategy Update
PDF
Intel dpdk Tutorial
PDF
Summit workshop thompto
PPSX
FD.io Vector Packet Processing (VPP)
PDF
DOME 64-bit μDataCenter
PDF
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
PDF
State of ARM-based HPC
PDF
IBM HPC Transformation with AI
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
A Fresh Look at HPC from Huawei Enterprise
PDF
IBM Data Centric Systems & OpenPOWER
PDF
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
PDF
ARM HPC Ecosystem
PDF
BXI: Bull eXascale Interconnect
PDF
AI is Impacting HPC Everywhere
PDF
HPC Accelerating Combustion Engine Design
PDF
TAU E4S ON OpenPOWER /POWER9 platform
PDF
DPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFV
PDF
High-Performance and Scalable Designs of Programming Models for Exascale Systems
Trends in Systems and How to Get Efficient Performance
Lenovo HPC Strategy Update
Intel dpdk Tutorial
Summit workshop thompto
FD.io Vector Packet Processing (VPP)
DOME 64-bit μDataCenter
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
State of ARM-based HPC
IBM HPC Transformation with AI
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
A Fresh Look at HPC from Huawei Enterprise
IBM Data Centric Systems & OpenPOWER
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
ARM HPC Ecosystem
BXI: Bull eXascale Interconnect
AI is Impacting HPC Everywhere
HPC Accelerating Combustion Engine Design
TAU E4S ON OpenPOWER /POWER9 platform
DPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFV
High-Performance and Scalable Designs of Programming Models for Exascale Systems

Similar to Energy Efficient Computing using Dynamic Tuning

PDF
Runtime Methods to Improve Energy Efficiency in HPC Applications
PPTX
Energy Efficiency in Large Scale Systems
PPTX
Optimizing High Performance Computing Applications for Energy
PDF
Barcelona Supercomputing Center, Generador de Riqueza
PDF
Deep Learning Training at Scale: Spring Crest Deep Learning Accelerator
PDF
POWER10 innovations for HPC
PDF
Hp All In 1
PDF
Intel - Challenges and Opportunities in Cloud-Based Genomics Analytics
PDF
E03403027030
PDF
Dynamic Frequency Scaling Regarding Memory for Energy Efficiency of Embedded...
PDF
The impact of software on data-center energy use - and what can we do about it?
PDF
Performance and Energy evaluation
PPTX
An application classification guided cache tuning heuristic for
PDF
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
PDF
Symposium on HPC Applications – IIT Kanpur
PDF
6 profiling tools
 
PDF
[IGC2018] AMD Don Woligroski - WHY Ryzen
PDF
Ga techsusthpc patterson
PPTX
Hardware-aware thread scheduling: the case of asymmetric multicore processors
PDF
IBM zEnterprise 114 Technical Guide
Runtime Methods to Improve Energy Efficiency in HPC Applications
Energy Efficiency in Large Scale Systems
Optimizing High Performance Computing Applications for Energy
Barcelona Supercomputing Center, Generador de Riqueza
Deep Learning Training at Scale: Spring Crest Deep Learning Accelerator
POWER10 innovations for HPC
Hp All In 1
Intel - Challenges and Opportunities in Cloud-Based Genomics Analytics
E03403027030
Dynamic Frequency Scaling Regarding Memory for Energy Efficiency of Embedded...
The impact of software on data-center energy use - and what can we do about it?
Performance and Energy evaluation
An application classification guided cache tuning heuristic for
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Symposium on HPC Applications – IIT Kanpur
6 profiling tools
 
[IGC2018] AMD Don Woligroski - WHY Ryzen
Ga techsusthpc patterson
Hardware-aware thread scheduling: the case of asymmetric multicore processors
IBM zEnterprise 114 Technical Guide

More from inside-BigData.com

PDF
Adaptive Linear Solvers and Eigensolvers
PDF
Data Parallel Deep Learning
PDF
Versal Premium ACAP for Network and Cloud Acceleration
PPTX
Transforming Private 5G Networks
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
PDF
SW/HW co-design for near-term quantum computing
PDF
Introducing HPC with a Raspberry Pi Cluster
PDF
Machine Learning for Weather Forecasts
PPTX
HPC AI Advisory Council Update
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
PDF
Major Market Shifts in IT
PDF
Fugaku Supercomputer joins fight against COVID-19
PDF
HPC Impact: EDA Telemetry Neural Networks
PDF
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
PDF
Scientific Applications and Heterogeneous Architectures
PDF
Scaling TCO in a Post Moore's Era
PDF
Making Supernovae with Jets
Adaptive Linear Solvers and Eigensolvers
Data Parallel Deep Learning
Versal Premium ACAP for Network and Cloud Acceleration
Transforming Private 5G Networks
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
SW/HW co-design for near-term quantum computing
Introducing HPC with a Raspberry Pi Cluster
Machine Learning for Weather Forecasts
HPC AI Advisory Council Update
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Major Market Shifts in IT
Fugaku Supercomputer joins fight against COVID-19
HPC Impact: EDA Telemetry Neural Networks
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Scientific Applications and Heterogeneous Architectures
Scaling TCO in a Post Moore's Era
Making Supernovae with Jets

Recently uploaded

PDF
[BDD 2025 - Mobile Development] Mobile Engineer and Software Engineer: Are we...
PDF
[BDD 2025 - Full-Stack Development] Digital Accessibility: Why Developers nee...
PDF
PCCC25(設立25年記念PCクラスタシンポジウム):エヌビディア合同会社 テーマ2「NVIDIA BlueField-4 DPU」
PPTX
MuleSoft AI Series : Introduction to MCP
PPTX
kernel PPT (Explanation of Windows Kernal).pptx
PPTX
Connecting the unconnectable: Exploring LoRaWAN for IoT
PDF
Cheryl Hung, Vibe Coding Auth Without Melting Down! isaqb Software Architectu...
PDF
Lets Build a Serverless Function with Kiro
PPTX
UFCD 0797 - SISTEMAS OPERATIVOS_Unidade Completa.pptx
PDF
The Evolving Role of the CEO in the Age of AI
PDF
Mastering UiPath Maestro – Session 2 – Building a Live Use Case - Session 2
PDF
[BDD 2025 - Full-Stack Development] PHP in AI Age: The Laravel Way. (Rizqy Hi...
PDF
DUBAI IT MODERNIZATION WITH AZURE MANAGED SERVICES.pdf
PDF
Crane Accident Prevention Guide: Key OSHA Regulations for Safer Operations
PDF
How Much Does It Cost to Build an eCommerce Website in 2025.pdf
PPTX
Support, Monitoring, Continuous Improvement & Scaling Agentic Automation [3/3]
PDF
"DISC as GPS for team leaders: how to lead a team from storming to performing...
 
PDF
So You Want to Work at Google | DevFest Seattle 2025
PPTX
The power of Slack and MuleSoft | Bangalore MuleSoft Meetup #60
PDF
How Much Does It Cost To Build Software
[BDD 2025 - Mobile Development] Mobile Engineer and Software Engineer: Are we...
[BDD 2025 - Full-Stack Development] Digital Accessibility: Why Developers nee...
PCCC25(設立25年記念PCクラスタシンポジウム):エヌビディア合同会社 テーマ2「NVIDIA BlueField-4 DPU」
MuleSoft AI Series : Introduction to MCP
kernel PPT (Explanation of Windows Kernal).pptx
Connecting the unconnectable: Exploring LoRaWAN for IoT
Cheryl Hung, Vibe Coding Auth Without Melting Down! isaqb Software Architectu...
Lets Build a Serverless Function with Kiro
UFCD 0797 - SISTEMAS OPERATIVOS_Unidade Completa.pptx
The Evolving Role of the CEO in the Age of AI
Mastering UiPath Maestro – Session 2 – Building a Live Use Case - Session 2
[BDD 2025 - Full-Stack Development] PHP in AI Age: The Laravel Way. (Rizqy Hi...
DUBAI IT MODERNIZATION WITH AZURE MANAGED SERVICES.pdf
Crane Accident Prevention Guide: Key OSHA Regulations for Safer Operations
How Much Does It Cost to Build an eCommerce Website in 2025.pdf
Support, Monitoring, Continuous Improvement & Scaling Agentic Automation [3/3]
"DISC as GPS for team leaders: how to lead a team from storming to performing...
 
So You Want to Work at Google | DevFest Seattle 2025
The power of Slack and MuleSoft | Bangalore MuleSoft Meetup #60
How Much Does It Cost To Build Software

Energy Efficient Computing using Dynamic Tuning

  • 1.
    Dynamic Tuning ofHPC Applications
  • 2.
    OverviewMethodology• Motivation andIntroduction• READEX project overview• What can you achieve with static and dynamic tuning• Tuning of the hardware parameters• Effect on the energy consumption• Effect of hardware parameter tuning on kernels with various arithmetic intensity• Evaluation of complex HPC applications• BEM4I• OpenFOAM• Scalability tests with ESPRESO• Tuning of the application parameters
  • 3.
  • 4.
    READEX Project &Motivation• Energy efficiency is critical to current and future systems• Applications exhibit dynamic behavior• Changing resource requirements• Computational characteristics• Changing load on processors over timeGoal was to create a tools-aided methodology for automatic tuning of parallel applications.Dynamically adjust system parameters to actual resource requirements
  • 5.
    What is dynamictuningFREQ=2 GHzPhase regionSignificant regionSignificant regionFREQ=1.5 GHz
  • 6.
    READEX Tool Suite1.Instrument application• Score-P provides different kinds of instrumentation2. Detect dynamism• Check whether runtime situations could benefitfrom tuning3. Detect energy saving potential andconfigurations (DTA)• Use tuning plugin and power measurementinfrastructure to search for optimal configuration• Create tuning model4. Runtime application tuning (RAT)• Apply tuning model, use optimal configurationPeriscope TuningFrameworkREADEXTuning PluginApplicationTuning ModelScore-PREADEX RuntimeLibraryOnlineAccessInterfaceSubstratePluginInterfaceParameterControl PluginEnergyMeasurements(HDEEM)READEX Tool Suite
  • 7.
    READEX Test SuiteConsistsof benchmarks, proxy apps and complex productionapplicationsKey features:• Full set of scripts allows reproducibility of experiments on• TUD Taurus HSW (HDEEM) and BDW partitions• IT4I Salomon machine (RAPL)• Support for Slurm and PBS schedulers• Automatic savings evaluation• Performs evaluation of• hardware and system parameter tuning• application parameter tuning• Contains manual instrumentation of significant regions• using header file à can be adopted to test other toolsApplicationtypeApplicationnamebenchmarks orproxy appsAMG2013BlasbenchKripkeLuleshNPB3.3productionapplicationsBEM4IESPRESOINDEEDOpenFOAM
  • 8.
    What can youexpect from static tuningMANUAL STATIC TUNING12.6%PROPOSAL4.3%17.6% Test Suite MAXTest Suite MINTest Suite AVGSoftwareStatic tuningsavingsAMG2013 12.5 %Blasbench 7.4 %Kripke 11.5 %Lulesh 17.6 %NPB3.3 11.0 %BEM4I 15.7 %INDEED 17.6 %ESPRESO 4.3 %OpenFOAM 15.9 %Average 12.6 %
  • 9.
    What can youexpect from dynamic tuningTest Suite MAXTest Suite MINTest Suite AVGproposal goal: up to 30%Test Suite MAXMANUAL DYNAMIC TUNING34.1%PROPOSALTest Suite MIN 8.2%Test Suite AVG 17.%SoftwareDynamic tuningsavingsAMG2013 12.5 %Blasbench 15.3 %Kripke 18.5 %Lulesh 18.7 %NPB3.3 11.0%BEM4I 34.1 %INDEED 19.5 %ESPRESO 8.2 %OpenFOAM 20.1%Average 17.5 %
  • 10.
    Energy savings achievedby static and dynamic tuningApplication(default is Intel compiler)(* uses GCC compiler)HW parametersStatic tuning savingnode energy / timeDynamic tuningsavingsnode energy/timeREADEX tuninsavingsnode energy/tiAMG2013 CF, UCF, threads 12.5% / −0.9% N/A 7.0% / −14.0%Blasbench CF, UCF, threads 7.4% / −0.9% 15.3% / −18.1% 9.9% / −9.2%Kripke CF, UCF 11.5% / −28.3% 18.8% / − 18.7% 10.5% / −28.9Lulesh CF, UCF, threads 17.6% / −8.9% 18.7% / −11.7% 18.2% / −25.7NPB3.3-BT-MZ CF, UCF, threads 11% / −11.3% N/A 10.8% / −12%BEM4I CF, UCF, threads 15.7% / −6.2% 34.1% / 10.9% 34.0% / 10.9%INDEED CF, UCF, threads 17.6% / −12.8% 19.5% / −14.2% 19.1% / −17.3ESPRESO CF, UCF, threads 4.3% / −8.9% 8.2% / −10.1% 7.1% / −12.3%OpenFOAM CF, UCF 15.9% / −10.5% 20.1% / 11.5% 9.8% / −9.8%Evaluation of READEX Tool Suite on TUD Taurus Haswell system with HDEEM energy measurementsKey findings:• Best savings achieved with BEM4I application – up to 34% for energy and 11% for runtime• In general energy savings are ”paid” by extra runtime
  • 11.
    Tuning of thehardware parameters
  • 12.
    Hardware parameter tuningInvestigationof impact of CPU uncore frequency tuning on memory bound code:• Optimal frequency, with low energy consumption, and a small performance impactEvaluation using STREAM Copy benchmarkResults by TU Dresden under READEX project
  • 13.
    Hardware parameter tuningEffectof changing core frequencies on uncore performance using memory bound code• Just a small impact on the Bandwidth and EnergyEvaluation using STREAM Copy benchmarkResults by TU Dresden under READEX projectHeatmap of the energy consumption of a stream benchmark for different core and uncore frequencies.The data array does not fit in the processor’s L3 processor cache
  • 14.
    Hardware parameter tuningL3Cache Energy efficiency and Bandwidth:• Different optimal uncore frequenciesEvaluation using STREAM Copy benchmarkResults by TU Dresden under READEX projectHeatmap of the energy consumption of a stream benchmark for different core and uncore frequencies.The data array does fit in the processor’s L3 processor cache
  • 15.
    Effect of hardwareparameter tuning on kernelswith various arithmetic intensity
  • 16.
  • 17.
    Static tuning forvarious arithmetic intensityRatio from 1:9
  • 18.
    Static tuning forvarious arithmetic intensityRatio from 2:8
  • 19.
    Static tuning forvarious arithmetic intensityRatio from 3:7
  • 20.
    Static tuning forvarious arithmetic intensityRatio from 4:6
  • 21.
    Static tuning forvarious arithmetic intensityRatio from 5:5
  • 22.
    Static tuning forvarious arithmetic intensityRatio from 6:4
  • 23.
    Static tuning forvarious arithmetic intensityRatio from 7:3
  • 24.
    Static tuning forvarious arithmetic intensityRatio from 8:2
  • 25.
    Static tuning forvarious arithmetic intensityRatio from 9:1
  • 26.
    Hardware parameter tuningBehaviorof the simple application with two kernels• Low computational intensity – DGEMV• High computational intensity – DGEMM• Tuning of three parameters• Core frequency• Uncore frequency• Number of OpenMP threads• Visualized by RADAR....Low CI (DGEMV) High CI (DGEMM)10 threads2.2 GHz UCF1.2 GHz CF12 threads1.2 GHz UCF2.5 GHz CFStatic tuning for both kernels12 threads2.2 GHz UCF2.4 GHz CFComputenodeenergyconsumption[J]CPU core frequency [GHz] CPU core frequency [GHz] CPU core frequency [GHz]Computenodeenergyconsumption[J]Computenodeenergyconsumption[J]Note: runtime of both kernels was equal for default settingsTwo kernels with1:1 workload ratioEnergyconsumptionEnergysavingsDefault settings 2017J - -Static optimal 1833J 179J 9%Dynamic optimal 1612J 221J 12%Total savings - 400J 20%
  • 27.
    Core and uncorefrequency tuning under power cap
  • 28.
    Experiments description andtestbed parametersTestbed: Broadwell partition of the Galileosupercomputer in CINECA• dual socket server• two 18-core Intel Xeon E5-2697v4 processor• 2.3 GHz nominal frequency.• 2.7 GHz turbo frequency when all 18 cores are utilized• 145W TDPKey tunable parameters of the 18-core Intel Xeon E5-2697v4 processor and their respective ranges and steps.A set of experiments performed on Intel Broadwell Architecture
  • 29.
    Tuning of COMPUTEbound workload• behavior of the platform when running memory bound workload• under 145 W (TDP level, no power cap)• three different power cap levels 100 W, 80 W and 60 W.3,268s 3,268s3,903s3,903s7,409s3,577s7,693s4,379s3,653s363,4J311,8J 311,8J285,4J304,2J271,6J290,0J0501001502002503003504002,54,56,58,510,512,51,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningof computebound region under 80W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCF8.5% energy savings8.4% time savings12.9% energy savings10.9% time extension3,268s3,450s 3,450s7,411s3,293s4,378s7,698s363,4J344,4J 344,4J300,4J293,0J305,4J271,0J297J0501001502002503003504002,54,56,58,510,512,51,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningof computebound region under 100W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCF14.9% energy savings4.5% time savings21.3% energy savings21.1% time extension3,268s4,944s 4,944s7,410s4,849s4,477s7,692s4,565s 4,606s363,4J296J295,0J268,0J303,0J270,4J0501001502002503003504002,54,56,58,510,512,51,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningof computebound region under 60W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - max UCFEXP7 - DVFS & UCF - min UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - max UCFEXP7 - DVFS & UCF - min UCF9.1% energy savings9.4% time savings
  • 30.
    3,268s3,450s 3,450s7,411s3,293s4,378s7,698s363,4J344,4J344,4J300,4J293,0J305,4J271,0J297J0501001502002503003504002,54,56,58,510,512,51,0 1,21,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of compute bound region under 100W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCF14.9% energy savings4.5% time savings21.3% energy savings21.1% time extensionTuning of COMPUTE bound workload under 100W power cap
  • 31.
    3,268s3,268s3,903s 3,903s7,409s3,577s7,693s4,379s3,653s363,4J311,8J 311,8J285,4J304,2J271,6J290,0J0501001502002503003504002,54,56,58,510,512,51,01,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of compute bound region under 80W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - min UCFEXP7 - DVFS & UCF - max UCF8.5% energy savings8.4% time savings12.9% energy savings10.9% time extensionTuning of COMPUTE bound workload under 80W power cap
  • 32.
    3,268s4,944s 4,944s7,410s4,849s4,477s7,692s4,565s 4,606s363,4J296J295,0J268,0J303,0J270,4J0501001502002503003504002,54,56,58,510,512,51,01,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of compute bound region under 60W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - max UCFEXP7 - DVFS & UCF - min UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - max UCFEXP7 - DVFS & UCF - min UCF9.1% energy savings9.4% time savingsTuning of COMPUTE bound workload under 60W power cap
  • 33.
    Observations for COMPUTEbound workload• To achieve the best possible performance• the uncore frequency must be reduces to minimum• 9.4 % performance gain up to and• 14.9 % lower energy consumption• If further energy savings are required – use DVFS and lower the core freq.• up to 21 % of energy savings• up to 21 % penalty in runtime• this effect is more visible for higher powercap levels
  • 34.
    Tuning of memorybound workload• behavior of the platform when running memory bound workload• under 145 W (TDP level, no power cap)• three different power cap levels 100 W, 80 W and 60 W.1,886s1,959s1,886s197,6J188,2J188,2J148,6J115,2J170J145,6J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningofmemorybound region under 100W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF38.7% energy savings3.6% time extension1,886s1,920s1,890s1,959s197,6J153,2J114,4J146,0J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningofmemorybound region under 80W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF21.6% energy savings3.6% time extension1,886s2,475s 2,475s1,945s2,397s1,925s197,6J147,8J116,2J115,0J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuningofmemorybound region under 60W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF22.2% energy savings22.2% time savings
  • 35.
    1,886s1,959s1,886s197,6J188,2J188,2J148,6J115,2J170J145,6J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,41,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of memory bound region under 100W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF38.7% energy savings3.6% time extensionTuning of memory bound workload under 100W power cap
  • 36.
    1,886s1,920s1,890s1,959s197,6J153,2J114,4J146,0J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,2 1,41,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of memory bound region under 80W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF21.6% energy savings3.6% time extensionTuning of memory bound workload under 80W power cap
  • 37.
    1,886s2,475s 2,475s1,945s2,397s1,925s197,6J147,8J116,2J115,0J0501001502002501,82,32,83,33,84,34,85,35,81,0 1,21,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0Energyconsumption[J]Runtime[s]Frequency [GHz]Tuning of memory bound region under 60W power capEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCFEXP0 - defaultEXP1 - default PcapEXP5 - DVFS under PcapEXP6 - UCF under PcapEXP7 - DVFS & UCF - UCF = 2.2GHzEXP7 - DVFS & UCF - max UCF22.2% energy savings22.2% time savingsTuning of memory bound workload under 60W power cap
  • 38.
    Observations for bothworkloadsObservations for memory bound workload• Under the power budget lower that 80 W• DVFS should be set to minimum value• boost the performance of the uncore part by 22%.• Tuning of the uncore frequency• has low effect on the performance• but a major effect on energy consumption• between 21% (60 W and 80W) to 38% (100W)Observations for compute bound workload• To achieve the best possible performance• the uncore frequency must be reduces to minimum• 9.4 % performance gain up to and• 14.9 % lower energy consumption• If further energy savings are required – use DVFS and lower the core freq.• up to 21 % of energy savings• up to 21 % penalty in runtime• this effect is more visible for higher powercap levels
  • 39.
    Evaluation of complexHPC applications
  • 40.
    BEM4I ApplicationApplication runtimeassemble_k[s]assemble_v[s]gmres_solve[s]print_vtu[s]main[s]defaultruntime 5.4 5.9 10.2 5.6 27.3static tuning runtime 9.8 10.6 6.1 2.4 29.0dynamic tuning runtime 7.0 7.2 7.9 2.1 24.3static savings [%] -82.3% -79.1% 40.5% 56.8% -6.2%dynamic savings [%] -30.6% -20.9% 23.2% 62.9% 10.9%Hardware: dual socket system with 2x12 CPU cores – ”standard HW” in HPC centresRegion description:• assemble_k and assemble_v – high utilization of vector units, extreme level ofoptimization – fully compute bound great utilization of both sockets and all cores• gmres_solve – uses DGEMV from MKL – memory bound, suffers on NUMA effect;this routine is more efficient on single socket• print_vtu – single threaded I/O and network bound region why stores data to afile on LUSTRE system”static": {"FREQUENCY": ”25", <--------- 2.5 GHz"NUM_THREADS": ”12", <--------- 12 OpenMP threads"UNCORE_FREQUENCY": ”22” } <--------- 2.2 GHz"assemble_k": {"FREQUENCY": "23","NUM_THREADS": "24","UNCORE_FREQUENCY": ”16”},"assemble_v": {"FREQUENCY": ”25","NUM_THREADS": "24","UNCORE_FREQUENCY": ”14”},"gmres_solve": {"FREQUENCY": ”17","NUM_THREADS": ”8","UNCORE_FREQUENCY": ”22”},"print_vtu": {"FREQUENCY": "25","NUM_THREADS": ”6","UNCORE_FREQUENCY": ”24”}
  • 41.
    Compute node energyassemble_k[J]assemble_v[J]gmres_solve[J]print_vtu[J]main[J]defaultenergy 1476 1484 2733 1142 6872static tuning energy 1962 2015 1366 420 5792dynamic tuning energy 1467 1462 1259 293 4531static savings [%] -33.8% -35.8% 50.0% 63.2% 15.7%dynamic savings [%] 0.6% 1.5% 53.9% 74.3% 34.1%BEM4I ApplicationLarge energy savings is combination of optimal HW settings and runtime savingsdue to mitigation of NUMA effect by optimal settings of OpenMP threading• Without savings in runtime caused by similar application will• Energy savings approx. 15 – 20%• Runtime savings approx. -15%”static": {"FREQUENCY": ”25", <--------- 2.5 GHz"NUM_THREADS": ”12", <--------- 12 OpenMP threads"UNCORE_FREQUENCY": ”22” } <--------- 2.2 GHz"assemble_k": {"FREQUENCY": "23","NUM_THREADS": "24","UNCORE_FREQUENCY": ”16”},"assemble_v": {"FREQUENCY": ”25","NUM_THREADS": "24","UNCORE_FREQUENCY": ”14”},"gmres_solve": {"FREQUENCY": ”17","NUM_THREADS": ”8","UNCORE_FREQUENCY": ”22”},"print_vtu": {"FREQUENCY": "25","NUM_THREADS": ”6","UNCORE_FREQUENCY": ”24”}
  • 42.
    OpenFOAM ApplicationOpenFOAM Energyconsumption Energy savingsDefault settings 14 231J - -Static tuning 12 264J 2 264J 15.9%Dynamic tuningTotal savings• Computational fluid dynamics• Finite volume + multigrid solver
  • 43.
    OpenFOAM ApplicationOpenFOAM Energyconsumption Energy savingsDefault settings 14 231J - -Static tuning 12 264J 2 264J 15.9%Dynamic tuning 11 370J 597J 4.8%Total savings 2 861J 20.1%• Computational fluid dynamics• Finite volume + multigrid solver
  • 44.
    ESPRESO Application33% ofenergy savings22% of time savings and improved strong scalability• Structural mechanics code• Finite element + sparse FETI solver• Different tuning models for different # of nodes is needed for strong scalability –workload per node is varies• Includes dynamic switching overheadsEnergy savings analysis for the strong scalability test of theESPRESO library when running the cube benchmark
  • 45.
  • 46.
    Application parameters tuningof the ESPRESO50% - 66% against ”reasonable” settings86% against the worst case0501001502002503000 500 1000 1500 2000 2500 3000 3500Energyconsumption[kJ]Configuration indexthe “reasonable” settingsthe optimal settings9 parameters3840 combinations• FETI METHOD 2x• PRECONDITIONER 5x• ITERATIVE SOLVER TYPE 2x• HFETI type 2x• NON-UNIFORM PARTS 6x• REDUNDANT LAGRANGE 2x• SCALING 2x• B0_TYPE 2x• ADAPTIVE PRECISION 2x
  • 47.
    Application parameters tuningApplicationparameter tuning parameters is very promising• application configuration parameters are given in the input file• each setting requires an individual start of the application• tool performs automatic search of application parameter spaceApplicationnumber of parameters tested /total number of optionsEnergy savingscomparedto the worst settingsEnergy savings compared todefault or reasonable settingsESPRESO 9 / 3840 86% 50 – 66%ELMER 1 / 40 97% 50 – 75%OpenFOAM 2 / 12 24% 8%INDEED 3 / 12 35% 25%
  • 48.

[8]ページ先頭

©2009-2025 Movatter.jp