Movatterモバイル変換


[0]ホーム

URL:


Jump to Content
Robert Hundt

Robert Hundt

Robert Hundt received a degree in Computer Science from Technical University in Munich in 1992. Until 1999 he worked for Terrasat GmbH in Germany, a 20+ people R&D company he co-owned. He played many roles - from company lead to booth cat - while writing and optimizing software for surveying and navigation with satellite systems.

In 2000 he started working for Hewlett-Packard Company in California on bringing up the new and scalable high-level optimizer SYZYGY for the HP C/C++/FORTRAN compilers with a new inter-procedural optimizer, a new loop optimizer, and a new scalar optimizer. Before joining the compiler group, Robert was responsible for dynamic binary instrumentation for Intel Itanium processors, co-creating and designing the performance analysis tool HP Caliper.

Since beginning of 2007 Robert has been working for Google. He created various compiler and performance projects, e.g., he served as Tech Lead for compiler optimization for servers (x86), Android (ARM), and GPUs (open-source CUDA compiler), built datacenter profiling and performance analysis tools, and worked on GMail/Apps performance, from Chrome to datacenter. For many years Robert was the SW lead for Google TPU - supercomputers to accelerate machine learning inference and training, which include the open-source TensorFlow compiler XLA. Today he is the TL for ML compilers, runtimes, and performance, for TPU, GPU, and CPU. In parallel, he works on the open-source High-Level Synthesis toolchain XLS and dabbles in Quantum Computing. He remains strongly engaged in compiler and datacenter research.

In real life, he enjoys spending time with his family, playing the piano (at which he sucks), playing Volleyball (which he used to do fairly well) and everything related to delicious high quality food (his main reason for joining Google ;-)

Research Areas

Authored Publications
results

Filter by:

Publications

Years

Research Areas

Teams

    Sort By
    • Title
    • Title, descending
    • Year
    • Year, descending
      Quantum Computing for Programmers
      Cambridge University Press, Cambridge CB2 8BS, United Kingdom (2022)
      Preview abstractThis introduction to quantum computing from a classical programmer's perspective is meant for students and practitioners alike. Over 25 fundamental algorithms are explained with full mathematical derivations and classical code for simulation, using an open-source code base developed from the ground up in Python and C++. After presenting the basics of quantum computing, the author focuses on algorithms and the infrastructure to simulate them efficiently, beginning with quantum teleportation, superdense coding, and Deutsch-Jozsa. Coverage of advanced algorithms includes the quantum supremacy experiment, quantum Fourier transform, phase estimation, Shor's algorithm, Grover's algorithm with derivatives, quantum random walks, and the Solovay–Kitaev algorithm for gate approximation. Quantum simulation is explored with the variational quantum eigensolver, quantum approximate optimization, and the Max-Cut and Subset-Sum algorithms. The book also discusses issues around programmer productivity, quantum noise, error correction, and challenges for quantum programming languages, compilers, and tools, with a final section on compiler techniques for transpilation.View details
      In-Datacenter Performance Analysis of a Tensor Processing Unit
      Norman P. Jouppi
      Nishant Patil
      Gaurav Agrawal
      Raminder Bajwa
      Sarah Bates
      Suresh Bhatia
      Nan Boden
      Al Borchers
      Rick Boyle
      Pierre-luc Cantin
      Clifford Chao
      Chris Clark
      Jeremy Coriell
      Mike Daley
      Matt Dau
      Ben Gelb
      Tara Vazir Ghaemmaghami
      Rajendra Gottipati
      William Gulland
      Robert Hagmann
      C. Richard Ho
      Doug Hogberg
      John Hu
      Dan Hurt
      Julian Ibarz
      Aaron Jaffey
      Alek Jaworski
      Alexander Kaplan
      Harshit Khaitan
      Andy Koch
      Naveen Kumar
      Steve Lacy
      James Law
      Diemthu Le
      Chris Leary
      Zhuyuan Liu
      Kyle Lucke
      Alan Lundin
      Gordon MacKean
      Adriana Maggiore
      Maire Mahony
      Kieran Miller
      Rahul Nagarajan
      Ravi Narayanaswami
      Ray Ni
      Kathy Nix
      Thomas Norrie
      Mark Omernick
      Narayana Penukonda
      Andy Phelps
      Jonathan Ross
      ISCA (2017) (to appear)
      Preview abstractMany architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.View details
      GPUCC - An Open-Source GPGPU Compiler
      Jingyue Wu
      Mark Heffernan
      Chris Leary
      Bjarke Roune
      Rob Springer
      Xuetian Weng
      Proceedings of the 2016 International Symposium on Code Generation and Optimization, ACM, New York, NY, pp. 105-116
      Preview abstractGraphics Processing Units have emerged as powerful accelerators for massively parallel, numerically intensive workloads. The two dominant software models for these devices are NVIDIA’s CUDA and the cross-platform OpenCL standard. Until now, there has not been a fully open-source compiler targeting the CUDA environment, hampering general compiler and architecture research and making deployment difficult in datacenter or supercomputer environments. In this paper, we present gpucc, an LLVM-based, fully open-source, CUDA compatible compiler for high performance computing. It performs various general and CUDA-specific optimizations to generate high performance code. The Clang-based frontend supports modern language features such as those in C++11 and C++14. Compile time is 8% faster than NVIDIA’s toolchain (nvcc) and it reduces compile time by up to 2.4x for pathological compilations (>100 secs), which tend to dominate build times in parallel build environments. Compared to nvcc, gpucc’s runtime performance is on par for several open-source benchmarks, such as Rodinia (0.8% faster), SHOC (0.5% slower), or Tensor (3.7% faster). It outperforms nvcc on internal large-scale end-to-end benchmarks by up to 51.0%, with a geometric mean of 22.9%.View details
      Whare-Map: Heterogeneity in “Homogeneous” Warehouse-Scale Computers
      Jason Mars
      Lingjia Tang
      Proceedings of the 2013 ACM/IEEE International Symposium on Computer Architecture (ISCA), IEEE (to appear)
      Preview abstractModern “warehouse scale computers” (WSCs) continue to be embraced as homogeneous computing platforms. However, due to frequent machine replacements and upgrades, modern WSCs are in fact composed of diverse commodity microarchitectures and machine configurations. Yet, currentWSCs are architected with the assumption of homogeneity, leaving a potentially significant performance opportunity unexplored.In this paper, we expose and quantify the performance impact of the “homogeneity assumption” for modern production WSCs using industry-strength large-scale web-service workloads. In addition, we argue for, and evaluate the benefits of, a heterogeneity-aware WSC using commercial web-service production workloads including Google’s websearch. We also identify key factors impacting the available performance opportunity when exploiting heterogeneity and introduce a new metric, opportunity factor, to quantify an application’s sensitivity to the heterogeneity in a given WSC. To exploit heterogeneity in “homogeneous” WSCs, we propose “Whare-Map,” the WSC Heterogeneity Aware Mapper that leverages already in-place continuous profiling subsystems found in production environments. When employing “Whare-Map”, we observe a cluster-wide performance improvement of 15% on average over heterogeneity–oblivious job placement and up to an 80% improvement forweb-service applications that are particularly sensitive to heterogeneityView details
      JSWhiz - Static Analysis for JavaScript Memory Leaks
      Proceedings of the 10th annual IEEE/ACM international symposium on Code generation and optimization, IEEE (2013)
      Preview abstractJavaScript is the dominant language for implementing dynamic web pages in browsers. Even though it is standardized, many browsers implement language and browser bindings in different and incompatible ways. As a result, a plethora of web development frameworks were developed to hide cross-browser issues and to ease development of large web applications. An unwelcome side-effect of these frameworks is that they can introduce memory leaks, despite the fact that JavaScript is garbage collected. Memory bloat is a major issue for web applications, as it affects user perceived latency and may even prevent large web applications from running on devices with limited resources.In this paper we present JSWhiz, an extension to the open-source Closure JavaScript compiler. Based on experiences analyzing memory leaks in Gmail, JSWhiz detects five identified common problem patterns. JSWhiz found a total of 89 memory leaks across Google's Gmail, Docs, Spreadsheets, Books, and Closure itself. It contributed significantly in a recent effort to reduce Gmail memory footprint, which resulted in bloat reduction of 75% at the 99th percentile, and by roughly 50% at the median.View details
      Optimizing Google's Warehouse Scale Computers: The NUMA Experience
      Lingjia Tang
      Jason Mars
      Robert Hagmann
      The 19th IEEE International Symposium on High Performance Computer Architecture (2013)
      Preview
      Bubble-Up: Increasing Utilization In Modern Warehouse Scale Computers Via Sensible Co-Locations
      Jason Mars
      Linjia Tang
      Kevin Skadron
      Mary Lou Souffa
      Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011, IEEE, New York, NY, USA
      Preview abstractAs much of the world’s computing continues to move into the cloud, the over-provisioning of computing resources to ensure the performance isolation of latency-sensitive tasks, such as web search, in modern datacenters is a major contributor to low machine utilization. Being unable to accurately predict performance degradation due to contention for shared resources on multicore systems has led to the heavy handed approach of simply disallowing the co-location of high-priority, latency-sensitive tasks with other tasks. Performing this precise prediction has been a challenging and unsolved problem.In this paper, we present Bubble-Up, a characterization methodology that enables the accurate prediction of the performance degradation that results from contention for shared resources in the memory subsystem. By using a bubble to apply a tunable amount of “pressure” to the memory subsystem on processors in production datacenters, our methodology can predict the performance interference between co-locate applications with an accuracy within 1% to 2% of the actual performance degradation. Using this methodology to arrive at “sensible” co-locations in Google’sproduction datacenters with real-world large-scale applications, we can improve the utilization of a 500-machine cluster by 50% to 90% while guaranteeing a high quality of service of latency-sensitive applications.View details
      MAO - an Extensible Micro-Architectural Optimizer
      Easwaran Raman
      Martin Thuresson
      Neil Vachharajani
      Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, ACM (2011)
      Preview abstractPerformance matters, and so does repeatability and predictability. Today's processors' micro-architectures have become so complex as to now contain many undocumented, not understood, and even puzzling performance cliffs. Small changes in the instruction stream, such as the insertion of a single NOP instruction, can lead to significant performance deltas, with the effect of exposing compiler and performance optimization efforts to perceived unwanted randomness.This paper presents MAO, an extensible micro-architectural assembly to assembly optimizer, which seeks to address this problem for x86/64 processors. In essence, MAO is a thin wrapper around a common open source assembler infrastructure. It offers basic operations, such as creation or modification of instructions, simple data-flow analysis, and advanced infra-structure, such as loop recognition, and a repeated relaxation algorithm to compute instruction addresses and lengths. This infrastructure enables a plethora of passes for pattern matching, alignment specific optimizations, peep-holes, experiments (such as random insertion of NOPs), and fast prototyping of more sophisticated optimizations. MAO can be integrated into any compiler that emits assembly code, or can be used standalone. MAO can be used to discover micro-architectural details semi-automatically. Initial performance results are encouraging.View details
      Heterogeneity in “Homogeneous” Warehouse-Scale Computers: A Performance Opportunity
      Jason Mars
      Lingjia Tang
      IEEE Computer Architecture Letters (CAL), Vol. 10 No. 2 (2011), pp. 29-32
      Preview abstractThe class of modern datacenters recently coined as “warehouse scale computers” (WSCs) has traditionally been embraced as homogeneous computing platforms. However, due to frequent machine replacements and upgrades, modern WSCs are in fact composed of diverse commodity microarchitectures and machine configurations. Yet, current WSCs are designed with an assumption of homogeneity, leaving a potentially significant performance opportunity unexplored. In this paper, we investigate the key factors impacting the available heterogeneity in modern WSCs, and the benefit of exploiting this heterogeneity to maximize overall performance. We also introduce a new metric, opportunity factor, which can be used to quantify an application’s sensitivity to the heterogeneity in a given WSC. For applications that are sensitive to heterogeneity, we observe a performance improvement of up to 70% when employing our approach. In a WSC composed of state-of-the-art machines, we can improve the overall performance of the entire datacenter by 16% over the status quo.View details
      The Impact of Memory Subsystem Resource Sharing on Datacenter Applications
      Lingjia Tang
      Jason Mars
      Neil Vachharajani
      Mary-Lou Soffa
      ISCA, ACM (2011)
      Preview abstractIn this paper we study the impact of sharing memory resources on five Google datacenter applications: a web search engine, bigtable, content analyzer, image stitching, and protocol buffer. While prior work has found neither positive nor negative effects from cache sharing across the PARSEC benchmark suite, we find that across these datacenter applications, there is both a sizable benefit and a potential degradation from improperly sharing resources. In this paper, we first present a study of the importance of thread-tocore mappings for applications in the datacenter as threads can be mapped to share or to not share caches and bus bandwidth. Second, we investigate the impact of co-locating threads from multiple applications with diverse memory behavior and discover that the best mapping for a given application changes depending on its co-runner. Third, we investigate the application characteristics that impact performance in the various thread-to-core mapping scenarios.Finally, we present both a heuristics-based and an adaptive approach to arrive at good thread-to-core decisions in the datacenter. We observe performance swings of up to 25% for web search and 40% for other key applications, simply based on how application threads are mapped to cores. By employing our adaptive thread-to-core mapper, the performance of the datacenter applications presented in this work improved by up to 22% over status quo thread-to-core mapping and performs within 3% of optimal.View details
      Search on Google Scholar

      Join us

      We're always looking for more talented, passionate people.

      See opportunities
      Careers dark

      [8]ページ先頭

      ©2009-2025 Movatter.jp