Java HotSpot™ Virtual Machine PerformanceEnhancements

TieredCompilation

Tiered compilation, introduced in Java SE 7, brings client startup speeds tothe server VM. Normally, aserver VM uses the interpreter to collect profiling information aboutmethods that is fed into the compiler. In the tiered scheme, in addition to theinterpreter, the client compiler is used to generate compiled versions ofmethods that collect profiling information about themselves. Since the compiledcode is substantially faster than the interpreter, the program executes withgreater performance during the profiling phase. In many cases, a startup that iseven faster than with the client VM can be achieved because the final codeproduced by the server compiler may be already available during the early stagesof application initialization. The tiered scheme can also achieve better peakperformance than a regular server VM because the faster profiling phase allowsa longer period of profiling, which may yield better optimization.

Both 32 and 64 bit modes are supported, as well as compressed oops (see thenext section). Use the-XX:+TieredCompilation flag with thejava command to enable tiered compilation.

CompressedOops

An "oop", orordinary object pointer in Java Hotspotparlance, is a managed pointer to an object. An oop is normally thesame size as a native machine pointer, which means 64 bits on anLP64 system. On an ILP32 system, maximum heap size is somewhat lessthan 4 gigabytes, which is insufficient for many applications. Onan LP64 system, the heap used by a given program might have to bearound 1.5 times larger than when it is run on an ILP32 system.This requirement is due to the expanded size of managed pointers.Memory is inexpensive, but these days bandwidth and cache are inshort supply, so significantly increasing the size of the heap andonly getting just over the 4 gigabyte limit is undesirable.

Managed pointers in the Java heap point to objects which arealigned on 8-byte address boundaries. Compressed oops representmanaged pointers (in many but not all places in the JVM software)as 32-bit object offsets from the 64-bit Java heap base address.Because they're object offsets rather than byte offsets, they canbe used to address up to four billionobjects (not bytes),or a heap size of up to about 32 gigabytes. To use them, they mustbe scaled by a factor of 8 and added to the Java heap base addressto find the object to which they refer. Object sizes usingcompressed oops are comparable to those in ILP32 mode.

The termdecode is used to express the operation bywhich a 32-bit compressed oop is converted into a 64-bit nativeaddress into the managed heap. The inverse operation is referred toasencoding.

Compressed oops is supported and enabled by default in Java SE 6u23 andlater.In Java SE 7, use of compressed oops is the default for 64-bit JVMprocesses when-Xmx isn't specified and for values of-Xmx less than 32 gigabytes. For JDK 6 before the 6u23release, use the-XX:+UseCompressedOops flag with thejavacommand to enable the feature.

Zero-Based Compressed Ordinary ObjectPointers (oops)

When using compressed oops in a 64-bit Java Virtual Machineprocess, the JVM software asks the operating system to reservememory for the Java heap starting at virtual address zero. If theoperating system supports such a request and can reserve memory forthe Java heap at virtual address zero, then zero-based compressedoops are used.

Use of zero-based compressed oops means that a 64-bit pointercan be decoded from a 32-bit object offset without adding in theJava heap base address. For heap sizes less than 4 gigabytes, theJVM software can use a byte offset instead of an object offset andthus also avoid scaling the offset by 8. Encoding a 64-bit addressinto a 32-bit offset is correspondingly efficient.

For Java heap sizes up around 26 gigabytes, any of Solaris,Linux, and Windows operating systems will typically be able toallocate the Java heap at virtual address zero.

EscapeAnalysis

Escape analysis is a technique by which the Java Hotspot ServerCompiler can analyze the scope of a new object's uses and decidewhether to allocate it on the Java heap.

Escape analysis is supported and enabled by default in Java SE 6u23 andlater.

The Java Hotspot Server Compiler implements the flow-insensitiveescape analysis algorithm described in:

 [Choi99] Jong-Deok Choi, Manish Gupta, Mauricio Seffano,          Vugranam C. Sreedhar, Sam Midkiff,          "Escape Analysis for Java", Procedings of ACM SIGPLAN          OOPSLA  Conference, November 1, 1999

Based on escape analysis, an object's escape state might be oneof the following:

GlobalEscape – An object escapes the methodand thread. For example, an object stored in a static field, or,stored in a field of an escaped object, or, returned as the resultof the current method.
ArgEscape – An object passed as an argumentor referenced by an argument but does not globally escape during acall. This state is determined by analyzing the bytecode of calledmethod.
NoEscape – A scalar replaceable object,meaning its allocation could be removed from generated code.

After escape analysis, the server compiler eliminates scalarreplaceable object allocations and associated locks from generatedcode. The server compiler also eliminates locks for allnon-globally escaping objects. It doesnot replace a heapallocation with a stack allocation for non-globally escapingobjects.

Some scenarios for escape analysis are described next.

The server compiler might eliminate certain object allocations.Consider the example where a method makes a defensive copy of anobject and returns the copy to the caller.

public class Person {  private String name;  private int age;  public Person(String personName, int personAge) {    name = personName;                age = personAge;  }          public Person(Person p) { this(p.getName(), p.getAge()); }  public int getName() { return name; }  public int getAge() { return age; }}public class Employee {  private Person person;          // makes a defensive copy to protect against modifications by caller        public Person getPerson() { return new Person(person) };                public void printEmployeeDetail(Employee emp) {          Person person = emp.getPerson();          // this caller does not modify the object, so defensive copy was unnecessary                System.out.println ("Employee's name: " + person.getName() + "; age: "  + person.getAge());             }}

The method makes a copy to prevent modification of the originalobject by the caller. If the compiler determines that thegetPerson method is being invoked in a loop, it willinline that method. In addition, through escape analysis, if thecompiler determines that the original object is never modified, itmight optimize and eliminate the call to make a copy.

The server compiler might eliminate synchronization blocks(lock elision) if it determines that an object is threadlocal. For example, methods of classes such asStringBuffer andVector are synchronizedbecause they can be accessed by different threads. However, in mostscenarios, they are used in a thread local manner. In cases wherethe usage is thread local, the compiler might optimize and removethe synchronization blocks.

NUMA Collector Enhancements

The Parallel Scavenger garbage collector has been extended totake advantage of machines with NUMA (Non Uniform MemoryAccess) architecture. Most modern computers are based on NUMAarchitecture, in which it takes a different amount of time toaccess different parts of memory. Typically, every processor in thesystem has a local memory that provides low access latency and highbandwidth, and remote memory that is considerably slower toaccess.

In the Java HotSpot Virtual Machine, the NUMA-aware allocatorhas been implemented to take advantage of such systems and provideautomatic memory placement optimizations for Java applications. Theallocator controls the eden space of the young generation of theheap, where most of the new objects are created. The allocatordivides the space into regions each of which is placed in thememory of a specific node. The allocator relies on a hypothesisthat a thread that allocates the object will be the most likely touse the object. To ensure the fastest access to the new object, theallocator places it in the region local to the allocating thread.The regions can be dynamically resized to reflect the allocationrate of the application threads running on different nodes. Thatmakes it possible to increase performance even of single-threadedapplications. In addition, "from" and "to" survivor spaces of theyoung generation, the old generation, and the permanent generationhave page interleaving turned on for them. This ensures that allthreads have equal access latencies to these spaces on average.

The NUMA-aware allocator is available on the Solaris™operating system starting in Solaris 9 12/02 and on the Linuxoperating system starting in Linux kernel 2.6.19 and glibc2.6.1.

The NUMA-aware allocator can be turned on with the-XX:+UseNUMA flag in conjunction with the selection ofthe Parallel Scavenger garbage collector. The Parallel Scavengergarbage collector is the default for a server-class machine. TheParallel Scavenger garbage collector can also be turned onexplicitly by specifying the-XX:+UseParallelGCoption.

The-XX:+UseNUMA flag was added in Java SE 6u2.

Note: There was a known bug in the Linux Kernel that may cause the JVM to crash when being run with-XX:UseNUMA. The bug was fixed in 2012, so this should not affect the latest versions of the Linux Kernel. To see if your Kernel has this bug, you can run thenative reproducer.

NUMA Performance Metrics

When evaluated against the SPEC JBB 2005 benchmark on an 8-chipOpteron machine, NUMA-aware systems showed the followingperformance increases:

32 bit – About 30 percent increase in performance withNUMA-aware allocator
64 bit – About 40 percent increase in performance withNUMA-aware allocator