Movatterモバイル変換

[also see thisFAQ]

This article surveys a number of benchmarks and finds thatJava performance on numerical code is comparable to that of C++, with hints that Java's relative performance is continuing to improve.We then describe clear theoretical reasons why thesebenchmark results should be expected.

Benchmarks

Five composite benchmarks listed below show that modernJava has acceptable performance,being nearly equal to (and in many cases faster than) C/C++ across a number of benchmarks.

Numerical Kernels
Benchmarking Java against C and Fortran for Scientific Applications
Mark Bull, Lorna Smith, Lindsay Pottage, Robin Freeman,
EPCC, University of Edinburgh (2001).
The authors test some real numerical codes (FFT, Matrix factorization,SOR, fluid solver, N-body) on several architectures and compilers.On Intel they found that the Java performance was very reasonablecompared to C (e.g, 20% slower), and that Java was faster thanat least one C compiler (KAI compiler on Linux).
The authors conclude, "On Intel Pentium hardware,especially with Linux,the performance gap is small enoughto be of little or no concern to programmers."

More numerical methods: SciMark2 scores

The National Institute of Standard'sscimark2 benchmark is available in java, C, and (third party) C# versions. It includes FFT, Jacobi relaxation for a 2D Laplace equation, Monte Carlo estimation of PI, sparse matrix multiply, and LU.

Java is slightly faster than C on this benchmark.Mflops (higher is better):

java 1.8.0_60	1178
gcc 4.8.3 -O3	959

./scimark2 #gcc4.8.3 -O3** **** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **** for details. (Results can be submitted to pozo@nist.gov)     **** **Using       2.00 seconds min time per kenel.Composite Score:          959.60FFT             Mflops:  1002.44    (N=1024)SOR             Mflops:   809.64    (100 x 100)MonteCarlo:     Mflops:   353.20Sparse matmult  Mflops:  1012.14    (N=1000, nz=5000)LU              Mflops:  1620.57    (M=100, N=100)java -server jnt.scimark2.commandline#java version "1.8.0_60"SciMark 2.0aComposite Score: 1178.7765479267096FFT (1024): 559.7691512301901SOR (100x100):   1068.4529791414395Monte Carlo : 540.5194437616267Sparse matmult (N=1000, nz=5000): 1009.4109039057408LU (100x100): 2715.730261594551

Still more numerical methods
From the book Object-Oriented Implementations of Numerical Methodsby Didier Besset (MorganKaufmann, 2001):
Operation Units C Smalltalk Java
Polynomial 10th degree msec. 1.1 27.7 9.0
Neville Interpolation (20 points) msec. 0.9 11.0 0.8
LUP matrix inversion (100 x 100) sec. 3.9 22.9 1.0
Microbenchmarks (cache effects considered)
Several years ago thesebenchmarksshowed java performance at the time to be somewhere in the middleof C compiler performance - faster than the worst C compilers,slower than the best. These are "microbenchmarks", but they do have the advantagethat they were run across a number of different problem sizesand thus the results are not reflecting a lucky cache interaction(see more details on this issue in the next section).
These benchmarks wereupdatedwith a more recent java(1.4) and gcc(3.2), using full optimization(gcc -O3 -mcpu=pentiumpro -fexpensive-optimizations -fschedule-insns2...).This timejava is faster than C the majority of the tests, by a factor of morethan 2 in some cases...
... suggesting that java performance is catching up to or even pulling aheadof gcc at least.
These test were mostly integer (except for an FFT).
Microbenchmarks (cache effects not considered)
In January 2004 OSNews.com posted an article,Nine Language Performance Round-up: Benchmarking Math & File I-O.These are simple numeric and file I/O loops, and no doubt sufferfrom the arbitrary cache interaction factor described below.They were however run under several different compilers, which helps.Again Java is competitive with (actually slighty faster than) several C++ compilers including Visual C++ in the majority of the benchmarks.
(One exceptional benchmark tested trigonometry library calls.Java numerical programmers are aware that these calls became slowerin java 1.4; recent benchmarks suggest this issue was fixed in java 1.4.2)

Note that these benchmarks are on Intel architecture machines.Java compilers on some other processors are less developed at present.

And In Theory: Maybe Java Should be Faster

Java proponents have stated that Javawill soon be faster than C. Why?Several reasons (also see reference [1]):

1) Pointers make optimization hard

This is a reason why C is generally a bit slower than Fortran.

In C, consider the code

        x = y + 2 * (...)        *p = ...arr[j] = ...        z = x + ...

Because p could be pointing at x,a C compiler cannot keep x in a register andinstead has to write it to cache and read it back -- unless it can figure out where p is pointing at compile time.And because arrays act like pointers in C/C++,the same is true for assignment to array elements: arr[j] could also modify x.

This pointer problem in C resembles thearray bounds checking issuein Java: in both cases, if the compiler can determine the array (or pointer)index at compile time it can avoid the issue.

In the loop below, for example, a Java compiler can trivially avoid testing the lower array bound because the loop counter is only incremented, never decremented.A single test before starting the loop handles the upper bound test if 'len'is not modified inside the loop (and java has no pointers, so simply lookingfor an assignment is enough to determine this):

   for( int i = 0; i< len; i++ ) { a[i] = ... }

In cases where the compiler cannot determine the necessary information at compile time,the C pointer problem may actually be the bigger performance hit. In the java case, the loop bound(s)can be kept in registers, and the index is certainly in a register,so a register-register test is needed. In the C/C++ case a load from memory is needed.

2) Garbage collection- is it worse...or better?

Most programmers say garbage collection is or should be slow, with no given reason- it's assumed but never discussed.Some computer language researchers say otherwise.

Consider what happens when you do a new/malloc: a) the allocatorlooks for an empty slot of the right size,then returns you a pointer. b) This pointer is pointing to somefairly random place.

With GC, a) the allocator doesn't need to look for memory, it knows where it is, b) the memory it returns is adjacentto the last bit of memory you requested. The wandering around parthappens not all the time but only at garbage collection.And then (depending on the GC algorithm) things get moved of course as well.

The cost of missing the cache

The big benefit of GC ismemory locality.Because newly allocated memory is adjacent to the memory recently used, it is more likely to already be in the cache.

How much of an effect is this? One rather dated (1993)exampleshows that missing the cache can be a big cost:changing an array size in small C program from 1023 to 1024 results in a slowdown of 17 times (not 17%). This is like switching from C to VB!This particular program stumbled across what was probably the worstpossible cache interaction for that particular processor (MIPS);the effect isn't that bad in general...but with processorspeeds increasing faster than memory, missing the cacheis probably an even bigger cost now than it was then.

(It's easy to find other research studies demonstrating this; here'sonefrom Princeton: they found that (garbage-collected) ML programstranslated from the SPEC92 benchmarkshave lower cache miss rates than the equivalent C and Fortran programs.)

This is theory, what about practice? In a well known paper [2]several widely used programs (including perl and ghostscript) wereadapted to use several different allocators including a garbagecollector masquerading as malloc (with a dummy free()).The garbage collector was as fast as a typical malloc/free;perl was one of several programs that ran faster when converted to use agarbage collector.Another interesting fact is that the cost of malloc/free is significant: both perl and ghostscript spent roughly 25-30% of their time in these calls.

Besides the improved cache behavior, also note that automatic memory management allows escape analysis,which identifies local allocations that can be placed on the stack.(Stack allocations are clearly cheaper than heap allocationof either sort).

3) Run-time compilation

The JIT compiler knows more than a conventional "pre-compiler", andit may be able to do a better job given the extra information:

The compiler knows what processor it is running on, andcan generate code specifically for that processor.It knows whether (for example) the processor is a PIII or P4, if SSE2 is present, and how big the caches are.A pre-compiler on the other hand has to targetthe least-common-denominator processor, at least inthe case of commercial software.
Because the compiler knows which classes are actually loaded and being called, it knows which methods canbe de-virtualized and inlined. (Remarkably, modern java compilers also know how to "uncompile" inlined calls in the case where an overriding method is loadedafter the JIT compilation happens.)
A dynamic compiler may also get the branch prediction hints right more often than a static compiler.

It might also be noted that Microsoft has some similarcomments regarding C# performance [5]:

"Myth: JITed Programs Execute Slower than Precompiled Programs"
.NET still provides a traditional pre-compiler ngen.exe, but "since the run-time only optimizations cannot be provided... the code is usually not as good as that generated by a normal JIT."

Speed and Benchmark Issues

Benchmarks usually lead to extensive and heated discussion in popular web forums.From our point of view there are several reasons why such discussions are mostly "hot air".

What is slow?

The notion of "slow" in popular discussions is often poorly calibrated.If you write a number of small benchmarks in several different types of programming language, the broad view of performancemight be something like this:

Language class	typical slowdown
Assembler:	1
Low level compiled (Fortran, C):	1-2
Byte-code (python):	25-50
Interpreted strings (csh, tcl?):	250x

Despite this big picture, performance differences of less thana factor of two are often upheld as evidence in speed debates.As we describe next, differences of 2x-4x or more are oftenjust noise.

Don't characterize the speed of a language based on a single benchmark of a single program.

We often see people drawing conclusions from a single benchmark. For example, an article posted on slashdot.org [3]claims to address the question"Which programming language provides the fastest tool for number crunching under Linux?", yet it discussed only one program.

Why isn't one program good enough?

For one, it's common sense; the compilermay happen to do particularly well or particularly poorlyon the inner loop of the program; this doesn't generalize.The fourth set of benchmarks above show Java asbeing faster than C by a factor twoon an FFT of an array of a particular size.Should you now proclaim that Java is always twice as fast as C?No, it's just one program.

There is a more important issue than the code quality on the particular benchmark, however:

Cache/Memory effects.

Look at the FFT microbenchmark that we referenced above.The figure is reproduced here with permission:

On this single program, depending on the input size,the relative performance of 'IBM' (IBM's Java) varies from about twice as slow to twice as fast as'max-C' (gcc)(-O3 -lm -s -static -fomit-frame-pointer -mpentiumpro -march=pentiumpro -malign-functions=4 -fu nroll-all-loops -fexpensive-optimizations -malign-double -fschedule-insns2 -mwide-multiply -finline-function s -fstrict-aliasing).So what do we conclude from this benchmark? Java is twice asfast as C, or twice as slow, or ...

This performance variation due to factors of data placement and size is universal. A more dramatic example of such cache effects is thelinkmentioned in the discussion on garbage collection above.

The person who posted [3] demonstrated the fragility of hisown benchmark in a followuppost,writing that "Java now performs as well as gcc on many tests"after changing something (note that it was not the Java language that changed).

Conclusions: Why is "Java is Slow" so Popular?

Java is nearly equal to (or faster than) C++ on low-level and numeric benchmarks.This should not be surprising: Java is a compiled language (albeit JIT compiled).

Nevertheless, the idea that "java is slow" is widely believed.Why this is so is perhaps the most interestingaspect of this article.

Let's look at several possible reasons:

Java circa 1995 was slow.The first incarnations of java did not java a JIT compiler, and hence were bytecode interpreted (like Python for example).JIT compilers appeared in JVMs from Microsoft, Symantec,and in Sun's java1.2.
This explanation is implausible.Most "computer folk" are able to rattle off the exact speedin GHz of the latest processors, and they track this informationas it changes each month (and have done so for years). Yet this explanation asks us to believe that they are notable to remember that a single and rather important language speed change occurred in 1996.
Java can be slow still. For example, programs written with the thread-safe Vectorclass are necessarily slower (on a single processor at least) than those written with the equivalent thread-unsafe ArrayList class.
This explanation is equally unsatisfying, because C++ and otherlanguages have had similar "abstraction penalties". For example,The Kernighan and Pike bookThe Practice of Programming has a table with the following entries, describing the performanceof several implementations of a text processing program:
Version 400 MHz PII
C 0.30 sec
C++/STL/deque 11.2 sec
C++/STL/list 1.5 sec
Another historic problem in C++ was theoverhead of returning an object from a function (several unnecessary object create/copy/destruct cycles were involved).
Java program startup is slow.As a java program starts, it unzips the java libraries and compilesparts of those libraries and itself, so an interactiveprogram can be sluggish for the first couple seconds of use.
This approaches being a reasonable explanation for the speed myth.But while it might explain user's impressions, it does not explainwhy many programmers (who can easily understand the idea ofan interpreted program being compiled) share the belief.

I find the disconnect between programmer opinions and benchmarking interesting, but not because I am a Java proponent. On the contrary, while I think Java is an adequate language for themiddleware stuff that it is typically used for, and as a clean introductory language,it is not a particularly good language for numerics.

Rather, the issue motivates me because similar "mythology" seems to prevent widespreadadoption of more advanced languages -- I prefer functional languages andbelieve that the "has this feature" view of languages is irrelevant if homoiconic metaprogramming is available.There is a similar "garbage collection is slow" myth that persists despite theory and several decades of evidence to the contrary [2].

A reason why you might care is that the "java must be slow" view justifiesdevelopment of software with buffer overflow vulnerabilities.Refer back to the LU portion of the Scimark2 benchmark, which predominantly involves array accessesand multiply-adds. it is evident that it is possible to haveboth good performance (1.7x faster than C)and array bounds checking.

Acknowledgements

Ian Rogers, Curt Fischer, and Bill Bogstad providedinput and clarification of some points.

References

[1] K. Reinholtz, Java will be faster than C++,ACM Sigplan Notices, 35(2): 25-28 Feb 2000.

[2] Benjamin Zorn, The Measured Cost of Conservative Garbage CollectionSoftware - Practice and Experience 23(7): 733-756, 1992.

[3]Linux Number Crunching: Languages and Tools,referenced onslashdot.org

[4]Christopher W. Cowell-Shah,Nine Language Performance Round-up: Benchmarking Math & File I/O,appeared atOSnews.com, Jan. 2004.

[5] E. Schanzer,Performance Considerations for Run-Time Technologies in the .NET Framework,Microsoft Developer Network article.

[8]ページ先頭