I am interested in Java performance issues for some time. Now we look not so much at the Java code performance but at Garbage Collection process performance. I am inspired by a lecture from a Warsaw JUG group, which was about Hot Spot in low latency Java. Lecturer Wojciech Kudla prepared a simple small application to simulate HFT system in GC point of view. All the source code with great GC Viewer (plugin for JVisualVM) is available on thier GitHub: heaptrasher.
In a nutshell heaptrasher is an application which is used to generate a lot of garbage to GC operate on. In addition it contains some code to generate histogram with latency statistics. Mentioned statistics data can be collected in two different ways. Worth mentioning is that the first one (array) allocate a lot of memory in Old Generation space and the second does not. I was later to find, it has a considerable impact on GC performance.
All charts bellow are from JVisualVM with mentioned GC Viewer and VisualGC plugins.
Test platform
- CPU: i7-3612QM
- RAM: 8GB
- OS: Windows 7 Ultimate
- JDK: 1.7.0_u21
A common feature of all the tests is to use parallel GC for Young Space collections. The GC for Old Generation doesn't matter because in this case Old Generation collection will never happen.
First test
Description
4GB heap, array as statistics data holderCommand line
java -XX:+UnlockDiagnosticVMOptions -Xmx4g -Xms4g -XX:+UseParallelGC Main arrayGarbage Collector pauses duration
Garbage Collections times
Message latency
50,0000% of message latency is less than: 1466 nanos 75,0000% of message latency is less than: 2443 nanos 90,0000% of message latency is less than: 3421 nanos 99,0000% of message latency is less than: 4888 nanos 99,9000% of message latency is less than: 11241 nanos 99,9900% of message latency is less than: 19549 nanos 99,9990% of message latency is less than: 75263 nanos 99,9999% of message latency is less than: 3480174 nanos
Summary
6,5 ms pauses? It isn't low latency :(. The graph is also very irregular. 99,9% messaged was processed in below 11241 nanos.
Second test
Description
4GB heap, direct statistics data collectionCommand line
java -XX:+UnlockDiagnosticVMOptions -Xmx4g -Xms4g -XX:+UseParallelGC Main directGarbage Collector pauses duration
Garbage Collections times
Message latency
50,0000% of message latency is less than: 1466 nanos 75,0000% of message latency is less than: 2443 nanos 90,0000% of message latency is less than: 2933 nanos 99,0000% of message latency is less than: 4887 nanos 99,9000% of message latency is less than: 11240 nanos 99,9900% of message latency is less than: 19060 nanos 99,9990% of message latency is less than: 82104 nanos 99,9999% of message latency is less than: 762892 nanos
Summary
Hmmm much better, isn't it? It appears that Old Generation is major low-latency apps enemy... The graph is irregular like in first case. Change does not affected the 99,9% of messages process time.
Third test
Description
4GB heap, direct statistics data collection, 3GB NewSizeCommand line
java -XX:+UnlockDiagnosticVMOptions -Xmx4g -Xms4g -XX:MaxNewSize=3g -XX:NewSize=3g -XX:+UseParallelGC Main directGarbage Collector pauses duration
Garbage Collections times
Message latency
50,0000% of message latency is less than: 1466 nanos 75,0000% of message latency is less than: 2443 nanos 90,0000% of message latency is less than: 3421 nanos 99,0000% of message latency is less than: 4888 nanos 99,9000% of message latency is less than: 11241 nanos 99,9900% of message latency is less than: 20038 nanos 99,9990% of message latency is less than: 69887 nanos 99,9999% of message latency is less than: 652930 nanos
Summary
In general changes does not affected a single collection time, but the entire collections count is more than 2 times less. It is interesting that increase Young Generation space does not affected single collection time. It is just because app generates a lot of short lived objects, that means GC has relative small graph of living objects to check. One more time 99,9% of messages are processed with the same latency.
Third test
Description
4GB heap, direct statistics data collection, 3GB NewSize, number_of_gc_threads=cpu_core_number-1Command line
java -XX:+UnlockDiagnosticVMOptions -Xmx4g -Xms4g -XX:MaxNewSize=3g -XX:NewSize=3g -XX:ParallelGCThreads=7 -XX:+UseGCTaskAffinity -XX:+BindGCTaskThreadsToCPUs Main directGarbage Collector pauses duration
Garbage Collections times
Message latency
50,0000% of message latency is less than: 1466 nanos 75,0000% of message latency is less than: 2443 nanos 90,0000% of message latency is less than: 3421 nanos 99,0000% of message latency is less than: 5375 nanos 99,9000% of message latency is less than: 11729 nanos 99,9900% of message latency is less than: 21015 nanos 99,9990% of message latency is less than: 70865 nanos 99,9999% of message latency is less than: 612855 nanos
Summary
It isn't a lot of changes in GC pause times. Must admit that chart is more regular. We have also the first time for a long time change in 99,9% messages processing time.
Fourth test
Description
4GB heap, direct statistics data collection, 3GB NewSize, number_of_gc_threads=cpu_core_number-1 with ParGCCardsPerStrideChunkCommand line
java -XX:+UnlockDiagnosticVMOptions -Xmx4g -Xms4g -XX:MaxNewSize=3g -XX:NewSize=3g -XX:ParallelGCThreads=7 -XX:+UseGCTaskAffinity -XX:+BindGCTaskThreadsToCPUs -XX:ParGCCardsPerStrideChunk=32768 Main directGarbage Collector pauses duration
Garbage Collections times
Message latency
50,0000% of message latency is less than: 1466 nanos 75,0000% of message latency is less than: 2443 nanos 90,0000% of message latency is less than: 3421 nanos 99,0000% of message latency is less than: 5375 nanos 99,9000% of message latency is less than: 10752 nanos 99,9900% of message latency is less than: 21503 nanos 99,9990% of message latency is less than: 76729 nanos 99,9999% of message latency is less than: 629960 nanos
Summary
It look almost the same as last time but we have significant better performance in 99,9% messages process time.
Summary
Durning the presentation lecturer use Linux based platform. During presentation the lecturer was used Linux based platform. In Linux it look that individual switches has more impact for collections time. A little summary here:
Test number | Collections number | Collections time | Avg. collection time | 99,90% latency less than | 99,99% latency less than |
---|---|---|---|---|---|
first | 126 | 706,342 ms | 5,60 ms | 11241 nanos | 19549 nanos |
second | 126 | 92,002 ms | 0,73 ms | 11240 nanos | 19560 nanos |
third | 57 | 53,738 ms | 0,94 ms | 11729 nanos | 21015 nanos |
fourth | 57 | 46,508 ms | 0,81 ms | 10752 nanos | 21503 nanos |
Warsaw JUG meeting
Does your app intend to run for any extended period of time? If so, ignoring old gen and calling it GC tuning is like getting a girls phone number and saying you are engaged.
ReplyDeleteIf your app isn't going to run for a long time - just give it a heap so humongous that it never needs to do any GC (young or otherwise) and you would have no GC pauses
JVM Architecture Specification Basic The Heap Area Introduction, teach about the JVM Heap Area in details
ReplyDeletehttp://www.youtube.com/watch?v=c-A7ZzxjWUI
JVM Architecture Specification Basic The Method Area explained, teach about the JVM method area
http://www.youtube.com/watch?v=a5GzF2fSSCE
This is an excellent information. I was using Oracke JRockit in my trading app, it has a very nice GC behavior with low pauses. Unfortunately, it crashes a lot. I am testing the JVM switches that you shared.
ReplyDelete