Investigating False Sharing in Java’s ArrayBlockingQueue

We have discussed extensively about false sharing on this blog. In our last post Why Ring Buffers Outperform ArrayBlockingQueue: Understanding the Performance Gap, we did a deep dive into ring buffers and ArrayBlockingQueue, and why ring buffers often outperform ArrayBlockingQueue.

Now, while writing that post, there was one thing we mentioned that keeps coming back, and if you already understand false sharing, you will immediately see why it looks suspicious.

Inside ArrayBlockingQueue, you have these two fields:

    /** items index for next take, poll, peek or remove */
    int takeIndex;

    /** items index for next put, offer, or add */
    int putIndex;

If you are familiar with false sharing, this looks like the perfect recipe for it.

But is it really?

ArrayBlockingQueue uses one single lock for both producers and consumers. So in theory, producers and consumers are not modifying these fields at the same time. It is basically one thread at a time touching these pointers.

So does false sharing still happen here or not?

That is exactly what we are investigating in this blog post. We will show how this object is actually laid out in memory, investigate cache line contention and cache misses, build a custom padded version of ArrayBlockingQueue (same logic, better layout), and finally run benchmarks against the JDK ArrayBlockingQueue.

Revealing the Memory Layout with JOL

First, let’s investigate the object’s memory layout. How are fields inside ArrayBlockingQueue actually arranged in heap memory?

Java doesn’t expose this natively, so we use Java Object Layout (JOL) from OpenJDK. JOL digs into JVM internals and prints the stuff Java normally hides: object headers, field offsets, sizes, alignment, and any padding the JVM inserts.

Installing and Using JOL

Add this dependency to your pom.xml:

<dependency>
    <groupId>org.openjdk.jol</groupId>
    <artifactId>jol-core</artifactId>
    <version>0.17</version>
</dependency>

Now run this code to inspect the layout:

import org.openjdk.jol.info.ClassLayout;
import java.util.concurrent.ArrayBlockingQueue;

public class ABQLayout {
    public static void main(String[] args) {
        ArrayBlockingQueue<String> queue = new ArrayBlockingQueue<>(1024);
        System.out.println(ClassLayout.parseInstance(queue).toPrintable());
    }
}

And run it like this:

mvn -q exec:java -Dexec.mainClass=com.heappulse.ABQLayout

Reading the JOL Output

Here’s a formated version of the output from JOL dump:

OFF  SZ   TYPE   DESCRIPTION
 0   8          (object header: mark)
 8   4          (object header: class)
12   4   int    takeIndex
16   4   int    putIndex
20   4   int    count
24   4   Object[] items
28   4   ReentrantLock lock
32   4   Condition notEmpty
36   4   Condition notFull
40   4   Itrs itrs
44   4          (object alignment gap)
Instance size: 48 bytes

This result shows that :

takeIndex, putIndex, and count are packed back-to-back.
Their offsets are: 12, 16, 20.
Each is 4 bytes, so together they occupy 12 bytes total.
And the whole ArrayBlockingQueue object is just 48 bytes.

Most CPUs have 64-byte cache lines. So, drawing the first cache line for this object will look somethign like this: (bytes 0 to 63).

Cache Line 0 (bytes 0 - 63)

+0                                                                  +63
|---------------------------------------------------------------------|
| mark (8) | klass (4) | takeIndex | putIndex | count | items | lock  |
|   0-7    |   8-11    |  12-15    |  16-19   | 20-23 | 24-27 | 28-31 |
|---------------------------------------------------------------------|
| notEmpty | notFull   | itrs      |  align gap (4 bytes)             |
|  32-35   |  36-39    |  40-43    |  44-47                           |
|---------------------------------------------------------------------|

If you understand false sharing, this this is where the suspicion comes from:

Consumers mostly write takeIndex and count.
Producers mostly write putIndex and count.

And since these live close together, they can easily share the same cache line.

Implementing a Padded ArrayBlockingQueue

It is now clear to us that takeIndex, putIndex, and count are literally sitting back-to-back in memory, inside the same cache line. We now build a simple padded version of ArrayBlockingQueue. Here is a small snippet of what it looks like:

public final class PaddedArrayBlockingQueue<E>
        extends AbstractQueue<E>
        implements BlockingQueue<E> {

    final Object[] items;
    final ReentrantLock lock;
    final Condition notEmpty;
    final Condition notFull;

    // Hot fields, now isolated onto different cache lines
    @Contended("take")
    int takeIndex;

    @Contended("put")
    int putIndex;

    @Contended("count")
    int count;

    public PaddedArrayBlockingQueue(int capacity) {
        if (capacity <= 0) throw new IllegalArgumentException();
        this.items = new Object[capacity];
        this.lock = new ReentrantLock(false);
        this.notEmpty = lock.newCondition();
        this.notFull = lock.newCondition();
    }
    
    @Override
    public void put(E e) throws InterruptedException { /* same as ABQ */ }

    @Override
    public E take() throws InterruptedException { /* same as ABQ */ }

    // ...
}

Demonstrating Cache Line Contention

Profiling with perf (Linux): Capturing Cache Misses and Sharing Events

Next we’ll use Linux perf to see what the CPU reports while the queue is under load: cache references, cache misses, plus basic noise signals like context switches and CPU migrations. We’ll pin producer/consumer threads to fixed cores, then repeat the same run with a padded queue and compare the counters.

Here is our workload:

import java.util.concurrent.ArrayBlockingQueue;

public class ABQContention {

  static volatile boolean running = true;

  public static void main(String[] args) throws Exception {
    int capacity = 1024;
    int producers = 2;
    int consumers = 2;
    int seconds = 20;

    var q = new ArrayBlockingQueue<Integer>(capacity);

    // Start half-full so both put() and take() are active immediately
    for (int i = 0; i < capacity / 2; i++) q.put(1);

    Runnable producer = () -> {
      while (running) {
        try { q.put(1); } catch (InterruptedException ignored) {}
      }
    };
    Runnable consumer = () -> {
      while (running) {
        try { q.take(); } catch (InterruptedException ignored) {}
      }
    };
    
    for (int i = 0; i < producers; i++) new Thread(producer, "producer-" + i).start();
    for (int i = 0; i < consumers; i++) new Thread(consumer, "consumer-" + i).start();

    Thread.sleep(seconds * 1000L);
    running = false;
  }
}

Build and run (pinned)

Build your code with:

mvn -q -DskipTests package

Now run it pinned:

taskset -c 0,2,4,6 java -cp target/classes com.heappulse.ABQContention

Here I explicitly pin the process to CPUs 0,2,4,6 based on my machine. I picked these because they are the physical cores (not the hyper-thread “sibling” cores)

Profile with perf stat

Now profile it with perf:

sudo taskset -c 0,2,4,6 perf stat -r 5 \
  -e cycles,instructions,cache-references,cache-misses,context-switches,cpu-migrations \
  -- java -cp target/classes com.heappulse.ABQContention

We use same command to run the padded version but before that in our class ABQContention simply replace ArrayBlockingQueue with PaddedArrayBlockingQueue. Rebuild and rerun the perf command.

Perf result interpretation

From our results, the padded version shows some improvements: cache references dropped from about 4.13B → 3.96B, and cache misses dropped from about 201M → 174M, bringing the miss rate from ~4.88% → ~4.40%. While this may not be very significant, it shows that with padding the hot fields were getting in each other’s way less, so the CPU spent less time reloading the same cache line and more time actually running the queue logic.

JMH Benchmark

Also, we run a simple benchmark for both versions. Here’s the code:

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 7, time = 2, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3, jvmArgsAppend = {
        "--add-exports=java.base/jdk.internal.vm.annotation=ALL-UNNAMED",
        "-XX:-RestrictContended",
        "-XX:ContendedPaddingWidth=64"
})
@State(Scope.Group)
public class AbqVsPaddedSpscBenchmark {

    @Param({"1024", "65536"})
    int capacity;

    @Param({"JDK_ABQ", "PADDED_ABQ"})
    String impl;

    @Param({"false"})
    boolean fair;

    static final Integer ITEM = 1;

    BlockingQueue<Integer> q;

    @Setup(Level.Iteration)
    public void setup() throws Exception {
        q = impl.equals("JDK_ABQ")
                ? new ArrayBlockingQueue<>(capacity, fair)
                : new PaddedArrayBlockingQueue<>(capacity, fair);
        for (int i = 0; i < capacity / 2; i++) q.put(ITEM);
    }

    @Benchmark
    @Group("q")
    @GroupThreads(1)
    public void producer() throws Exception {
        q.put(ITEM);
    }

    @Benchmark
    @Group("q")
    @GroupThreads(1)
    public void consumer(Blackhole bh) throws Exception {
        bh.consume(q.take());
    }
}

Build your code with mvn -q -DskipTests clean package and run with java -jar target/benchmarks.jar AbqVsPaddedSpscBenchmark

JMH result interpretation (SPSC)

In SPSC, the padded version consistently beats the JDK ArrayBlockingQueue. For capacity=1024 the queue throughput goes from about 25.7M ops/s → 29.7M ops/s (roughly +15%).
For capacity=65536 it goes from about 26.1M ops/s → 28.9M ops/s (roughly +11%).

Concluson

As we’ve seen in this post, padding gives a small but real improvement. ArrayBlockingQueue doesn’t suffer from the usual “classic” false sharing, because the single lock means producers and consumers aren’t updating takeIndex and putIndex at the exact same time. But those counters still sit on the same cache line, so when threads run on different cores, that cache line keeps getting passed around between cores every time the lock switches hands. Padding reduces this cache-line shuffling, and that’s why we see the improvements.

However, it’s important to note that the real cost is still the locking mechanism, most of the time is spent waiting to acquire the lock.

Investigating False Sharing in Java’s ArrayBlockingQueue

Leave a ReplyCancel Reply

Java Concurrency Defaults: What Actually Happens Under Load

Stop Guessing: When new ArrayList<>(capacity) Actually Matters

The Bug That Reset a Mars Lander: Priority Inversion in Java, Explained Like It Matters

Don’t miss a post!

Revealing the Memory Layout with JOL

Installing and Using JOL

Reading the JOL Output

Implementing a Padded ArrayBlockingQueue

Demonstrating Cache Line Contention

Profiling with perf (Linux): Capturing Cache Misses and Sharing Events

Build and run (pinned)

Profile with perf stat

Perf result interpretation

JMH Benchmark

JMH result interpretation (SPSC)

Concluson

Don’t miss a post!

Lobe Serge

Related Posts

Java Concurrency Defaults: What Actually Happens Under Load

The Bug That Reset a Mars Lander: Priority Inversion in Java, Explained Like It Matters

Why Ring Buffers Outperform ArrayBlockingQueue: Understanding the Performance Gap

Leave a ReplyCancel Reply

Trending now

Java Concurrency Defaults: What Actually Happens Under Load

Stop Guessing: When new ArrayList<>(capacity) Actually Matters

The Bug That Reset a Mars Lander: Priority Inversion in Java, Explained Like It Matters