False sharing flow

False Sharing – A silent performance killer

Imagine you have designed a high performance pipeline. One thread is responsible for pulling data in, another thread for processing it. This is a very common pattern used in real time systems like log processing, order book updates, and feed parsing. Now, you keep two tiny counters: fetched (how many items the ingester has pulled from the source) and processed (how many items the worker has handled) to track progress. The system works fine, but under high load your tail latency is very high. A simple investigation shows that the culprit is not the logic; it is how the two counters are laid out in memory. They sit next to each other in memory, on the same 64 byte cache line. Each write to one counter forces the other core to invalidate and refetch that line. This back and forth cache line transfer is called false sharing, where cores spend cycles on cache line coherence instead of doing real work.

Cache line flow

When two threads on different cores write to fields that share a cache line, each write invalidates the other core’s copy. The other core must reload the line before it can write again, and the cycle continues until the writes slow down or the fields are moved onto separate lines.

False sharing flow

Cache Coherency Protocol

As shown in the diagram, both cores operate on data that resides in the same cache line, each maintains its own private copy of that line in its local cache. To keep these copies synchronized, the processor enforces a cache coherency protocol. When Thread 0 on Core 1 writes to the Fetched Counter (FC), the coherency protocol marks the entire cache line as Modified in Core 1’s cache and Invalid in Core 2’s cache. Later, when Thread 1 on Core 2 tries to read or update the Processed Counter (PC), even though PC itself wasn’t modified, the protocol forces the entire cache line to be fetched again from the core that last modified it, in this case, Core 1. This ownership transfer involves communication through the shared cache controller (often the L3), resulting in increased latency.

Java Memory Layout

Every Java object on the heap has a fixed internal layout that the HotSpot JVM uses for performance reasons. Objects have a header, typically 16 bytes, which contains a mark word (used to store locking information, hash codes, and garbage collection bits) and a Klass pointer that tells the JVM where the object belongs. For arrays, there is an additional 4 bytes to hold the length of the array. After this header, the object fields are placed sequentially in memory, not necessarily in the order of declaration. To make memory access faster, the JVM groups fields by size, placing 8-byte values (long, double) first, then 4-byte values (int, float), and finally smaller types (short, char, byte, boolean) and object references. This layout makes field access efficient as it aligns with CPU access boundaries. However, side effects can occur if, as in our case, two counters, fetched and processed, which are frequently written, end up sitting on the same 64-byte cache line.

To prevent this, we can add padding fields or use the @Contended annotation to ensure each counter sits in a separate cache line, eliminating false sharing and improving throughput.

Eliminating false sharing with padding and Contended

In this section, we develop a program that increments separate counters for 200 million iterations each. The baseline keeps both counters in the same cache line, while the padded and Contended versions move the counters to different lines. We later evaluate the results using a simple benchmark.

1) Baseline: false sharing (two hot fields in one cache line)

In this case the object header is about 16 bytes and each long is 8 bytes, so the two counters plus the header are about 32 bytes, which makes them sit in the same 64-byte cache line.

public class FalseSharingDemo {
    static final int ROUNDS = 200_000_000;

    static final class Counters {
        volatile long fetched;
        volatile long processed;
    }

    static final Counters C = new Counters();

    public static void main(String[] args) throws Exception {
        long t0 = System.nanoTime();

        Thread thread0 = new Thread(() -> {
            for (int i = 0; i < ROUNDS; i++) {
                C.fetched++;
            }
        });

        Thread thread1 = new Thread(() -> {
            for (int i = 0; i < ROUNDS; i++) {
                C.processed++;
            }
        });

        thread0.start();
        thread1.start();
        thread0.join();
        thread1.join();

        double sec = (System.nanoTime() - t0) / 1e9;
        System.out.printf("False sharing elapsed: %.3f s  fetched=%d  processed=%d%n",
                sec, C.fetched, C.processed);
    }
}
2) Fix with manual padding (separate the hot fields into different cache lines)

We fix this by adding about 56 bytes of padding, which pushes the second counter past the first 64-byte cache line so it starts on the next one. This separates the two hot fields and eliminates false sharing.

public class PaddedDemo {
    static final int ROUNDS = 200_000_000;

    static final class PaddedCounters {
        // header ~16 bytes, fetched at offset ~16
        volatile long fetched;
        // ~56 bytes of padding to push 'processed' into the next 64-byte line
        long p1, p2, p3, p4, p5, p6, p7;
        volatile long processed;
    }

    static final PaddedCounters C = new PaddedCounters();

    public static void main(String[] args) throws Exception {
        long t0 = System.nanoTime();

        Thread thread0 = new Thread(() -> {
            for (int i = 0; i < ROUNDS; i++) {
                C.fetched++;
            }
        });

        Thread thread1 = new Thread(() -> {
            for (int i = 0; i < ROUNDS; i++) {
                C.processed++;
            }
        });

        thread0.start();
        thread1.start();
        thread0.join();
        thread1.join();

        double sec = (System.nanoTime() - t0) / 1e9;
        System.out.printf("Padded elapsed: %.3f s  fetched=%d  processed=%d%n",
                sec, C.fetched, C.processed);
    }
}
3) Fix with @Contended (JVM inserts padding)

In this case we allow the JVM to insert its own padding using @Contended, which places each field on its own cache line, removing false sharing without manual padding.

import jdk.internal.vm.annotation.Contended;

public class ContendedDemo {
    static final int ROUNDS = 200_000_000;

    static final class Counters {
        @Contended
        volatile long fetched;

        @Contended
        volatile long processed;
    }

    static final Counters C = new Counters();

    public static void main(String[] args) throws Exception {
        long t0 = System.nanoTime();

        Thread thread0 = new Thread(() -> {
            for (int i = 0; i < ROUNDS; i++) {
                C.fetched++;
            }
        });

        Thread thread1 = new Thread(() -> {
            for (int i = 0; i < ROUNDS; i++) {
                C.processed++;
            }
        });

        thread0.start();
        thread1.start();
        thread0.join();
        thread1.join();

        double sec = (System.nanoTime() - t0) / 1e9;
        System.out.printf("@Contended elapsed: %.3f s  fetched=%d  processed=%d%n",
                sec, C.fetched, C.processed);
    }
}

Performance results

We run a simple test to see how long each version takes. Below are the commands to run each version and the results we obtained. ( You can adapt the core numbers in the taskset command to match your system’s CPU layout )

❯ # Compile
javac FalseSharingDemo.java
# Run pinned to two different physical cores (0 and 2)
taskset -c 0,2 java FalseSharingDemo
False sharing elapsed: 15.340 s  fetched=200000000  processed=200000000
❯ # Compile
javac PaddedDemo.java
# Run pinned to two different physical cores (0 and 2)
taskset -c 0,2 java PaddedDemo
Padded elapsed: 2.816 s  fetched=200000000  processed=200000000
❯ # Compile with module export (needed for jdk.internal.vm.annotation)
javac --add-exports=java.base/jdk.internal.vm.annotation=ALL-UNNAMED ContendedDemo.java

# Run pinned to different physical cores (0 and 2)
taskset -c 0,2 java \
  --add-exports=java.base/jdk.internal.vm.annotation=ALL-UNNAMED \
  -XX:-RestrictContended \
  -XX:ContendedPaddingWidth=128 \
  ContendedDemo

@Contended elapsed: 2.513 s  fetched=200000000  processed=200000000 

You can run this test multiple times to confirm consistent results, or even better, measure precisely using a proper microbenchmarking framework such as JMH (Java Microbenchmark Harness). In our runs, the baseline version took about 15.34 seconds, while the padded and @Contended versions dropped to 2.81 seconds and 2.51 seconds respectively. Both fixes isolate the two hot counters into separate cache lines, eliminating the constant cache-line handoffs between cores and resulting in roughly a 6× speedup.

false sharing results

Conclusion

False sharing is a subtle but costly performance issue. The fix is simple: keep hot fields in different cache lines using padding or @Contended. For a better benchmark, use JMH, pin threads to separate cores (and NUMA nodes), and compare results.

Oval@3x 2 1024x570

Don’t miss a post!

Lobe Serge
Lobe Serge
Articles: 12

Leave a Reply

Your email address will not be published. Required fields are marked *