Pied Piper is Here with DFloat11

April 24, 2025

5 min read

AI Technology

DF11 Technology
Highlights
  • DF11 (DFloat11) is a mathematically proven lossless compression method that reduces LLM memory footprint by 30% while maintaining 100% identical accuracy to the original model by using Huffman coding to compress only the predictable exponent bits.

  • • In real-world customer applications, custom df11 models demonstrate identical accuracy to their bf16 counterparts for sensitive tasks like document redaction, providing memory improvements without sacrificing reliability.

  • • Unlike quantization which degrades model quality through lossy compression and requires additional calibration to recover accuracy, df11 is mathematically lossless - returning the exact same bits you started with while reducing memory requirements.

Introducing DFloat11—A Mathematically Lossless Way to Run LLMs on GPUs and CPUs

Businesses leveraging AI today are grappling with the reliability of LLM-based products and agents. Most applications are failing the reliability and compliance tests set by business units. The last thing they want is to introduce even a sliver of additional uncertainty. Thus, most businesses just choose the "best" SOTA model that passes their own internal evaluations.

However, many businesses want to own their own & customize these "best" SOTA LLMs vs. relying on third-party model providers, especially when dealing with sensitive data. Unfortunately, state-of-the-art LLMs like DeepSeek or Llama4 require data-center level GPU infrastructure to just store the weights of these models. For example, the smallest variant of Llama4 (Llama-4-Scout-17B-16E-Instruct) requires at least 4xA100/H100 GPUs (80GB each).

To address this, most modern models are released in lower-precision formats like bf16 (bfloat16), a data type invented by Google Brain that has become the standard for modern LLMs. Lossy compression techniques like quantization can then be used to further reduce the memory footprint. This comes at the cost of degrading model quality and introduces complexities that some end-users would prefer to avoid, since it creates uncontrolled variables that must be empirically stress-tested for each deployment scenario.

We introduce df11 (DFloat11), a novel data type that delivers 100% identical performance to the original model, while consuming only ∼70% of the memory footprint. This is the first mathematically proven, lossless compression method based on Huffman coding that allows any LLM to be smaller and cheaper to run with absolutely no loss in accuracy.

But First—What Is bf16 and Why Do Popular LLMs Use It to Represent Their Weights?

Before diving into compression, let's unpack what bf16 even is.

bf16, or bfloat16, is a 16-bit floating-point format invented by Google Brain. Most popular transformer models use bf16 representation for their weights like Llama. Similar to standard 32-bit floats (fp32), it splits a number into three parts:

1 bit for the sign (positive or negative)

8 bits for the exponent, which controls the number's magnitude (i.e., how big or small it is)

7 bits for the fraction (aka the mantissa), which adds fine-grained detail

So even though bf16 is only 16 bits long, it keeps the same exponent size as fp32. This gives it a wide dynamic range—good for training and inference stability—while using less memory.

But here's the catch: in trained models, most weights are already squeezed into a very small range (often between -1 and 1). That means those 8 exponent bits are mostly unused, have low entropy and are a candidate for compression.

In contrast, the 1 bit sign and 7 bit mantissa values are unpredictable, thus high entropy, and are not suitable candidates for compression.

DF11 Technology

The Core Idea: DF11 Encoding

DF11 stands for Dynamic-Length Float 11, named after the average number of bits needed to store each weight using our method—just 11 instead of 16.

Here's how DF11 works:

• We keep the sign and fraction bits as-is. These already have high entropy (they carry a lot of useful information), and are difficult to compress without loss.

• We compress only the exponent bits, which tend to follow a predictable pattern and are perfect candidates for Huffman coding.

By using a precomputed Huffman tree based on typical exponent values from real-world models, we replace the fixed 8-bit exponent with a variable-length code that uses fewer bits for common values and more bits for rare ones.

On average, we save about 5 bits per weight—hence the name DF11.

DF11 Technology
Wait -- What's Huffman Coding?

Huffman coding is a mathematically optimal, lossless compression algorithm. It's still used today in JPEG images, MP3 audio, and many other formats. Here's how it works:

  • Count how often each value occurs. In our case, these values are exponent bits.

  • Assign shorter codes to more frequent values, and longer codes to rare ones.

  • Build a binary tree where each path represents a value's code. Huffman-coded data can be decoded without any extra markers. It uses a greedy algorithm to parse the encoded stream, always taking the longest matching prefix that corresponds to a known code in the tree.

Storage Layout and How It Works

We split each weight into two parts:

Sign and Fraction Block (8 bits for each): The sign and fraction, stored exactly as in bf16

Exponent Stream (variable bits): The compressed exponent values

We store these separately:

• A flat array for the sign and fraction block

• A continuous bitstream for the exponent codes

• A tiny header that stores the Huffman codebook (just once)

On the fly, we decode by:

Read one sign and fraction block

Decode one exponent using the Huffman tree

Recombine the two into the original bf16 weight

This fast process works at O(1) runtime and is parallelized efficiently across vectorized hardware like GPUs.

Results: Less Memory, Same Accuracy

DF11 can be applied to ANY transformer model. Here's what our benchmarks show across our evaluations with a few popular ones (Table 2 & 3):

30% smaller model weights compared to bf16

No loss in accuracy (it's lossless! Proven by Math!)

DF11 works well in two scenarios:

• For solo developers or hobbyists, it lets you run bigger models on your limited hardware, trading a bit of latency for a big memory win

• For enterprise deployment, it cuts memory use and bandwidth while preserving full fidelity and throughput

DF11 Technology

Real World Examples

• Working closely with our early design partners and customers, we've thoroughly validated df11 through rigorous internal evaluations on real-world datasets. The results are compelling: our df11 models deliver substantial memory and performance improvements while maintaining 100% fidelity on highly sensitive customer data.

• For instance, our customer-specific df11 models can redact sensitive information from customer documents with identical accuracy to their bf16 counterparts, demonstrating that this breakthrough doesn't just save resources—it preserves critical model reliability that enterprises depend on.

Why Not Just Quantize?

• Do you want information loss with your quantized model?

• Do you want to spend several days or weeks doing additional calibration or post-training to recover the accuracy loss of your quantized model?

Quantization works by rounding weights to a smaller set of values, like using 8-bit integers. It's powerful—but it's lossy and will degrade model quality.

DF11 is different: it's mathematically exact. You get back the exact same bits you started with.

This makes DF11 a great alternative when accuracy is critical.

Looking Ahead: Smarter Compression, Not Just Smaller Numbers

DF11 shows that targeted, format-aware compression can unlock new efficiencies. Instead of throwing away precision across the board, we selectively compress the redundant bits—and keep the good stuff.

This idea could apply to other formats, like fp8, int4, or even new, custom formats for accelerators or edge devices.

Compression isn't just about squeezing things smaller. Done right, it can make models more efficient, more deployable, and more accessible.

Want to Learn More?

If you're curious about DF11, compression for ML, or just want to collaborate, we'd love to chat. This is just the beginning—and we believe there are many more wins waiting in the bits.


XMAD.ai

Lead AI Research Labs