Ceva to Advance Real-Time AI with Transformer Model Quantization

6 min read

At the heart of particle physics, CERN operates the Large Hadron Collider (LHC)—the most powerful particle accelerator in the world. Every second, the LHC produces around 40 million particle collisions, each generating approximately 1 megabyte of raw data. This adds up to an overwhelming 40 terabytes of data every second—far more than can be stored or analyzed in its entirety. To deal with this torrent of information, CERN relies on a sophisticated real-time filtering mechanism known as the trigger system. Its task is to rapidly identify and retain only the most scientifically valuable events for further analysis, discarding the rest with extreme precision and speed.

In an earlier collaborative project, Ceva worked with CERN on the trigger system of the Large Hadron Collider (LHC), a sophisticated real-time filtering mechanism that deals with the torrent of information generated by the LHC experiments. Its task is to rapidly identify and retain only the most scientifically valuable events for further analysis, discarding the rest with extreme precision and speed. Ceva and CERN tackled this by applying advanced model compression techniques, specifically automating mixed-precision quantization for Convolutional Neural Networks (CNNs), while also exploring the use of Binary Neural Networks (BNNs) and Ternary Weight Networks (TWNs) for jet particles detection and classification. These compact, resource-efficient models were well-suited to CERN’s stringent latency and hardware constraints. However, as the complexity of collision data increases and the demand for even greater event selection accuracy grows, the limits of these earlier models have become increasingly apparent.

Transformers: An Upgrade for Scientific Event Selection

The emergence of Transformer models offers a promising path forward. Originally developed for natural language processing, Transformers have since become foundational models in many domains, including computer vision, biology, and now scientific data analysis. For CERN’s use case, Transformer models offer significant advantages over traditional architectures.

Transformers are particularly effective at capturing complex, long-range relationships in high-dimensional data. Their attention mechanism enables them to focus dynamically on the most relevant parts of an input sequence, which makes them ideal for identifying subtle and rare patterns within particle collision data. Furthermore, their ability to handle structured and multimodal input allows them to process signals from multiple detectors simultaneously—something that simpler, local-receptive-field-based models like CNNs struggle to do. As a result, Transformers provide a more expressive and flexible framework for distinguishing meaningful events from background noise in real time.

In terms of performance, power, and scalability, Transformers enable significantly more expressive representations and improved classification accuracy while supporting concurrent multi-channel processing. They allow for higher sensitivity to rare signals without drastically increasing false positives, leading to better event filtering under constrained bandwidth. In terms of power efficiency, while their raw compute demand is high, when quantized properly, their outputs remain compact enough for efficient real-time deployment. The scalability of Transformers also allows for smoother adaptation to future data volumes and detector complexity.

The Challenges Introduced by Transformer Models

Despite these benefits, deploying Transformer models in a real-time trigger environment introduces new and significant challenges. Transformers are typically large models, with millions or even billions of parameters, making them computationally intensive and memory-hungry. This poses a major hurdle for hardware platforms like FPGAs and edge accelerators, which must operate under strict latency and resource constraints.

In addition, Transformers are particularly sensitive to quantization. Their performance can degrade sharply if the precision of weights or activations is reduced too aggressively. This sensitivity stems from the abundance of activation and weight outliers in LLMs, which introduce high dynamic ranges that are difficult to capture with low-bit precision. As a result, specialized techniques are often required to mitigate their impact and preserve model accuracy. Mapping these complex architectures onto low-latency, deterministic systems—without losing their performance advantages—requires innovative engineering and algorithmic adaptation.

How to Mitigate the Challenges of Transformer Deployment

To overcome these challenges, Ceva is leveraging its expertise to develop a new generation of optimization techniques tailored specifically for Transformers in real-time environments. One major strategy is to apply quantization not uniformly, but intelligently—adapting the precision level on a per-layer basis.

This means that layers with stable activation distributions can be quantized statically using pre-computed parameters, while layers exhibiting significant activation outliers are better suited for dynamic quantization at runtime. Another critical technique is group quantization, in which weights or activations are divided into smaller, logically coherent groups and quantized independently. This allows for much finer control over precision loss and helps maintain model accuracy.

Furthermore, mixed-precision inference is used to allocate higher precision to sensitive components and lower precision where tolerable, achieving a balance between accuracy and efficiency. These strategies can yield Transformer models that are both fast and lightweight—yet still capable of sophisticated decision-making in CERN’s trigger system.

Quantization Techniques and the Challenges They Address

Intelligent quantization means, it is not a one-size-fits-all solution. Ceva is applying multiple quantization techniques, each selected to address a specific technical challenge introduced by Transformer deployment. Together, these quantization strategies will form a toolkit that allows CERN to compress Transformer models without compromising their ability to operate effectively.

Ceva’s Role in this Collaborative Project

Static Quantization involves pre-computing the quantization parameters—such as scale and zero-point—during a calibration phase. This method is ideal for layers with stable activation distributions and predictable behavior. It allows for highly efficient inference since no additional computations are required at runtime, but it may struggle with layers that encounter high variance during operation.

Dynamic quantization calculates quantization parameters—such as scale and zero-point—on-the-fly during inference. This makes it well-suited for Transformer layers with unpredictable input patterns and significant activation outliers. While dynamic quantization introduces some runtime overhead, it enables the model to adapt to changing data distribution, helping preserve accuracy in scenarios where static quantization would struggle.

Group quantization, on the other hand, enhances precision by dividing a model’s weights or activations into smaller, logically coherent sub-groups—often aligned with attention heads, channels, or vector segments. Each group is quantized independently, striking a practical balance between overly coarse quantization (which can hurt model accuracy) and the high complexity of per-weight scaling. Group quantization is particularly valuable in Transformers, which are known to be sensitive to outliers and distribution shifts. To manage outliers within large vectors, the token vector is split into smaller subgroups and dynamic quantization is applied independently to each subgroup. This combination of group and dynamic quantization enables more fine-grained control over numerical precision, improving both accuracy and stability in real-time inference.

CERN’s Role in this Collaborative Project

Mixed-Precision Quantization allocates different bit-widths to different parts of the model. For example, sensitive layers might use 16-bit floating point (FP16), while less critical layers might operate in 8-bit integer (INT8) or even 4-bit formats. This flexible approach is especially important in Transformers, where different layers contribute unequally to overall performance.

Per-Layer Quantization Decisioning represents an emerging area of research in the collaboration, where algorithms automatically determine the optimal quantization strategy (static vs dynamic, low vs high bit-width) for each Transformer layer. This decision process considers both hardware efficiency and accuracy sensitivity, enabling deployment-ready models without the need for extensive manual tuning.

CERN will use what Ceva develops in this project to enhance the automation of mixed-signal precision quantization for Transformer models. The effort will start from the achievements of the previous collaborative project, where the FIT algorithm was developed.

Ceva’s Role in the Edge AI Market

As a well-established actor in Edge AI and neural processing, Ceva’s family of Neural Processing Units (NPUs) is engineered for scalable performance, supporting AI inferencing workloads from ultra-low-power embedded Machine Learning to high-throughput generative models. These NPUs are specifically optimized for quantized operations, making them ideal for deploying compressed Transformer models in resource constrained Edge AI environments.

Ceva also brings a robust AI SDK and software stack that automates key compression processes—including mixed-precision quantization, group quantization, and hardware-aware optimization.

Summary

As the volume and complexity of data generated by the LHC continue to grow, the need for more intelligent and scalable real-time analysis systems becomes ever more urgent. The partnership between Ceva and CERN marks a pivotal step toward that future. By upgrading CERN’s trigger system to support Transformer models—and by addressing the associated challenges through advanced quantization strategies—the collaboration is laying the groundwork for next-generation AI systems that are both powerful and efficient.

Both Ceva and CERN want to quantize Transformers for their respective use cases. Ceva will apply the quantization algorithms and techniques developed from this collaborative project on Generative AI large language, vision, and multi-modal models. CERN will apply them for event selection/processing in the LHC’s high-throughput, real-time environment.

Roy Janco

Roy Janco is a Deep Learning Engineer at Ceva, specializing in the quantization and deployment of Large Language Models (LLMs) on Ceva’s AI accelerator. Over the past two years, he has focused on optimizing LLMs for efficient inference on edge devices. Roy holds both a B.Sc. and M.Sc. in Electrical Engineering from Tel Aviv University, earned through the university’s Accelerated Master’s Program.

Ceva Advancing Real-Time AI with Transformers and Intelligent Quantization

Get in touch

Ceva Advancing Real-Time AI with Transformers and Intelligent Quantization

Related Content

Connect. Sense. Infer. Repeat: How Ceva Powers the Entire Smart Edge Stack – Executive #5

Think Fast: Bringing AI Inference to the Real World with Ceva – Executive Blog Series #4

Beyond Awareness: How Sensing Fuels the Edge Experience – Executive Blog Series #3

Get in touch