Neural Speed: Accelerating 4-bit LLM Inference on CPUs
Written on
Chapter 1: Introduction to Neural Speed
Running large language models (LLMs) on standard consumer hardware poses significant challenges, particularly when the model exceeds available GPU memory. Quantization can help reduce the model's size, but even after this process, it might still be unmanageable for GPUs. A viable solution is to utilize CPU RAM with frameworks optimized for CPU inference, such as llama.cpp.
Intel is actively enhancing CPU inference capabilities with its new framework, the Intel Extension for Transformers. This user-friendly solution is designed to maximize CPU performance when running LLMs.
Neural Speed, licensed under Apache 2.0, builds on Intel's extension and dramatically accelerates inference for 4-bit LLMs on CPUs. According to Intel, this framework can achieve inference speeds that are up to 40 times faster than those obtained with llama.cpp.
In this article, I will explore the key optimizations offered by Neural Speed, demonstrate its usage, and benchmark its inference throughput in comparison to llama.cpp.
Section 1.1: Key Optimizations in Neural Speed
At NeurIPS 2023, Intel unveiled several optimizations aimed at improving CPU inference efficiency. The diagram below highlights the essential components introduced by Neural Speed for enhanced performance:
The CPU tensor library features various kernels optimized for 4-bit models and supports x86 architectures, including AMD processors. These kernels cater specifically to models quantized with the INT4 data type, supporting frameworks like GPTQ, AWQ, and GGUF. Additionally, Intel offers its own quantization library, the Neural Compressor, which is utilized if the model is unquantized.
While the NeurIPS paper provides limited details on the "LLM Optimizations," it mentions the preallocation of memory for the KV cache, which current frameworks often overlook. The benchmark results from their experiments indicate that Neural Speed offers a latency for next-token predictions that is 1.6 times lower than that achieved with llama.cpp.
Section 1.2: Implementing Neural Speed
To explore the capabilities of Neural Speed, you can access the relevant notebook: Get the notebook (#60). Neural Speed can be found at: intel/neural-speed (Apache 2.0 license). Installation is straightforward using pip:
pip install neural-speed intel-extension-for-transformers accelerate datasets
Neural Speed integrates seamlessly with Intel's extension for transformers, requiring only the extension to be imported. Instead of the standard import from "transformers," you will use:
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
Then, you can load your model in 4-bit format:
model = AutoModelForCausalLM.from_pretrained(
model_name, load_in_4bit=True
)
Intel's extension will handle the quantization and serialization of the model, a process that may take some time. For instance, using the 16 CPUs of the L4 instance on Google Colab, loading a 7B model took approximately 12 minutes.
Chapter 2: Benchmarking Neural Speed
In my benchmarking of the inference throughput for this model, I utilized the Mayonnaise model (Apache 2.0 license) based on Mistral 7B on the same Google Colab L4 instance. With Neural Speed and Intel's quantization, I achieved an impressive throughput of:
32.5 tokens per second.
While this is indeed fast, how does it compare to llama.cpp?
After quantizing the model using llama.cpp to 4-bit (type 0), I ran inference with it, yielding:
9.8 tokens per second.
This result shows a significant speed difference, although it does not reach the "up to 40x faster than llama.cpp" claim found on Neural Speed's GitHub page. It’s essential to note that I did not modify hyperparameters or hardware configurations for either framework in my tests.
Intel's performance reports suggest that the speedup observed with Neural Speed may be more pronounced with an increased number of CPU cores. For example, their experiments used an "Intel® Xeon® Platinum 8480+ system" and an "Intel® Core™ i9–12900" system, yielding performance gains of 3.1 times faster than llama.cpp.
Neural Speed also supports the GGUF format. By running another benchmark with the GGUF version of my Mayonnaise model, I obtained:
44.2 tokens per second.
This is 25% faster than the quantized model produced by Neural Speed, representing a 4.5x improvement over llama.cpp.
Conclusion: Assessing CPU vs. GPU Inference
Intel Neural Speed delivers impressive performance, clearly outpacing llama.cpp. Although I could not verify the claim of being "up to 40x" faster, a 4.5x increase is substantial.
When considering whether to use CPU or GPU for inference, it’s important to note that CPU performance is steadily improving. For single-instance processing (batch size of one), the performance disparity between CPUs and GPUs is minimal. However, for larger batch sizes, GPUs generally outperform CPUs, provided there is adequate memory available, thanks to their superior parallel processing capabilities.
To stay updated on advancements in AI, consider subscribing to my newsletter for more articles and tutorials.