LLM Inference Benchmark Part 1: Xeon + RTX4090 GPU

Reading time: 4 minutes.

Last modified: 29 July 2024

Illustration

Understanding the performance characteristics of different models is crucial for developers and researchers. A recent benchmark conducted on July 27, 2024, provides valuable insights into the inference speeds of several popular LLMs. This recent study highlights the performance nuances of different models, shedding light on their strengths and weaknesses in real-world applications. The benchmark data, meticulously gathered and analyzed, reveals how these models perform under various conditions, providing a clearer picture of their operational efficiency. For developers and researchers, understanding these performance metrics is not just beneficial but critical for advancing AI technology. The insights gained from this benchmark can significantly influence the direction of future AI projects, ensuring that the chosen models align with the specific needs and goals of their applications. Let’s dive into the results and what they mean for the AI community.

Benchmark Setup

  • Date: July 27, 2024
  • Platform: Xeon E5-2680 CPU, 64GB RAM, RTX4090 24GB GPU
  • Provider: Ollama
  • Models Tested:
    • phi3 (3.8B parameters)
    • mistral (7B parameters)
    • mixtral (8x7B parameters)
    • aya (8B parameters)
    • qwen2 (7B parameters)
    • llama3 (8B parameters)
    • llama3-gradient (8B parameters)
    • llama3.1 (8B parameters)

Key Findings

  • Fastest Model: phi3

    • Total tokens per second: 181.73
    • Total time: 4.48 seconds
  • Slowest Model: mixtral

    • Total tokens per second: 20.90
    • Total time: 33.29 seconds
  • Performance Spectrum (from fastest to slowest):

    • phi3 > mistral > llama3.1 > llama3 > qwen2 > llama3-gradient > aya > mixtral

Analysis and Insights

  1. Smaller Models Can Be Faster: phi3, despite being the smallest model (3.8B parameters), demonstrated the fastest inference speed. This highlights that smaller models can offer significant speed advantages in certain scenarios.

  2. Size vs. Speed Trade-off: The largest model, mixtral (8x7B parameters), showed the slowest inference speed. This illustrates the common trade-off between model size/capability and inference speed.

  3. Variation Among Similar-Sized Models: The 7-8B parameter models (mistral, qwen2, llama3 variants) showed considerable variation in performance. This suggests that architectural differences and optimization techniques play a crucial role in inference speed, even among models of similar size.

  4. Llama Family Performance: The various Llama models (llama3, llama3-gradient, llama3.1) showed some variation in performance, with llama3.1 being the fastest among them. This could indicate improvements in the architecture or optimization across iterations.

  5. Potential for Specialized Use Cases: The significant performance variation across models suggests that different LLMs might be better suited for specific applications. For instance, phi3’s speed might make it ideal for real-time applications where low latency is crucial, while larger models like mixtral might be preferred for tasks requiring more extensive knowledge or capabilities, albeit at the cost of slower inference.

Implications for Developers and Researchers

  1. Model Selection: When choosing an LLM for a project, carefully consider the trade-off between model size, capabilities, and inference speed. Smaller models like phi3 can offer substantial speed benefits for certain applications.

  2. Optimization Opportunities: The benchmark results highlight the potential for significant performance gains through architectural improvements and optimizations, especially for larger models.

  3. Hardware Considerations: The benchmark was conducted on specific hardware (Xeon CPU, RTX4090 GPU). Performance may vary on different setups, emphasizing the importance of testing on your target hardware.

  4. Balancing Act: Developers must balance between model capability (often correlated with size) and inference speed based on their application requirements. In some cases, a smaller, faster model might be preferable to a larger, slower one.

  5. Use Case Alignment: For applications requiring real-time or near-real-time responses, smaller models like phi3 or mistral might be more suitable. For more complex tasks that can tolerate higher latency, larger models like mixtral might be appropriate.

Conclusion

This benchmark provides valuable insights into the current state of LLM inference performance. It demonstrates that the landscape is diverse, with different models excelling in various aspects. Notably, it challenges the assumption that larger models are always better, showing that smaller models can offer significant speed advantages.

For developers and researchers, these results underscore the importance of thorough testing and careful model selection based on specific project requirements. The future of AI applications will likely see a more nuanced approach to model deployment, where the choice of LLM is tailored to the unique demands of each use case, balancing between speed and capability.

As the field continues to evolve, we can expect further improvements in both model capabilities and inference speeds. This may lead to innovations that bridge the gap between the speed of smaller models and the capabilities of larger ones.

Stay tuned for future benchmarks as the world of LLMs continues to advance at a rapid pace!

Full report available on LinkedIn:
https://www.linkedin.com/posts/cimpleo_llm-inference-benchmark-xeon-e5-268064gb-activity-7224005179635818497-7NDZ