LLM Inference Benchmark: Xeon vs RTX4090

Reading time: 2 minutes

Last modified:

Illustration

On July 27, 2024, we ran a local inference benchmark across eight popular open-weight LLMs on a Xeon E5-2680 + RTX 4090 setup using Ollama. Here are the results.

Benchmark Setup

  • Date: July 27, 2024
  • Platform: Xeon E5-2680 CPU, 64GB RAM, RTX4090 24GB GPU
  • Provider: Ollama
  • Models Tested:
    • phi3 (3.8B parameters)
    • mistral (7B parameters)
    • mixtral (8x7B parameters)
    • aya (8B parameters)
    • qwen2 (7B parameters)
    • llama3 (8B parameters)
    • llama3-gradient (8B parameters)
    • llama3.1 (8B parameters)

Key Findings

  • Fastest Model: phi3

    • Total tokens per second: 181.73
    • Total time: 4.48 seconds
  • Slowest Model: mixtral

    • Total tokens per second: 20.90
    • Total time: 33.29 seconds
  • Performance Spectrum (from fastest to slowest):

    • phi3 > mistral > llama3.1 > llama3 > qwen2 > llama3-gradient > aya > mixtral

Analysis

phi3 is fast because it’s small. At 3.8B parameters it runs entirely in VRAM with room to spare, which means no memory bandwidth bottleneck and no quantisation overhead at this tier. For latency-sensitive applications — real-time chat, inline completions, interactive tools — phi3 competes with models twice its size at a third of the tokens/second cost.

mixtral’s speed penalty is architectural. The 8x7B mixture-of-experts design activates roughly 12B parameters per forward pass (two experts of seven), not the full 56B, but the routing overhead and memory layout still hit inference throughput harder than a dense 7B model. At 20.9 tok/s it’s usable, but plan for the latency.

7–8B models diverge more than expected. mistral (105.1 tok/s) outpaces llama3 (79.5 tok/s) despite identical parameter counts. Architecture and quantisation choices matter more than size at this range. Test the specific model rather than relying on parameter count as a proxy for speed.

llama3.1 beats llama3 and llama3-gradient. The 3.1 iteration ships with better default quantisation targets, which shows up directly in inference throughput. If you’re running the Llama family, default to the most recent stable version.

Hardware context matters. These numbers are specific to this setup: RTX 4090 with 24GB VRAM, 64GB system RAM, Ollama as the runtime. GPU memory capacity is the dominant variable for models at this size — a 16GB card will produce different throughput ordering, especially for mixtral, which may partially offload to CPU.

Pick a model based on your latency budget and the task. For real-time use: phi3 or mistral. For tasks that tolerate higher latency and benefit from stronger reasoning: llama3.1 or mixtral.

Full report available on LinkedIn: https://www.linkedin.com/posts/cimpleo_llm-inference-benchmark-xeon-e5-268064gb-activity-7224005179635818497-7NDZ

Table of Contents

Ready to get started?

Contact us today to discuss your project.