Python Is Not Running Your AI Models

Published Saturday June 13, 2026

By Jarle Aase

ai cuda

Some AI influencers and self-proclaimed experts I have seen on social media recently seem to believe that Python is the thing that is running the AI models or LLM models. That would explain how slow and energy demanding AI is, but it's not factually true.

Python is usually the conductor of the orchestra, not the orchestra itself. The music is played by highly optimized C++, CUDA, and GPU code running underneath.

The illustration shows how Python often serves as a convenient orchestration layer on top of AI engines such as whisper.cpp and llama.cpp.

This approach allows data analysts, researchers, data scientists, and curious developers to experiment with AI models without needing deep knowledge of systems programming, GPU programming, or the internal implementation details of the model itself. Python provides a high-level interface that makes it easy to load models, process inputs, adjust parameters, and inspect results, while still exposing enough low-level controls for advanced experimentation.

The AI model itself is not a program in the traditional sense. A large language model is primarily a collection of trained weights stored in a large binary file. Inference engines such as llama.cpp contain the actual code that loads those weights, performs the tensor calculations, manages memory, schedules work on CPUs or GPUs, and generates output tokens.

When a user interacts with a model through Python, most of the heavy work is delegated to highly optimized native code written in languages such as C++ and CUDA. Python typically handles orchestration, configuration, and user interaction, while the inference engine performs the computationally intensive tasks.

This separation is important for performance. If Python itself were responsible for moving large amounts of memory, scheduling GPU kernels, and executing the numerical operations required by modern AI models, inference would be dramatically slower. Instead, Python acts as a control layer that calls into native libraries, allowing optimized C++, CUDA, and GPU code to perform the heavy lifting while Python provides a productive and accessible user experience.

In that sense, Python is often best viewed as the user interface of modern AI systems, while the actual execution engine resides in the native runtimes beneath it.

NVIDIA and CUDA

NVIDIA currently dominates the market for high-performance AI hardware. Since CUDA is frequently mentioned in discussions about AI, it's worth understanding where it fits into the software stack and how it relates to Python.

The previous illustration showed a Python application running on top of inference engines such as llama.cpp and whisper.cpp. That approach works perfectly well on NVIDIA hardware, but it doesn't necessarily deliver the maximum performance available from high-end AI accelerators.

If you are training large models or running inference on powerful NVIDIA GPUs, CUDA is usually involved somewhere in the stack.

CUDA is often discussed as if it were a mysterious AI technology. In reality, CUDA is a platform for GPU computing that extends C++ with language features for parallel programming on NVIDIA GPUs. Modern CUDA code is typically written in C++, not Python.

Looking at the illustration, the stack becomes much easier to understand.

At the top sits Python. This is usually the code written by researchers, data scientists, and AI developers. Python provides a productive environment for experimentation, orchestration, and model development.

Below Python is PyTorch, one of the most widely used AI frameworks. PyTorch is a library that provides tensors, neural network components, optimizers, automatic differentiation, and thousands of building blocks that would be impractical for most developers to implement themselves.

Below PyTorch sits Triton. Triton allows developers to describe GPU operations using a Python-like syntax. However, Triton is not executed as ordinary Python code. Instead, it acts as a compiler and code generator, producing highly optimized GPU kernels that ultimately run on NVIDIA hardware.

Those kernels are translated into CUDA and PTX code, which are then compiled and executed by the CUDA runtime and driver stack. The CUDA platform manages memory transfers, kernel execution, and communication with the GPU.

At the bottom of the stack is the hardware itself. This is where the actual work happens. Thousands of GPU cores execute the mathematical operations required for neural networks, transformer models, image generation, speech recognition, and other AI workloads.

So, to iterate back to where we started; Python is rarely responsible for the heavy computation. Python orchestrates the workload, while PyTorch, Triton, CUDA, and the GPU perform the computationally intensive tasks. This division of responsibilities is one of the reasons modern AI systems can deliver such impressive performance despite being driven from a high-level language.

Why do I care?

As a software developer, I've been interested in AI since the 1980s. Back then, artificial intelligence was something you read about in fiction books and comic magazines. I heard about Prolog an "ai programming language", and bought it for my Osborne 1 computer, running CP/M. It was fun, but over-hyped. I could not use it for anything practical.

Still, AI both interested and terrified me. Over the decades I loved fiction with realistic AIs, and I kept myself up to date with computer litterature. A few years ago I participated in an AI user-group in Sofia, Bulgaria, where members experimented with various AI technologies to do things like predict the local weather.

Now the first wave of consumer-grade AI have arrived.

For the first time, I can run genuinely useful AI models on my own computers. I use local models regularly, both through LM Studio and through my own desktop application written in C++. They help me transcribe recordings, translate text, summarize information, brainstorm ideas, and research topics that I would rather not share with cloud-based AI services.

What fascinates me is not only what these models can do, but also how they work.

As a software developer, I like understanding how technology works. Not how it's hyped or presented, but how it works, and why that exact way, from all the possible alternatives, was chosen.

We live in interesting times. In some ways that's exciting. In some ways it's wildly terrifying. But unlike many of the AI waves I've seen come and go over the decades, this one is real.