How Thunder Compute works (GPU-over-TCP)

October 29, 2024

Thunder Compute uses network-attached GPUs instead of physically-attached GPUs. Behind the scenes, Thunder Compute tricks CPU-only instances into thinking that they have GPUs attached. These GPUs are network-attached over TCP. From your perspective, the resulting instances behave like they have GPUs without requiring that a GPU is physically connected.

As a result, all instances on Thunder Compute are on-demand CPU-only instances, exactly like you would find on AWS, GCP, or Azure. These instances do not have GPUs. Logically, it follows that the CPU-only instances you interact with on Thunder Compute have all of the functionality of EC2 instances that you would find on Amazon or Google Cloud. In fact, many of them are hosted on Amazon or Google Cloud.

Here is a rough diagram of how we manage these connections between CPU-only instances and GPUs behind the scenes:

A simple example demonstrates the distinction between our virtual GPU-over-TCP technology and a physical PCIe connection:

Running `$ nvidia-smi` on Thunder Compute behaves exactly as expected with a physical GPU, returning the attached GPU.

nvidia-smi output

Meanwhile, running `lspci` shows no connected GPUs.

lspci output

To hammer home the point that there is no GPU, here is the full list of PCIe-connected devices on this Thunder Compute instance.

PCIe devices list

I hope we have convinced you that there is no GPU physically connected to the machine. Pretty cool, right? You can `pip install tnr` and run `tnr start` to try this same demo yourself.

Now that you understand the distinction between a Thunder Compute instance and a GPU instance on EC2, it is worth explaining the limitations of this virtualized approach.

1. **Performance**: TCP is slower than PCIe. While this may seem problematic, Thunder Compute is optimized to minimize the resulting performance impact. The real-world slowdown often is not noticeable and minimally impacts common data science tasks.
2. **Limited Compatibility**: Eventually, our GPUs-over-TCP will have the full functionality of physically attached cards, but today, Thunder Compute lacks official support for some common GPU libraries. If Thunder Compute does not support your particular use case, please reach out and we will add support.

The impact of these drawbacks will vary depending on your specific workload, and we continue to improve both over time. Until now, our testing has shown data science workflows to be the most performant and stable. You can find full compatibility details in “Compatibility.” Thunder Compute is open to the public, so the easiest way to test compatibility with your workflow is to try it yourself.