DeepInfra was founded in September 2022 in Palo Alto by Nikola Borisov, Yessenzhar Kanapin, and Georgios Papoutsis, engineers with deep backgrounds in large-scale networking and distributed systems. Their thesis was straightforward: as open-source models matured, the bottleneck for most companies would not be access to weights but the cost and operational burden of serving those models reliably at high throughput. DeepInfra was built to make running inference as simple as calling an API while keeping the per-token price as low as possible.
The platform exposes a large catalog of open models, including popular LLMs, embedding models, text-to-image models, and speech-to-text systems, each available behind an OpenAI-compatible endpoint. Developers pay only for what they use, billed by tokens or by compute time for image and audio workloads. Behind the API, DeepInfra manages GPU clusters, batching, and autoscaling, absorbing the complexity of capacity planning and keeping utilization high enough to sustain aggressive pricing.
For teams that need isolation or custom models, DeepInfra also offers dedicated deployments where a specific model runs on reserved GPUs. This gives predictable latency and throughput for production traffic while retaining the simplicity of the managed platform. The company emphasizes throughput and price-performance as its core differentiators against both hyperscalers and other inference startups.
DeepInfra's funding accelerated alongside demand for inference capacity. It raised an $8 million seed, an $18 million Series A in April 2025 led by Felicis and Georges Harik, and a $107 million Series B co-led by 500 Global and Georges Harik, with participation from A.Capital Ventures, Crescent Cove, Felicis, NVIDIA, Peak6, Samsung Next, Supermicro, and Upper90, bringing total funding above $130 million. NVIDIA's participation reflects the strategic importance of inference-focused neoclouds that drive GPU consumption.
The company sits in a competitive segment alongside other serverless inference providers, but its emphasis on raw price-per-token and a broad open-model catalog makes it attractive to developers and startups building high-volume AI products who want to avoid both vendor lock-in and the overhead of self-hosting GPUs.