FriendliAI was founded in 2021 by CEO Byung-Gon Chun, a systems researcher at Seoul National University who previously worked on AI infrastructure at Microsoft and Meta. The company set out to solve one of the most expensive problems in applied AI: serving large generative models efficiently in production. Rather than building a foundation model of its own, FriendliAI focuses entirely on the inference layer, where GPU utilization, batching strategy, and memory management determine whether an AI product is economically viable at scale.
The core of the platform is the Friendli Engine, a serving runtime that pioneered iteration-level (continuous) batching and combines it with aggressive but accuracy-preserving quantization, including FP8, INT8, and AWQ. These techniques allow FriendliAI to pack far more concurrent requests onto each GPU than naive serving approaches, which the company says translates into as much as 90% lower inference cost and some of the fastest token-generation speeds on the market. The engine supports text, image, video, and audio generation models, and integrates directly with model hubs like Hugging Face.
FriendliAI offers three main deployment modes: Friendli Serverless Endpoints for instant access to popular open models on a pay-per-token basis, Friendli Dedicated Endpoints for autoscaling private deployments of custom or fine-tuned models, and Friendli Container, which lets enterprises run the optimized engine inside their own VPC or on-premises infrastructure for data residency and compliance reasons.
The company raised an initial $6 million seed round in late 2021 led by Capstone Partners, then a $20 million seed extension in 2025 again led by Capstone, with participation from Sierra Ventures, Alumni Ventures, KDB Investment, and KB Securities, bringing total funding to roughly $25 million. The fresh capital is aimed at expanding its enterprise inference platform and growing its US presence.
FriendliAI competes with a crowded field of inference clouds but differentiates on raw engine performance and flexibility across modalities and deployment surfaces, making it a fit for teams that have outgrown generic API providers and need predictable cost and latency on open-weight models.