Episode 121
Open source AI models are now just 3-5% behind the best closed source models on benchmarks — about six months of lag time, not five years. If you're building an AI infrastructure company on the assumption that OpenAI or Anthropic will maintain a permanent lead, your moat is disappearing faster than your revenue projections assume.
Most founders at the $3M–$20M stage are still over-indexed on model selection and under-indexed on inference economics. They're obsessed with training costs and model access, but the real cost explosion is coming from running models at scale. A model that trains for a year but only runs for a month is a terrible investment — and yet that's how most AI budgets are still structured.
Nikola Borisov spent a decade building backend infrastructure for a chat app with 200 million monthly users before launching Deep Infra. He's CEO and co-founder of Deep Infra, an AI inference platform that owns its own GPU clusters and serves as one of the largest token suppliers on OpenRouter.
The episode centers on two bets Nikola made that most infrastructure founders won't: first, that open source models would catch up to closed source models faster than anyone expected, and second, that inference — not training — would dominate AI budgets within five years. Those bets are both paying off. The gap has narrowed to 3-5%, and as Deep Infra lowers costs, customers aren't just consuming more tokens — they're jumping to better, bigger models.
The conversation also surfaces a less obvious pattern: the economics of AI inference mirror the economics of CDNs more than they mirror cloud compute. Walmart and Target don't care if their images are served from the same CDN — it's just an efficient way to deliver content. Deep Infra runs the same model for multiple companies in parallel on the same GPUs, and neither company cares. It's neutral infrastructure that scales horizontally without requiring every company to build their own.
Roland sees this pattern constantly in his advisory work with SaaS companies scaling from $1M to $50M: founders are modeling their AI spend around closed source API access and per-token pricing, but they're not accounting for what happens when open source closes the gap and inference costs drop 20x. The companies that move early to open source inference infrastructure will have a cost structure their competitors can't match in 18 months — and cost structure at scale is the actual competitive wedge, not model access.
Key Moments:
3:01 — Why the gap between closed source and open source models has narrowed to 3-5% — and what that percentage actually measures
5:00 — The five-year-old explanation of inference: training is school, running the model is work
6:41 — Why Anthropic's compute conflict (training vs. serving customers) reveals the real economic wedge
10:39 — The CDN analogy: why Walmart and Target don't care if their requests run on the same infrastructure
16:12 — How lowering costs changes customer behavior — they jump to bigger models, not just more tokens
18:51 — Why Nikola believes inference will dominate company budgets in 5-10 years
20:29 — What a math Olympiad medalist and programming competitor learned about certainty that still drives how he builds
22:31 — Nikola's advice to younger founders: focus on what's most important today, not what's interesting
If navigating AI infrastructure economics — balancing model access, inference costs, and long-term vendor lock-in — is something you're working through right now, the Midstage Accelerator helps SaaS founders at the $1M–$50M stage model these decisions with real unit economics and stage-specific benchmarks. mdstg.ac/drag-erase
#AIInfrastructure #OpenSourceAI #InferenceEconomics #SaaSScaling #ScalingWithoutBreaking
Roland Siebelink (00:01)
Hello and welcome, everyone. Here's yet another episode of Scaling Without Breaking. I'm Roland Siebelink, coach and advisor to many of the fastest growing startups around the world, as you heard. And I'm very honored to have an amazing guest today. He won a gold medal at a math Olympiad in Bulgaria before he moved to the US. And then he spent a decade quietly building the backend for a chat app that's used by 200 million people every month.
Now, instead of chasing a bigger title, he started over with a tiny team, building infrastructure for a technology that barely existed when they launched. Then, ChatGPT happened and suddenly this little startup was sitting in the middle of a multi-billion dollar land grab, competing against companies with 10, 100 times the headcount and 1,000 times the funding. And yet, he is winning. With that, everybody meet my guest.
Nikola Borisov, the CEO and co-founder of DeepInfra. Welcome to Scaling Without Breaking, Nikola.
Nikola Borisov (01:04)
Thank you. Thanks for having me, Roland. Pleasure to be here.
Roland Siebelink (01:07)
Of course. For those that are watching us on YouTube, I'm just wearing the Mid-Stage Accelerator hoodie, but Nikola is wearing this amazing racing jacket with a big logo of an oil company from my native country on the front and a bank from Spain and everything. It's all free sponsorship for those people. I wish them the best with it.
Nikola, let's talk about real business. Most people think the AI race is about who builds the best model. But you've been betting that the models become commodities and that the real race is infrastructure. What do you see that the model companies are missing?
Nikola Borisov (01:48)
First, I want to say that we're really bullish on the open source versus closed source models. And over the last three and a half years, as we've been running Deep Infra, we've seen a great advance in both categories. But we've seen the open source models catch up, closer and closer to the...
the best close source models. And if you look at comparisons on artificial analysis and other sites, you will see that the top open source models are just a couple percent behind and they're better than the top versions from the close source labs that are six months old.
There was a big risk when we were starting. We were running deep Infra as an inference cloud, if we focus on a single model - if let's say Open AI has the only good model, then there'll be little business for us to do. But actually, it's good for the world as a whole that the technology of building these models doesn't seem to be as restrictive. A number of open source labs have managed to produce very, very decent open source models.
Roland Siebelink (03:01)
When you say they're just a couple of percentages behind, can we express that a little bit in months of delay, years of delay, compared to the leaders in the field? Are they just a year behind, just a couple of months behind? Where are they at where, let's say, OpenAI or Anthropic was how much time ago?
Nikola Borisov (03:20)
At the moment, my best guess is they're around six months or less in quality. And the percentages are not complex things. Basically, when we build these AI models, we can test how good they are by giving them a test. The same way, you and I would have done an SAT or something like that. And then you see how well the model can answer a set of questions.
There's a many different benchmarks, some test calling, some test knowledge, and so on. But if you look across the board, I think that the gap between the best closed source models and the best open source model has narrowed to 3-5% in a lot of these benchmarks out of a hundred. And I think we're about six months behind. This is why we're really excited about inference and infrastructure.
We've reached levels where the models are becoming really, really useful to many people. Obviously, they're disrupting the software development industry massively. But also, almost every other area as well. We believe a lot of time will be spent doing inference and building the right infrastructure for inference is what we're really focused on.
Roland Siebelink (04:47)
You mentioned the term inference, which of course is very common in the AI field. But if you explain that to me like I'm a five-year-old, what does inference mean and how does it differ from the rest of the AI process?
Nikola Borisov (05:00)
Here's a very simple explanation. AI model has two stages of its life. At some point it gets trained and that's similar to like you and me going to school and learning things. And then after it finishes training and basically starts producing work, it starts running.
And so running an AI model is called inference because the model essentially predicts something. In our current cases, it predicts the next token, predicts the next line of code. Inference is essentially just another word for running the AI models. And so it makes sense to me that we will be spending - a bigger portion of the life of a model will be spent running it rather than training it. If we train a model for a year and then we only use it for a month, it's really not great, because you spend all this energy and resources training, but you only get your usefulness when you're actually running or doing inference.
Roland Siebelink (06:14)
And when you hear about companies like Anthropic that have experienced gigantic growth, but that did not order all the compute power in time, now they are in that conflict, as I understand it, between how much do we spend on training new models; if we do too little, then we may fall behind. But on the other hand, we also have to serve all our customers that are looking to use our models.
Nikola Borisov (06:41)
Yeah, I think it's a really natural economic thing to happen. The usefulness of the thing they've already built is now quite high. They generate a lot of value from inference or running what they've already trained. And the incremental gains from training another model is somewhat lower. You have to strike the balance for sure. But one thing you're right is we don't have enough GPUs or enough data centers or enough chips.
We spend a lot of our venture-backed dollars into the actual hardware because we think these things are essentially small factories for intelligence and you need them to really be cost efficient as well. You need to own them rather than lease them.
Roland Siebelink (07:28)
Okay, that's actually one other bet that you seem to be making, Nikola, that ultimately the inference of the models will be run by different companies and maybe even different infrastructure than those that are training the models.
Nikola Borisov (07:53)
The reason why I think that, or at least at some scale, this will be the case, I think probably the biggest few people still run their own. But I tried to draw inspiration here for another similar set of specialized clouds, like CDNs. CDNs are content delivery networks. They're the specialized clouds that help you distribute content.
And the biggest tech companies sometimes end up building their own. Google has built their own CDN. But some tech companies end up using this as a service from other people. At the end of the day, I always felt that this will be somewhat similar; inference, meaning similar to the CDNs. There would be a need for a specialized cloud that is just there to run models and then there will be a lot of demand from enterprises, businesses, including some of the big people, would think eventually allow their models to be served by large inference clouds.
Roland Siebelink (09:13)
Yeah, you're drawing the analogy with content delivery networks because I guess none of the people who developed hosting software or who actually write the content are running a CDN by themselves, maybe with the exception of Google.
Nikola Borisov (09:31)
And I think it also makes sense because - I'll give you an example of why this made sense to me for the content delivery. Let's say a company like Walmart and Target, they would want their websites to be very fast and load well for all their consumers in each place. But it doesn't make sense for each of them to build their own infrastructures, buy some servers in every city.
And so the CDN is this neutral ground that does this for both of them. And then the actual server of, let's say a CDN might actually host images from Walmart and Target and serve the same customer.
Roland Siebelink (10:13)
And neither Walmart nor Target cares.
Nikola Borisov (10:16)
Yeah. It's an efficient way to do this. And I think the same thing applies to inference. We would host, let's say DeepSeek on our GPUs and then multiple companies can access the same model and make requests to it. And their requests will run in parallel onto the same GPUs. That's why I like the CDN analogy a lot.
Roland Siebelink (10:39)
Yeah, of course. But then your mode for Deep Infra, how do you compete and keep your competition sustainable is a very different hypothesis than the mode steps the model providers would have.
Nikola Borisov (11:00)
Yes. I spent 10 years in my previous career building backend and infrastructure services. And I wanted to build a company that uses some of the strengths, some of the things we've learned, me and the team, from this previous experience. We feel to build an efficient inference cloud, you have to own and operate the cards and you have to produce the tokens efficiently. It means you have to be able to do more tokens per card per hour than other and there's a lot of other aspects: time to first token, latency. have to strike a good balance. It's again, similar to the CDNs. They have to serve a lot of content. That's roughly how they make money, number of requests and number of bytes they push.
Roland Siebelink (11:58)
I think you mentioned you have about 20 people on your team, I understand they're mostly engineers, right?
Nikola Borisov (12:05)
Yes, a very engineering heavy company.
Roland Siebelink (12:11)
The smartest people that help you bring down your cost advantage. I'm not sure this is for public consumption, but I think you mentioned striving after a 20% cost advantage over some of the competition. Is that in the right ballpark?
Nikola Borisov (12:24)
There is a wide variety of things that I think are competitive or alternative to what we do. On one hand, we help bring these open source models to production. And the true competitors of the open source models, I think, are the closed source models. And they are sometimes almost an order of magnitude more expensive. But as I said, they're smarter, they're better models.
If you're trying to do something really hard and expensive, let's say replacing a software engineer or augmenting a software engineer, people are willing to pay the extra 5% because it really saves them time. But there's a lot of tasks that this is not important. Both models are already completing the tasks 99.9% of the time, so you don't need to use the more expensive ones in those cases.
Roland Siebelink (13:25)
How do you see the market evolve there at this point in time? Who do you actually serve as customers? Who are your core customers that you talk to and how do you see them respond to these different offerings in the marketplace and where to choose what?
Nikola Borisov (13:41)
I think the market for open source models inference is still developing because if you're a large enterprise, I think your first step - and you've not done any inference before - then your first steps into this space would probably be using a model from Google or Anthropic.
Our customers and the people we like to focus are mostly startups, maybe mid-market companies, but they're building AI-first products. They're not going to add a little bit of AI to their existing SaaS. They are asking this question, now that we have AI, if we wanted to do X, how do we do it? And the AI is 80% of the thing.
Roland Siebelink (14:38)
So the native AI companies, that's what we call them in this space, right?
Nikola Borisov (14:43)
Yeah. They are really focused on - to them, the inference is really, really important. It's basically the most important thing for their company. And they look for a partner like us that has the infrastructure, that has the team, that has the APIs and the product to just hand off some of this execution to us, while still maintaining a lot of flexibility of being able to switch to the latest models as they come out, being able to adjust how the models run to meet their needs. That's where we focus on.
Nikola Borisov (15:26)
It's not as much on the demand side. One of our advisors likes to say that there's infinite demand for AI.
Roland Siebelink (15:33)
It's the gold rush feeling, isn't it? You have customers coming to you. You don't need to go and hunt for customers? People find you.
Nikola Borisov (15:37)
Yeah. We are one of the largest inference providers on OpenRouter. We have very wide selection of models. We are a cost leader. We do believe that a lot of customers care about the price per token. If you're doing this AI native thing, it's a big cost for you. It's the biggest cost, but you also want to...
Roland Siebelink (16:03)
Yeah, it's a huge cost.
Nikola Borisov (16:12)
build a product that's useful to someone and if we give you lower cost, you can afford more tokens. Let me put it this way or smarter tokens. What I've seen come over the last two years is the more we lower the cost of particular models, then the customers would basically jump to a better, bigger model.
The interesting thing about open source model is there's big variety. And once people like a model and they build something with it - some of them tried to move to newer models, but some of them are happy with it. They just want to run this one. And so we think that having a good variety of models available to people, it's what makes for a good inference platform.
Roland Siebelink (16:51)
I see, okay.
Nikola Borisov (17:00)
That's what we think is important, to have a good selection of the latest models, but also maintain the big models for long periods of time to have stability for the developers building on top of it.
Roland Siebelink (17:18)
Is it part of your offering to help developers select the right model or is it just to offer all of them and they should find their own sources? Which one is the one for them that fits them best?
Nikola Borisov (17:31)
Not in the way where we build some black box that picks the model for you. I think the sophisticated teams are very particular about the
We try to mostly just educate people about the different models.
Not as much on - we believe they're intelligent enough and they have their own reasons. Some of them fine tune the particular models, they run custom versions.
And so I sometimes worry that there is too much selection and developers get a little bit overwhelmed.
But this is how open source things work. There is variety in them.
Roland Siebelink (18:15)
Yeah, is it a bit of a function of what target group you have in mind? You seem to be really targeting the people who are already specialists in AI who come with the knowledge.
Nikola Borisov (18:31)
It's a bit of self-selection. We are good at producing tokens at scale. And so the people who care about that also come in and look for that.
Roland Siebelink (18:43)
I love the business proposition and I love how successful you guys are. How big could this become?
Nikola Borisov (18:51)
I'm not saying this just as a founder of an inference cloud. I do believe we're gonna five, 10 years from now, a lot of companies' budgets will be mostly dominated by inference. It's just the nature of the models.
Of hiring a thousand people, now you can maybe do it with 200 and then the other spend goes into inference. This could get really big. We just need, I guess for our particular success, we need a continued supply of open source
Roland Siebelink (19:20)
That the market doesn't shut off in a way. It's not like being taken back in-house by all the three great players.
Nikola Borisov (19:41)
And my worst case is I think open source models are 10, 20% of the overall business. And the best case is that they're 80%. Either way, there's going to be a good amount of area to explore, even if we're stuck at the 20% of open source versus closed source.
Roland Siebelink (19:53)
Yeah, so that's the big bet you're making.
That's very good. Last two questions. One is tell us a little bit more about what led you to become this successful founder. You already mentioned you grew up in Sofia, Bulgaria. You were a math wizard as a child. Tell me a little bit more about what you were like as a child and a teenager. What would have predicted that you were going to be a successful founder one day?
Nikola Borisov (20:29)
I liked math and then that's how I got into programming. I did a lot of programming competitions in Bulgaria and it was just - I enjoyed the certainty of writing a piece of software, a small piece of code, a small program that can take some inputs and produces some outputs and you can
test it and make sure it's actually correct and fast. I randomly ended up in the US a little bit. My sister applied to study in the US, and then once she went, my parents were like, she's very happy there. It's very nice. Why don't you apply? Okay, I will.
I ended up in the US. I went to Northwestern. Interesting enough, my sister basically graduated and went back to Bulgaria, but I ended up staying and coming to the Bay area. I did three internships, two at a startup and one at Microsoft. And I really loved working at the startup. I was their first intern the first summer. Next summer there were a few other people.
Things change, whatever I did, I felt like I'm actually contributing to the product more. When I had my big tech internship, I was given a project. It was an interesting project, but at the end of the day, it was more of a research thing, not something that ended up actually in
Roland Siebelink (21:58)
Last question, Nikola, what would be your advice to somebody who is behind you in their journey and starting a company? Maybe they just started, maybe they're just figuring out where to focus their time and energy. What's your best startup advice to younger founders behind you?
Nikola Borisov (22:31)
Actually, it's quite hard and difficult to build a startup. Even if things go really well, it's still a lot of work and a lot of uncertainty. You just have to be okay with the uncertainty. Startups, they need to just try things quickly.
But you have to also just think about all the time, what’s the most important thing for us to do today and focus on that.
Roland Siebelink (23:00)
Super prioritization. I love it. Very good. Nikola, where can people reach you if they want to hear more about Deep Infra or just to be in touch?
Nikola Borisov (23:10)
Yeah, not the best at social networking, but check our website. can drop us an email. You can find us on X.
Roland Siebelink (23:23)
Yeah, Very good. When people are already in touch with me and they want an introduction to Nikola, then I'm of course also happy to provide. Nikola, this was amazing. Thank you for joining the Scaling Without Breaking podcast. And for the audience, we will have another amazing guest for you again next week. Keep scaling without breaking.