I get asked a lot whether specialized AI is still relevant in the age of frontier models. My thoughts run further ahead. With more powerful generalized AI rolling out in the coming months and years, what strategy should we take? As an AI-focused organization, what will enable us not just to survive, but thrive?
The enterprise AI conversation has been dominated by capability. Get AI in. Move fast. Don't get left behind. Most organizations have crossed that threshold. The question has shifted from whether to use AI to how to use it well. Moving from experimentation to genuine ROI. Understanding which AI to trust and how to safely deploy it.
At the heart of it all, organizations want to solve the AI productivity paradox. Why doesn't increased AI adoption simply translate into outcomes that can be measured?
Perhaps it's because the conversation has become too centered on capability. The implicit assumption is that as models become more capable, outcomes will follow. That's proving to be an oversimplification.
Why Generalist Models Create a False Ceiling
The rapid advancement of large language models has done something subtle to how people think about AI selection. Because these systems perform impressively across such a wide range of tasks - summarizing, coding, reasoning, generating - there's a creeping assumption that one model architecture can handle almost any business problem.
In some ways, that's true. The missing piece is whether it can solve that problem optimally.
Today's attention is on these exciting general-purpose models, with their incredible flexibility and seemingly limitless applications. What many organizations are temporarily forgetting is that businesses aren't built on novelty. They're built on repeatable processes that power top-line growth, protect against churn and assure consistent behavior.
Where organizations can build repeatable systems, the successful ones do so eagerly.
General-purpose AI and LLMs are necessary when including unstructured text and documents in these processes. For everything else, including image recognition, specialized AI and well-designed software are superior on every measure. This will remain the case, even as frontier models become more powerful in novel research, writing elegant bug-free software and providing flawless advice.
The goal of generalized AI is very different from the goal of enterprise AI operating inside production systems: execution of repeatable tasks that are optimised in every sense – accuracy, repeatability, cost, efficiency, scalability and explainability. In these systematic processes, organizations don't measure AI success by the breadth of tasks a model can perform. They measure it through outcomes: reduced costs, lower risk, faster decisions, fewer errors. When evaluated through that lens, versatility is only valuable if it comes without a performance penalty on the tasks that actually matter.
If you are building systems that repeatedly ask AI the same question using well-defined inputs, it's worth asking a simple question. Is there a better way?
The Role of Training Data and Validation
This performance gap on specific tasks doesn't emerge by accident. It's a consequence of how models are built.
We recently put this to the test, benchmarking our computer vision models against several leading foundation models on identical property analysis tasks. The performance gap was significant, but the more revealing finding was what drove it. Domain-specific training data, clearly defined objectives and rigorous real-world validation mattered far more than model size or general capability.
Specialised AI models are constructed with a specific goal of minimizing error on a precisely defined task. If the goal of the task is defined and a set of curated examples is provided, deep learning models can reach, and even exceed, human performance. For example, a model trained to detect swimming pools in aerial imagery can achieve exceptional accuracy because it's solving one well-defined problem. As with most enterprise AI applications, success depends as much on the quality of the training data and validation process as it does on the model itself.
General-purpose models, by contrast, are optimized to learn patterns across huge volumes of diverse text and images. That breadth is the point, but it means they don't have the laser-focused precision of a model built for a single task.
Validation methodology matters as much as training data. Overreliance on public benchmark datasets introduces systematic bias in how performance is reported. Organisations evaluating AI for specialized workflows should ask whether the benchmark data actually resembles their operating environment. In the real world, what matters is whether the model gets the right answer under real conditions, not whether it can spot an object in an ideal image.
What Rigourous Evaluation Actually Looks Like
The bar for AI procurement needs to rise. The questions worth asking vendors are specific: How was the model benchmarked? Was it evaluated on representative data? How does it perform across edge cases and different operating environments? How often is it revalidated? Can its performance claims be independently verified?
These questions matter because performance gaps tend to surface after deployment, not before. A model that performs well under ideal conditions or on a few cherry-picked examples can struggle significantly when it encounters the variability of production environments, such as different geographies, seasonal changes, data quality inconsistencies, rare but consequential edge cases.
Continuous validation is how organizations build trust in AI systems over time. The vendors worth working with are transparent about where their models perform well and where limitations exist.
General-Purpose and Specialised AI Are Complementary, Not Competing
None of this is an argument against foundation models. It's an argument for clarity about where each type of AI delivers value.
General-purpose AI will continue transforming knowledge work, software development, customer service and countless other applications. I expect these models to become dramatically more capable over the coming years. But I don't believe they'll replace every form of specialized AI. I think the future belongs to organizations that understand where each approach creates value.
Traditional software will continue solving deterministic problems. Purpose-built AI will continue excelling in specialized workflows where accuracy, repeatability and efficiency matter most. The organizations that understand where each approach fits will be the ones that get the most from both.