The AI Software Platform Shift
While AI model makers work on how to make the best LLMs, the rest of us try to make the most of the AI models we have.
What kinds of applications can be made with AI? How best to use AI models to make them? What design principles and dev toolsets are needed to build good AI-based software? These are questions software developers are grappling with to build applications with AI.
This article provides some guidance on this and reviews the ‘Schillace laws’ of semantic AI.
The answer these questions, we should start with the understanding that AI is software. AI in software applications creates new capabilities and software application types, but software engineering principles apply.
AI changes everything in software, and here is how:
AI is a ‘platform shift’ for software. This AI shift is as important as prior shifts, from mainframe to PC, to internet, to mobile, and to cloud. AI-based applications are as novel and different as internet SaaS apps were to PC software.
AI expands the universe of possible applications. Software has traditionally been for precise, factual calculations on datasets large or small; computing precisely defined and limited the reach of software. AI expands what computers can do to include creative outputs from imprecise and general inputs. Combining AI and traditional software will yield many new types of applications.
AI introduces the natural language prompt as a major new interface, enabled by the natural-language understanding inherent to LLMs.
The Origin of Schillace Laws
To drill down further, we will review and update an earlier effort to characterize how AI as software behaved and what it meant for developers: The Schillace Laws.
Sam Schillace, currently a deputy CTO at Microsoft, is the innovator who created the collaborative writing tool Writely in 2005 that became Google docs. He went on to a career at Google, Box, and now Microsoft, where Sam was privileged to get an early peek at GPT-4. He and colleagues turned their experience into observations called “Schillace Laws of Semantic AI.”
Schillace and others at Microsoft understood that AI as a new capability needs new dev tools to support it, so they also developed Semantic Kernel, a toolkit for integrating AI capabilities into applications similar to LangChain.
Sam spoke in a recent Latent Space podcast about Microsoft’s perspective on AI, his experience with GPT-4, how these “laws” came out of that, and his current thoughts and understanding of AI as software.
Code vs. Models: Prioritize AI over code
We’ll further review the Schillace Laws and give an update on AI application development lessons we’ve learned since then. The Schillace Law text is in italics, and our own commentary follows each point.
Don’t write code if the model can do it; the model will get better, but the code won't.
The overall goal of the system is to build very high leverage programs using the LLM's capacity to plan and understand intent. It's very easy to slide back into a more imperative mode of thinking and write code for aspects of a program. Resist this temptation – to the degree that you can get the model to do something reliably now, it will be that much better and more robust as the model develops.
Lesson: All software should be AI first. Refactor software around what AI can do, and make AI models carry the load that it can. Fall back to “imperative” programming as a last resort.
Regarding calculations, specifications don’t need to be hard-coded in precise programming languages directly: The LLM can write the code, run the code, and get back the calculation result you want; see the next item.
Code is for syntax and process; models are for semantics and intent.
… models are stronger when they are being asked to reason about meaning and goals, and weaker when they are being asked to perform specific calculations and processes. For example, it's easy for advanced models to write code to solve a sudoku generally, but hard for them to solve a sudoku themselves. … The boundaries between syntax and semantics are the hard parts of these programs.
While it’s true models are good at assisting with brain-storming and tasks around goals, and can do so from general open-ended prompts, we still have a challenge with robust planning with LLMs.
We’ve learned that getting AI models to perform hard tasks well is usually a matter of better prompting. For example “think step-by-step” prompts and asking AI to review and verify answers will help with reasoning.
Thanks to adding functions to LLMs via tools like Code Interpreter aka Data Analyst, there is a useful design pattern for many difficult problems involving a mix of reasoning, goals, and calculation:
Prompt “Solve X” → LLM → generate code to solve X → execute code → answer.
For example, this question for the “sum of prime numbers up to 100” took ChatGPT a few seconds to resolve. It could not answer it directly, but it could write the program that could and then run it.
This approach can solve fairly complex data analyst tasks, such as statistical analyses over large datasets.
Dealing With AI Software Challenges and Errors
The system will be as brittle as its most brittle part.
This goes for either kind of code. Because we are striving for flexibility and high leverage, it’s important to not hardcode anything unnecessarily. Put as much reasoning and flexibility into the prompts and use imperative code minimally to enable the LLM.
This “weakest link” truism describes all software systems; AI models are no exception. Are LLMs brittle? Yes, in multiple ways:
Sensitivity to inputs: Different prompts attempting same task can yield different results. Hence the suggestion for reasoning and flexibility into the prompts.
Lack of reliability: Input sensitivity and an opaque stochastic system with randomness yields AI that is inherently unreliable. Lower temperature on AI models aids predictability; wrapping LLMs with guardrails can reduce harmful output; use RAG and other knowledge connections to avoid hallucinations. All help overcome LLM’s reliability weaknesses.
Ask Smart to Get Smart.
Emerging LLM AI models are incredibly capable and "well educated" but they lack context and initiative. If you ask them a simple or open-ended question, you will get a simple or generic answer back. If you want more detail and refinement, the question has to be more intelligent. This is an echo of "Garbage in, Garbage out" for the AI age.
In other words, prompts matter. Crafting good prompts, i.e., prompt engineering, is worthwhile.
Trade leverage for precision; use interaction to mitigate.
Related to the above, the right mindset when coding with an LLM is not "let's see what we can get the dancing bear to do," it's to get as much leverage from the system as possible. For example, it's possible to build very general patterns, like "build a report from a database" or "teach a year of a subject" that can be parameterized with plain text prompts to produce enormously valuable and differentiated results easily.
A data science truism is to use the simplest ML model for the job. Throwing an LLM at simple tasks may be overkill. However, the point here is that you have a more general solution if you use a more general tool; you can parameterize and generalize what you do to create a more powerful feature.
Uncertainty is an exception throw.
Because we are trading precision for leverage, we need to lean on interaction with the user when the model is uncertain about intent. Thus, when we have a nested set of prompts in a program, and one of them is uncertain in its result ("One possible way...") the correct thing to do is the equivalent of an "exception throw" - propagate that uncertainty up the stack until a level that can either clarify or interact with the user.
This point communicates that unexpected LLM output like uncertainty should be an error. Frankly, this depends on the AI application (creative mode might accept it), but it’s not useful to have an ‘exception throw’ for erroneous output in a nested AI system.
More generally, LLM error conditions can be triggered by any output that is erroneous, faulty, or fails to respond “helpfully and harmlessly.” Not just uncertainty, but other erroneous, non-responsive (“I can’t do that”) or harmful responses can also lead to errors.
AI software systems need error-handling capabilities, and the typical solution is LLM guardrails that divert flawed, harmful or jailbreaking input, and catch-and-throw incorrect outputs into error status.
AI Application Design Patterns and UIs
Text is the universal wire protocol.
Since the LLMs are adept at parsing natural language and intent as well as semantics, text is a natural format for passing instructions between prompts, modules and LLM based services. Natural language is less precise for some uses, and it is possible to use structured language like XML sparingly, but generally speaking, passing natural language between prompts works very well and is less fragile than more structured language for most uses.
Natural language prompting is the primary interface. It’s a primary user interface, but also useful for LLM communication, for example in AI Agent flows. Structure and precision in LLM-to-LLM interfaces as needed to improve comprehension.
While the natural language interface is expressive and general, its lack of precision is a weakness. User interfaces (UIs) that mix natural language prompts with precise input modes obtain both general and precise user intent, a ‘best of both worlds’ UI.
Hard for you is hard for the model.
One common pattern when giving the model a challenging task is that it needs to "reason out loud." This is fun to watch and very interesting, but it's problematic when using a prompt as part of a program, where all that is needed is the result of the reasoning. However, using a "meta" prompt that is given the question and the verbose answer and asked to extract just the answer works quite well. … So, when writing programs, remember that something that would be hard for a person is likely to be hard for the model, and breaking patterns down into easier steps often gives a more stable result.
We are still learning to assess how well an AI model will do on a particular task. AI models can surprise us with what they can do well, but it’s always the case that breaking a task into simpler smaller tasks is easier, both for humans and AI.
There are a number of design patterns helpful and useful to AI-centric software that we see develop: Retrieval-augmented generation; prompt chaining; ‘think step-by-step’ prompting; function-calling (tool use); code execution, etc. All of these are ways to improve the AI results by breaking the problem up and leveraging other tools or AI calls to solve parts of the problem.
Beware "pareidolia of consciousness"; the model can be used against itself."
It is very easy to imagine a "mind" inside an LLM. But there are meaningful differences between human thinking and the model. An important one that can be exploited is that the models currently don't remember interactions from one minute to the next. … we can "use the model against itself" in some places – it can be used as a safety monitor for code, a component of the testing strategy, a content filter on generated content, etc.
Using the AI model to check it’s own work is an effective AI design pattern.
The AI Software Era
Even though we are still learning the ways of AI software, these “laws” are good guideposts for software developers building AI applications from AI models.
The top-line take-away is that we are undergoing a platform shift to AI software. In this AI era, software applications will be AI-centric, and the best applications will leverage AI capabilities for natural language understanding and semantic reasoning, while using procedural software to address AI’s weaknesses.
New creative AI applications, AI-enabled user experiences, and AI devices are being introduced every day. AI applications not yet invented will define the limits of what AI-based applications can do. We haven’t reached those limits, in the capabilities of either AI models or AI-centric software.