AI Is Eating the Software World, Pt 1
AI For Coding - Reviewing CoPilot, Code Interpreter and the challenge of Code Llama
Introduction
As AI becomes more powerful, the ability of AI to write software has become one of its most powerful and interesting use cases. Consider this: If the only practical application of today’s Foundational AI Model-based Artificial Intelligence was a doubling of software productivity, AI would still be one of the most important technologies ever developed.
Last week, we saw an important step forward in AI for coding when Meta released Code Llama. Code Llama comes in multiple sizes and flavors: Foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. DrJimFan on twitter has a Code Llama summary.
Open Source AI for Coding Leaderboard
Code Llama changed the landscape for AI coding models. Before, open source AI models for coding weren’t competitive, and the best coding AI models weren’t open source. But now CodeLlama 34B model is competitive with GPT-3.5 in the HumanEval benchmark, CodeLlama-python achieves 53.7 vs GPT-3.5 (48.1), but still trailing behind GPT-4 (67.0).
Quickly following the release of Code Llama, a fine-tuned version of CodeLlama-34B was developed by Phind to further optimize it, and they reported Beating GPT-4 on HumanEval.
HuggingFace has a leaderboard for the top open source AI coding models. Code Llama and derivative models fine-tuned from Code Llama, Phind and latest Code Llama-based WizardCoder, dominate the leaderboard:
Reviewing the Pioneer AI Coding Models
Before diving deeper into these new AI coding models, though, we should review how we got here and the status of the reigning AI coding assistant leaders.
Using LLMs to interpret and write code have been developed as one of the first applications of LLMs, and two contemporary pioneering AI-for-coding models stand above the rest: OpenAI’s Codex, which powered GitHub CoPilot, and the ChatGPT CodeInterpreter plug-in.
GitHub CoPilot
After the release of GPT-3, a derivative model specialized as a coding assistant called Codex was developed by OpenAI and released in August 2021. OpenAI announced that: “Codex powers Copilot, a GitHub service launched earlier this summer that provides suggestions for whole lines of code inside development environments like Microsoft Visual Studio.”
Codex has served as the grand-daddy of LLM-based coding models. GitHub’s CoPilot has continued to evolve and improve from the original OpenAI Codex base; in March of this year, GitHub announced CoPilot X, to level it up to a base of GPT-4.
Github CoPilot works as a programming assistant, following suggestions that are made in comments as you code. Working as a programming extension, it operates as the name suggests, as a co-pilot sitting beside you to help with various coding tasks. Trained on source code for several different languages, it is able to help under many contexts.
ChatGPT: CodeInterpreter
As we discussed in a prior article, OpenAI’s CodeInterpreter is a powerful tool with many useful applications. Its abilities include writing short snippets of code, accepting input files and performing any manner of analysis with code, and being tied to the most powerful LLM. As such, it is taking the native capability of chatGPT to write code and extending it to run and even self-review its generated code.
CoPilot and Code Interpreter Capabilities
With the release of GPT-4, these two contemporary models - CoPilot and Code Interpreter - stand on the shoulders of the best-in-class LLM out there and provide very good capabilities as a coding assistant.
Still, GPT-4 level AI for coding is incapable of sitting in the pilot’s seat. It is at most an assistant. So overall, what is it best used for?
Known Algorithms: Coding tools are good for implementing well known algorithms. Since they have learned how to code from online language, popular and frequently implemented algorithms are often in the training set and well known to the AI model. For example, asking for a merge sort or other standard function is easier for a developer than searching and cut-n-paste or typing it in oneself.
Debugging: Coding tools are able to catch small logic errors, misspellings, and syntax errors, including improper indentation in Python. This attribute is particularly useful for novice programmers learning a language. CodeInterpreter is better at this than CoPilot, as the latter is more so built for code suggestions than corrections.
Learning: Coding tools that are able to make suggestions may help new programmers learn how to code far more quickly. However, it is important to make sure that training wheels don’t become crutches; if a novice becomes over-reliant on these tools to code for them, they may have more trouble understanding programming.
Predictive Suggestion: Particularly for CoPilot, which is built for suggesting code to help speed up your process, these tools are able to predict what you’ll try to write next. This is similar to top results showing up as you type into a search bar, or text completion AI in email tools or your smartphone. Whatever is most common to occur next based on context in the training data is what CoPilot suggests. This is useful in many cases, but can also be encumbering when CoPilot makes incorrect assumptions. As Language Models become more powerful, these suggestions will become more useful and accurate.
Limits of AI Coding Tools
These AI coding tools are still unable to do complex projects, and are best tasked with small pieces of a larger problem. Code Interpreter for example has a tendency to lose consistency for long projects, and as such is limited to shorter tasks. There’s a reason why the tool is called “CoPilot” and not Pilot. The AI is not ready for self-driving stand-alone programming, and attempting to use it as such often ends up in failure.
When given a prompt for a task that may be outside of its range, you can often overload the AI tool’s logic. In CoPilot’s case, this can often look like a repetition of faulty logic. In this example, it is unable to realize such an unspecific prompt on its own, which causes it to loop far past the requested two buttons while it tries designing.
Infinite Loops: If the initial prompt isn’t clear enough or is too niche, CodeInterpreter and Co-Pilot may hit a dead end and loop on their own logic. This can be a result of unspecific prompts or of limitations of the model. In this example, the prompt wasn’t unspecific, but due to its niche nature, Co-Pilot was unable to find a stop point.
Language support: Some sophisticated AI models, such as CodeInterpreter, are also restricted to Python. This is a result of scope and training data. We are finding out that sometimes a specialized AI model for a specific language can outperform a general AI model that covers multiple programming languages. On the other hand, GitHub’s Co-Pilot has been trained on source code for many languages and benefits from that, being able to support their user more adequately across many programming languages.
Conclusion
CoPilot and Code Interpreter paved the way for using powerful AI models to assist in programming, drawing on existing coding examples in training data and the power of LLMs to generate code.
With CodeLlama, we now have highly capable open source AI models for coding as well. The flexibility of open source combined with the power of AI will open up even more variation and capabilities in AI models for coding, for example, we are likely to see coding assistant models you can run locally. These might be code language-specific plug-ins that you plug-in to make your IDE (such as Visual Studio code) more capable.
These AI models for coding are limited to being capable coding assistants that can complete simple and standard programming tasks, but can fall down in niche cases and have limits on the complexity of tasks they can handle. Yet that is enough for significant utility.
For novice and learning programmers, the ability to extend one’s understanding or learn from the AI model is a powerful boost. For advanced programmers, the best use case for AI code assistants is as a productivity enhancer to speed development and reduce coding of boilerplate.
Beyond just coding, AI models are also being developed to manage the whole of the software development life-cycle: Requirements, design, code optimization, testing, release processes, CI/CD and deployment, and more. This flywheel of productivity-enhancing AI tools will build upon itself and make the entire software development ecosystem much better than it was before.
As stated, these AI models are not at the level of complexity required to fully implement large projects that take teams of programmers to design and build. Yet this will continue to improve as AI models improve, and in the near future they may be able to build impressive programs on their own. Advancing AI model capability will get us there.
In a followup article, we will take a look at Code Llama and other open source AI models for coding and see how they perform.