Running Your Own AI Models

DIY ways to run AI Models: Llama-cpp for AI on CPUs, Web UIs for AI models, local GPUs, and custom AI models in the cloud

Sep 19, 2023

Figure 1. AI Art, evoking the Mockingjay.

The AI revolution is not about running just one massive foundational AI model through a chatbot interface. The AI Revolution is hundreds, nay, thousands of AI models, large, small, general-purpose to specialized, run in many different environments, in different ways, and in different applications. It’s a Cambrian explosion of Artificial Intelligence.

The recent Apple product announcements, quietly embedding more AI under the hood of the Apple Watch 9 and iPhone 15, is an example of how AI is being woven into existing products. AI will be embedded in products from practically every tech company, just as various other forms of digital intelligence already is.

In our prior article, “Fantastic AI Models And Where to Find Them”, we described access to many AI models, where the primary consumer interface is textual chatbot interfaces. For most consumers, their AI experience will be as embedded AI in applications or as a chatbot-style interface, with AI served from the cloud.

What are your options, though, if you want to go outside that? If you are a developer, hacker or a ‘power user’ who simply wants to try out some of these other models, what are your options? That’s what this article covers. There are various ways to ‘roll your own’ AI model, here are a few options:

Run AI model locally on CPU with llama-cpp.
Run AI model locally using on-board GPU.
Connecting a local web interface that connects to your own AI models, which are be hosted locally running on CPU or local GPU, or cloud-hosted GPU.

Running AI Locally on CPU

Llama-cpp: If you have enough memory on your own PC or laptop, you can try llama-cpp, a port of AI models to C/C++.

Llama-cpp was developed by Georgi Gerganov with an initial goal of getting Llama models to run on a MacBook. Since then, it has improved and evolved, and currently provides a framework to support a number of AI models. It was using GGML format, but now uses GGUF format, described as “GGUF is that it is a extensible, future-proof format which stores more information about the model as metadata.”

To use llama-cpp you need to: git clone llama-cpp’ to your local machine; download the desired AI model from HuggingFace and compile it; setup and run the model. This guide offers a script that takes you through the steps needed:

# 1.git clone llama.cpp repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# 2.Download and resolve/compile model if not present
[ ! -f models/${FILE} ] && curl -L "https://huggingface.co/${REPO_ID}/resolve/main/${FILE}" -o models/${FILE}

# Set a welcoming prompt
PROMPT="Hello! Need any assistance?"
# 3. Run the model in interactive mode with specified parameters
./main -m ./models/${FILE} --color --ctx_size 2048 -n -1 \
  -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 8

I used llama-cpp to run a Llama-2 7B model (TheBloke/Airoboros-L2-7B-2.2-GGUF) and was able to get a CPU-based model that took up 7GB and got interesting and passable results on my older 32 GB PC running Linux. It was a pokey 4 tokens per second, but modern MacBooks equipped with GPUs can do much better and run larger models.

Web Interfaces To Run AI Models

There are a number of deployment options beyond llama-cpp that support GGUF. This includes several Web UI and GUI interfaces to local run models:

text-generation-webui: This allows running AI models in a local web UI. It’s the most widely used web UI, with many features and powerful extensions.
LM Studio is a GUI front-end for locally-run models with GPU acceleration on both Windows and macOS.
KoboldCpp is a web UI that is built on llama-cpp and includes a GUI front-end that on Windows is offered as an .exe release. It has GPU support across multiple platforms.
LoLLMS Web UI, which has a lot of customization and setup great web UI with many interesting and unique features, including a full model library for easy model selection.

For example, you can git clone text-generation-webui and get it running in your browser, and from there select a GGUF model to download and run. For example, the CodeLlama 13B model at https://huggingface.co/TheBloke/CodeLlama-13B-GGUF has codellama-13b.Q5_K_S.gguf that weighs in at 9GB.

For both text-generation-webui and LoLLMS Web UI, there was setup of the environment needed and I had issues configuring and running models, so it can become a hacker project. However, the payoff is an environment that is very flexible, and gives you a chance to run models released to HuggingFace.

For those wanting the simpler “just get me a model running,” you can directly download LocalAI, which provides a desktop app for “local, private, secured AI experimentation” on Windows, macOS, or Debian Linux. However, my experience running even a 3B Dolly model with LocalAI on my 16GB Windows machine was unacceptably slow.

Local GPUs

The larger CPU memory can enable trying out larger models, but they are slow. While running the AI model on a CPU is an interesting experiment in what is possible, speed and performance limitations make it less practical for day-to-day use of AI Models versus using a GPU.

To run AI models effectively on your machine, you really want GPUs on board to run them. For Macs, the latest MacBookPro equipped with the M2 Ultra (GPU cores double with up to 76 available on the high-end model) offers a good platform.

For PCs, you’d want a latest-generation RTX 40 series NVidia GPU paired with a good PC or laptop that can keep up with it. For example, a laptop with the 8GB RTX4070 or the highest end 24GB RTX4090 in a gaming PC.

Even with a 24GB card, the memory size limit prevents you from running larger models. This is where quantization comes in. GPTQ is a quantization of the model parameters to reduce the model memory footprint. For example the 34B CodeLlama model cannot fit on a 24GB GPU card, but with GPTQ quantizing the 34B model down to 4 bits, A CodeLlama-34B-GPTQ model can fit it under 20GB.

The RTX4090 is an energy-hogging 3-slot-filling beast that costs $1500, begging the question of whether it’s worth the spend, and raising the question of renting the GPU instead of buying it. That takes us back away from the ‘run local’ option to running in the cloud.

Running ‘roll your own’ AI Models in the Cloud

If in the end, your goal is to ‘roll your own’ AI, but you don’t want or need to own the hardware or deal with the problems of local configuration, you can run your own customized roll-your-own AI models in the cloud.

This takes us beyond what a consumer would want to do, but for a developer or hacker wanting to explore different AI models, it’s not hard task and the advantages are clear: You pay only for what you need to use; a lot of configuration and maintenance is offloaded; you aren’t limited by consumer-grade hardware, so you can scale up to enough compute to do fine-tuning or even training.

There are many GPU cloud service options, including major providers like Google Cloud offering A100s and T4s, Azure and Amazon AWS offering a range of AI services, and a number of smaller players, such as Lambda Labs, offering H100s for only $2/hr.

RunPod is another cloud service offering up GPUs to run AI inference and training, that offers on-demand GPUs and API endpoints for various AI models and services. Matthew Berman shows in a video how to load and run an AI model on RunPOD that then can be accessed via the same text-generation-webui we discussed earlier.

I was able to follow along with his instructions, pay up front for some RunPod credits, bring up an RTX 6000 Ada GPU with 48GB VRAM in a matter of minutes, then download a Llama2 70B quantized model from TheBloke (which took a while) then adjust settings and run. I got Llama2 to talk about Paris in the fall; it worked.

Conclusion

The real question driving this is: Why Run Your Own Model?

The main reasons to go the ‘roll your own’ route is the hacker ethic: To try things out, experiment, see what these AI models can do. If you are a hacker who likes tinkering with things, the AI Revolution is a great time to be alive. You can run AI models both locally and in the cloud. Try them in all ways.

If you are a developer who wants to both use and build in AI, the cloud instances and cloud-hosted tools that are available cannot be beat; it’s not hard to setup a workflow that accesses, fine-tunes and even builds AI models, and much of it is available at low cost for pay-as-you-go uses.

If you are an AI model user or consumer, even a power user, but don’t want to explore open AI models, then stick with OpenAI’s ChatGPT plus subscription. GPT-4 plus plug-ins and code interpreter remains best-in-class AI models for users.

Eventually, we will see inference engine chips embedded in PCs and other edge devices, and AI models will come to the edge for consumers. Apple is leading in this space, with their neural engine chip and on-board embedded GPU. For PC users, get a good GPU so any AI workloads that do land on your local machine can be handled well. We’ll know the era of edge-run AI has arrived when Intel processors come equipped with AI inference engines on-chip to compete with Apple.

Final thought: Hat tip and many thanks to “TheBloke” aka Tom Jobbins for creating most of the AI models used in our efforts at trying out different models described about. TheBloke has released 1464 models on HuggingFace, a huge accomplishment.

AI Changes Everything

Discussion about this post