AI UI /UX, User Experience & Screen AI

Design considerations and techniques for UI/ UX in the era of AI Applications

May 08, 2024

Figure 1. Ingredients for interfacing with AI: Coffee, headset, keyboard and screen.

An AI UX Fail

There’s a meme-famous Photoshop expert who took requests from others to ‘fix’ images with Photoshop and changed them in the most inappropriate way possible. It’s hilarious.

Apparently, Adobe Photoshop shares this sense of humor, because it managed to ‘fix’ a photo with generative fill in the worst way possible, replacing an undesired rock with a face-down baby.

Figure 2. The newlyweds weren’t expecting a baby quite that soon.

The User Interface (UI) for a software application translates user intent to an expected outcome, and a good UI does that reliably and well. The UI and the application interact to create a User Experience (UX) that, in the ideal situation, makes the application a seamless extension of user’s intention.

The topic of this article is AI UI/UX. The above UI and UX fails are a good introduction to this topic, because knowing how a good UI maps user intentions to results is best understood in the breach.

The basis of good UI/UX is following user intent. It’s not the color scheme or button shading is that is important; function is more important. Translating user intent reliably and well is the primary goal of good UI that leads to good User Experience (UX).

AI Prompts, Chatbots And Beyond

The primary change LLM-based AI brings to application interfaces is the prompt: The prompt is the interface.

Writing prompts to extract properly the desired LLM response - prompt engineering - is a key skill for making LLMs our bidding. Because AI is augmented and automated knowledge-work, the prompt has become a unit of knowledge work and a pattern for leveraging knowledge work via AI. For example:

Prompt “summarize” with a document in the context, and LLM writes a summary.
Prompt “find / fix the code bug” with code in the context, and LLM proposes a fix.
Prompt “Write a Shakespearean poem about lilies in Spring” and Gemini obliges (see postscript.)

The chatbot is the simplest and most direct interface with LLMs. Current LLM-based chatbots are far more powerful than earlier generations of chatbots that were limited to script-based interactions, so much so that early chatbots like Siri and Alexa must adapt and upgrade or become obsoleted. (Ajax to the rescue for Siri.)

Standard AI-based chatbots from leading providers like Claude, Gemini and ChatGPT now include helpful cues and features:

Preserving prior prompts and chat conversations. Sharing prompts and prompt libraries such as Claude’s prompt library further simplifies the process.
Import documents and images to analyze, summarize or understand. Claude invites importing documents: “Add images, PDFs, docs, spreadsheets, and more to summarize, analyze, and query content with Claude.”
The iOS and Android chatGPT app allows voice input, as does Gemini.

There are other platforms like Poe that give you buffet-style access to AI models of your choosing, including both proprietary frontier AI models and open-source LLMs.

Beyond standard AI chatbots are chatbots based on customized and personalized LLMs, such as CharacterAI’s companion chatbots, Meta AI’s character-based AI assistants, OpenAI’s custom GPTs and GPT store, and others. The broad diversity of custom LLMs and variety of experiences is based on the uniqueness of system and individual prompts.

AI-centric Applications

AI-centric applications are taking many forms beyond the chatbot. These include:

Generative AI for images, video, and music combine complex prompts interfaces with GUI controls. Examples include Midjourney, Ideogram, Runway, Suno, etc.
Copilot applications embed a prompt-based chat interaction within interactive editing environments, for example, in coding (Github co-pilot, Codium), writing (Jasper), etc.
“AI as a feature,” where generative AI features are embedded as commands in a complex application. For example, Adobe Photoshop “generative fill” gen AI feature.

Figure 3. AI as a feature - Embedded AI features in Adobe Premier Pro with AI-powered editing. Source: YouTube channel Curious Refuge.

AI Agents: This is a whole new class of software built on AI. From a user interface perspective, AI Agents, like copilot applications, will be navigating through workflows. The Github Copilot Workspace interface exhibits this paradigm; it is based on going through a software work-flow: spec, design, code and test.

Figure 4. Github Copilot Workspace interface walks you through the dev flow.

In all the above cases, there is a mix of both user input prompting, usually via a textbox or chat interface, to give textual (or voice) user input, but it is complimented by other types of inputs, such as traditional GUI inputs - mouse clicks, sliders, buttons, etc.

While human language is natural and expressive, it can also be vague; this is great for some creative exploration, but vagueness and verbosity make it less precise and efficient to convey some inputs. A broad palette of inputs that compliment each-other is the best way to understand total user intent.

Key UI Design Considerations for AI Applications

AI models can be broadly applied to many applications, but can be unpredictable and unreliable. AI application interfaces should expose and enable AI’s powerful abilities while controlling risky or unwanted outcomes. Key features in UI/UX for AI:

Feedback: Provide clear feedback mechanisms throughout user interactions with AI models to manage expectations and prevent errors.
Controllability: Provide controls for the AI models (temperatures, system prompts, etc.) so users have good control over potential AI outputs. AI model guardrails can limit risky and unwanted outputs.
Explainability: AI models are opaque and complex. Design UIs that help users understand AI outputs and explain AI decisions, to build trust.
Adaptability: AI applications should learn and adapt to user preferences and behaviors over time, allowing for personalization.
Transparency: Clearly indicate what AI model (if any) is being used, when users are interacting with an AI system versus a human, and what artifacts are AI generated.

AI UI Techniques

As with the GUIs of the past, UI techniques being used in AI-centric applications are becoming standard interface elements because they are effectively convey user intent. Some examples:

The Magic Prompt: Many AI applications do re-prompting, rewriting an original user-entered prompt with more detailed specifications, to fill in gaps and make it work better for the AI model. In some cases (like with Google’s Imagen) this can go awry if done in an opaque or inflexible way.

In Ideogram, they implement re-prompting as ‘the magic prompt’ and do it well. The re-prompt is exposed and shared with the user as editable input, giving user transparency and control.

Figure 5. Ideogram’s AI interface features - Magic Prompt, GUI elements, Remix.

Ideogram’s interface is also a great example of two other UI features - remix and GUI-prompting mixed input.

GUI plus prompting input: As shown in Ideogram’s interface, AI model input could combine a text prompt with directives selected via radio buttons (‘Ratio’, ‘Model version’) or tags (‘cinematic’, ‘poster’).

Remix: In the context of AI image generation, Remix takes a generated AI image and uses it as an input that can be combined with other inputs. Often it take multiple prompts and responses to get a good desired output out of a generative AI model, so Remix is one way to turn a “close” output into an even better output, by exploring the space around a result.

Edit and Rewind: Edit and Rewind is the “Undo” button equivalent. It’s also like a reserve Remix; instead of taking a desired output forward, it takes an undesired output and unwinds it. Edit and Rewind is useful in work-flow-based AI agents or co-pilot applications, where a user might need to roll back several work-flow steps to undo unwanted downstream results.

ScreenAI: An AI for understanding UI & UX

Just as AI can help with programming, writing, image generation, and other knowledge work, AI should be able to assist with UI development. A specialized AI model for understanding user interfaces could be helpful.

In March, GoogleResearch introduced ScreenAI, A visual language model for UI and visually-situated language understanding. ScreenAI is a Vision-Language Model (VLM) that can comprehend both user interfaces (UIs) and infographics. It's capable of graphical question-answering, element annotation, summarization, navigation, and UI-specific QA.

Their paper “ScreenAI: A Vision-Language Model for UI and Infographics Understanding” has all the details:

Our model [ScreenAI] improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale.

ScreenAI uses two stages: Pre-training applies self-supervised learning to automatically generate data labels; Fine-tuning uses manually labeled data by human raters.

Figure 6. The process of extracting screen interface information to train ScreenAI.

One way to think about this specialized ScreenAI VLM (Vision-Language Model)is model is that it is comprehending and operating on another modality. Just as an LLM can chunk through text or an image-model can understand pure images, ScreenAI understands how text and images interact in an interface.

This does very well on visual QA within documents (DocVQA) and related benchmarks:

At only 5B parameters, ScreenAI achieves new state-of-the-art results on UI- and infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size.

This has useful applications for creating AI applications that might interact with gra Screen summarization The model summarizes the screen content in one or two sentences

Figure 7. Screen navigation - ScreenAI converts a natural language utterance into an executable action on a screen. e.g., “Click the search button.”

Summary

AI changes everything about software and will disrupt what applications look like and how we interact with them, but the essential rules of UI/UX haven’t changed: UI / UX all about conveying user intent to get correct application results.

The prompt is and will be the primary interface for LLMs, because human language is our most natural form of communication - flexible, expressive, and natural. However, it can be vague and there are easier ways to convey some inputs. AI application interfaces will therefore combine prompt chat and voice inputs with other input modalities, such as GUIs.

The best AI applications will combine a palette of inputs effectively.

Devin and Github CoPilot Workspace are showing the way forward for interfaces for AI Agents. These ‘Cockpits for Copilots’ will be flow-based and show multiple view of the interactions of AI agents with code (code editing in IDE), execution (code execution terminal), search and tools (via OS windows), and with the flow itself (showing flow progress).

AI itself can understand and build interfaces. AI models such as ScreenAI will be used to help train up AI applications so AI models themselves navigate interfaces which is especially useful for AI agents that could automate such interactions.

The migration of knowledge-work that we will evolve through as AI application software gets more agentic:

Pre-AI: User manages and does tasks, using software tools for discrete steps.
AI (LLM chatbot-based): User manages the prompts, with AI LLM completing discrete steps.
AI Agents: User manages workflow and AI agents instead of prompts, with the AI agents acting as co-pilot or automation engine to complete multiple steps in flow.

Interfaces with therefore continue to evolve, as the capabilities of AI agents grow.

Postscript

Given the prompt “Write a Shakespearean poem about lilies in spring,” Gemini did a fine job, worth sharing:

When April's breath doth stir the sleeping land, And buds unfurl with soft and gentle grace, The lily wakes, a vision pure and grand, Her petals white, like stars in heaven's space.

Upon a stem so slender, yet so strong, She lifts her head towards the warming sun, Her fragrance sweet, a soft and whispered song, Of fairer days and seasons newly won.

The bees descend, with pollen-laden wings, To kiss her depths and sip her nectar sweet. The lily's heart with gentle music rings, Where springtime joy and summer's touch do meet.

O lily fair, with beauty pure and bright, Thou art a symbol of new life and light.

Lotus Rose

May 8, 2024

Great and comprehensive overview of a strangely overlooked subject. Just shared on LinkedIn.

Two observations:

- As a UI/UX professional, I prefer to frame design problems and solutions around "user goals" vs "following user intent." The former is a slightly more general category. Users don't always know their intentions when they visit a new site or app, but they generally have a goal (entertainment, learning, professional development, etc).

- For your excellent list of best practices, I would add one more to strive for: reproducibility. I realize this is more challenging with AI. It's true you usually don't get the same probability matrix twice! But building in mechanisms to capture and report serious errors is essential -- otherwise we lose the entire mindset and ethos for QA in software. That would be unfortunate, because our process works remarkably well compared to other industries...

Expand full comment

4 replies by Patrick McGuinness and others

4 more comments...

AI Changes Everything

Discussion about this post