5 Comments

I wouldn’t call what ChatGPT does vision or listening ... seems like a gross mischaracterization.

Expand full comment

1. The 'listening' is the audio input to the AI model apps. ChatGPT uses the OpenAI whisper AI model for speech recognition. For example, I was able to speak into the chatGPT app on android "Can you explain vision in AI models?" and got an answer. There's a beta feature where it will give audio response so the AI will not only "listen" by "speak" in reply. Which gets us to ...

Expand full comment

2. The phrase computer vision has been around for decades to mean the recognition by AI of image information, so the following chatGPT response to the question is on target :

"Certainly! Vision in AI models refers to the ability of artificial intelligence systems to interpret and understand visual information, typically in the form of images or videos. This is an essential component of computer vision, a subfield of AI."

Then chatGPT went into great detail on the topic. Point being, many of the use cases of GPT-4V clearly fall into it being an AI with "vision". When it can do handwriting recognition, that's powerful example. See #9 on this list as what GPT-4V does.

"Here's a basic overview of how vision works in AI models:

1. **Data Input**: AI models take images or videos as input. These images are composed of pixels, each representing a color or grayscale value.

2. **Feature Extraction**: In order to understand the content of an image, AI models extract features from the input data. These features could include edges, textures, shapes, and more complex patterns.

3. **Convolutional Neural Networks (CNNs)**: CNNs are a common architecture used for computer vision tasks. They consist of layers of interconnected neurons that automatically learn to recognize patterns in images through a process of convolution and pooling. This allows the model to identify features at various scales and orientations.

4. **Deep Learning**: Modern vision AI models are often deep neural networks with many layers. This depth allows them to capture increasingly abstract and high-level features as they process the input data.

5. **Object Detection and Recognition**: AI models can be trained to detect and recognize objects within images or videos. This involves not only identifying the presence of objects but also classifying them into specific categories.

6. **Semantic Segmentation**: Another important task is semantic segmentation, where the model labels each pixel in an image with the object or class it belongs to. This is useful for tasks like image segmentation and understanding the spatial layout of objects in an image.

7. **Image Classification**: AI models can classify entire images into predefined categories. For example, they can determine whether an image contains a cat, dog, or car.

8. **Object Tracking**: In video analysis, AI models can track the movement of objects over time, enabling applications like video surveillance and autonomous driving.

9. **Natural Language Integration**: Some AI models are designed to integrate vision with natural language processing. They can generate textual descriptions of images or answer questions about them.

10. **Applications**: Vision in AI has a wide range of applications, from self-driving cars and medical image analysis to facial recognition, content recommendation, and augmented reality.

To perform these tasks effectively, AI models require large labeled datasets for training and often rely on powerful hardware like GPUs. Advances in deep learning and computer vision algorithms have led to significant progress in the accuracy and capabilities of vision-based AI systems in recent years.'"

Expand full comment

That makes sense ... I feel like that’s not how it will be interpreted by the masses. 😆

Expand full comment

Ah, you have a point there. People interpret in Sci-Fi terms, not technical ones.

Expand full comment