The internet is going wild for openai s gpt 4o native image generation…

Image gallery for: The internet is going wild for openai s gpt 4o native image generation…

The internet is going wild for OpenAI's GPT-4o native image generation…

The internet is going wild for OpenAI's GPT-4o native image generation, but would you like to know how it works? There is a paper from Meta, Waymo, and the University of Southern California from mid-2024 that introduced the Transfusion architecture. An architecture that combines the Transformers that we typically see in language models with Diffusion models that we typically see in image generation. Previous image generation from systems like ChatGPT involved ChatGPT calling an image generation tool (DALL-E) on the user's behalf. While this new Transfusion approach involves GPT-4o outputting an optional sequence of tokens for text, then a special token to signal the generation of an image (BOI), a sequence of n random image patches which are then filled in Diffusion style, and finally a special token to signal the end of the image token (EOI). This interleaving of text tokens and image patches can be repeated. These image patches are then converted to an image by some combination of either a simple linear layer or U-Net up blocks and a Variational Autoencoder (VAE) decoder. There was previous work in this space, notably Chameleon (also from Meta in 2024). The big difference between Transfusion and Chameleon is that Chameleon had a discretization step when handling images. All images were broken into discrete image tokens chosen from a vocabulary of fixed size and the image token generation was also done from the same vocabulary. This discretization process created an information bottleneck and threw away information. As a result of this, Transfusion significantly outperforms Chameleon in this paper, surpassing it in every combination of modalities. If you are interested in multimodality and vision language models, this is one of the most important papers to read! OpenAI New Image Generation: https://lnkd.in/ejrVGyDf Transfusion Paper: https://lnkd.in/e23CrVGN Chameleon Paper: https://lnkd.in/eczeVtnR | 20 comments on LinkedIn
Advertisement
Software Development
css grid tool layout

Programmation web
Gokul Swamy (@g_k_swamy) on X

Reinforcement Learning
Quick Saves
Advertisement
ℏεsam (@Hesamation) on X

Understanding LLMs and Gen AI
Valeriy M., PhD, MBA, CQF (@predict_addict) on X

Understanding LLMs and Gen AI
Yuchen Jin (@Yuchenj_UW) on X

Understanding LLMs and Gen AI
Youssef Hosni on LinkedIn: If you want to study LLMs check out this series of articles I wrote…

Understanding LLMs and Gen AI
Someone implemented Llama 3 from scratch in Python. Explained…

Understanding LLMs and Gen AI
Maxime Labonne (@maximelabonne) on X

Understanding LLMs and Gen AI
Max Buckley on LinkedIn: What is Speculative Decoding? Speculative Decoding is a very clever…

Understanding LLMs and Gen AI
Advertisement
Advertisement
Advertisement
ℏεsam (@Hesamation) on X

Understanding LLMs and Gen AI