The internet is going wild for OpenAI's GPT-4o native image generation, but would you like to know how it works? There is a paper from Meta, Waymo, and the University of Southern California from mid-2024 that introduced the Transfusion architecture. An architecture that combines the Transformers that we typically see in language models with Diffusion models that we typically see in image generation. Previous image generation from systems like ChatGPT involved ChatGPT calling an image generation tool (DALL-E) on the user's behalf. While this new Transfusion approach involves GPT-4o outputting an optional sequence of tokens for text, then a special token to signal the generation of an image (BOI), a sequence of n random image patches which are then filled in Diffusion style, and finally a special token to signal the end of the image token (EOI). This interleaving of text tokens and image patches can be repeated. These image patches are then converted to an image by some combination of either a simple linear layer or U-Net up blocks and a Variational Autoencoder (VAE) decoder. There was previous work in this space, notably Chameleon (also from Meta in 2024). The big difference between Transfusion and Chameleon is that Chameleon had a discretization step when handling images. All images were broken into discrete image tokens chosen from a vocabulary of fixed size and the image token generation was also done from the same vocabulary. This discretization process created an information bottleneck and threw away information. As a result of this, Transfusion significantly outperforms Chameleon in this paper, surpassing it in every combination of modalities. If you are interested in multimodality and vision language models, this is one of the most important papers to read! OpenAI New Image Generation: https://lnkd.in/ejrVGyDf Transfusion Paper: https://lnkd.in/e23CrVGN Chameleon Paper: https://lnkd.in/eczeVtnR | 20 comments on LinkedIn