DALL-E 2, Imagen, Parti

テキストから画像を生成するモデルが話題ですが、代表的なモデルであるDALL-E2とImagen、Partiのアーキテクチャを比較しているTwitterのスレッドを紹介。

A quick thread on "How DALL-E 2, Imagen and Parti Architectures Differ" with breakdown into comparable modules, annotated with size 🧵#dalle2 #imagen #parti

* figures taken from corresponding papers with slight modification
* parts used for training only are greyed out pic.twitter.com/9zsIUq3toU
— Rosanne Liu (@savvyRL) 2022年6月25日

以下は、アーキテクチャの構成ブロックの簡単な比較を表にまとめたもの。

	DALL-E2	Imagen	Parti
Text Encoder	CLIP	T5-XXL	Transformer Encoder
Text Embeddings to Image (64x64 or 256x256)	Diffusion Model x 2 (Prior + Decoder)	Diffusion Model	Transformer Decoder + ViT-VQGAN
Upsample the Image (256x256->1024x1024)	Diffusion Model x 2	Diffusion Model x 2	Convolution

それぞれの論文は画像がたくさん含まれてはいますが、DALL-E2が27ページ、Imagenが46ページ、Partiは49ページもあって、読むぞ！とは気軽に言えない分量ですね...

stMind

about Tech, Computer vision and Machine learning

DALL-E 2, Imagen, Parti