[AI作圖][教學]Disco Diffusion 教學

5 min readAug 9, 2022



Disco Diffusion (DD) is a Google Colab Notebook which leverages an AI Image generating technique called CLIP-Guided Diffusion to allow you to create compelling and beautiful images from just text inputs. Created by Somnai, augmented by Gandamu, and building on the work of RiversHaveWings, nshepperd, and many others.
It’s magic. And also, free. (!)

Disco Diffusion 是一個透過Diffusion model 以及CLIP 技術來實現文轉圖AI 作畫的程式.其用於訓練model的dataset畫作包括世界主名畫作及線上創作網站artstation.com上的數萬藝術作品.運行時使用Pytorch以及CUDA加速.

原始github: https://github.com/alembics/disco-diffusion

原始colab notbook: https://colab.research.google.com/github/alembics/disco-diffusion/blob/main/Disco_Diffusion.ipynb

Ruka 個人簡易版本(Using jina-ai/discoart): https://colab.research.google.com/drive/1adglV_W87i69EsaQhwNnJc3O00i2tfG1


原始Disco Diffusion使用教學



連結:Colab notebook


按下第一個cell確認colab規格, 確保有足夠的gpu vram(ex: Colab免費版會提供的V100的14GB), 如果沒有則需要Change runtime type並且選擇使用GPU

接下來案第二個cell import package, 這部分會使虛擬機重啟, 重啟後再重新run 第二個cell

接著一路按完剩下的cell即可.其中有個cell可以讓你簡單自訂想要生成的圖片參數, 或者進到code裡面手動修改參數dictionary



參考文件: https://botbox.dev/disco-diffusion-cheatsheet/


n_batches: (50|1–100) This variable sets the number of still images you want DD to create. If you are using an animation mode (see below for details) DD will ignore n_batches and create a single set of animated frames based on the animation settings.

text_prompts: (defaultdict(String list)) Text prompt for AI to render art, the key of dict represent the step where this prompt will will be consider by AI.

steps: (250|50–10000) When creating an image, the denoising curve is subdivided into steps for processing. Each step (or iteration) involves the AI looking at subsets of the image called ‘cuts’ and calculating the ‘direction’ the image should be guided to be more like the prompt. Then it adjusts the image with the help of the diffusion denoiser, and moves to the next step.

skip_steps: (10|integer up to steps) Consider the chart shown here. Noise scheduling (denoise strength) starts very high and progressively gets lower and lower as diffusion steps progress. The noise levels in the first few steps are very high, so images change dramatically in early steps.

init_image: Optional. Recall that in the image sequence above, the first image shown is just noise. If an init_image is provided, diffusion will replace the noise with the init_image as its starting state. To use an init_image, upload the image to the Colab instance or your Google Drive, and enter the full image path here.

clip_guidance_scale: (5000|1500–100000) CGS is one of the most important parameters you will use. It tells DD how strongly you want CLIP to move toward your prompt each timestep. Higher is generally better, but if CGS is too strong it will overshoot the goal and distort the image. So a happy medium is needed, and it takes experience to learn how to adjust CGS.

cutn_batches: (4|1–8) Each iteration, the AI cuts the image into smaller pieces known as cuts, and compares each cut to the prompt to decide how to guide the next diffusion step. More cuts can generally lead to better images, since DD has more chances to fine-tune the image precision in each timestep.

display_rate: (50|5–500) During a diffusion run, you can monitor the progress of each image being created with this variable. If display_rate is set to 50, DD will show you the in-progress image every 50 timesteps.

AI 作圖原理

1. CLIP Model (January 5, 2021)

CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. We then use this behavior to turn CLIP into a zero-shot classifier. We convert all of a dataset’s classes into captions such as “a photo of a dog” and predict the class of the caption CLIP estimates best pairs with a given image.

簡單解釋:一個圖轉文的預測模型 (根據圖片來猜測文字)

2. Diffusion Model (May 13, 2021)



Diffusion Models are generative models, meaning that they are used to generate data similar to the data on which they are trained. Fundamentally, Diffusion Models work by destroying training data through the successive addition of Gaussian noise, and then learning to recover the data by reversing this noising process. After training, we can use the Diffusion Model to generate data by simply passing randomly sampled noise through the learned denoising process.

簡單解釋:一個從模糊影像出發,根據文字敘述,還原作品的模型(從模糊or初始圖片生成圖片的模型, 文轉圖模型)

3. Disco Diffusion 作畫步驟:

  1. 先用Diffusion model生成亂數image, 或使用init_image
  2. 將image切成若干部分(cutn_batches
  3. 用CLIP來詮釋(猜測)每個部分的image屬於text_prompts的哪個部分
  4. 讓Diffusion model針對該部分使用CLIP所詮釋的內容來重新作畫
  5. 反覆重複以上2–4步驟(steps)次, 直到結束


text prompt

  1. 他是一個dictionary, 你可以決定在第幾步的時候增加不同的作圖方向
  2. 在prompt list裡面, 可以給不同的敘述不同的權重, 以加強作圖方向
  3. 可以帶入畫風關鍵字(網路上的創作者),可以很有效的加強作畫的品質
  4. 畫風整合包:https://weirdwonderfulai.art/resources/disco-diffusion-70-plus-artist-studies/
  5. 或者上artstation帶入作者的名稱(如果他的畫有被當作訓練素材)


Twitter tags: