SANA: Nvidia's Image generation is really Faster?

 

nvidia sana image generation model

The renowned GPU manufacture entered into the diffusion race. SANA released by NVIDIA Labs can generate a 1024 × 1024 image in under 1 second on a 16GB laptop GPU, handles resolutions up to 4096 × 4096. It competes with much larger models like Flux-12B while being 20× smaller and 100× faster. 



Unlike traditional Stable diffusion models (ex- SDXL), it uses an AE-F32C32 auto-encoder which reduces image data by 32× (compared to 8× in older methods), making the process faster without losing quality. 

It uses linear attention, perfect for handling high-resolution images efficiently. Replaces older text encoders (like T5) with a smaller, faster model that understands instructions better. Its Flow-DPM-Solver speeds up the process, creating high-quality images in fewer steps. For more in-depth understanding, you can access their research paper.


Installation

1. Get ComfyUI installed into your machine.

install sana custom nodes

2. Navigate to ComfyUI's Manager, select "Custom Nodes Manager" and search for "Extra models" by author "city96" then click "Install". To get the real time download status, you can check the ComfyUI terminal.

Alternative(Manual):

Move inside "ComfyUI/custom_nodes" folder. Open command prompt by typing "cmd" on the folder address bar. Then clone the repository using the following command:

git clone https://github.com/Efficient-Large-Model/ComfyUI_ExtraModels.git


download sana model

3. Download the models from Sana's Hugging face repository and save it into your "ComfyUI/models/checkpoints" folder. 


download VAE model

4. Next is to download VAE(Variaional Auto Encoder) from Hugging Face repository and save it into  "ComfyUI/models/vae" folder. After downloading rename it to something relatable (like- sana_vae.safetensors ) to avoid conflicts. 

5. Another model also required that will refine your prompts. Its a light weight LLM- Google's Gemma (well know for developers usage) gets automatically downloaded when you run the workflow. To get the real time status just switch to ComfyUI's terminal running at the background.

While running the workflow, many people are getting black/grey image output. If you are getting the same, just uninstall the "Extra Models" custom nodes. Restart the ComfyUI. Choose the manual installation in Step2. Then, navigate to Install Missing Custom Nodes from the manager and at last Click Install gemma.

6. Restart and refresh ComfyUI. 


Workflow

1. Get the workflow from Github repository.

(a) Basic TextToImage workflow

2. Drag and drop to ComfyUI.

load checkpoint

(a) Load your SANA model checkpoint.

load vae

(b) Load VAE model.

load gemma loader

(c) Select Gemma model. It can also run in CPU if you do not have higher GPUs but will take more time. To do this, you can use 4bit quantized model when working with CPU setup.


KSampler settings

(d) Set KSampler settings

Add positive negative prompts

(e) Add positive and negative prompt. 

(f) Hit "Queue" button to start generating, and here is our result.


First try:

Prompt: a cyberpunk cat with a neon sign that says "Sana"

CFG: 3

Steps:18

a cat with neon lightings

a cat with neon lightings


The results are not cherry picked. Second generation is much better than that of the first one. The text is not up to mark.

Second Try:

Tried something like realistic.

Prompt used: a beautiful woman in the beach, professional photoshoot, 8k, art with realism

CFG: 3

Steps:18


a beautiful woman in beach for photoshoot

a beautiful woman in beach for photoshoot

We tried to enhanced the above prompt using Transformer based LLMs(LArge Language Models) and here is what we got.

Enhanced prompt (NLP technique): A stunning woman with graceful features, standing confidently on a pristine sandy beach during golden hour, illuminated by the warm glow of the setting sun. She is wearing an elegant flowing summer dress that gently sways with the ocean breeze, her hair cascading in soft waves. The background captures the serene beauty of the beach, with clear turquoise waters stretching to the horizon, soft ripples reflecting the sunlight, and a vibrant sky transitioning from orange to pink hues. The composition emphasizes artful realism, professional photoshoot quality, with intricate details in facial expressions, fabric textures, and natural lighting. Captured in ultra-high-resolution (8k), the scene evokes a sense of tranquility and sophistication.


generated result with enhanced prompting

Even after enhancing prompt we got poor results. The face is deformed. Hands and legs are not perfect. Really unexpected result. 

After multiple trials and testing, we concluded that the model is faster enough to generate images as compared to other diffusion models but you can't expect much from it as from the quality perspective.

Another sad part is that the model is censored that means like Pony models you cannot get something which controlled by you . So, yes this is restriction they have put it in.