Delta Denoising Score

Amir Hertz^1,2 Kfir Aberman¹ Daniel Cohen-Or^1,2

¹ Google Research ² Tel Aviv University

Abstract

We introduce Delta Denoising Score (DDS), a novel scoring function for text-based image editing that guides minimal modifications of an input image towards the content described in a target prompt. DDS leverages the rich generative prior of text-to-image diffusion models and can be used as a loss term in an optimization problem to steer an image towards a desired direction dictated by a text. DDS utilizes the Score Distillation Sampling (SDS) mechanism for the purpose of image editing. We show that using only SDS often produces non-detailed and blurry outputs due to noisy gradients. To address this issue, DDS uses a prompt that matches the input image to identify and remove undesired erroneous directions of SDS. Our key premise is that SDS should be zero when calculated on pairs of matched prompts and images, meaning that if the score is non-zero, its gradients can be attributed to the erroneous component of SDS. Our analysis demonstrates the competence of DDS for text based image-to-image translation. We further show that DDS can be used to train an effective zero-shot image translation model. Experimental results indicate that DDS outperforms existing methods in terms of stability and quality, highlighting its potential for real-world applications in text-based image editing.

Main Idea

DDS editing optimization eliminates the noisy component of SDS using a reference SDS estimation.

Modifying the flamingo in the image (left) to a peacock using SDS-based optimization results in a smooth and detail-less peacock on a blurry background (third image). To eliminate the "bad" component of the SDS gradient, we subtract the SDS gradient of the input image with respect to its corresponding input text prompt at each optimization step.

Our key idea is that SDS should be zero when calculated on pairs of matched prompts and images, meaning that if the score is non-zero, its gradients can be attributed to the erroneous component of SDS.

Zero Shot Image Editing

Our Delta Deposing Score (DDS) can guide a pixel space optimization of a given real image towards a target text prompt, with and without a source prompt:

Image-to-Image Translation

Using DDS we can train a multi-task image-translation in an unsupervised fashion (no paired data). Given an input image (left) and a sampled task embedding (top), our network is trained using DDS and corresponding text embeddings (bottom) that describe the input image and the desired edited image result:

The trained network can then translate arbitrary real images based on the specified task, within a single feedforward pass:

Acknowledgements

We thank Ben Poole, Elad Richardson, Ron Mokady and Jason Baldridge for their valuable inputs that helped improve this work.

BibTex

@article{hertz2023delta, title={Delta Denoising Score}, author={Hertz, Amir and Aberman, Kfir and Cohen-Or, Daniel}, booktitle={arXiv preprint arXiv:2304.07090}, year={2023} }