Pictorama: Text based Image Editing using Diffusion Model


  • Teena Varma, Harshali Patil, Kavita Jain, Deepali Vora, Akash Sawant, Vishal Mamluskar, Allen Lopes, Nesan Selvan


Image Processing, Computer vision, Pytorch, Stable Diffusion Models, Python, Machine learning.


This research aims to pioneer image modification through text, integrating natural language descriptions with advanced computer vision and NLP techniques. The primary objective is to bridge human language and image editing, empowering users to convey creative visions effortlessly, revolutionizing the field of image modification. The study employs stable diffusion models, leveraging PyTorch and Python. It builds on prior works like Imagic, LEDITS, and Instructpix2pix, integrating a novel Vector Quantized Diffusion (VQ-Diffusion) model. The model is trained on a dataset of 436 GB containing 3 features an input image, an editing instruction, and an output edited image. Test samples include real images subjected to diverse text prompts for image edits, with disentanglement properties explored. The approach combines text inversion and Box-Constrained Diffusion (BoxDiff) for personalized and conditional image synthesis. The research showcases that stable diffusion models exhibit disentanglement properties, enabling effective modifications without extensive fine-tuning. The introduced BoxDiff and VQ-Diffusion models demonstrate superior performance in spatially constrained and complex scene synthesis, outperforming traditional methods. We are able to observe greater quality in output images with good cohesiveness throughout the image. Runing the model with greater number of steps allows for upheaval in quality. Here we have used 100 steps for greater image quality. The effect of number of steps on the time taken for inference is also studied. Due to the large amount of video memory required for inferencing, we recommend a GPU with >11 GB of video memory. The study adds value by addressing biases, achieving higher speeds, and enhancing image quality, contributing to the evolving landscape of text-to-image synthesis. This research introduces novel approaches in disentanglement, spatially constrained synthesis, and rapid image generation, pushing the boundaries of text-to-image synthesis beyond existing limitations.


Download data is not yet available.


Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., ... & Irani, M.. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)(pp. 6007-6017).Available from : https://doi.org/10.48550/arXiv.2210.09276

Tsaban, L., & Passos, A. LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance. arXiv preprint arXiv:2307.00522 (2023). Available from : https://doi.org/10.48550/arXiv.2307.00522

Brooks, T., Holynski, A., & Efros, A. A. Instructpix2pix: Learning to follow image editing instruc-tions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18392-18402) (2023).. Available from: https://doi.org/10.48550/arXiv.2211.09800

Kawar, Bahjat, et al. "Imagic: Text-based real image editing with diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. Available from : https://doi.org/10.48550/arXiv.2210.09276

Miyake, Daiki, et al. "Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Models." arXiv preprint arXiv:2305.16807 (2023). Available from : https://doi.org/10.48550/arXiv.2305.16807

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., … Simonyan, K. Flamingo: A visual language model for few-shot learning.(2022).Available from: https://doi.org/10.48550/arXiv.2204.14198

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021. Available from https://doi.org/10.48550/arXiv.2105.05233

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel CohenOr. An image is worth one word: Personalizing text-toimage generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022. Available from : https://doi.org/10.48550/arXiv.2208.01618

Elharrouss, O., Almaadeed, N., Al-Maadeed, S., & Akbari, Y. Image inpainting: A review. Neural Processing Letters, 51, 2007-2028.(2020). Available from : https://doi.org/10.48550/arXiv.1909.06399

Ding, M., Yang, Z., Hong, W., Zheng, W., Zhou, C., Yin, D., ... & Tang, J. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34, 19822-19835.(2021). Available from : https://doi.org/10.48550/arXiv.2105.13290

Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., & Taigman, Y. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision (pp. 89-106). (2022, October). Cham: Springer Nature Switzerland. Available from : https://doi.org/10.48550/arXiv.2203.13131

Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., ... & Chang, S. Uncovering the disentanglement capability in text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1900-1910). (2023). Available from : https://doi.org/10.48550/arXiv.2212.08698

Xie, J., Li, Y., Huang, Y., Liu, H., Zhang, W., Zheng, Y., & Shou, M. Z. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 7452-7461). (2023) Available from : https://doi.org/10.48550/arXiv.2307.10816

Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., ... & Guo, B. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10696-10706). (2022) Available from : https://doi.org/10.48550/arXiv.2111.14822

Thao Nguyen, Yuheng Li, Utkarsh Ojha, Yong Jae Lee. Visual Instruction Inversion: Image Editing via Visual Prompting. ArXiv Preprint ArXiv:2307.14331. (2023, July 25) Available from : https://doi.org/10.48550/arXiv.2307.14331

Li, J., Lu, W., Yang, M., Zhou, Y., & Yu, W. Text to Image Generation with Semantic-Spatial Aware GAN. ArXiv Preprint ArXiv:2104.00567. (2021, April 6) Available from : https://doi.org/10.48550/arXiv.2104.00567

Gu, X., Yang, Y., Xu, H., Zhou, C., & Wang, Y. Text-Guided Neural Image Inpainting. ArXiv Pre-print ArXiv:2004.03212.(2020, April 7) Retrieved from Available from : https://doi.org/10.1145/3394171.3414017

Dataset Link : https://instruct-pix2pix.eecs.berkeley.edu/clip-filtered-dataset/




How to Cite

Vishal Mamluskar, Allen Lopes, Nesan Selvan, T. V. H. P. K. J. D. V. A. S. . (2024). Pictorama: Text based Image Editing using Diffusion Model. International Journal of Intelligent Systems and Applications in Engineering, 12(21s), 380–388. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/5435



Research Article