匿名用户2024年07月31日
27阅读
所属分类aiPytorch
开源地址https://modelscope.cn/models/AI-ModelScope/Emu2-Gen

作品详情

Emu2-Gen

Paper | ?HF Demo | Demo | Project Page | Github

Model Weights

Model name Weight
Emu2 ? HF link
Emu2-Chat ? HF link
Emu2-Gen ? HF link

Inference (Huggingface Version)

Emu2-Gen

from modelscope import snapshot_download
from diffusers import AutoPipelineForText2Image
import torch
import cv2
from diffusers import DiffusionPipeline
import numpy as np
from PIL import Image
import requests
from modelscope import AutoModelForCausalLM, AutoTokenizer
import os

# For the first time of using,
# you need to download the modelscope repo "AI-ModelScope/Emu2-Gen" to local first
path = snapshot_download("AI-ModelScope/Emu2-Gen",revision='master')

multimodal_encoder = AutoModelForCausalLM.from_pretrained(
    f"{path}/multimodal_encoder",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    use_safetensors=True,
    variant="bf16"
)
tokenizer = AutoTokenizer.from_pretrained(f"{path}/tokenizer")

pipe = DiffusionPipeline.from_pretrained(
    path,
    custom_pipeline="pipeline_emu2_gen",
    torch_dtype=torch.bfloat16,
    use_safetensors=True,
    variant="bf16",
    multimodal_encoder=multimodal_encoder,
    tokenizer=tokenizer,
)

# For the non-first time of using, you can init the pipeline directly
pipe = DiffusionPipeline.from_pretrained(
    path,
    custom_pipeline="pipeline_emu2_gen",
    torch_dtype=torch.bfloat16,
    use_safetensors=True,
    variant="bf16",
)

pipe.to("cuda")

# text-to-image
prompt = "impressionist painting of an astronaut in a jungle"
ret = pipe(prompt)
ret.image.save("astronaut.png")

# image editing
image = Image.open(os.path.join(path,"examples/dog.jpg")).convert('RGB')
prompt = [image, "wearing a red hat on the beach."]
ret = pipe(prompt)
ret.image.save("dog_hat_beach.png")

# grounding generation
def draw_box(left, top, right, bottom):
    mask = np.zeros((448, 448, 3), dtype=np.uint8)
    mask = cv2.rectangle(mask, (left, top), (right, bottom), (255, 255, 255), 3)
    mask = Image.fromarray(mask)
    return mask

dog1 = Image.open(os.path.join(path,"examples/dog1.jpg")).convert('RGB')
dog2 = Image.open(os.path.join(path,"examples/dog2.jpg")).convert('RGB')
dog3 = Image.open(os.path.join(path,"examples/dog3.jpg")).convert('RGB')
dog1_mask = draw_box( 22,  14, 224, 224)
dog2_mask = draw_box(224,  10, 448, 224)
dog3_mask = draw_box(120, 264, 320, 438)

prompt = [
    "<grounding>",
    "An oil painting of three dogs,",
    "<phrase>the first dog</phrase>"
    "<object>",
    dog1_mask,
    "</object>",
    dog1,
    "<phrase>the second dog</phrase>"
    "<object>",
    dog2_mask,
    "</object>",
    dog2,
    "<phrase>the third dog</phrase>"
    "<object>",
    dog3_mask,
    "</object>",
    dog3,
]
ret = pipe(prompt)
ret.image.save("three_dogs.png")

# Autoencoding
# to enable the autoencoding mode, you can only input exactly one image as prompt
# if you want the model to generate an image,
# please input extra empty text "" besides the image, e.g.
#   autoencoding mode: prompt = image or [image]
#   generation mode: prompt = ["", image] or [image, ""]
prompt = Image.open(os.path.join(path,"examples/doodle.jpg")).convert("RGB")
ret = pipe(prompt)
ret.image.save("doodle_ae.png")

Citation

If you find Emu2 useful for your research and applications, please consider starring this repository and citing:

@article{Emu2,
    title={Generative Multimodal Models are In-Context Learners}, 
    author={Quan Sun and Yufeng Cui and Xiaosong Zhang and Fan Zhang and Qiying Yu and Zhengxiong Luo and Yueze Wang and Yongming Rao and Jingjing Liu and Tiejun Huang and Xinlong Wang},
    publisher={arXiv preprint arXiv:2312.13286},
    year={2023},
}
声明:本文仅代表作者观点,不代表本站立场。如果侵犯到您的合法权益,请联系我们删除侵权资源!如果遇到资源链接失效,请您通过评论或工单的方式通知管理员。未经允许,不得转载,本站所有资源文章禁止商业使用运营!
下载安装【程序员客栈】APP
实时对接需求、及时收发消息、丰富的开放项目需求、随时随地查看项目状态

评论