Qwen-Image-Edit: a 20B model that separates semantic and appearance editing with bilingual text support

Image editing models typically specialize in one type of edit: they either change semantics (the meaning or identity of content) or appearance (specific pixels and elements), but not both equally well. Qwen-Image-Edit, announced by the Qwen team, addresses this by simultaneously feeding the input image through two pathways — Qwen2.5-VL for visual semantic control and a VAE Encoder for visual appearance control — to handle both categories of edit in a single model. It also extends the text rendering expertise from Qwen-Image, which the team describes as a 20B model, to image editing tasks, enabling precise text modification.

The model is available through Qwen Chat’s “Image Editing” feature.

Semantic versus appearance editing

The announcement distinguishes two editing modes with meaningfully different requirements.

Semantic editing modifies image content while preserving the identity or character of the subject. The example used throughout the post is Qwen’s mascot Capybara: even when most pixels in the edited image differ from the original, the character’s visual identity is preserved. The post describes semantic editing as enabling IP content creation — generating varied scenes, poses, or contexts around a consistent character. Specific applications demonstrated include novel view synthesis (rotating objects by 90 or 180 degrees to show the back side), style transfer (converting a portrait to Studio Ghibli style), and MBTI-themed emoji packs built around the Capybara mascot using 16 personality type prompts.

Appearance editing, by contrast, requires specific regions of the image to remain completely unchanged while targeted elements are added, removed, or modified. The post demonstrates adding a signboard to a scene — including generating a corresponding reflection of the sign, which the post describes as showing “exceptional attention to detail.” Other demonstrated cases include removing fine hair strands and small objects, changing the color of a specific letter in an image to blue, adjusting a person’s background, and changing clothing.

The architectural reason for dual inputs is that these two editing modes pull in opposite directions. Semantic control requires high-level visual understanding to reason about identity and scene meaning. Appearance control requires pixel-level fidelity to the original image in unedited regions. Routing the input through Qwen2.5-VL for the semantic pathway and the VAE Encoder for the appearance pathway lets each pathway specialize.

Precise text editing

Text rendering in generated images has been a persistent weakness of diffusion models. The post frames Qwen-Image-Edit’s text editing capability as a direct extension of Qwen-Image’s existing text rendering expertise.

The model supports bilingual text editing in Chinese and English. Demonstrated capabilities include modifying large headline text on posters, adjusting small and intricate text elements, and preserving the original font, size, and style of surrounding text when making changes. The post shows direct addition, deletion, and modification of text in images while leaving adjacent text unchanged.

A specific example illustrates the capability at its limit: correcting errors in a calligraphy artwork generated by Qwen-Image, including the relatively obscure character “稽” (where the lower-right component should be “旨” rather than “日”). When the model fails to make the correction in a single step, the post demonstrates a chained editing approach: a bounding box is drawn around the specific incorrect component, and the correction is targeted to that region. The example concludes with a calligraphic rendering of Lantingji Xu (Orchid Pavilion Preface) corrected to be fully accurate.

Chained editing as a workflow

The Lantingji Xu example illustrates a broader capability: using Qwen-Image-Edit iteratively to progressively refine an image toward a target state. When a single edit instruction does not fully achieve the desired result — which the post presents as normal for complex or rare content — multiple rounds of bounded corrections can converge on the correct output. The bounding box mechanism provides spatial grounding for the edits, directing attention to specific image regions rather than requiring the model to infer which part of a complex image needs changing.

Benchmark performance

The post states that Qwen-Image-Edit achieves state-of-the-art performance on multiple public benchmarks for image editing tasks, describing it as “a powerful foundation model for image editing.” Specific benchmark names and numeric results are not included in the announcement text.

The combination of semantic editing, appearance editing, and text editing in a single model, with the chained correction workflow demonstrated on a difficult Chinese calligraphy case, represents a broader editing surface than models that treat each editing type separately.