Intelligent and Multimedia Science Laboratory

Sketch Generation and Applications
Image Editing and Synthesis
3D Pose Estimation and Motion Generation
Garment Modeling and Virtual Try-on
Multimedia Processing & 3D Rendering and Modeling

Sketch Generation and Applications

	Joint Stroke Tracing and Correspondence for 2D Animation Haoran Mo, Chengying Gao^* and Ruomei Wang Intro: We for the first time propose a joint stroke tracing and correspondence approach. Given consecutive raster keyframes along with a single vector image of the starting frame as a guidance, the approach generates vector drawings for the remaining keyframes while ensuring one-to-one stroke correspondence. Our framework trained on clean line drawings generalizes to rough sketches and the generated results can be imported into inbetweening systems to produce inbetween sequences. Hence, the method is compatible with standard 2D animation workflow. An adaptive spatial transformation module (ASTM) is introduced to handle non-rigid motions and stroke distortion. We collect a dataset for training, with 10k+ pairs of raster frames and their vector drawings with stroke correspondence. ACM Transactions on Graphics (Presented at SIGGRAPH 2024) (CCF-A) [Paper] [Code] [Project Page]
	Text-based Vector Sketch Editing with Image Editing Diffusion Prior Haoran Mo, Xusheng Lin, Chengying Gao^* and Ruomei Wang Intro: We present a framework for text-based vector sketch editing to improve the efficiency of graphic design. The key idea behind the approach is to transfer the prior information from raster-level diffusion models, especially those from image editing methods, into the vector sketch-oriented task. The framework presents three editing modes and allows iterative editing. To meet the editing requirement of modifying the intended parts only while avoiding changing the other strokes, we introduce a stroke-level local editing scheme that automatically produces an editing mask reflecting locally editable regions and modifies strokes within the regions only. International Conference on Multimedia & Expo (ICME, 2024) (CCF-B) [Paper] [Code]
	Video-Driven Sketch Animation via Cyclic Reconstruction Mechanism Zhuo Xie, Haoran Mo and Chengying Gao^* Intro: Considering the time-consuming manual workflow in 2D sketch animation production, we present an automatic solution by using videos as reference to animate the static sketch images. This includes motion extraction from the videos and injection into the sketches to produce animated sketch sequences in which appearance properties from the source sketches should be preserved. To reduce blurry artifact caused by complex motions and maintain stroke line continuity, we propose to incorporate inner masks of the sketches as an explicit guidance to indicate inner regions and ensure component integrality. Moreover, to bridge the domain gap between the video frames and the sketches when modelling the motions, we introduce a cyclic reconstruction mechanism to increase compatibility with different domains and improve motion consistency between the sketch animation and the driving video. International Conference on Multimedia & Expo (ICME, 2024) (CCF-B) [Paper]
	Multi-instance Referring Image Segmentation of Scene Sketches based on Global Reference Mechanism Peng Ling, Haoran Mo and Chengying Gao^* Intro: We propose GRM-Net, a one-stage framework tailored for multi-instance referring image segmentation of scene sketches. We extract the language features from the expression and fuse it into a conventional instance segmentation pipeline for filtering out the undesired instances in a coarse-to-fine manner and keeping the matched ones. To model the relative arrangement of the objects and the relationship among them from a global view, we propose a global reference mechanism (GRM) to assign references to each detected candidate to identify its position. Pacific Graphics (PG 2022) (CCF-B) [Paper] [Code]
	Line Art Colorization Based on Explicit Region Segmentation Ruizhi Cao, Haoran Mo and Chengying Gao^* Intro: We introduce an explicit segmentation fusion mechanism to aid colorization frameworks in avoiding color bleeding artifacts. This mechanism is able to provide region segmentation information for the colorization process explicitly so that the colorization model can learn to avoid assigning the same color across regions with different semantics or inconsistent colors inside an individual region. The proposed mechanism is designed in a plug-and-play manner, so it can be applied to a diversity of line art colorization frameworks with various kinds of user guidances. Computer Graphics Forum (Pacific Graphics 2021) *(oral) (CCF-B)** [Paper] [Code]
	General Virtual Sketching Framework for Vector Line Art Haoran Mo, Edgar Simo-Serra, Chengying Gao^, Changqing Zou and Ruomei Wang Intro:* Vector line art plays an important role in graphic design, however, it is tedious to manually create. We introduce a general framework to produce line drawings from a wide variety of images, by learning a mapping from raster image space to vector image space. Our approach is based on a recurrent neural network that draws the lines one by one. A differentiable rasterization module allows for training with only supervised raster data. We use a dynamic window around a virtual pen while drawing lines, implemented with a proposed aligned cropping and differentiable pasting modules. Furthermore, we develop a stroke regularization loss that encourages the model to use fewer and longer strokes to simplify the resulting vector image. Ablation studies and comparisons with existing methods corroborate the efficiency of our approach which is able to generate visually better results in less computation time, while generalizing better to a diversity of images and applications. ACM Transactions on Graphics (SIGGRAPH 2021, Journal track) *(oral) (CCF-A)** [Paper] [Code] [Project Page]
	SketchyCOCO: Image Generation from Freehand Scene Sketches Chengying Gao, Qi Liu, Qi Xu, Jianzhuang Liu, Limin Wang, Changqing Zou* Intro: We introduce the first method for automatic image generation from scene-level freehand sketches. Our model allows for controllable image generation by specifying the synthesis goal via freehand sketches. The key contribution is an attribute vector bridged generative adversarial network called edgeGAN which supports high visual-quality image content generation without using freehand sketches as training data. We build a large-scale composite dataset called SketchyCOCO to comprehensively evaluate our solution. We validate our approach on the task of both objectlevel and scene-level image generation on SketchyCOCO. We demonstrate the method’s capacity to generate realistic complex scene-level images from a variety of freehand sketches by quantitative, qualitative results, and ablation studies. Computer Vision and Pattern Recognition (CVPR, 2020) *(oral) (CCF-A)** [Paper] [Code]
	Language-based Colorization of Scene Sketches Changqing Zou^#, Haoran Mo^#(joint first author), Chengying Gao^, Ruofei Du and Hongbo Fu Intro:* This paper for the first time presents a language-based system for interactive colorization of scene sketches, based on semantic comprehension. The proposed system is built upon deep neural networks trained on a large-scale repository of scene sketches and cartoonstyle color images with text descriptions. Given a scene sketch, our system allows users, via language-based instructions, to interactively localize and colorize specific foreground object instances to meet various colorization requirements in a progressive way. We demonstrate the effectiveness of our approach via comprehensive experimental results including alternative studies, comparison with the state-of-the-art methods, and generalization user studies. Given the unique characteristics of language-based inputs, we envision a combination of our interface with a traditional scribble-based interface for a practical multimodal colorization system, benefiting various applications. ACM Transactions on Graphics (SIGGRAPH Asia 2019, Journal track) *(oral) (CCF-A)** [Paper] [Code]
	SketchyScene: Richly-Annotated Scene Sketches Changqing Zou^#, Qian Yu^#, Ruofei Du, Haoran Mo, Yi-Zhe Song, Tao Xiang, Chengying Gao, Baoquan Chen^, and Hao Zhang Intro:* This paper constructed the first large-scale dataset of scene sketches called SketchyScene. We demonstrate the potential impact of SketchyScene by training new computational models for semantic segmentation of scene sketches. European Conference on Computer Vision (ECCV, 2018) (CCF-B) [Paper] [Code]

Image Editing and Synthesis

Including: image inpainting, color restoration, color transfer and non-photorealistic rendering.

	Controllable Anime Image Editing via Probability of Attribute Tags Zhenghao Song, Haoran Mo, and Chengying Gao* Intro: Editing anime images via probabilities of attribute tags allows controlling the degree of the manipulation in an intuitive and convenient manner. Existing methods fall short in the progressive modification and preservation of unintended regions in the input image. We propose a controllable anime image editing framework based on adjusting the tag probabilities, in which a probability encoding network (PEN) is developed to encode the probabilities into features that capture continuous characteristic of the probabilities. Thus, the encoded features are able to direct the generative process of a pre-trained diffusion model and facilitate the linear manipulation. We also introduce a local editing module that automatically identifies the intended regions and constrains the edits to be applied to those regions only, which preserves the others unchanged. Comprehensive comparisons with existing methods indicate the effectiveness of our framework in both one-shot and linear editing modes. Results in additional applications further demonstrate the generalization ability of our approach. Pacific Graphics (PG, 2024) (CCF-B) [Paper] [Code]
	CAP-VSTNet: Content Affinity Preserved Versatile Style Transfer Linfeng Wen, Chengying Gao, Changqing Zou Intro:* Content affinity loss including feature and pixel affinity is a main problem which leads to artifacts in photorealistic and video style transfer. This paper proposes a new framework named CAP-VSTNet, which consists of a new reversible residual network and an unbiased linear transform module, for versatile style transfer. This reversible residual network can not only preserve content affinity but not introduce redundant information as traditional reversible networks, and hence facilitate better stylization. Empowered by Matting Laplacian training loss which can address the pixel affinity loss problem led by the linear transform, the proposed framework is applicable and effective on versatile style transfer. Computer Vision and Pattern Recognition (CVPR, 2023) (CCF-A) [Paper] [Code]
	Structural Prior Guided Image Inpainting for Complex Scene Shuxin Wei, Chengying Gao Intro: Existing deep-learning based image inpainting methods have reach plausible results for small corrupted regions with rich context information. However, these methods fail to generate semantically reasonable results and clear boundaries. In this paper, we disentangle inpainting for complex scene into two stages: semantic segmentation map inpainting and segmentation guided texture inpainting. We use feature correspondence matrix to find correlation between segmentation maps and known region of corrupted images and realize texture generation of corrupted region. International Conference on Multimedia & Expo (ICME, 2021) *(oral) (CCF-B)** [Paper]
	基于稀疏结构的复杂物体修复高成英，徐仙儿，罗燕媚，王栋计算机学报，2019
	An edge-refined vectorized deep colorization model for grayscale-to-color images Zhuo Su, Xiangguo Liang, Jiaming Guo, Chengying Gao, Xiaonan Luo Neurocomputing, 2018 [Paper]
	PencilArt: A Chromatic Penciling Style Generation Framework Chengying Gao, Mengyue Tang, Xiangguo Liang, Zhou Su, Changqing Zou Computer Graphics Forum (CGF), 2018 (CCF-B) [Paper]

3D Pose Estimation and Motion Generation

DiFusion: Flexible Stylized Motion Generation Using Digest-and-Fusion Scheme

Yatian Wang, Haoran Mo, Chengying Gao*

Intro: To address the issue of style expression in existing text-driven human motion synthesis methods, we propose DiFusion, a framework for diversely stylized motion generation. It offers flexible control of content through texts and style via multiple modalities, i.e., textual labels or motion sequences. Our approach employs a dual-condition motion latent diffusion model, enabling independent control of content and style through flexible input modalities. To tackle the issue of imbalanced complexity between the text-motion and style-motion datasets, we propose the Digest-and-Fusion training scheme, which digests domain specific knowledge from both datasets and then adaptively fuses them into a compatible manner. Comprehensive evaluations demonstrate the effectiveness of our method and its superiority over existing approaches in terms of content alignment, style expressiveness, realism, and diversity. Additionally, our approach can be extended to practical applications, such as motion style interpolation.

IEEE Transactions on Visualization and Computer Graphics (TVCG, 2025) (CCF-A)
[Paper]

Unpaired Motion Style Transfer with Motion-oriented Projection Flow Network

Yue Huang, Haoran Mo, Xiao Liang, Chengying Gao*

Intro: In this paper, we propose a novel unpaired motion style transfer framework that generates complete stylized motions with consistent content. We introduce a motion-oriented projection flow network (M-PFN) designed for temporal motion data, which encodes the content and style motions into latent codes and decodes the stylized features produced by adaptive instance normalization (AdaIN) into stylized motions. The M-PFN contains dedicated operations and modules, e.g., Transformer, to process the temporal information of motions, which help to improve the continuity of the generated motions.

International Conference on Multimedia & Expo (ICME, 2022) (*oral) (CCF-B)
[Paper]

3D interacting hand pose and shape estimation from a single RGB image

Chengying Gao*, Yujia Yang, Wensheng Li

Intro: This paper proposes a network called GroupPoseNet using a grouping strategy to address this problem. GroupPoseNet extracts the left- and right-hand features respectively and thus avoids the mutual affection between the interacting hands. Empowered by a novel up-sampling block called MF-Block predicting 2D heat-maps in a progressive way by fusing image features, hand pose features, and multi-scale features, GroupPoseNet is effective and robust to severe occlusions. To achieve an effective 3D hand reconstruction, we design a transformer mechanism based inverse kinematics module(termed TikNet) to generate 3D hand mesh.

Neurocomputing, 2022
[Paper]

Garment Modeling and Virtual Try-on

Controllable Garment Image Synthesis Integrated with Frequency Domain Features

Xinru Liang, Haoran Mo, Chengying Gao*

Intro: We propose a controllable garment image synthesis framework that takes as inputs an outline sketch and a texture patch and generates garment images with complicated and diverse texture patterns. To improve the performance of global texture expansion, we exploit the frequency domain features in the generative process, which are from a Fast Fourier Transform (FFT) and able to represent the periodic information of the patterns. We also introduce a perceptual loss in the frequency domain to measure the similarity of two texture pattern patches in terms of their intrinsic periodicity and regularity.

Computer Graphics Forum (Pacific Graphics, 2023) (*oral) (CCF-B)
[Paper]

FashionGAN: Display your fashion design using Conditional Generative Adversarial Nets

Yirui Cui, Qi Liu, Chengying Gao*, Zhuo Su

Computer Graphics Forum (Pacific Graphics, 2018) (*oral) (CCF-B)
[Paper] [Code] [Dataset]

Automatic 3D Garment Fitting Based on Skeleton Driving

Haozhong Cai, Guangyuan Shi, Chengying Gao*, Dong Wang

Pacific-Rim Conference on Multimedia (PCM, 2018) (*oral) (CCF-C)
[Paper]

Multimedia Processing & 3D Rendering and Modeling

Multimedia Processing: generation and understanding of music and dance.
3D Rendering and Modeling: dynamic human reconstruction and nrural rendering, fast fluid surface reconstruction based on narrow band method and fabric modeling and rendering.

	Mitigating Density Imbalance in 3D Gaussian Splatting for Few-Shot Reconstruction Rongbin Zheng, Wensheng Li, Lingzhe Zeng, Dong Wang, Chengying Gao* Intro: 3D Gaussian Splatting (3DGS) provides high-quality and real-time rendering for novel view synthesis; however, it often produces artifacts and blurry results under sparse-view settings. We find that sparse observations lead to uneven spatial constraints during optimization, exacerbating density imbalance, which subsequently gives rise to inaccurate Gaussians. To address this, we adopt a two-stage strategy that improves the density distribution. In preprocessing, we densify sparse distant regions of the SfM point cloud via a depth- and density-aware strategy. During training, we use two coupled 3DGS models for co-regularization and guide Gaussian updates with periodically rendered density maps, which are segmented into sparse, dense, and high-gradient regions to reflect different types of density errors. We then apply region-specific update rules—densification, reset, and perturbation—to progressively correct the imbalance. Experiments on LLFF, Mip-NeRF360, and DTU demonstrate the superiority of our approach. International Conference on Multimedia & Expo (ICME, 2026) *(spotlight) (CCF-B)** [Paper]
	Illumination-Consistent Human-Scene Reconstruction from Monocular Video Rongbin Zheng, Wensheng Li, Lingzhe Zeng, Dongwang, Chengying Gao* Intro: Reconstructing 3D humans and scenes from monocular videos is a challenging task, particularly due to human motion, varying illumination, and dynamic scene shadows. While recent works have explored scene disentanglement by jointly modeling humans and their surrounding scenes, they often overlook illumination and shadow effects—resulting in inconsistent human appearance and degraded scene realism. To address this gap, we propose a photometrically consistent integration of human and scene reconstruction based on 3D Gaussian Splatting, with a key focus on modeling spatially-varying illumination and shadows. Central to our method is a learnable light volume that provides localized lighting cues to human Gaussians, enabling more realistic and consistent appearance synthesis. To further ensure accurate human geometry and alignment, we adopt a two-stage reconstruction strategy: we first optimize a human mesh and then anchor Gaussians to the refined surface. In addition, we introduce an implicit shadow estimation module that disentangles cast shadows from the scene, thus supporting plausible human shadow synthesis. Our framework also facilitates human relighting and compositing into novel scenes with contextually appropriate lighting. Quantitative and qualitative results demonstrate that our method achieves state-of-the-art performance, producing consistent appearances, realistic illumination, and enhanced overall scene realism. IEEE Computer Vision and Pattern Recognition (CVPR, 2026) (中科院1区/CCF-A) [Paper]
	IR-HGP: Physically-Aware Gaussian Inverse Rendering for High-Illumination Scenes via Generative Priors Qingan Zhang, Wensheng Li, Chengying Gao* Intro: Applying 3D Gaussian Splatting to inverse rendering, especially for relightable assets under high-illuminance conditions, remains challenging. Strong specular highlights and complex reflections complicate material-light disentanglement, often baking in shadows and losing specular detail. To address this, we introduce IR-HGP, a framework that achieves robust disentanglement using three synergistic modules: First, a Hybrid Visibility Decomposition module ensures physical visibility consistency. Second, a Generative Illumination Field Prior module infers detailed and high-dynamic range environmental lighting. Finally, a Physics-Aware Radiance Correction module stabilizes optimization and mitigates illumination artifacts. Our framework achieves SOTA material recovery and relighting performance, outperforming existing methods under challenging illumination conditions. It reconstructs the view-dependent “shiny” appearance of reflective surfaces in real time, surpassing the limits of prior 3DGS-based inverse rendering methods. IEEE Computer Vision and Pattern Recognition (CVPR, 2026) (中科院1区/CCF-A) [Paper]
	ReGA: Relighting Dynamic Gaussian Avatars from Sparse Views Lingzhe Zeng, Wensheng Li, Rongbin Zheng, Chengying Gao* Intro: Dynamic human relighting is a complex task and its core challenge lies in effectively handling both dynamic human geometry reconstruction and material estimation. Existing works mainly achieve human reconstruction and relighting with Neural Radiance Fields, but they are not only inefficient, but also still inaccurate in material estimation and relighting. In this paper, we propose a novel approach called ReGA, which leverages efficient 3D Gaussian Splatting to create animatable and relightable avatars from sparse-view human motion. To overcome the geometric weakness of vanilla Gaussian representation, we introduce dynamic alignment mechanism in the geometry stage, combining the advantages of Gaussian splatting and mesh-based representation to produce reasonable human surface. In the material stage, we enhance the inverse rendering process by introducing twofold correlation strategies that establish chrominance correlation between Gaussian radiance color and albedo. Experiments demonstrate that our method outperforms existing approaches in dynamic human relighting task. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT, 2025) (中科院1区/CCF-B) [Paper]
	Feature Replacement in Gaussian Splatting for 3D Stylization Jinkeng Zhu, Wensheng Li, ChengYing Gao* Intro 3D generation and editing, as creative tools, have attracted increasing attention, with 3D stylization being a significant area of scene editing. However, existing 3D stylization methods suffer from several limitations, including the need for retraining for new style inputs, inadequate separation of scene content and style information, and mismatches between the stylized results and the reference style images. In this paper, we introduce a feature replacement module that utilizes reversible network to decouple content and style features, ensuring the effective substitution of style information while preserving scene content. Additionally, we propose a Feature Chamfer Loss to align the high-dimensional feature space of the generated image with the reference style image, improving consistency and visual coherence. Experimental results demonstrate that our method outperforms existing techniques in terms of generation quality and multi-view consistency, advancing the state of 3D scene stylization. Computer Graphics International (CGI, 2025) (CCF-C) [Paper]
	Efficient Integration of Neural Representations for Dynamic Humans Wensheng Li, Lingzhe Zeng, Chengying Gao, Ning Liu* Intro: While numerous studies have explored NeRF-based novel view synthesis for dynamic humans, they often require training that exceeds several hours. In this work, we introduce an innovative approach for efficiently learning and integrating neural human representations. Specifically, we initially propose decomposing high-dimensional multi-space feature volume into several feature planes, subsequently utilizing matrix multiplication to explicitly establish the correlations between different planes. This enables the simultaneous optimization of their counterparts across all dimensions by optimizing interpolated features, efficiently integrating associated details, and accelerating the rate of convergence. Additionally, we use the proposed collaborative refinement process to iteratively enhance the canonical representation. By integrating multi-space representations, we further facilitate the co-optimization of multiple frames' time-dependent observations. Experiments demonstrate that our method can achieve high-quality free-viewpoint renderings within nearly 5 minutes of optimization. IEEE Transactions on Visualization and Computer Graphics (TVCG, 2024) (中科院 1区/CCF-A) [Paper]
	DanceComposer: Dance-to-Music Generation Using a Progressive Conditional Music Generator Xiao Liang, Wensheng Li, Lifeng Huang and Chengying Gao Intro: A wonderful piece of music is the essence and soul of dance, which motivates the study of automatic music generation for dance. To create appropriate music from dance, cross-modal correlations between dance and music such as rhythm and style, should be considered. However, existing dance-to-music methods have difficulties in achieving rhythmic alignment and stylistic matching simultaneously. Additionally, the diversity of generated samples is limited due to the lack of available paired data. To address these issues, we propose DanceComposer, a novel dance-to-music framework, which generates rhythmically and stylistically consistent multi-track music from dance videos. DanceComposer features a Progressive Conditional Music Generator (PCMG) that gradually incorporates rhythm and style constraints, enabling both rhythmic alignment and stylistic matching. To enhance style control, we introduce a Shared Style Module (SSM) that learns cross-modal features as stylistic constraints. This allows the PCMG can be trained on extensive music-only data and diversifies generated pieces. IEEE Transactions on Multimedia (TMM, 2024) (中科院 1区/CCF-B) [Paper]
	PianoBART: Symbolic Piano Music Generation and Understanding with Large-Scale Pre-Training Xiao Liang, Zijian Zhao, Weichao Zeng, Yutong He, Fupeng He, Yiyi Wang and Chengying Gao Intro: Learning musical structures and composition patterns is necessary for both music generation and understanding, but current methods do not make uniform use of learned features to generate and comprehend music simultaneously. In this paper, we propose PianoBART, a pre-trained model that uses BART for both symbolic piano music generation and understanding. We devise a multi-level object selection strategy for different pre-training tasks of PianoBART, which can prevent information leakage or loss and enhance learning ability. The musical semantics captured in pre-training are fine-tuned for music generation and understanding tasks. Experiments demonstrate that PianoBART efficiently learns musical patterns and achieves outstanding performance in generating high-quality coherent pieces and comprehending music. International Conference on Multimedia & Expo (ICME, 2024) (CCF-B)
	A Completely Parallel Surface Reconstruction Method for Particle-Based Fluids Wencong Yang, Chengying Gao Intro: In this paper, a fast, simple and extremely accurate narrow-band method of fluid surface is proposed firstly, which makes the surface reconstruction algorithm (such as marching cube) accurately process the valid fluid surface area, which greatly avoids the useless calculation process. At the same time, we analyze the potential race conditions and conditional branching in the reconstruction process, by using mutual exclusive prefix sum algorithm, the whole process of fluid surface reconstruction is completely parallelized, which greatly speeds up the efficiency of surface reconstruction. Computer Graphics International (CGI, 2020) (CCF-C) [Paper]
	Fully automatic algorithm on yarn model generation Zekun Zhang [Introduction (PPT)]
	Microscopic model based real time algorithm on fabric rendering Xingrong Luo [Introduction (PPT)]