StrokeFusion

StrokeFusion: Vector Sketch Generation via Joint Stroke-UDF Encoding and Latent Sequence Diffusion

Jin Zhou, Yi Zhou, Hongliang Yang, Pengfei Xu*, Hui Huang

CSSE, Shenzhen University
AAAI 2026

Abstract

In the field of sketch generation, raster-format-trained models often produce non-stroke artifacts, while vector-format-trained models typically lack a holistic understanding of sketches, resulting in compromised recognizability. Moreover, existing methods struggle to extract common features from similar elements (e.g., animal eyes) that appear at varying positions across sketches. To address these challenges, we propose StrokeFusion, a two-stage framework for vector sketch generation. It contains a dual-modal sketch feature learning network that maps strokes into a high-quality latent space. This network decomposes sketches into normalized strokes and jointly encodes stroke sequences with Unsigned Distance Function (UDF) maps, representing sketches as sets of stroke feature vectors. Building upon this representation, our framework exploits a stroke-level latent diffusion model that simultaneously adjusts stroke position, scale, and trajectory during generation. This enables high-fidelity stroke generation while supporting stroke interpolation editing. Extensive experiments across multiple sketch datasets demonstrate that our framework outperforms state-of-the-art techniques, validating its effectiveness in preserving structural integrity and semantic features.

Overview of StrokeFusion Framework

Figure 2: The proposed StrokeFusion framework comprises two core components:
1) Dual-Modal Stroke Encoding: Each stroke $s$ is processed through parallel encoding paths - a transformer-based sequence encoder handles geometric coordinates while a CNN processes the stroke distance field $I_n$. These modalities are fused into joint features $f$, trained via symmetric decoder networks that reconstruct both the original stroke ($s$) and distance field ($I_n$);
2) Sketch Diffusion Generation: All normalized strokes are encoded into latent vectors $z_i$, augmented with bounding box parameters $b^i = [x^i, y^i, w^i, h^i]$ and presence flags $v^i \in \{-1, 1\}$. The diffusion model learns the distribution of stroke sequences $\{z_1, ..., z_N\}$ through $T$-step denoising training. During generation, the denoiser progressively refines noisy latents via reverse diffusion, with valid strokes ($v^i = 1$) being decoded through inverse normalization of $\hat{b}^i$ to reconstruct the final sketch. The architecture maintains permutation invariance through order-agnostic sequence processing.

Qualitative Comparison on QuickDraw

Figure 3: Qualitative comparison of sketches generated by our method and the baselines across different categories in QuickDraw. Our method consistently produces more structurally coherent sketches with richer local details, particularly in complex, multi-stroke scenarios.

Method	< 4 strokes	< 8 strokes	$\geq$ 8 strokes
SketchRNN	31.61	0.49	0.45	36.98	0.58	0.44	40.67	0.55	0.40
SketchKnitter	23.17	0.52	0.48	27.07	0.57	0.45	35.64	0.54	0.40
ChiroDiff	17.17	0.61	0.50	23.84	0.63	0.45	27.78	0.62	0.42
StrokeFusion	19.53	0.71	0.58	18.99	0.69	0.61	17.76	0.71	0.58

Method

< 4 strokes

< 8 strokes

$\geq$ 8 strokes

FID$\downarrow$

Prec$\uparrow$

Rec$\uparrow$

FID$\downarrow$

Prec$\uparrow$

Rec$\uparrow$

FID$\downarrow$

Prec$\uparrow$

Rec$\uparrow$

SketchRNN

31.61

0.49

0.45

36.98

0.58

0.44

40.67

0.55

0.40

SketchKnitter

23.17

0.52

0.48

27.07

0.57

0.45

35.64

0.54

0.40

ChiroDiff

17.17

0.61

0.50

23.84

0.63

0.45

27.78

0.62

0.42

StrokeFusion

19.53

0.71

0.58

18.99

0.69

0.61

17.76

0.71

0.58

Method	Creative Birds	Creative Creatures	FaceX	TU Berlin
SketchRNN	59.85	0.26	0.28	121.02	0.44	0.26	155.02	0.01	0.31	98.01	0.73	0.20
SketchKnitter	59.16	0.25	0.22	110.46	0.42	0.27	156.97	0.08	0.34	99.46	0.56	0.22
ChiroDiff	60.10	0.56	0.18	36.66	0.59	0.27	99.33	0.06	0.30	98.30	0.53	0.25
Doodleformer	27.32	0.67	0.55	33.46	0.52	0.69	-	-	-	-	-	-
StrokeFusion	26.19	0.56	0.30	19.41	0.58	0.32	7.27	0.76	0.89	33.68	0.66	0.48

Method

Creative Birds

Creative Creatures

FaceX

TU Berlin

FID$\downarrow$

Prec$\uparrow$

Rec$\uparrow$

FID$\downarrow$

Prec$\uparrow$

Rec$\uparrow$

FID$\downarrow$

Prec$\uparrow$

Rec$\uparrow$

FID$\downarrow$

Prec$\uparrow$

Rec$\uparrow$

SketchRNN

59.85

0.26

0.28

121.02

0.44

0.26

155.02

0.01

0.31

98.01

0.73

0.20

SketchKnitter

59.16

0.25

0.22

110.46

0.42

0.27

156.97

0.08

0.34

99.46

0.56

0.22

ChiroDiff

60.10

0.56

0.18

36.66

0.59

0.27

99.33

0.06

0.30

98.30

0.53

0.25

Doodleformer

27.32

0.67

0.55

33.46

0.52

0.69

StrokeFusion

26.19

0.56

0.30

19.41

0.58

0.32

7.27

0.76

0.89

33.68

0.66

0.48

BibTeX Citation

@inproceedings{zhou2026strokefusion, title={StrokeFusion: Vector Sketch Generation via Joint Stroke-UDF Encoding and Latent Sequence Diffusion}, author={Zhou, Jin and Zhou, Yi and Yang, Hongliang and Xu, Pengfei and Huang, Hui}, booktitle={Proceedings of the AAAI Conference on Artificial Intelligence}, year={2026} }

StrokeFusion: Vector Sketch Generation via Joint Stroke-UDF Encoding and Latent Sequence Diffusion

Figure 1: Two examples of the generation process in StrokeFusion. From left to right, the noise level progressively decreases. At each timestep, only strokes with presence confidence $\hat{v}_i > 0$ are visualized.

Abstract

Overview of StrokeFusion Framework

Qualitative Comparison on QuickDraw

Results on Complex Datasets

BibTeX Citation