Class-Partitioned VQ-VAE and Latent Flow Matching for Point Cloud Scene Generation

Top: Diffuscene's pretrained VAE frequently decodes object latents into shapes that disagree with target classes. Our LFMM produces robust latents correctly decoded by the CPVQ-VAE, which exploits generated class labels to look up class-specific codevectors. Bottom: Overview of the full scene generation framework.

Abstract

Most 3D scene generation methods are limited to only generating object bounding box parameters, while newer diffusion methods also generate class labels and latent features, then retrieve objects from a predefined database. For complex scenes of varied, multi-categorical objects, diffusion-based latents cannot be effectively decoded by current autoencoders into the correct point cloud objects which agree with target classes.

We introduce a Class-Partitioned Vector Quantized Variational Autoencoder (CPVQ-VAE) that is trained to effectively decode object latent features by employing a pioneering class-partitioned codebook where codevectors are labeled by class. To address codebook collapse, we propose a class-aware running average update which reinitializes dead codevectors within each partition. During inference, object features and class labels — both generated by a Latent-space Flow Matching Model (LFMM) designed specifically for scene generation — are consumed by the CPVQ-VAE. The CPVQ-VAE’s class-aware inverse look-up then maps generated latents to codebook entries decoded to class-specific point cloud shapes.

This achieves pure point cloud generation without relying on an external objects database for retrieval. Experiments demonstrate up to 70.4% and 72.3% reduction in Chamfer and Point2Mesh errors on complex living room scenes compared to prior work, while being 90.3% faster than Diffuscene.

Method

The CPVQ-VAE uses class label inputs to partition the codebook, allowing it to learn class-specific point cloud representations and decode latents into class-consistent shapes.

Class-Partitioned VQ-VAE (CPVQ-VAE)

A traditional VQ-VAE quantizes an encoded point cloud to the nearest codevector in a shared codebook, irrespective of object class. When latents are generated by a diffusion or flow matching process, this class-agnostic quantization frequently produces incorrect decoded shapes. The CPVQ-VAE addresses this by partitioning the codebook by class: for \(N_c\) classes and \(N_q\) codevectors per class, the total codebook size is \(N_K = N_c \times N_q\). The class-aware quantization selects only among codevectors belonging to the correct class via an indicator function:

\[Q(\mathcal{E}(\mathbf{P}); \mathcal{C}, c) = k_c^* = \underset{k \in [N_K]}{\operatorname{argmin}} \| \mathcal{E}(\mathbf{P}) - \mathbf{1}(c,k) \times \mathbf{e}^k \|_2^2\]

The CPVQ-VAE is trained with a standard reconstruction objective (Chamfer distance) plus commitment and codebook losses, weighted to ensure equal-order loss contributions.

Class-Aware Running Average Update

VQ-VAE training is susceptible to codebook collapse: early in training, a small subset of codevectors is heavily used while the rest become dead. We mitigate this with a class-aware running average update that dynamically reinitializes dead codevectors. Usage is tracked per-codevector with exponential moving average:

\[U_s^k = \gamma U_{s-1}^k + \frac{1-\gamma}{B} u_s^k\]

Dead codevectors (those with low usage) are reinitialized towards the closest mini-batch encodings, selected in a class-aware manner using the same indicator function as the quantization step. A decay coefficient \(\alpha_s^k\) controls how aggressively each codevector is updated, with nearly-dead codevectors receiving stronger reinitialization signals.

Latent-Space Flow Matching Model (LFMM)

U-Net architecture of the Latent-space Flow Matching Model (LFMM), conditioned on the floor plan to generate all object attributes for a scene.

The LFMM simultaneously generates all object attributes in a scene — bounding box parameters (translation \(\hat{\mathbf{T}}\), rotation \(\hat{\mathbf{R}}\), size \(\hat{\mathbf{S}}\)), class vector \(\hat{\mathbf{C}}\), and latent feature \(\hat{\mathbf{F}} \in \mathbb{R}^{32}\) — conditioned on the floor plan.

Sampling is formulated as an optimal transport flow matching problem. Noisy initial samples \(\mathbf{x}_0 \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) are transported to the clean data distribution via linearly interpolated intermediate states:

\[\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{x}_1\]

This linear interpolation implies a constant velocity field \(\mathbf{v}(\mathbf{x}_t; t) = \mathbf{x}_1 - \mathbf{x}_0\), which is approximated by a U-Net with parameters \(\theta\). The training objective minimises the expected squared deviation between predicted and target velocities across all object parameter types \(J \in \{T, R, S, C, F\}\). At inference, the Euler method with \(N_{\hat{t}} = 100\) steps is used.

Class-Aware Inverse Look-up

After the LFMM generates a 32-dimensional latent \(\hat{\mathbf{F}}\) and class \(\hat{c}\), a class-aware inverse look-up function maps \(\hat{\mathbf{F}}\) to the most similar codebook entry within the partition belonging to \(\hat{c}\), maximising cosine similarity in a class-restricted search:

\[\mathbf{L}(\hat{\mathbf{F}}, \mathcal{C}, \hat{c}) = \underset{k \in [N_K]}{\operatorname{argmax}} \left( \hat{\mathbf{F}} \cdot \left[ \mathbf{1}(\hat{c}, k) \times \mathbf{e}_{n<32}^k \right] \right)\]

The retrieved full codevector is then decoded by the CPVQ-VAE to produce the final class-consistent point cloud.

Results

Point Cloud Generation

Visualization of retrieved and generated scenes. Green close-ups highlight correctly decoded/retrieved objects; red close-ups highlight failures. Unlike the VAE decoder of Diffuscene, the CPVQ-VAE accurately decodes object features into class-consistent point clouds.

Evaluated on the 3D-FRONT dataset (living rooms, dining rooms, bedrooms), our LFMM+CPVQ-VAE achieves state-of-the-art generation quality:

Method	Living Room CD↓	Living Room P2M↓	Dining Room CD↓	Dining Room P2M↓	Bedroom CD↓	Bedroom P2M↓
Diffuscene	30.63	29.87	30.60	29.49	45.01	44.88
Ours (LFMM + VAE)	24.65	23.41	2.66	2.52	4.24	3.63
Ours (LFMM + CPVQ-VAE)	9.06	8.27	2.38	2.17	2.46	2.06

CD and P2M values ×10³. On complex living room data, our full model reduces CD and P2M errors by 70.4% and 72.3% over Diffuscene.

Scene Retrieval & Runtime

Our LFMM also improves bounding box quality for retrieval-based evaluation (FID, KID) and is substantially faster than diffusion-based approaches:

Method	Runtime (s)↓	Avg. FID↓	Avg. KID↓
ATISS	0.024	26.21	3.686
Diffuscene	9.153	27.58	4.775
Ours	0.892	24.34	3.107

Our method is 90.3% faster than Diffuscene while achieving better average FID and KID scores across all room types.

Ablation: Autoencoder Design

Variant	Modules	Bedroom CD↓	Bedroom P2M↓
V1: VAE	VAE only	4.24	3.63
V2: VQ-VAE	VQ-VAE	36.27	33.93
V3: VQ-VAE + CP	+ Class Partitioning	5.00	4.17
V4: CPVQ-VAE	+ Class Partitioning + RAU	2.46	2.06

Variant

Modules

Bedroom CD↓

Bedroom P2M↓

V1: VAE

VAE only

4.24

3.63

V2: VQ-VAE

VQ-VAE

36.27

33.93

V3: VQ-VAE + CP

+ Class Partitioning

5.00

4.17

V4: CPVQ-VAE

+ Class Partitioning + RAU

2.46

2.06

Both class partitioning and the running average update (RAU) are essential — without the RAU, codebook collapse degrades performance below even the plain VAE baseline.

BibTeX

@InProceedings{deSilvaEdirimuni_2026_AAAI,
  author    = {de Silva Edirimuni, Dasith and Mian, Ajmal Saeed},
  title     = {Class-Partitioned VQ-VAE and Latent Flow Matching for Point Cloud Scene Generation},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  volume    = {40},
  number    = {5},
  pages     = {3542--3550},
  year      = {2026},
  month     = {March},
  doi       = {10.1609/aaai.v40i5.37352}
}

Links

PDF

arXiv

Code

AAAI