Abstract
Most 3D scene generation methods are limited to only generating object bounding box parameters, while newer diffusion methods also generate class labels and latent features, then retrieve objects from a predefined database. For complex scenes of varied, multi-categorical objects, diffusion-based latents cannot be effectively decoded by current autoencoders into the correct point cloud objects which agree with target classes.
We introduce a Class-Partitioned Vector Quantized Variational Autoencoder (CPVQ-VAE) that is trained to effectively decode object latent features by employing a pioneering class-partitioned codebook where codevectors are labeled by class. To address codebook collapse, we propose a class-aware running average update which reinitializes dead codevectors within each partition. During inference, object features and class labels — both generated by a Latent-space Flow Matching Model (LFMM) designed specifically for scene generation — are consumed by the CPVQ-VAE. The CPVQ-VAE’s class-aware inverse look-up then maps generated latents to codebook entries decoded to class-specific point cloud shapes.
This achieves pure point cloud generation without relying on an external objects database for retrieval. Experiments demonstrate up to 70.4% and 72.3% reduction in Chamfer and Point2Mesh errors on complex living room scenes compared to prior work, while being 90.3% faster than Diffuscene.
Method
Class-Partitioned VQ-VAE (CPVQ-VAE)
A traditional VQ-VAE quantizes an encoded point cloud to the nearest codevector in a shared codebook, irrespective of object class. When latents are generated by a diffusion or flow matching process, this class-agnostic quantization frequently produces incorrect decoded shapes. The CPVQ-VAE addresses this by partitioning the codebook by class: for \(N_c\) classes and \(N_q\) codevectors per class, the total codebook size is \(N_K = N_c \times N_q\). The class-aware quantization selects only among codevectors belonging to the correct class via an indicator function:
\[Q(\mathcal{E}(\mathbf{P}); \mathcal{C}, c) = k_c^* = \underset{k \in [N_K]}{\operatorname{argmin}} \| \mathcal{E}(\mathbf{P}) - \mathbf{1}(c,k) \times \mathbf{e}^k \|_2^2\]The CPVQ-VAE is trained with a standard reconstruction objective (Chamfer distance) plus commitment and codebook losses, weighted to ensure equal-order loss contributions.
Class-Aware Running Average Update
VQ-VAE training is susceptible to codebook collapse: early in training, a small subset of codevectors is heavily used while the rest become dead. We mitigate this with a class-aware running average update that dynamically reinitializes dead codevectors. Usage is tracked per-codevector with exponential moving average:
\[U_s^k = \gamma U_{s-1}^k + \frac{1-\gamma}{B} u_s^k\]Dead codevectors (those with low usage) are reinitialized towards the closest mini-batch encodings, selected in a class-aware manner using the same indicator function as the quantization step. A decay coefficient \(\alpha_s^k\) controls how aggressively each codevector is updated, with nearly-dead codevectors receiving stronger reinitialization signals.
Latent-Space Flow Matching Model (LFMM)
The LFMM simultaneously generates all object attributes in a scene — bounding box parameters (translation \(\hat{\mathbf{T}}\), rotation \(\hat{\mathbf{R}}\), size \(\hat{\mathbf{S}}\)), class vector \(\hat{\mathbf{C}}\), and latent feature \(\hat{\mathbf{F}} \in \mathbb{R}^{32}\) — conditioned on the floor plan.
Sampling is formulated as an optimal transport flow matching problem. Noisy initial samples \(\mathbf{x}_0 \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) are transported to the clean data distribution via linearly interpolated intermediate states:
\[\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{x}_1\]This linear interpolation implies a constant velocity field \(\mathbf{v}(\mathbf{x}_t; t) = \mathbf{x}_1 - \mathbf{x}_0\), which is approximated by a U-Net with parameters \(\theta\). The training objective minimises the expected squared deviation between predicted and target velocities across all object parameter types \(J \in \{T, R, S, C, F\}\). At inference, the Euler method with \(N_{\hat{t}} = 100\) steps is used.
Class-Aware Inverse Look-up
After the LFMM generates a 32-dimensional latent \(\hat{\mathbf{F}}\) and class \(\hat{c}\), a class-aware inverse look-up function maps \(\hat{\mathbf{F}}\) to the most similar codebook entry within the partition belonging to \(\hat{c}\), maximising cosine similarity in a class-restricted search:
\[\mathbf{L}(\hat{\mathbf{F}}, \mathcal{C}, \hat{c}) = \underset{k \in [N_K]}{\operatorname{argmax}} \left( \hat{\mathbf{F}} \cdot \left[ \mathbf{1}(\hat{c}, k) \times \mathbf{e}_{n<32}^k \right] \right)\]The retrieved full codevector is then decoded by the CPVQ-VAE to produce the final class-consistent point cloud.
Results
Point Cloud Generation
Evaluated on the 3D-FRONT dataset (living rooms, dining rooms, bedrooms), our LFMM+CPVQ-VAE achieves state-of-the-art generation quality:
| Method | Living Room CD↓ | Living Room P2M↓ | Dining Room CD↓ | Dining Room P2M↓ | Bedroom CD↓ | Bedroom P2M↓ |
|---|---|---|---|---|---|---|
| Diffuscene | 30.63 | 29.87 | 30.60 | 29.49 | 45.01 | 44.88 |
| Ours (LFMM + VAE) | 24.65 | 23.41 | 2.66 | 2.52 | 4.24 | 3.63 |
| Ours (LFMM + CPVQ-VAE) | 9.06 | 8.27 | 2.38 | 2.17 | 2.46 | 2.06 |
CD and P2M values ×10³. On complex living room data, our full model reduces CD and P2M errors by 70.4% and 72.3% over Diffuscene.
Scene Retrieval & Runtime
Our LFMM also improves bounding box quality for retrieval-based evaluation (FID, KID) and is substantially faster than diffusion-based approaches:
| Method | Runtime (s)↓ | Avg. FID↓ | Avg. KID↓ |
|---|---|---|---|
| ATISS | 0.024 | 26.21 | 3.686 |
| Diffuscene | 9.153 | 27.58 | 4.775 |
| Ours | 0.892 | 24.34 | 3.107 |
Our method is 90.3% faster than Diffuscene while achieving better average FID and KID scores across all room types.
Ablation: Autoencoder Design
| Variant | Modules | Bedroom CD↓ | Bedroom P2M↓ |
|---|---|---|---|
| V1: VAE | VAE only | 4.24 | 3.63 |
| V2: VQ-VAE | VQ-VAE | 36.27 | 33.93 |
| V3: VQ-VAE + CP | + Class Partitioning | 5.00 | 4.17 |
| V4: CPVQ-VAE | + Class Partitioning + RAU | 2.46 | 2.06 |
Both class partitioning and the running average update (RAU) are essential — without the RAU, codebook collapse degrades performance below even the plain VAE baseline.
BibTeX
@InProceedings{deSilvaEdirimuni_2026_AAAI,
author = {de Silva Edirimuni, Dasith and Mian, Ajmal Saeed},
title = {Class-Partitioned VQ-VAE and Latent Flow Matching for Point Cloud Scene Generation},
booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
volume = {40},
number = {5},
pages = {3542--3550},
year = {2026},
month = {March},
doi = {10.1609/aaai.v40i5.37352}
}