Experiments | Dynamic Compression Flows

Across experiments, we analyze representations at compression levels \( \tau\in {0,1} \). At generation time, we integrate both learned flows (dynamic and compressive) as neural ODEs using an adaptive Dormand–Prince (DOPRI5) solver.

We obtain the flowed data representation \( {\tilde{\mathbf{x}}^{(\tau)}_t} \) by two methods: flowed, where each observed frame \( \mathbf{x}^{(1)}_t \) is compressed pointwise to level \( \tau \) by integrating the compressive flow \( \mathbf{u}_{\phi} \) from \( \tau=1 \); and simulated, where we first compress an initial frame to \( \tilde{\mathbf{x}}^{(\tau)}_0 \) and then integrate the dynamical flow \( \mathbf{v}_{\theta} \) forward in time at the same \( \tau \).

At \( \tau=0 \), we map compressed states to encoder-normalized latent coordinates via \( \boldsymbol{\tilde{\mu}}_t=(\mathbf{L}\mathbf{D}^{1/2})^{-1}\left(\mathbf{\tilde{x}}^{(0)}_t-\mathbf{b}\right) \) and visualize the first three coordinates, which are the most important under nested-dropout ordering. We also project velocities by using \( \boldsymbol{\dot{\tilde{\mu}}}_t=(\mathbf{L}\mathbf{D}^{1/2})^{-1}\mathbf{\dot{\tilde{x}}}^{(0)}_t \), and plotting dynamical velocities at the start of each step (\( s=0 \) in Equation 4, Section 2.4).

Experiment train times and parameter choices as well as trajectory roll-out time estimates can be found in the Appendix (Sections B,C respectively) of our manuscript. Links to manuscript, repo, and bird (audio) data are listed under “Resources”.

4.1 Simulated Data

To test the ability of DCF to extract low-dimensional dynamics from high-dimensional data, we simulate \( 10 \) short videos of a ball moving clockwise (Figure 3). Each video comprises \( 50 \) time steps, with each frame being a \( 28\times 28 \) grayscale image. Both the compressive flow \( \mathbf{u}_{\phi} \) and the dynamical flow \( \mathbf{v}_{\theta} \) are parameterized by 4-level convolutional encoder-decoders (U-Net style) with channel widths \( {32,64,128,256} \) and a 256-dimensional bottleneck embedding. The encoder feature map \( \boldsymbol{\mu}_{\boldsymbol{\psi}}(\mathbf{x}) \) is a convolutional VAE with the same multiscale channel schedule. We use no dynamical history. That is, \( h=0 \) for \( \mathbf{x}_{\mathrm{hist}}^{(\tau)} \) in Section 2.4.

schematic — **Figure 3: Rotating-ball simulation.** **Left:** We simulated 10 videos of a ball moving counterclockwise (50 frames, \( 28\times 28 \)) and trained DCF with nested dropout (\( p=\frac{1}{50} \)) and no history (\( h=0 \)). (A) Ground-truth frames (top) and rollouts decoded in data space (middle, \( \tau=1 \)), with the corresponding simulated states in compressed space (bottom, \( \tau=0 \)), shown at \( t\in\{0,5,10,15,20\} \). (B) Simulated trajectories visualized in the first three coordinates of the latent space. Colored points correspond to frames in (A).

We set nested dropout to \( K\sim\mathrm{Geom}(p) \) with \( p=1/50 \), so that \( K_{\mathrm{target}}=\mathbb{E}(K)=50 \) can provide a generous latent budget. With no additional penalty on \( \mathbf{D} \), the fitted scales \( {d_i} \) decay rapidly (first column of Figure S1, Appendix A of manuscript).

We define the effective dimension as the smallest \( K \) such that \( \sum_{i=1}^{K} d_i \geq 0.95\sum_{i=1}^{D} d_i \), which yields \( K_{\mathrm{eff}}=30 \) in this run. As expected (Figure 3A, supplemental videos), simulated frames in data space (\( \tau=1 \)) track the ground-truth ball location over time, while the corresponding rollouts in compressed space ( \( \tau=0 \), embedded in image space) provide a nearly identical image.

Moreover, the simulated latent trajectory in the first three coordinates of \( \boldsymbol{\tilde{\mu}} \) forms a smooth closed loop (Figure 3B, supplemental videos), consistent with the underlying periodic motion. In further experiments imposing additional shrinkage on \( \mathbf{D} \) (Figure 4; Figure S1, Appendix A of manuscript), we found that inferred dynamics and latent spaces were reproducible across runs. This is expected given that our model produces identifiable representations by construction, with additional cross-run stability checks in Figure 8.

For these and subsequent experiments (cf. Sections 4.2-4.4) we compare DCF against several other competing approaches, namely VAEs (Kingma & Welling, 2014; Rezende et al., 2014), MARBLE (Gosztolai et al., 2025), LFADS (Pandarinath et al., 2018), DMD (Kutz et al., 2016), T-PHATE (Busch et al., 2023), and CEBRA (Schneider et al., 2023). For larger datasets, we could only train T-PHATE, which needs to load everything into memory, on a subset of data. Results for these models on the balls dataset are shown in Figure 5. Surprisingly, several of these models struggled to identify a simple, low-dimensional manifold underlying the data.

4.2 Neural Data

We evaluated DCF on a dataset comprising population neural activity from a center-out reach task performed by nonhuman primates (592 training trials, 197 held-out test trials; Pei et al., 2021). Each trial’s data consisted of smoothed spike counts from 137 neurons over 100 time steps. All methods in Table 1 used the same train/test split and neural time window, with cursor velocity used only afterward for downstream linear decoding.

Method	Median \( R^2 \) (25^th percentile, 75^th percentile)
DCF (ours)	0.304 (0.020, 0.505)
VAE*	0.242 (0.041, 0.427)
MARBLE*	0.195 (-0.083, 0.336)
LFADS	0.311 (0.093, 0.416)
DMD*	0.014 (-0.157, 0.128)
T_PHATE*	-0.012 (-0.152, 0.010)
CEBRA*	0.202 (0.010, 0.294)

Table 1: Model comparisons for neural data (cursor velocity prediction). Decoding (linear predictions) of cursor velocity from 3D latent representations of neural activity. Reported values are \( R^2 \) quartiles across held-out trials. Asterisks (*) indicate models performing significantly worse than ours (p-value \( <0.05 \), one-sided Wilcoxon signed-rank test with Bonferroni correction for multiple comparisons)

For this experiment, the encoder feature map \( \boldsymbol{\mu}_{\boldsymbol{\psi}}(\mathbf{x}) \), the compressive flow \( \mathbf{u}_{\boldsymbol{\phi}} \), and the dynamical flow \( \mathbf{v}_{\boldsymbol{\theta}} \) were all parameterized by multilayer perceptrons (MLPs) with depth 4 and width 256. We use a lag-10 dynamical history (\( h=10 \)) and nested dropout \( p=1/50 \) for all reported results.

As expected, most latent models, including DCF, learned a well-organized latent space in which neural dynamics corresponding to distinct reach targets organized topographically (Figure 6). Quantification on a velocity prediction task using these latent spaces (Table 1) shows that DCF outperforms other models, despite falling well short of larger, prediction-focused approaches (e.g.,Azabou et al., 2023). Additionally, our model achieves higher reconstruction quality of neural activity (i.e., firing rates) on held-out data (Table 2). Consistent with latent identifiability up to sign, DCF also recovers stable target-organized 3D latent geometry across five random seeds, whereas LFADS and CEBRA show lower seed-to-seed consistency (Figure 8).

Method	Median \( R^2 \) (25^th percentile, 75^th percentile)
DCF (ours)	0.999 (0.998, 0.999)
VAE	0.618 (0.559, 0.678)
LFADS	0.595 (0.527, 0.661)

Table 2: Model comparisons for neural data (reconstruction). Reconstruction of neural activity (firing rates) on held out test-data. Reported values are \( R^2 \) quartiles across held-out trials.

4.3 Video Data

We next applied DCF to a well-studied behavioral video dataset (Musall et al., 2019), which consists of a single long video with \( 71{,}942 \) frames of size \( 64\times 64 \). These are challenging data for most dimension reduction methods, since most parts of the frame are highly static, with only intermittent bursts of activity. Here, since we focus on inference, we did not use a train/test split and do not report long-horizon trajectory rollouts. Both flows were parameterized by 4-level convolutional encoder-decoders (U-Net style) with channel widths \( {32,64,128,256} \) and a 256-dimensional bottleneck embedding. We fit the model with nested dropout \( p=1/50 \) and no history (\( h=0 \)), yielding an effective dimension \( K_{\mathrm{eff}}=30 \). Since the behavior is highly repetitive, we visualized only the first 1,438 frames (about \( 2\% \) of the video), which is sufficient to capture the latent structure.

The latent structure forms four prominent bands in 3D latent space (Figure 7A), separated primarily along \( (\mu_1,\mu_2) \) while sharing a common within-band axis \( \mu_3 \).

Within each band, lower values of \( \mu_3 \) correspond to stronger mouth movement, while higher \( \mu_3 \) is more quiescent (Figure 7A,B). Across bands, the mean appearance is broadly similar, but each band captures a slightly different baseline visual state, reflected by systematic mean shifts and variability patterns (Figure 7B). Moreover, outliers selected by maximal latent distance or maximal velocity magnitude (Figure 7C) correspond to transient paw and controller movements. Importantly, this latent structure was not well captured by comparison models (Figure 7D), and the same four-band structure was recovered across five random seeds (Figure 8). For additional details on latent structure identified by our model and comparisons against competing approaches, see our supplemental videos.

4.4 Audio Data

Lastly, we evaluated DCF on a birdsong dataset (262 training trials, 66 test trials) converted to 26 \( 64\times 64 \) sequential spectrograms. We used the same convolutional architectures as in the video experiment, and the encoder feature map \( \boldsymbol{\mu}_{\boldsymbol{\psi}}(\mathbf{x}) \) was a convolutional VAE with the same multiscale schedule. We fit the model with nested dropout \( p=1/50 \) and no history (\( h=0 \)), evaluating simulated rollouts.

On training trials, DCF rollouts preserved the syllable-level structure of the motif and reproduced the major time-frequency energy patterns across successive syllables (Figure 9, supplemental videos).

In the learned 3D latent coordinates, simulated trajectories concentrate on a low-dimensional curved manifold, and transitions between syllables align with regions of larger latent velocity magnitude (Figure 9, supplemental videos).

On held-out trials, the same latent geometry is retained and the decoded rollouts remain qualitatively consistent with the ground-truth spectrograms, indicating that the learned dynamics generalize beyond the training set (Figure 10). This is in contrast to typical VAE-based approaches (Goffinet et al., 2021; Sainburg et al., 2020), which preserve structure at the syllable level while failing to smoothly capture dynamics (Figure 11).

4.1 Simulated Data

4.2 Neural Data

4.3 Video Data

4.4 Audio Data

Go to next section Conclusion