Across experiments, we analyze representations at compression levels \( \tau\in {0,1} \). At generation time, we integrate both learned flows (dynamic and compressive) as neural ODEs using an adaptive Dormand–Prince (DOPRI5) solver.

We obtain the flowed data representation \( {\tilde{\mathbf{x}}^{(\tau)}_t} \) by two methods: flowed, where each observed frame \( \mathbf{x}^{(1)}_t \) is compressed pointwise to level \( \tau \) by integrating the compressive flow \( \mathbf{u}_{\phi} \) from \( \tau=1 \); and simulated, where we first compress an initial frame to \( \tilde{\mathbf{x}}^{(\tau)}_0 \) and then integrate the dynamical flow \( \mathbf{v}_{\theta} \) forward in time at the same \( \tau \).

At \( \tau=0 \), we map compressed states to encoder-normalized latent coordinates via \( \boldsymbol{\tilde{\mu}}_t=(\mathbf{L}\mathbf{D}^{1/2})^{-1}\left(\mathbf{\tilde{x}}^{(0)}_t-\mathbf{b}\right) \) and visualize the first three coordinates, which are the most important under nested-dropout ordering. We also project velocities by using \( \boldsymbol{\dot{\tilde{\mu}}}_t=(\mathbf{L}\mathbf{D}^{1/2})^{-1}\mathbf{\dot{\tilde{x}}}^{(0)}_t \), and plotting dynamical velocities at the start of each step (\( s=0 \) in Equation 4, Section 2.4).

Experiment train times and parameter choices as well as trajectory roll-out time estimates can be found in the Appendix (Sections B,C respectively) of our manuscript. Links to manuscript, repo, and bird (audio) data are listed under “Resources”.

4.1 Simulated Data

To test the ability of DCF to extract low-dimensional dynamics from high-dimensional data, we simulate \( 10 \) short videos of a ball moving counterclockwise (Figure 3). Each video comprises \( 50 \) time steps, with each frame being a \( 28\times 28 \) grayscale image. Both the compressive flow \( \mathbf{u}_{\phi} \) and the dynamical flow \( \mathbf{v}_{\theta} \) are parameterized by 4-level convolutional encoder-decoders (U-Net style) with channel widths \( {32,64,128,256} \) and a 256-dimensional bottleneck embedding. The encoder feature map \( \boldsymbol{\mu}_{\boldsymbol{\psi}}(\mathbf{x}) \) is a convolutional VAE with the same multiscale channel schedule. We use no dynamical history. That is, \( h=0 \) for \( \mathbf{x}_{\mathrm{hist}}^{(\tau)} \) in Section 2.4.

schematic
Figure 3: Rotating-ball simulation. Left: We simulated 10 videos of a ball moving counterclockwise (50 frames, \( 28\times 28 \)) and trained DCF with nested dropout (\( p=\frac{1}{50} \)) and no history (\( h=0 \)). (A) Ground-truth frames (top) and rollouts decoded in data space (middle, \( \tau=1 \)), with the corresponding simulated states in compressed space (bottom, \( \tau=0 \)), shown at \( t\in\{0,5,10,15,20\} \). (B) Simulated trajectories visualized in the first three coordinates of the latent space. Colored points correspond to frames in (A).

We set nested dropout to \( K\sim\mathrm{Geom}(p) \) with \( p=1/50 \), so that \( K_{\mathrm{target}}=\mathbb{E}(K)=50 \) can provide a generous latent budget. With no additional penalty on \( \mathbf{D} \), the fitted scales \( {d_i} \) decay rapidly (first column of Figure S1, Appendix A of manuscript).

We define the effective dimension as the smallest \( K \) such that \( \sum_{i=1}^{K} d_i \geq 0.95\sum_{i=1}^{D} d_i \), which yields \( K_{\mathrm{eff}}=30 \) in this run. As expected (Figure 3A), simulated frames in data space (\( \tau=1 \)) track the ground-truth ball location over time, while the corresponding rollouts in compressed space ( \( \tau=0 \), embedded in image space) provide a nearly identical image.

schematic
Figure 4: Soft 3D latent representation via nested dropout for ball simulation. Left: We repeat the rotating-ball experiment with no history (\( h=0 \)) and a tighter nested-dropout budget \( p=\frac{1}{3} \) so \( K_{\mathrm{target}}=\mathbb{E}(K)=3 \)). (A) Absolute per-pixel deviation between the original frame and its compression to \( \tau=0 \), comparing the encoder endpoint versus the learned compressive flow (rMAE: root mean absolute error per pixel). (B) Projected dynamical velocities from simulated trajectories in the first three latent coordinates. (C) Simulated trajectories in the same 3D latent space, with five representative time points from trial 1 highlighted. (D) Corresponding ground-truth frames (top), rollouts in data space (middle, \( \tau=1 \)), and rollouts in compressed space (bottom, \( \tau=0 \)).

Moreover, the simulated latent trajectory in the first three coordinates of \( \boldsymbol{\tilde{\mu}} \) forms a smooth closed loop (Figure 3B), consistent with the underlying periodic motion. In further experiments imposing additional shrinkage on \( \mathbf{D} \) (Figure 4; Figure S1, Appendix A of manuscript), we found that inferred dynamics and latent spaces were reproducible across runs. This is expected given that our model produces identifiable representations by construction, with additional cross-run stability checks in Figure 8.

For these and subsequent experiments (cf. Sections 4.2-4.4) we compare DCF against several other competing approaches, namely VAEs (Kingma & Welling, 2014; Rezende et al., 2014), MARBLE (Gosztolai et al., 2025), LFADS (Pandarinath et al., 2018), DMD (Kutz et al., 2016), T-PHATE (Busch et al., 2023), and CEBRA (Schneider et al., 2023). For larger datasets, we could only train T-PHATE, which needs to load everything into memory, on a subset of data. Results for these models on the balls dataset are shown in Figure 5. Surprisingly, several of these models struggled to identify a simple, low-dimensional manifold underlying the data.

schematic
Figure 5: 3D latent representations of comparison models on the rotating ball dataset. Nearly all comparison models fail to capture the cyclical latent structure of the ball dataset; those that do display more variable latent trajectories than DCF.

4.2 Neural Data

We evaluated DCF on a dataset comprising population neural activity from a center-out reach task performed by nonhuman primates (592 training trials, 197 held-out test trials; Pei et al., 2021). Each trial’s data consisted of smoothed spike counts from 137 neurons over 100 time steps. All methods in Table 1 used the same train/test split and neural time window, with cursor velocity used only afterward for downstream linear decoding.

Method
Median \( R^2 \) (25th percentile, 75th percentile)
DCF (ours) 0.304 (0.020, 0.505)
VAE* 0.242 (0.041, 0.427)
MARBLE* 0.195 (-0.083, 0.336)
LFADS 0.311 (0.093, 0.416)
DMD* 0.014 (-0.157, 0.128)
T_PHATE* -0.012 (-0.152, 0.010)
CEBRA* 0.202 (0.010, 0.294)
Table 1: Model comparisons for neural data (cursor velocity prediction). Decoding (linear predictions) of cursor velocity from 3D latent representations of neural activity. Reported values are \( R^2 \) quartiles across held-out trials. Asterisks (*) indicate models performing significantly worse than ours (p-value \( <0.05 \), one-sided Wilcoxon signed-rank test with Bonferroni correction for multiple comparisons)

For this experiment, the encoder feature map \( \boldsymbol{\mu}_{\boldsymbol{\psi}}(\mathbf{x}) \), the compressive flow \( \mathbf{u}_{\boldsymbol{\phi}} \), and the dynamical flow \( \mathbf{v}_{\boldsymbol{\theta}} \) were all parameterized by multilayer perceptrons (MLPs) with depth 4 and width 256. We use a lag-10 dynamical history (\( h=10 \)) and nested dropout \( p=1/50 \) for all reported results.

schematic
Figure 6: 3D latent representations of neural data. Left: Flowed mean trajectories (compressing data pointwise via compressive flow \( \mathbf{u}_{\boldsymbol{\phi}} \)) in the first three latent coordinates. Color indicates monkey reach direction; lines are averaged across reach direction. Right: 3D latent trajectories of comparison models. The latent representation of CEBRA (without supervision), LFADS, VAE, MARBLE and DMD are averaged across monkey reach location. T-PHATE latent representation is unstructued and therefore not averaged.

As expected, most latent models, including DCF, learned a well-organized latent space in which neural dynamics corresponding to distinct reach targets organized topographically (Figure 6). Quantification on a velocity prediction task using these latent spaces (Table 1) shows that DCF outperforms other models, despite falling well short of larger, prediction-focused approaches (e.g.,Azabou et al., 2023). Additionally, our model achieves higher reconstruction quality of neural activity (i.e., firing rates) on held-out data (Table 2). Consistent with latent identifiability up to sign, DCF also recovers stable target-organized 3D latent geometry across five random seeds, whereas LFADS and CEBRA show lower seed-to-seed consistency (Figure 8).

Method
Median \( R^2 \) (25th percentile, 75th percentile)
DCF (ours) 0.999 (0.998, 0.999)
VAE 0.618 (0.559, 0.678)
LFADS 0.595 (0.527, 0.661)
Table 2: Model comparisons for neural data (reconstruction). Reconstruction of neural activity (firing rates) on held out test-data. Reported values are \( R^2 \) quartiles across held-out trials.

4.3 Video Data

We next applied DCF to a well-studied behavioral video dataset (Musall et al., 2019), which consists of a single long video with \( 71{,}942 \) frames of size \( 64\times 64 \). These are challenging data for most dimension reduction methods, since most parts of the frame are highly static, with only intermittent bursts of activity. Here, since we focus on inference, we did not use a train/test split and do not report long-horizon trajectory rollouts. Both flows were parameterized by 4-level convolutional encoder-decoders (U-Net style) with channel widths \( {32,64,128,256} \) and a 256-dimensional bottleneck embedding. We fit the model with nested dropout \( p=1/50 \) and no history (\( h=0 \)), yielding an effective dimension \( K_{\mathrm{eff}}=30 \). Since the behavior is highly repetitive, we visualized only the first 1,438 frames (about \( 2\% \) of the video), which is sufficient to capture the latent structure.

The latent structure forms four prominent bands in 3D latent space (Figure 7A), separated primarily along \( (\mu_1,\mu_2) \) while sharing a common within-band axis \( \mu_3 \).

schematic
Figure 7: Latent structure in behavioral video. Left: (A) Flowed latent structure forms four prominent bands (colored points), with outliers marked by maximal latent distance (brown \( \times \)) and maximal velocity magnitude (cyan \( \times \)). (B) Representative snapshots along each band (top/mid/low in \( \mu_3 \)). Lowering \( \mu_3 \) corresponds to stronger mouth movement, paw lift, or both. (C) Outlier frames selected by maximal latent distance (left) or maximal velocity magnitude (right), correspond to transient paw and controller movements. (D) 3D Latent representations from comparison models. Most either collapse the four bands or fail to capture outliers.

Within each band, lower values of \( \mu_3 \) correspond to stronger mouth movement, while higher \( \mu_3 \) is more quiescent (Figure 7A,B). Across bands, the mean appearance is broadly similar, but each band captures a slightly different baseline visual state, reflected by systematic mean shifts and variability patterns (Figure 7B). Moreover, outliers selected by maximal latent distance or maximal velocity magnitude (Figure 7C) correspond to transient paw and controller movements. Importantly, this latent structure was not well captured by comparison models (Figure 7D), and the same four-band structure was recovered across five random seeds (Figure 8). For additional details on latent structure identified by our model and comparisons against competing approaches, see our supplemental videos.

schematic
Figure 8: Cross-run stability of learned latent representations. (A) Musall mouse-video experiment. Top: sign-aligned 3D DCF latent representations from five random seeds. Bottom: pairwise mean cosine similarity between seeds. (B) Monkey center-out reach experiment. Top: sign-aligned, target-averaged 3D DCF latent trajectories across five random seeds. Bottom: pairwise mean cosine similarity between seeds. (C) Pairwise mean cosine similarity for LFADS and CEBRA on the monkey reach experiment. All heatmaps use the same color scale.

4.4 Audio Data

Lastly, we evaluated DCF on a birdsong dataset (262 training trials, 66 test trials) converted to 26 \( 64\times 64 \) sequential spectrograms. We used the same convolutional architectures as in the video experiment, and the encoder feature map \( \boldsymbol{\mu}_{\boldsymbol{\psi}}(\mathbf{x}) \) was a convolutional VAE with the same multiscale schedule. We fit the model with nested dropout \( p=1/50 \) and no history (\( h=0 \)), evaluating simulated rollouts.

On training trials, DCF rollouts preserved the syllable-level structure of the motif and reproduced the major time-frequency energy patterns across successive syllables (Figure 9).

schematic
Figure 9: DCF captures birdsong latent dynamics. (A) Example motif with syllables A--D, shown as ground-truth spectrogram frames (top), simulated rollouts decoded in data space (middle, \( \tau=1 \)), and compressed space (bottom, \( \tau=0 \)). (B) Projected latent velocity field from simulated rollouts in the first three latent coordinates, with arrows normalized to unit length and color indicating velocity magnitude. (C) Simulated latent trajectories in the same 3D space, with one representative rollout highlighted and colored by time.

In the learned 3D latent coordinates, simulated trajectories concentrate on a low-dimensional curved manifold, and transitions between syllables align with regions of larger latent velocity magnitude (Figure 9).

schematic
Figure 10: Birdsong rollouts on the test set. Left: simulated latent trajectories in the first three coordinates of \( \tilde{\boldsymbol{\mu}} \) for held-out trials. Right: ground-truth spectrogram frames (top) and corresponding simulated rollouts decoded in data space (middle, \( \tau=1 \)) and compressed space (bottom, \( \tau=0 \)), shown at matched time points.

On held-out trials, the same latent geometry is retained and the decoded rollouts remain qualitatively consistent with the ground-truth spectrograms, indicating that the learned dynamics generalize beyond the training set (Figure 10). This is in contrast to typical VAE-based approaches (Goffinet et al., 2021; Sainburg et al., 2020), which preserve structure at the syllable level while failing to smoothly capture dynamics (Figure 11).

schematic
Figure 11: Comparison embeddings using a 3d VAE. Left: simulated trajectories from a DCF model (cp. Figure 9C). Middle: embeddings of a VAE trained using 100 ms long spectrogram windows. Right: embeddings of a VAE trained using 20ms long spectrogram windows. In general, VAEs with short data windows produce latent spaces with disorganized temporal structure, while longer data windows exhibit more obvious structure but more variability than DCF.

Go to next section Conclusion