Across experiments, we analyze representations at compression levels \( \tau\in {0,1} \). At generation time, we integrate both learned flows (dynamic and compressive) as neural ODEs using an adaptive Dormand–Prince (DOPRI5) solver.
We obtain the flowed data representation \( {\tilde{\mathbf{x}}^{(\tau)}_t} \) by two methods: flowed, where each observed frame \( \mathbf{x}^{(1)}_t \) is compressed pointwise to level \( \tau \) by integrating the compressive flow \( \mathbf{u}_{\phi} \) from \( \tau=1 \); and simulated, where we first compress an initial frame to \( \tilde{\mathbf{x}}^{(\tau)}_0 \) and then integrate the dynamical flow \( \mathbf{v}_{\theta} \) forward in time at the same \( \tau \).
At \( \tau=0 \), we map compressed states to encoder-normalized latent coordinates via \( \boldsymbol{\tilde{\mu}}_t=(\mathbf{L}\mathbf{D}^{1/2})^{-1}\left(\mathbf{\tilde{x}}^{(0)}_t-\mathbf{b}\right) \) and visualize the first three coordinates, which are the most important under nested-dropout ordering. We also project velocities by using \( \boldsymbol{\dot{\tilde{\mu}}}_t=(\mathbf{L}\mathbf{D}^{1/2})^{-1}\mathbf{\dot{\tilde{x}}}^{(0)}_t \), and plotting dynamical velocities at the start of each step (\( s=0 \) in Equation 4, Section 2.4).
Experiment train times and parameter choices as well as trajectory roll-out time estimates can be found in the Appendix (Sections B,C respectively) of our manuscript. Links to manuscript, repo, and bird (audio) data are listed under “Resources”.
4.1 Simulated Data
To test the ability of DCF to extract low-dimensional dynamics from high-dimensional data, we simulate \( 10 \) short videos of a ball moving counterclockwise (Figure 3). Each video comprises \( 50 \) time steps, with each frame being a \( 28\times 28 \) grayscale image. Both the compressive flow \( \mathbf{u}_{\phi} \) and the dynamical flow \( \mathbf{v}_{\theta} \) are parameterized by 4-level convolutional encoder-decoders (U-Net style) with channel widths \( {32,64,128,256} \) and a 256-dimensional bottleneck embedding. The encoder feature map \( \boldsymbol{\mu}_{\boldsymbol{\psi}}(\mathbf{x}) \) is a convolutional VAE with the same multiscale channel schedule. We use no dynamical history. That is, \( h=0 \) for \( \mathbf{x}_{\mathrm{hist}}^{(\tau)} \) in Section 2.4.
We set nested dropout to \( K\sim\mathrm{Geom}(p) \) with \( p=1/50 \), so that \( K_{\mathrm{target}}=\mathbb{E}(K)=50 \) can provide a generous latent budget. With no additional penalty on \( \mathbf{D} \), the fitted scales \( {d_i} \) decay rapidly (first column of Figure S1, Appendix A of manuscript).
We define the effective dimension as the smallest \( K \) such that \( \sum_{i=1}^{K} d_i \geq 0.95\sum_{i=1}^{D} d_i \), which yields \( K_{\mathrm{eff}}=30 \) in this run. As expected (Figure 3A), simulated frames in data space (\( \tau=1 \)) track the ground-truth ball location over time, while the corresponding rollouts in compressed space ( \( \tau=0 \), embedded in image space) provide a nearly identical image.
Moreover, the simulated latent trajectory in the first three coordinates of \( \boldsymbol{\tilde{\mu}} \) forms a smooth closed loop (Figure 3B), consistent with the underlying periodic motion. In further experiments imposing additional shrinkage on \( \mathbf{D} \) (Figure 4; Figure S1, Appendix A of manuscript), we found that inferred dynamics and latent spaces were reproducible across runs. This is expected given that our model produces identifiable representations by construction, with additional cross-run stability checks in Figure 8.
For these and subsequent experiments (cf. Sections 4.2-4.4) we compare DCF against several other competing approaches, namely VAEs (Kingma & Welling, 2014; Rezende et al., 2014), MARBLE (Gosztolai et al., 2025), LFADS (Pandarinath et al., 2018), DMD (Kutz et al., 2016), T-PHATE (Busch et al., 2023), and CEBRA (Schneider et al., 2023). For larger datasets, we could only train T-PHATE, which needs to load everything into memory, on a subset of data. Results for these models on the balls dataset are shown in Figure 5. Surprisingly, several of these models struggled to identify a simple, low-dimensional manifold underlying the data.
4.2 Neural Data
We evaluated DCF on a dataset comprising population neural activity from a center-out reach task performed by nonhuman primates (592 training trials, 197 held-out test trials; Pei et al., 2021). Each trial’s data consisted of smoothed spike counts from 137 neurons over 100 time steps. All methods in Table 1 used the same train/test split and neural time window, with cursor velocity used only afterward for downstream linear decoding.
| DCF (ours) | 0.304 (0.020, 0.505) |
| VAE* | 0.242 (0.041, 0.427) |
| MARBLE* | 0.195 (-0.083, 0.336) |
| LFADS | 0.311 (0.093, 0.416) |
| DMD* | 0.014 (-0.157, 0.128) |
| T_PHATE* | -0.012 (-0.152, 0.010) |
| CEBRA* | 0.202 (0.010, 0.294) |
For this experiment, the encoder feature map \( \boldsymbol{\mu}_{\boldsymbol{\psi}}(\mathbf{x}) \), the compressive flow \( \mathbf{u}_{\boldsymbol{\phi}} \), and the dynamical flow \( \mathbf{v}_{\boldsymbol{\theta}} \) were all parameterized by multilayer perceptrons (MLPs) with depth 4 and width 256. We use a lag-10 dynamical history (\( h=10 \)) and nested dropout \( p=1/50 \) for all reported results.
As expected, most latent models, including DCF, learned a well-organized latent space in which neural dynamics corresponding to distinct reach targets organized topographically (Figure 6). Quantification on a velocity prediction task using these latent spaces (Table 1) shows that DCF outperforms other models, despite falling well short of larger, prediction-focused approaches (e.g.,Azabou et al., 2023). Additionally, our model achieves higher reconstruction quality of neural activity (i.e., firing rates) on held-out data (Table 2). Consistent with latent identifiability up to sign, DCF also recovers stable target-organized 3D latent geometry across five random seeds, whereas LFADS and CEBRA show lower seed-to-seed consistency (Figure 8).
| DCF (ours) | 0.999 (0.998, 0.999) |
| VAE | 0.618 (0.559, 0.678) |
| LFADS | 0.595 (0.527, 0.661) |
4.3 Video Data
We next applied DCF to a well-studied behavioral video dataset (Musall et al., 2019), which consists of a single long video with \( 71{,}942 \) frames of size \( 64\times 64 \). These are challenging data for most dimension reduction methods, since most parts of the frame are highly static, with only intermittent bursts of activity. Here, since we focus on inference, we did not use a train/test split and do not report long-horizon trajectory rollouts. Both flows were parameterized by 4-level convolutional encoder-decoders (U-Net style) with channel widths \( {32,64,128,256} \) and a 256-dimensional bottleneck embedding. We fit the model with nested dropout \( p=1/50 \) and no history (\( h=0 \)), yielding an effective dimension \( K_{\mathrm{eff}}=30 \). Since the behavior is highly repetitive, we visualized only the first 1,438 frames (about \( 2\% \) of the video), which is sufficient to capture the latent structure.
The latent structure forms four prominent bands in 3D latent space (Figure 7A), separated primarily along \( (\mu_1,\mu_2) \) while sharing a common within-band axis \( \mu_3 \).
Within each band, lower values of \( \mu_3 \) correspond to stronger mouth movement, while higher \( \mu_3 \) is more quiescent (Figure 7A,B). Across bands, the mean appearance is broadly similar, but each band captures a slightly different baseline visual state, reflected by systematic mean shifts and variability patterns (Figure 7B). Moreover, outliers selected by maximal latent distance or maximal velocity magnitude (Figure 7C) correspond to transient paw and controller movements. Importantly, this latent structure was not well captured by comparison models (Figure 7D), and the same four-band structure was recovered across five random seeds (Figure 8). For additional details on latent structure identified by our model and comparisons against competing approaches, see our supplemental videos.
4.4 Audio Data
Lastly, we evaluated DCF on a birdsong dataset (262 training trials, 66 test trials) converted to 26 \( 64\times 64 \) sequential spectrograms. We used the same convolutional architectures as in the video experiment, and the encoder feature map \( \boldsymbol{\mu}_{\boldsymbol{\psi}}(\mathbf{x}) \) was a convolutional VAE with the same multiscale schedule. We fit the model with nested dropout \( p=1/50 \) and no history (\( h=0 \)), evaluating simulated rollouts.
On training trials, DCF rollouts preserved the syllable-level structure of the motif and reproduced the major time-frequency energy patterns across successive syllables (Figure 9).
In the learned 3D latent coordinates, simulated trajectories concentrate on a low-dimensional curved manifold, and transitions between syllables align with regions of larger latent velocity magnitude (Figure 9).
On held-out trials, the same latent geometry is retained and the decoded rollouts remain qualitatively consistent with the ground-truth spectrograms, indicating that the learned dynamics generalize beyond the training set (Figure 10). This is in contrast to typical VAE-based approaches (Goffinet et al., 2021; Sainburg et al., 2020), which preserve structure at the syllable level while failing to smoothly capture dynamics (Figure 11).