Streaming Mode
Streaming mode processes simulation frames in chunks and writes them to disk incrementally, avoiding GPU out-of-memory errors for long simulations.
Why Streaming?
A single simulation at 2048x2048 resolution with 1000 time steps produces:
1000 steps x 2 channels x 2048 x 2048 x 4 bytes = ~32 GB
This exceeds most GPU memory capacities. Streaming mode processes frames in configurable chunks (default: 100 steps), writes each chunk to disk, and reuses the GPU memory for the next chunk.
Usage
python gpu_main.py --runner chunked --chunk-size 100
How It Works
- The simulator runs
simulate_turb_chunk()(a@tf.functionwithinput_signature) forchunk_sizesteps - The step count is passed as
tf.constant(steps, dtype=tf.int32)so the graph is traced only once and reused across chunks of any size - The resulting tensor is transferred to CPU memory
- A background thread writes the chunk to disk as
frames_chunk_NNN.npy - Double-buffering overlaps the next GPU computation with the current disk write
- A
frames_chunk_manifest.jsonis written with the full index
Double-Buffered I/O
Streaming mode uses a ThreadPoolExecutor with max_in_flight=1 to overlap GPU compute with disk I/O:
GPU: [chunk 0 compute] [chunk 1 compute] [chunk 2 compute] ...
Disk: [chunk 0 write] [chunk 1 write] ...
This minimizes the I/O overhead to near-zero for most configurations.
Output
Chunked runs produce:
RUN_N/
frames_chunk_000.npy
frames_chunk_001.npy
...
frames_chunk_manifest.json
See Output Format for details on loading chunks.
Runner Comparison
| Runner | Memory Usage | Speed | Output | Best For |
|---|---|---|---|---|
full | High (all frames in GPU) | Fastest | Single frames.npy | Short simulations |
chunked | Low (one chunk at a time) | Slightly slower | Chunk files + manifest | Long simulations |
no_save | Minimal (no storage) | Fastest | Metrics only | Throughput testing |
Choosing Chunk Size
- Smaller chunks (10-50): Lower peak memory, more I/O operations, slightly more overhead
- Larger chunks (100-500): Higher peak memory, fewer I/O operations, more efficient
- Default (100): Good balance for most GPUs with 8+ GB VRAM
# Conservative memory usage
python gpu_main.py --runner chunked --chunk-size 25
# Higher throughput
python gpu_main.py --runner chunked --chunk-size 200