Streaming Mode

Streaming mode processes simulation frames in chunks and writes them to disk incrementally, avoiding GPU out-of-memory errors for long simulations.

Why Streaming?

A single simulation at 2048x2048 resolution with 1000 time steps produces:

1000 steps x 2 channels x 2048 x 2048 x 4 bytes = ~32 GB

This exceeds most GPU memory capacities. Streaming mode processes frames in configurable chunks (default: 100 steps), writes each chunk to disk, and reuses the GPU memory for the next chunk.

Usage

python gpu_main.py --runner chunked --chunk-size 100

How It Works

The simulator runs simulate_turb_chunk() (a @tf.function with input_signature) for chunk_size steps
The step count is passed as tf.constant(steps, dtype=tf.int32) so the graph is traced only once and reused across chunks of any size
The resulting tensor is transferred to CPU memory
A background thread writes the chunk to disk as frames_chunk_NNN.npy
Double-buffering overlaps the next GPU computation with the current disk write
A frames_chunk_manifest.json is written with the full index

Double-Buffered I/O

Streaming mode uses a ThreadPoolExecutor with max_in_flight=1 to overlap GPU compute with disk I/O:

GPU:  [chunk 0 compute] [chunk 1 compute] [chunk 2 compute] ...
Disk:                    [chunk 0 write]   [chunk 1 write]   ...

This minimizes the I/O overhead to near-zero for most configurations.

Output

Chunked runs produce:

RUN_N/
  frames_chunk_000.npy
  frames_chunk_001.npy
  ...
  frames_chunk_manifest.json

See Output Format for details on loading chunks.

Runner Comparison

Runner	Memory Usage	Speed	Output	Best For
`full`	High (all frames in GPU)	Fastest	Single `frames.npy`	Short simulations
`chunked`	Low (one chunk at a time)	Slightly slower	Chunk files + manifest	Long simulations
`no_save`	Minimal (no storage)	Fastest	Metrics only	Throughput testing

Choosing Chunk Size

Smaller chunks (10-50): Lower peak memory, more I/O operations, slightly more overhead
Larger chunks (100-500): Higher peak memory, fewer I/O operations, more efficient
Default (100): Good balance for most GPUs with 8+ GB VRAM

# Conservative memory usage
python gpu_main.py --runner chunked --chunk-size 25

# Higher throughput
python gpu_main.py --runner chunked --chunk-size 200

Why Streaming?​

Usage​

How It Works​

Double-Buffered I/O​

Output​

Runner Comparison​

Choosing Chunk Size​