Skip to main content

Streaming Mode

Streaming mode processes simulation frames in chunks and writes them to disk incrementally, avoiding GPU out-of-memory errors for long simulations.

Why Streaming?

A single simulation at 2048x2048 resolution with 1000 time steps produces:

1000 steps x 2 channels x 2048 x 2048 x 4 bytes = ~32 GB

This exceeds most GPU memory capacities. Streaming mode processes frames in configurable chunks (default: 100 steps), writes each chunk to disk, and reuses the GPU memory for the next chunk.

Usage

python gpu_main.py --runner chunked --chunk-size 100

How It Works

  1. The simulator runs simulate_turb_chunk() (a @tf.function with input_signature) for chunk_size steps
  2. The step count is passed as tf.constant(steps, dtype=tf.int32) so the graph is traced only once and reused across chunks of any size
  3. The resulting tensor is transferred to CPU memory
  4. A background thread writes the chunk to disk as frames_chunk_NNN.npy
  5. Double-buffering overlaps the next GPU computation with the current disk write
  6. A frames_chunk_manifest.json is written with the full index

Double-Buffered I/O

Streaming mode uses a ThreadPoolExecutor with max_in_flight=1 to overlap GPU compute with disk I/O:

GPU:  [chunk 0 compute] [chunk 1 compute] [chunk 2 compute] ...
Disk: [chunk 0 write] [chunk 1 write] ...

This minimizes the I/O overhead to near-zero for most configurations.

Output

Chunked runs produce:

RUN_N/
frames_chunk_000.npy
frames_chunk_001.npy
...
frames_chunk_manifest.json

See Output Format for details on loading chunks.

Runner Comparison

RunnerMemory UsageSpeedOutputBest For
fullHigh (all frames in GPU)FastestSingle frames.npyShort simulations
chunkedLow (one chunk at a time)Slightly slowerChunk files + manifestLong simulations
no_saveMinimal (no storage)FastestMetrics onlyThroughput testing

Choosing Chunk Size

  • Smaller chunks (10-50): Lower peak memory, more I/O operations, slightly more overhead
  • Larger chunks (100-500): Higher peak memory, fewer I/O operations, more efficient
  • Default (100): Good balance for most GPUs with 8+ GB VRAM
# Conservative memory usage
python gpu_main.py --runner chunked --chunk-size 25

# Higher throughput
python gpu_main.py --runner chunked --chunk-size 200