Skip to main content

Monitoring & Profiling

AtmoTurbSim includes a built-in monitoring system that tracks system resources during simulation and integrates with TensorBoard.

SimulationMonitor

The SimulationMonitor class (gpu_core/monitoring/system_monitor.py) is automatically instantiated by GpuSimulator and collects metrics throughout the simulation.

Collected Metrics

MetricSourceTensorBoard Tag
CPU usagepsutilCPU/Usage
Memory usagepsutilMemory/Usage
GPU utilizationGPUtilGPU/Utilization
GPU memoryGPUtilGPU/Memory
GPU temperatureGPUtilGPU/Temperature
Operation timingInternalTiming/<operation>

TensorBoard

Enable Profiling

python gpu_main.py --profile

View Logs

tensorboard --logdir outputs/YYYY-MM-DD/HH/RUN_N/monitoring/tensorboard/

Then open http://localhost:6006 in your browser.

What You'll See

  • Scalars: CPU/GPU usage over time, memory consumption, timing per operation
  • Trace (with --profile): TensorFlow operation-level profiling showing kernel execution times and GPU utilization

Configuration

Monitoring behavior is controlled via config.json:

{
"monitoring": {
"enabled": { "value": true },
"tensorboard": {
"enabled": { "value": true },
"log_interval": { "value": 10 }
},
"system_metrics": {
"enabled": { "value": true },
"log_interval": { "value": 5 }
}
}
}
ParameterDefaultDescription
enabledtrueMaster switch for all monitoring
tensorboard.enabledtrueEnable TensorBoard scalar logging
tensorboard.log_interval10Log every N simulation steps
system_metrics.enabledtrueEnable CPU/GPU metric collection
system_metrics.log_interval5Sample metrics every N steps

Output Files

After simulation, monitoring data is saved to:

monitoring/
system_info.json # Hardware info snapshot
tensorboard/ # TensorBoard event files
metrics/
cpu_usage.npy # CPU usage time series
memory_usage.npy # RAM usage time series
gpu_utilization.npy # GPU utilization time series
gpu_memory.npy # GPU memory time series
gpu_temperature.npy # GPU temperature time series
timing_report.txt # Per-operation timing summary
summary_statistics.json # Aggregated metric statistics

system_info.json

Captured at simulation start:

{
"cpu_count": 16,
"cpu_freq_mhz": 3400,
"ram_total_gb": 32.0,
"gpu_name": "NVIDIA GeForce RTX 4080",
"gpu_memory_total_mb": 13424,
"gpu_temperature": 45,
"gpu_uuid": "GPU-xxxx-xxxx"
}

Disabling Monitoring

To disable all monitoring (slightly reduces overhead):

{
"monitoring": {
"enabled": { "value": false }
}
}