Monitoring & Profiling
AtmoTurbSim includes a built-in monitoring system that tracks system resources during simulation and integrates with TensorBoard.
SimulationMonitor
The SimulationMonitor class (gpu_core/monitoring/system_monitor.py) is automatically instantiated by GpuSimulator and collects metrics throughout the simulation.
Collected Metrics
| Metric | Source | TensorBoard Tag |
|---|---|---|
| CPU usage | psutil | CPU/Usage |
| Memory usage | psutil | Memory/Usage |
| GPU utilization | GPUtil | GPU/Utilization |
| GPU memory | GPUtil | GPU/Memory |
| GPU temperature | GPUtil | GPU/Temperature |
| Operation timing | Internal | Timing/<operation> |
TensorBoard
Enable Profiling
python gpu_main.py --profile
View Logs
tensorboard --logdir outputs/YYYY-MM-DD/HH/RUN_N/monitoring/tensorboard/
Then open http://localhost:6006 in your browser.
What You'll See
- Scalars: CPU/GPU usage over time, memory consumption, timing per operation
- Trace (with
--profile): TensorFlow operation-level profiling showing kernel execution times and GPU utilization
Configuration
Monitoring behavior is controlled via config.json:
{
"monitoring": {
"enabled": { "value": true },
"tensorboard": {
"enabled": { "value": true },
"log_interval": { "value": 10 }
},
"system_metrics": {
"enabled": { "value": true },
"log_interval": { "value": 5 }
}
}
}
| Parameter | Default | Description |
|---|---|---|
enabled | true | Master switch for all monitoring |
tensorboard.enabled | true | Enable TensorBoard scalar logging |
tensorboard.log_interval | 10 | Log every N simulation steps |
system_metrics.enabled | true | Enable CPU/GPU metric collection |
system_metrics.log_interval | 5 | Sample metrics every N steps |
Output Files
After simulation, monitoring data is saved to:
monitoring/
system_info.json # Hardware info snapshot
tensorboard/ # TensorBoard event files
metrics/
cpu_usage.npy # CPU usage time series
memory_usage.npy # RAM usage time series
gpu_utilization.npy # GPU utilization time series
gpu_memory.npy # GPU memory time series
gpu_temperature.npy # GPU temperature time series
timing_report.txt # Per-operation timing summary
summary_statistics.json # Aggregated metric statistics
system_info.json
Captured at simulation start:
{
"cpu_count": 16,
"cpu_freq_mhz": 3400,
"ram_total_gb": 32.0,
"gpu_name": "NVIDIA GeForce RTX 4080",
"gpu_memory_total_mb": 13424,
"gpu_temperature": 45,
"gpu_uuid": "GPU-xxxx-xxxx"
}
Disabling Monitoring
To disable all monitoring (slightly reduces overhead):
{
"monitoring": {
"enabled": { "value": false }
}
}