📝 Author
Birat Aryal — birataryal.github.io
Created Date: 2026-03-28
Updated Date: Saturday 28th March 2026 15:35:14
Website - birataryal.com.np
Repository - Birat Aryal
LinkedIn - Birat Aryal
DevSecOps Engineer | System Engineer | Cyber Security Analyst | Network Engineer
System Level Troubleshooting
Q1. You see a production server with 100% CPU. Walk through your diagnostic approach.
Answer: Start broad, narrow down, then act. Never guess and restart — always understand the root cause first.
# Step 1: Immediate overview
top -b -n1 | head -20 # snapshot — which PID is consuming CPU?
uptime # load average trend (1m, 5m, 15m)
# Step 2: Identify the offending process
top -b -n1 -o %CPU | head -15
pidstat -u 2 5 # per-process CPU, 5 samples at 2s intervals
# Step 3: What is that process doing?
strace -p <PID> -c -T 2>&1 | head -20 # syscall profile
perf top -p <PID> # kernel-level hot functions (need perf)
ls -la /proc/<PID>/fd | wc -l # file descriptor count
# Step 4: Is it CPU-bound or I/O wait?
vmstat 1 5
# If wa (iowait) > 20%: storage problem, not pure CPU
# If us (user) near 100%: application code
# If sy (system/kernel) elevated: kernel issue, syscall storm
# Step 5: Memory pressure contributing?
free -h
cat /proc/meminfo | grep -E 'MemFree|Cached|SwapUsed|Dirty'
# Step 6: If Java/.NET process — get thread dump
jstack <PID> > /tmp/threaddump.txt
kill -3 <PID> # also triggers thread dump to stdout
# Step 7: Network causing CPU load?
ss -s # socket summary
cat /proc/net/softnet_stat # NIC softirq drops
VMStat Command Details
Mnemonic:
“Run Block - Swap Free BuffCache - SwapIn SwapOut - BlockIn BlockOut - Interrupt Context - User System Idle Wait Steal”
| Field | Meaning | Mnemonic |
|---|---|---|
r |
running queue | Run |
b |
blocked (I/O wait) | Block |
Signals
r > CPU cores high CPU contention b > 0 -> I/O bottleneck
Disk
| Field | Meaning | Mnemonic |
|---|---|---|
bi |
blocks in (read) | Block In |
bo |
blocks out (write) | Block Out |
Signals
High bi/bo + wa → disk bottleneck
Sudden spikes → burst workload or flush
| Field | Meaning | Mnemonic |
|---|---|---|
us |
user CPU | User |
sy |
kernel CPU | System |
id |
idle | Idle |
wa |
I/O wait | Wait |
st |
stolen (VM) | Steal |
🟥 CPU Bottleneck
👉 “Run high, Idle low”
r ↑, id ↓
🟥 Memory Bottleneck
👉 “Swap is death”
si/so > 0
🟥 Disk Bottleneck
👉 “Blocked + Waiting”
b > 0, wa ↑
🟥 Thread Contention
👉 “Context explosion”
cs ↑↑
🟥 Virtualization Issue
👉 “Steal means host stealing CPU”
st > 0
🔥 Real-World Workflow (What seniors actually do)
Step 1 — Detect
vmstat 1
👉 “Something is wrong: memory/disk/cpu”
Step 2 — Narrow Down
If disk suspected:
iostat -x 1
If CPU:
top
If memory:
free -m
Step 3 — Root Cause
- Identify process
- Identify pattern
- Identify spike source