1) (Almost) Universal Troubleshooting Method
Troubleshooting Checklist
- Scope + safety: impact, severity, recent change.
- Define symptom: down vs slow; exact error.
- Reproduce: local + internal + external vantage points.
- Quick triage: CPU/mem/disk/network.
- Dependencies: DNS/AD/DB/storage/certs/other services.
- Service/process: is it running & listening on port?
- Logs: journal/Event Viewer/app logs.
- Security/config: firewall, SELinux, permissions, certs.
- Fix low risk first: restart/rollback/canary patch.
- Verify + monitor + document: multi-vantage validation.
2) RHEL Core Commands
Services (systemd)
Manage and troubleshoot system services (daemons).
systemctl status httpd # Check service status and recent logs
systemctl start|stop|restart httpd
# Control service runtime state
systemctl enable|disable httpd
# Control auto-start at boot
journalctl -u httpd -b # View logs for httpd since last boot
journalctl -xe # View recent system errors
Packages
Install, update, remove, and verify software packages.
dnf install httpd # Install package
dnf update # Update all packages
dnf remove httpd # Remove package
dnf list installed # List installed packages
rpm -qa | grep httpd # Query installed RPMs
Network
Check IP configuration, listening ports, and connectivity.
ip addr # Show IP addresses
ss -tulnp # Show listening ports + processes
ping host # Test connectivity
curl -I http://site # Test HTTP response headers
Firewall / SELinux
Security enforcement and traffic filtering.
firewall-cmd --list-all # Show firewall rules
firewall-cmd --add-service=http --permanent
firewall-cmd --add-service=https --permanent
# Open HTTP/HTTPS ports
firewall-cmd --reload # Apply firewall changes
getenforce # Check SELinux mode (Enforcing/Permissive)
Disk / Files
Check storage usage, mounts, and permissions.
df -h # Disk usage (human readable)
df -i # Inode usage (file count limit)
du -sh /path # Directory size
lsblk # Block devices and disks
mount # Mounted filesystems
chmod 755 file # Change permissions
chown user:group file # Change ownership
Performance & Troubleshooting
Quick performance and slowdown analysis tools.
top # Live CPU/memory usage per process
htop # Enhanced interactive top (if installed)
free -h # Memory usage (RAM + swap)
vmstat 1 # CPU, memory, IO every 1 sec
iostat -xz 1 # Disk IO stats (requires sysstat)
uptime # Load average
ps aux --sort=-%cpu # Top CPU consumers
ps aux --sort=-%mem # Top memory consumers
• High load? → uptime
• CPU maxed? → top
• Memory exhausted? → free -h
• Disk bottleneck? → iostat -xz 1
• Too many files? → df -i
• Swap usage high? → vmstat
free -h = memory.
load average ≠ CPU %, it reflects runnable processes.
3) Linux Disk Expansion
A) If it’s a VM disk expanded (same disk grew)
# confirm OS sees new size
lsblk
df -h
# If partition needs resize (common in VMs):
# (Depending on tooling; may use growpart if installed)
# growpart /dev/sda 2
# If LVM is used:
pvs; vgs; lvs
pvresize /dev/sda2 # (example PV partition)
lvextend -l +100%FREE /dev/vg0/lv_data
# Filesystem grow (choose one)
xfs_growfs /mountpoint # XFS
resize2fs /dev/vg0/lv_data # ext4
B) If adding a NEW disk and using LVM
lsblk
# create PV
pvcreate /dev/sdb
# add PV to VG
vgextend vg0 /dev/sdb
# extend LV
lvextend -l +100%FREE /dev/vg0/lv_data
# grow FS
xfs_growfs /mountpoint # XFS
# OR
resize2fs /dev/vg0/lv_data # ext4
4) Puppet Reference
Architecture & Concepts
- Puppet Server compiles catalog (desired state).
- Agent applies catalog on interval (commonly ~30 min) or manual run.
- Idempotent: safe to re-run; fixes drift only if needed.
- Drift: manual changes get reverted to policy.
Common Tasks Puppet Manages
Basic Manifest Patterns
# Ensure Apache installed and running
package { 'httpd':
ensure => installed,
}
service { 'httpd':
ensure => running,
enable => true,
}
Agent Commands
puppet agent --test
systemctl status puppet
Talking Points (High Score)
- Role-based node classification (web/app/db).
- Central change = fleet-wide consistency.
- Reports show compliance + failed runs.
5) PowerShell Quick Reference
Use Get-Help and Get-Command to confirm syntax.
Help & Discovery (first stop when you’re stuck)
Find the right cmdlet and confirm parameters/examples.
Get-Command *service* # Search commands by name
Get-Help Restart-Service -Full # Full help + examples
Get-Help Invoke-Command -Examples # Fast examples
Get-Member # Inspect object properties/methods
Core Cmdlets (with what they’re for)
# Services (check/control Windows services)
Get-Service # List services
Get-Service -Name Spooler # One service by name
Restart-Service -Name Spooler # Restart service
Start-Service -Name Spooler # Start service
Stop-Service -Name Spooler # Stop service
# Processes (CPU/memory consumers, hung tasks)
Get-Process # List processes
Stop-Process -Id 1234 -Force # Kill by PID (force)
# Files & content (basic file operations)
Get-ChildItem # List files/dirs (like dir/ls)
Get-Content .\file.txt # Read file
Set-Content .\file.txt "text" # Overwrite file
Add-Content .\file.txt "more text" # Append to file
# Events (Windows event logs)
Get-WinEvent -LogName System -MaxEvents 50 # Recent events in a log
# Connectivity (ICMP ping)
Test-Connection server1 -Count 2 # Ping test
Common “Admin Reality” Cmdlets
These come up constantly in enterprise troubleshooting.
# System info / performance quick checks
Get-ComputerInfo # High-level OS/system info
Get-CimInstance Win32_OperatingSystem # Memory, OS details
Get-CimInstance Win32_LogicalDisk # Disk free space
Get-Counter '\Processor(_Total)\% Processor Time' -SampleInterval 1 -MaxSamples 5
# Networking quick checks
Get-NetIPConfiguration # IP/DNS/gateway
Test-NetConnection server1 -Port 445 # Test TCP port (SMB example)
Get-DnsClientServerAddress # DNS servers configured
# Windows updates (varies by org tooling, but common conceptually)
Get-HotFix | Select-Object -First 10 # Recently installed updates/hotfixes
Pipeline patterns (this is where PowerShell “clicks”)
Think: Get → Where (filter) → Select (shape output) → Sort / Group → Do something
# Filter objects (Where-Object) by a property
Get-Service | Where-Object Status -eq 'Stopped'
# Select only fields you want
Get-Service -Name Spooler | Select-Object Name, Status, StartType
# Sort and take top items
Get-Process | Sort-Object CPU -Descending | Select-Object -First 10 Name, Id, CPU
# Export results (reporting)
Get-Service | Select-Object Name, Status | Export-Csv .\services.csv -NoTypeInformation
Variables, arrays, and loops (simple patterns you can reuse)
# Variable
$svcName = 'Spooler'
# Array
$servers = @('srv1','srv2')
# Foreach loop
foreach ($s in $servers) {
"$s - checking $svcName"
}
Remote basics (WinRM / PowerShell Remoting)
Run commands on remote servers without logging in interactively.
# Run one command remotely
Invoke-Command -ComputerName server1 -ScriptBlock { Get-Service Spooler }
# Use local variables inside ScriptBlock with $using:
$svcName = 'Spooler'
Invoke-Command -ComputerName server1 -ScriptBlock { Get-Service -Name $using:svcName }
Common “Performance Exam” scenario: Check a service across servers and fix it
Pattern: check → if wrong → remediate → report.
$servers = @('srv1','srv2')
$svcName = 'Spooler'
$results = foreach ($s in $servers) {
try {
$svc = Invoke-Command -ComputerName $s -ErrorAction Stop -ScriptBlock {
Get-Service -Name $using:svcName
}
if ($svc.Status -ne 'Running') {
Invoke-Command -ComputerName $s -ScriptBlock {
Restart-Service -Name $using:svcName -ErrorAction Stop
}
$action = 'Restarted'
} else {
$action = 'NoChange'
}
[pscustomobject]@{
Server = $s
Service = $svcName
Status = $svc.Status
Action = $action
}
}
catch {
[pscustomobject]@{
Server = $s
Service = $svcName
Status = 'Unknown'
Action = "Error: $($_.Exception.Message)"
}
}
}
$results | Format-Table -AutoSize
• try/catch prevents one dead server from killing the whole script.
• [pscustomobject] makes clean, exportable output.
• Add | Export-Csv to save results.
Mini “cheat sheet” of the most useful operators
-eq equal -ne not equal
-gt greater than -lt less than
-like wildcard match -match regex match
-and logical AND -or logical OR
-not negation
Safe scripting habits (small things that prevent big mistakes)
# Preview changes when supported
Restart-Service -Name Spooler -WhatIf
# Fail fast inside try/catch
Get-Service -Name DoesNotExist -ErrorAction Stop
6) Windows Troubleshooting – Systems Engineer Notes
Core OS Checks
- Event Viewer – System, Application, Security logs.
- Services – status, dependencies, startup type, service accounts.
- DNS – A/CNAME/SRV records; verify forward & reverse lookup.
- Permissions – NTFS vs Share permissions (Effective Access).
- Patching – Recent updates, KB correlation, rollback plan.
Performance & Resource Analysis
- CPU – sustained >80%? Check process via Task Manager / Get-Process.
- Memory – paging, commit charge, memory leaks.
- Disk – latency (>20ms concern), queue depth, IOPS.
- Network – packet loss, NIC errors, port saturation.
- PerfMon – collect counters for trend analysis.
Active Directory / Authentication Flow
- Check domain controller connectivity.
- Verify time synchronization (Kerberos sensitive).
- Validate DNS SRV records for DC discovery.
- Review account lockouts & replication health.
- Use dcdiag and repadmin for DC health.
Group Policy Troubleshooting
- Understand LSDOU processing order.
- Check Resultant Set of Policy (rsop.msc / gpresult).
- Validate SYSVOL replication.
Networking Stack
- ipconfig /all (DNS, gateway, DHCP).
- Test-Connection / tracert for reachability.
- netstat -ano (port conflicts, listening services).
- Firewall rules & Windows Defender status.
High Availability Awareness
- Cluster service status.
- Quorum configuration.
- Failover event logs.
- Storage presentation (LUN visibility, MPIO).
PowerShell (Enterprise-Level Useful Cmdlets)
Get-Service
Get-WinEvent
Test-Connection
Get-Process
Get-Counter
Get-ADUser
dcdiag
repadmin /replsummary
gpresult /r
7) Systems Thinking
A) Deploy & Manage 50 Servers (RHEL + Puppet)
- Standardize build (gold image/template) + register to repos/subscription.
- Install Puppet agent on all nodes; classify by role (web/app/db).
- Manifests enforce packages, services, configs, firewall/SELinux baseline.
- Central logging + monitoring; alerting for services, disk, CPU/mem.
- Patch lifecycle: staged/canary → rollout; track compliance.
- Backups/DR integration + runbooks; verify restores.
- Document and manage lifecycle (build → operate → retire).
B) Critical Vulnerability on Public-Facing Servers
- Validate: CVE details, exploitability, affected versions.
- Contain: WAF/firewall restrictions, disable vulnerable service if needed.
- Patch: canary first, then rollout (automation where applicable).
- Verify: rescan, version check, regression test.
- Monitor & document: logs, lessons learned, update runbooks.
C) SAN / Storage Latency Spike (keywords)
8) Bash Scripting
Bash is text + exit codes: 0 = success, non-zero = failure.
Help & Discovery (when you forget syntax)
Find usage, examples, and what flags mean.
man systemctl # Manual page
command --help # Quick help
type -a ls # What command you’re actually running (alias/binary)
which curl # Where binary lives
echo $? # Exit code of last command (0 = OK)
set -x # Debug: print commands as they execute
set +x # Stop debug
Core commands you’ll script constantly
# Files / navigation
pwd # Current directory
ls -lah # List (human, all, long)
cd /path
cp -av src dst # Copy with attributes + verbose
mv -v old new
rm -i file # Interactive delete
mkdir -p /a/b/c # Create nested dirs
# Viewing files
cat file # Dump file
less file # Page through file
head -n 50 file
tail -n 50 file
tail -f /var/log/messages # Follow logs live
# Search / filter
grep -R "text" /path # Search recursively
grep -i "error" file # Case-insensitive
awk '{print $1,$3}' file
cut -d: -f1 /etc/passwd
sort file | uniq -c
# Permissions
chmod 755 file
chown user:group file
Variables, quoting, and command substitution (most common mistakes)
name="httpd"
echo "$name" # Use double-quotes to preserve spaces
echo '$name' # Single quotes do NOT expand variables
now="$(date +%F_%H%M%S)" # Command substitution
echo "$now"
path="/var/log/messages"
echo "$path"
Conditionals (if/elif/else) and file tests
# Common file tests:
# -f file exists and is a regular file
# -d directory exists
# -r readable, -w writable, -x executable
# -z string is empty
file="/etc/ssh/sshd_config"
if [[ -f "$file" ]]; then
echo "Found $file"
else
echo "Missing $file"
fi
Loops (for / while) you can reuse
# For over a list
servers=("srv1" "srv2" "srv3")
for s in "${servers[@]}"; do
echo "Checking $s"
done
# While read lines from a file
while IFS= read -r line; do
echo "Line: $line"
done < servers.txt
Functions + safe mode (professional baseline)
This prevents silent failures and makes scripts predictable.
#!/usr/bin/env bash
set -euo pipefail
log() { echo "[$(date +%F\ %T)] $*"; }
log "Starting script"
Common pipeline patterns (grep/awk/sed)
# Find top CPU processes
ps aux --sort=-%cpu | head -n 10
# Find top memory processes
ps aux --sort=-%mem | head -n 10
# Show listening ports
ss -tulnp
# Count failed SSH logins (example paths vary)
grep -i "failed password" /var/log/secure | wc -l
# Extract column examples
df -h | awk '{print $1,$5,$6}'
Remote basics (SSH non-interactive)
Run a command on a remote host without logging in manually.
ssh user@server1 "hostname; uptime; systemctl is-active httpd"
Performance / “server is slow” triage (Linux)
Fast checks to decide if it’s CPU, RAM, disk, or network.
uptime # Load average (runnable + waiting)
top # CPU/mem live per process
free -h # RAM + swap usage
vmstat 1 # CPU run queue, swapping, IO
iostat -xz 1 # Disk latency/utilization (sysstat)
df -h # Disk space
df -i # Inodes (file-count exhaustion)
ss -s # Socket summary
dmesg -T | tail -n 50 # Kernel messages (hardware/IO errors)
• High load with low CPU → often IO wait (disk/network).
• High swap usage → memory pressure; look for top RSS processes.
• Full inodes (df -i) → “No space left on device” even when disk isn’t full.
Practical scenario script: check a service on many servers and fix it
Pattern: check → remediate → report.
#!/usr/bin/env bash
set -euo pipefail
servers=("srv1" "srv2" "srv3")
svc="httpd"
for s in "${servers[@]}"; do
echo "=== $s ==="
if ssh "$s" "systemctl is-active --quiet $svc"; then
echo "$svc is running"
else
echo "$svc is NOT running - restarting..."
ssh "$s" "sudo systemctl restart $svc && systemctl is-active $svc"
fi
done
Text processing “cheat sheet” (most useful one-liners)
grep -i "error" file # Find lines matching text
grep -v "pattern" file # Exclude lines
awk '{print $1}' file # Print column 1
awk -F: '{print $1,$3}' /etc/passwd # Use delimiter :
sed 's/old/new/g' file # Replace text
wc -l file # Count lines
Safe scripting habits (avoid self-inflicted outages)
# Preview what you’re about to delete
rm -i /path/file
# Use a dry-run pattern (manual, but effective)
echo "Would run: systemctl restart httpd"
# Guardrails for critical variables
: "${TARGET:?TARGET is required}" # Fails if TARGET is empty/unset
9) Scenario Playbooks (Open-Book Speed Runs)
A) Apache site not loading after config change (RHEL)
# 1) Is Apache running?
systemctl status httpd
# 2) Is it listening?
ss -tulnp | grep -E ':80|:443'
# 3) Syntax/config test (fast root cause)
apachectl -t
# common variants:
# httpd -t
# 4) Logs (most important)
journalctl -u httpd -b
# also check:
# /var/log/httpd/error_log (common on RHEL)
# 5) Firewall + SELinux
firewall-cmd --list-all
getenforce
# 6) Rollback / revert last change if needed
# (restore known-good config, then restart)
systemctl restart httpd
B) Service fails after patching
- Check service: systemctl status <svc>
- Logs: journalctl -u <svc> -b + app logs
- Dependencies: DB up? DNS? certs? ports? permissions?
- Config drift: compare with known-good config; use Puppet to enforce baseline if applicable.
- Rollback plan: snapshot/restore or downgrade package (only if policy allows).
- Verify: service active + port listening + functional check (curl).
C) “Server is slow” (Linux performance triage in 90 seconds)
uptime # load average (runnable + waiting)
top # CPU/mem per process
free -h # RAM + swap pressure
df -h # disk space
df -i # inode exhaustion ("No space left" even when disk isn't full)
ss -s # socket summary (connection spikes)
dmesg -T | tail -n 30 # kernel / IO errors
D) Disk full vs Inodes full vs Memory pressure (common confusion)
Disk full
df -h
du -sh /* | sort -h
# cleanup logs/temp, then consider expansion
Inodes full
df -i
# too many small files → cleanup/rotate
Memory pressure
free -h
top
# high swap usage → find top RSS processes
Disk expansion “decision tree”
- Cleanup possible? do that first
- VM disk grew or new disk added?
- LVM or not?
- XFS or ext4?
E) Public can’t reach service but internal can
- Local: service running + listening (systemctl/ss)
- Host firewall: firewalld open for 80/443
- SELinux: enforcing blocks? check audit/journal
- DNS: public record correct?
- Perimeter: NAT/LB/WAF rules, upstream firewall
F) Puppet: drift + compliance (what to write if asked)
- Idempotency: Puppet enforces desired state; reverts unauthorized changes.
- Detect: Puppet reports/failed runs + last run status; investigate who/why.
- Correct: fix manifests if change is legitimate; otherwise allow Puppet to remediate drift.
- Force run: puppet agent --test
G) Mini Glossary (don’t freeze on acronyms)
- LVM: Logical Volume Manager (flexible disk management)
- PV: Physical Volume (disk/partition initialized for LVM)
- VG: Volume Group (pool of storage made from PVs)
- LV: Logical Volume (the “volume” you mount; like a flexible partition)
- XFS grow: xfs_growfs /mountpoint (grow online)
- ext4 grow: resize2fs /dev/vg/lv
- Idempotent: safe to run repeatedly; only changes what’s out of compliance
- Drift: config changes made outside automation; Puppet fixes it
H) “If I don’t know” fallback (still score points)
10) Cloud & Hybrid (Azure-Focused + AWS Equivalents)
A) Service Models – Know the Differences
- IaaS: Virtual Machines (Azure VM) (AWS: EC2) – full OS control, patching required.
- PaaS: Managed app/database platforms (Azure App Service, Azure SQL) (AWS: Elastic Beanstalk, RDS).
- SaaS: Hosted applications (Microsoft 365) (AWS: WorkSpaces / third-party SaaS).
B) Identity & Access Management
- Azure AD (Entra ID) (AWS: IAM + AWS SSO / IAM Identity Center).
- Conditional Access policies (AWS: IAM policies + MFA enforcement).
- Role-Based Access Control (RBAC) (AWS: IAM Roles & Policies).
- Privileged Identity Management (PIM) (AWS: IAM role assumption + temporary credentials).
- Azure AD Connect (Hybrid identity sync) (AWS: AD Connector / AWS Managed Microsoft AD).
C) Hybrid Connectivity
- Azure VPN Gateway (AWS: Site-to-Site VPN).
- Azure ExpressRoute (AWS: Direct Connect).
- Virtual Networks (VNet) (AWS: VPC).
- Subnets + NSGs (AWS: Subnets + Security Groups).
- Route tables (AWS: Route Tables).
D) Compute (High Availability Focus)
- Azure Virtual Machines (AWS: EC2).
- Availability Sets (AWS: EC2 placement groups).
- Availability Zones (AWS: Availability Zones).
- VM Scale Sets (auto scale) (AWS: Auto Scaling Groups).
- Azure Load Balancer / Application Gateway (AWS: ELB / ALB).
E) Storage
- Azure Managed Disks (AWS: EBS).
- Azure Files (SMB shares) (AWS: EFS / FSx).
- Azure Blob Storage (AWS: S3).
- Storage replication: LRS / ZRS / GRS (AWS: S3 Standard / Multi-AZ / Cross-Region Replication).
F) Backup & Disaster Recovery
- Azure Backup (AWS: AWS Backup).
- Azure Site Recovery (replication) (AWS: Elastic Disaster Recovery).
- Recovery Services Vault (AWS: Backup Vault).
- Define RPO and RTO.
- Test restores regularly.
G) Security & Monitoring
- Network Security Groups (NSG) (AWS: Security Groups).
- Azure Firewall (AWS: AWS Network Firewall).
- Web Application Firewall (WAF) (AWS: AWS WAF).
- Defender for Cloud (AWS: Security Hub / GuardDuty).
- Azure Monitor + Log Analytics (AWS: CloudWatch + CloudTrail).
- Microsoft Sentinel (SIEM) (AWS: Security Hub + OpenSearch / third-party SIEM).
H) Hybrid Architecture Answer Template
2) Integrate identity (Azure AD Connect / hybrid sync).
3) Secure connectivity (VPN or ExpressRoute).
4) Deploy compute in Availability Zones or use PaaS where appropriate.
5) Segment network (subnets + NSGs).
6) Implement backup & replication.
7) Enable monitoring + security baseline.
8) Document lifecycle and cost management.
11) Active Directory (Enterprise Quick Reference)
A) Core Components
- Domain Controller (DC) – Authentication + directory services.
- Forest – Security boundary.
- Domain – Logical grouping of objects.
- Organizational Units (OUs) – Used for delegation + GPO targeting.
- Global Catalog – Partial attribute store for forest-wide searches.
B) FSMO Roles (Know These)
- Schema Master
- Domain Naming Master
- RID Master
- PDC Emulator
- Infrastructure Master
C) Authentication Flow
- Client contacts DC.
- Kerberos ticket issued.
- Access granted based on group membership + ACLs.
D) Troubleshooting AD Authentication
- Check DC availability.
- Verify DNS resolution.
- Check time sync (Kerberos requires time sync).
- Review Event Viewer on DC.
- Check replication status.
E) Replication Health
repadmin /replsummary
repadmin /showrepl
dcdiag
F) Group Policy (GPO)
- Order: Local → Site → Domain → OU.
- Use gpresult /r to verify applied policies.
- Block inheritance carefully.
G) Security Best Practices
- Least privilege.
- Separate admin accounts.
- MFA for privileged roles.
- Tiered administration model.
- Audit logging enabled.
H) If Asked: “Users Can’t Log In”
2) Confirm DNS resolution.
3) Check time sync.
4) Review Event Viewer logs.
5) Validate replication health.
6) Verify account not locked/expired.
12) DNS & Certificates (Records + TLS Quick Reference)
A) Common DNS Record Types
- A Record – Maps hostname to IPv4 address.
- AAAA Record – IPv6 equivalent.
- CNAME – Alias to another hostname.
- MX – Mail routing record.
- SRV – Service locator (critical for AD).
- TXT – SPF/DKIM/verification records.
- PTR – Reverse lookup.
B) DNS Troubleshooting
nslookup hostname
Resolve-DnsName hostname # PowerShell
dig hostname # Linux (if available)
- Verify record exists.
- Check TTL.
- Flush cache if needed.
- Ensure correct DNS server configured.
C) AD & DNS Relationship
- AD heavily relies on SRV records.
- If DNS fails → authentication fails.
- DCs must register properly in DNS.
D) Certificates (TLS / SSL)
- Ensure certificate matches hostname (CN or SAN).
- Check expiration date.
- Verify certificate chain (intermediate CA).
- Ensure private key present.
E) Common Certificate Issues
- Expired certificate.
- Wrong hostname.
- Missing intermediate CA.
- Service not bound to correct certificate.
F) Quick Certificate Checks
# Windows (PowerShell)
Get-ChildItem Cert:\LocalMachine\My
# Linux
openssl s_client -connect site:443
G) If Website Shows Security Warning
2) Verify CN/SAN matches URL.
3) Validate certificate chain.
4) Confirm service binding to correct cert.
5) Restart service after fix.