System Engineering Quick Reference

Keep Calm and Code On

1) (Almost) Universal Troubleshooting Method

OSI Model:
Physical | Data Link | Network | Transport | Session | Presentation | Application

Troubleshooting Checklist

  1. Scope + safety: impact, severity, recent change.
  2. Define symptom: down vs slow; exact error.
  3. Reproduce: local + internal + external vantage points.
  4. Quick triage: CPU/mem/disk/network.
  5. Dependencies: DNS/AD/DB/storage/certs/other services.
  6. Service/process: is it running & listening on port?
  7. Logs: journal/Event Viewer/app logs.
  8. Security/config: firewall, SELinux, permissions, certs.
  9. Fix low risk first: restart/rollback/canary patch.
  10. Verify + monitor + document: multi-vantage validation.
ValidateContainFixVerifyDocument

2) RHEL Core Commands

Services (systemd)

Manage and troubleshoot system services (daemons).

systemctl status httpd        # Check service status and recent logs
systemctl start|stop|restart httpd
                               # Control service runtime state
systemctl enable|disable httpd
                               # Control auto-start at boot
journalctl -u httpd -b        # View logs for httpd since last boot
journalctl -xe                # View recent system errors

Packages

Install, update, remove, and verify software packages.

dnf install httpd             # Install package
dnf update                    # Update all packages
dnf remove httpd              # Remove package
dnf list installed            # List installed packages
rpm -qa | grep httpd          # Query installed RPMs

Network

Check IP configuration, listening ports, and connectivity.

ip addr                       # Show IP addresses
ss -tulnp                     # Show listening ports + processes
ping host                     # Test connectivity
curl -I http://site           # Test HTTP response headers

Firewall / SELinux

Security enforcement and traffic filtering.

firewall-cmd --list-all       # Show firewall rules
firewall-cmd --add-service=http --permanent
firewall-cmd --add-service=https --permanent
                               # Open HTTP/HTTPS ports
firewall-cmd --reload         # Apply firewall changes

getenforce                    # Check SELinux mode (Enforcing/Permissive)

Disk / Files

Check storage usage, mounts, and permissions.

df -h                         # Disk usage (human readable)
df -i                         # Inode usage (file count limit)
du -sh /path                  # Directory size
lsblk                         # Block devices and disks
mount                         # Mounted filesystems
chmod 755 file                # Change permissions
chown user:group file         # Change ownership
⚠ If disk space looks fine but files cannot be created, check df -i — you may be out of inodes.

Performance & Troubleshooting

Quick performance and slowdown analysis tools.

top                           # Live CPU/memory usage per process
htop                          # Enhanced interactive top (if installed)
free -h                       # Memory usage (RAM + swap)
vmstat 1                      # CPU, memory, IO every 1 sec
iostat -xz 1                  # Disk IO stats (requires sysstat)
uptime                        # Load average
ps aux --sort=-%cpu           # Top CPU consumers
ps aux --sort=-%mem           # Top memory consumers
Slow server checklist:
• High load? → uptime
• CPU maxed? → top
• Memory exhausted? → free -h
• Disk bottleneck? → iostat -xz 1
• Too many files? → df -i
• Swap usage high? → vmstat
Remember: df = disk space.
free -h = memory.
load average ≠ CPU %, it reflects runnable processes.

3) Linux Disk Expansion

First questions to ask:
Is it a Logical Volume Manger (LVM)? What filesystem (XFS vs ext4)? Is it a VM disk expansion or adding a new disk?

A) If it’s a VM disk expanded (same disk grew)

# confirm OS sees new size
lsblk
df -h

# If partition needs resize (common in VMs):
# (Depending on tooling; may use growpart if installed)
# growpart /dev/sda 2

# If LVM is used:
pvs; vgs; lvs
pvresize /dev/sda2               # (example PV partition)
lvextend -l +100%FREE /dev/vg0/lv_data
# Filesystem grow (choose one)
xfs_growfs /mountpoint           # XFS
resize2fs /dev/vg0/lv_data       # ext4

B) If adding a NEW disk and using LVM

lsblk
# create PV
pvcreate /dev/sdb
# add PV to VG
vgextend vg0 /dev/sdb
# extend LV
lvextend -l +100%FREE /dev/vg0/lv_data
# grow FS
xfs_growfs /mountpoint     # XFS
# OR
resize2fs /dev/vg0/lv_data # ext4
Note: Take a backup / snapshot where policy allows, then do the lowest-risk change first (cleanup space if possible).

4) Puppet Reference

Architecture & Concepts

Common Tasks Puppet Manages

packagesservicesfilesuserscronsshfirewall

Basic Manifest Patterns

# Ensure Apache installed and running
package { 'httpd':
  ensure => installed,
}

service { 'httpd':
  ensure => running,
  enable => true,
}

Agent Commands

puppet agent --test
systemctl status puppet

Talking Points (High Score)

5) PowerShell Quick Reference

How to think (works for every script):
1) Define input (server names? service name?) → 2) Get data → 3) Filter → 4) Act (restart/copy/disable) → 5) Log + handle errors
Use Get-Help and Get-Command to confirm syntax.

Help & Discovery (first stop when you’re stuck)

Find the right cmdlet and confirm parameters/examples.

Get-Command *service*               # Search commands by name
Get-Help Restart-Service -Full      # Full help + examples
Get-Help Invoke-Command -Examples   # Fast examples
Get-Member                           # Inspect object properties/methods
PowerShell is object-based (not plain text). Pipe passes objects, so you can filter on real properties.

Core Cmdlets (with what they’re for)

# Services (check/control Windows services)
Get-Service                          # List services
Get-Service -Name Spooler            # One service by name
Restart-Service -Name Spooler        # Restart service
Start-Service -Name Spooler          # Start service
Stop-Service -Name Spooler           # Stop service

# Processes (CPU/memory consumers, hung tasks)
Get-Process                          # List processes
Stop-Process -Id 1234 -Force         # Kill by PID (force)

# Files & content (basic file operations)
Get-ChildItem                        # List files/dirs (like dir/ls)
Get-Content .\file.txt               # Read file
Set-Content .\file.txt "text"        # Overwrite file
Add-Content .\file.txt "more text"   # Append to file

# Events (Windows event logs)
Get-WinEvent -LogName System -MaxEvents 50  # Recent events in a log

# Connectivity (ICMP ping)
Test-Connection server1 -Count 2     # Ping test

Common “Admin Reality” Cmdlets

These come up constantly in enterprise troubleshooting.

# System info / performance quick checks
Get-ComputerInfo                      # High-level OS/system info
Get-CimInstance Win32_OperatingSystem # Memory, OS details
Get-CimInstance Win32_LogicalDisk     # Disk free space
Get-Counter '\Processor(_Total)\% Processor Time' -SampleInterval 1 -MaxSamples 5

# Networking quick checks
Get-NetIPConfiguration                # IP/DNS/gateway
Test-NetConnection server1 -Port 445  # Test TCP port (SMB example)
Get-DnsClientServerAddress            # DNS servers configured

# Windows updates (varies by org tooling, but common conceptually)
Get-HotFix | Select-Object -First 10  # Recently installed updates/hotfixes

Pipeline patterns (this is where PowerShell “clicks”)

Think: GetWhere (filter) → Select (shape output) → Sort / GroupDo something

# Filter objects (Where-Object) by a property
Get-Service | Where-Object Status -eq 'Stopped'

# Select only fields you want
Get-Service -Name Spooler | Select-Object Name, Status, StartType

# Sort and take top items
Get-Process | Sort-Object CPU -Descending | Select-Object -First 10 Name, Id, CPU

# Export results (reporting)
Get-Service | Select-Object Name, Status | Export-Csv .\services.csv -NoTypeInformation

Variables, arrays, and loops (simple patterns you can reuse)

# Variable
$svcName = 'Spooler'

# Array
$servers = @('srv1','srv2')

# Foreach loop
foreach ($s in $servers) {
  "$s - checking $svcName"
}

Remote basics (WinRM / PowerShell Remoting)

Run commands on remote servers without logging in interactively.

# Run one command remotely
Invoke-Command -ComputerName server1 -ScriptBlock { Get-Service Spooler }

# Use local variables inside ScriptBlock with $using:
$svcName = 'Spooler'
Invoke-Command -ComputerName server1 -ScriptBlock { Get-Service -Name $using:svcName }

Common “Performance Exam” scenario: Check a service across servers and fix it

Pattern: check → if wrong → remediate → report.

$servers = @('srv1','srv2')
$svcName = 'Spooler'

$results = foreach ($s in $servers) {
  try {
    $svc = Invoke-Command -ComputerName $s -ErrorAction Stop -ScriptBlock {
      Get-Service -Name $using:svcName
    }

    if ($svc.Status -ne 'Running') {
      Invoke-Command -ComputerName $s -ScriptBlock {
        Restart-Service -Name $using:svcName -ErrorAction Stop
      }
      $action = 'Restarted'
    } else {
      $action = 'NoChange'
    }

    [pscustomobject]@{
      Server = $s
      Service = $svcName
      Status  = $svc.Status
      Action  = $action
    }
  }
  catch {
    [pscustomobject]@{
      Server = $s
      Service = $svcName
      Status  = 'Unknown'
      Action  = "Error: $($_.Exception.Message)"
    }
  }
}

$results | Format-Table -AutoSize
Notes:
try/catch prevents one dead server from killing the whole script.
[pscustomobject] makes clean, exportable output.
• Add | Export-Csv to save results.

Mini “cheat sheet” of the most useful operators

-eq  equal            -ne  not equal
-gt  greater than     -lt  less than
-like wildcard match  -match regex match
-and logical AND      -or  logical OR
-not negation

Safe scripting habits (small things that prevent big mistakes)

# Preview changes when supported
Restart-Service -Name Spooler -WhatIf

# Fail fast inside try/catch
Get-Service -Name DoesNotExist -ErrorAction Stop

6) Windows Troubleshooting – Systems Engineer Notes

Core OS Checks

Performance & Resource Analysis

Active Directory / Authentication Flow

Group Policy Troubleshooting

Networking Stack

High Availability Awareness

PowerShell (Enterprise-Level Useful Cmdlets)


Get-Service
Get-WinEvent
Test-Connection
Get-Process
Get-Counter
Get-ADUser
dcdiag
repadmin /replsummary
gpresult /r
  

7) Systems Thinking

Reflect on:
"What's Connected?" - "What causes what?" - "Where are the feedback loops? - What happens next if I feed this?"

A) Deploy & Manage 50 Servers (RHEL + Puppet)


B) Critical Vulnerability on Public-Facing Servers


C) SAN / Storage Latency Spike (keywords)

latency (ms), IOPS, throughput, queue depth, controller utilization, multipath, snapshot/replication load, fabric congestion.

8) Bash Scripting

Rinse and repeat:
1) Define input (hostnames? files? service?) → 2) Get data → 3) Filter (grep/awk/sed) → 4) Act (restart/copy/kill) → 5) Exit codes + logging
Bash is text + exit codes: 0 = success, non-zero = failure.

Help & Discovery (when you forget syntax)

Find usage, examples, and what flags mean.

man systemctl         # Manual page
command --help        # Quick help
type -a ls            # What command you’re actually running (alias/binary)
which curl            # Where binary lives
echo $?               # Exit code of last command (0 = OK)
set -x                # Debug: print commands as they execute
set +x                # Stop debug

Core commands you’ll script constantly

# Files / navigation
pwd                   # Current directory
ls -lah               # List (human, all, long)
cd /path
cp -av src dst        # Copy with attributes + verbose
mv -v old new
rm -i file            # Interactive delete
mkdir -p /a/b/c       # Create nested dirs

# Viewing files
cat file              # Dump file
less file             # Page through file
head -n 50 file
tail -n 50 file
tail -f /var/log/messages   # Follow logs live

# Search / filter
grep -R "text" /path  # Search recursively
grep -i "error" file  # Case-insensitive
awk '{print $1,$3}' file
cut -d: -f1 /etc/passwd
sort file | uniq -c

# Permissions
chmod 755 file
chown user:group file

Variables, quoting, and command substitution (most common mistakes)

name="httpd"
echo "$name"          # Use double-quotes to preserve spaces
echo '$name'          # Single quotes do NOT expand variables

now="$(date +%F_%H%M%S)"  # Command substitution
echo "$now"

path="/var/log/messages"
echo "$path"
Rule of thumb: always quote variables like "$var" unless you explicitly want word-splitting.

Conditionals (if/elif/else) and file tests

# Common file tests:
# -f file exists and is a regular file
# -d directory exists
# -r readable, -w writable, -x executable
# -z string is empty

file="/etc/ssh/sshd_config"

if [[ -f "$file" ]]; then
  echo "Found $file"
else
  echo "Missing $file"
fi

Loops (for / while) you can reuse

# For over a list
servers=("srv1" "srv2" "srv3")
for s in "${servers[@]}"; do
  echo "Checking $s"
done

# While read lines from a file
while IFS= read -r line; do
  echo "Line: $line"
done < servers.txt

Functions + safe mode (professional baseline)

This prevents silent failures and makes scripts predictable.

#!/usr/bin/env bash
set -euo pipefail

log() { echo "[$(date +%F\ %T)] $*"; }

log "Starting script"
set -e stop on error • -u error on unset vars • pipefail catch pipeline failures

Common pipeline patterns (grep/awk/sed)

# Find top CPU processes
ps aux --sort=-%cpu | head -n 10

# Find top memory processes
ps aux --sort=-%mem | head -n 10

# Show listening ports
ss -tulnp

# Count failed SSH logins (example paths vary)
grep -i "failed password" /var/log/secure | wc -l

# Extract column examples
df -h | awk '{print $1,$5,$6}'

Remote basics (SSH non-interactive)

Run a command on a remote host without logging in manually.

ssh user@server1 "hostname; uptime; systemctl is-active httpd"

Performance / “server is slow” triage (Linux)

Fast checks to decide if it’s CPU, RAM, disk, or network.

uptime                 # Load average (runnable + waiting)
top                    # CPU/mem live per process
free -h                # RAM + swap usage
vmstat 1               # CPU run queue, swapping, IO
iostat -xz 1           # Disk latency/utilization (sysstat)
df -h                  # Disk space
df -i                  # Inodes (file-count exhaustion)
ss -s                  # Socket summary
dmesg -T | tail -n 50  # Kernel messages (hardware/IO errors)
Quick reads:
• High load with low CPU → often IO wait (disk/network).
• High swap usage → memory pressure; look for top RSS processes.
• Full inodes (df -i) → “No space left on device” even when disk isn’t full.

Practical scenario script: check a service on many servers and fix it

Pattern: check → remediate → report.

#!/usr/bin/env bash
set -euo pipefail

servers=("srv1" "srv2" "srv3")
svc="httpd"

for s in "${servers[@]}"; do
  echo "=== $s ==="
  if ssh "$s" "systemctl is-active --quiet $svc"; then
    echo "$svc is running"
  else
    echo "$svc is NOT running - restarting..."
    ssh "$s" "sudo systemctl restart $svc && systemctl is-active $svc"
  fi
done

Text processing “cheat sheet” (most useful one-liners)

grep -i "error" file              # Find lines matching text
grep -v "pattern" file            # Exclude lines
awk '{print $1}' file             # Print column 1
awk -F: '{print $1,$3}' /etc/passwd # Use delimiter :
sed 's/old/new/g' file            # Replace text
wc -l file                        # Count lines

Safe scripting habits (avoid self-inflicted outages)

# Preview what you’re about to delete
rm -i /path/file

# Use a dry-run pattern (manual, but effective)
echo "Would run: systemctl restart httpd"

# Guardrails for critical variables
: "${TARGET:?TARGET is required}"   # Fails if TARGET is empty/unset

9) Scenario Playbooks (Open-Book Speed Runs)

Exam-safe structure (write this every time):
Scope/Impact → Recent change → Quick triage (CPU/Mem/Disk/Net) → Service status → Logs → Firewall/SELinux/Perms → Fix lowest-risk → Verify from 2+ vantage points → Document

A) Apache site not loading after config change (RHEL)

# 1) Is Apache running?
systemctl status httpd

# 2) Is it listening?
ss -tulnp | grep -E ':80|:443'

# 3) Syntax/config test (fast root cause)
apachectl -t
# common variants:
# httpd -t

# 4) Logs (most important)
journalctl -u httpd -b
# also check:
# /var/log/httpd/error_log  (common on RHEL)

# 5) Firewall + SELinux
firewall-cmd --list-all
getenforce

# 6) Rollback / revert last change if needed
# (restore known-good config, then restart)
systemctl restart httpd

B) Service fails after patching

C) “Server is slow” (Linux performance triage in 90 seconds)

uptime            # load average (runnable + waiting)
top               # CPU/mem per process
free -h           # RAM + swap pressure
df -h             # disk space
df -i             # inode exhaustion ("No space left" even when disk isn't full)
ss -s             # socket summary (connection spikes)
dmesg -T | tail -n 30   # kernel / IO errors

D) Disk full vs Inodes full vs Memory pressure (common confusion)

Disk full

df -h
du -sh /* | sort -h
# cleanup logs/temp, then consider expansion

Inodes full

df -i
# too many small files → cleanup/rotate

Memory pressure

free -h
top
# high swap usage → find top RSS processes

Disk expansion “decision tree”

  • Cleanup possible? do that first
  • VM disk grew or new disk added?
  • LVM or not?
  • XFS or ext4?

E) Public can’t reach service but internal can

F) Puppet: drift + compliance (what to write if asked)

G) Mini Glossary (don’t freeze on acronyms)

H) “If I don’t know” fallback (still score points)

Write this:
“I’m not fully certain of the exact command syntax, but my approach is: confirm scope and recent changes, check service status and logs, validate firewall/SELinux/permissions, apply lowest-risk remediation, then verify and document. If open-book is allowed, I’d confirm the specific command with man/Get-Help.”

10) Cloud & Hybrid (Azure-Focused + AWS Equivalents)

Cloud Design Mindset:
Identity first, network segmentation, least privilege, high availability, logging, backup/DR, and cost awareness. Always mention MFA, RBAC, monitoring, and encryption.

A) Service Models – Know the Differences

B) Identity & Access Management

C) Hybrid Connectivity

D) Compute (High Availability Focus)

E) Storage

F) Backup & Disaster Recovery

G) Security & Monitoring

H) Hybrid Architecture Answer Template

If asked: “Design a hybrid Azure solution”
1) Define requirements (availability, compliance, RPO/RTO).
2) Integrate identity (Azure AD Connect / hybrid sync).
3) Secure connectivity (VPN or ExpressRoute).
4) Deploy compute in Availability Zones or use PaaS where appropriate.
5) Segment network (subnets + NSGs).
6) Implement backup & replication.
7) Enable monitoring + security baseline.
8) Document lifecycle and cost management.

11) Active Directory (Enterprise Quick Reference)

High-Scoring Mindset:
Identity, authentication, least privilege, replication health, and GPO enforcement are core to AD stability.

A) Core Components

B) FSMO Roles (Know These)

If authentication issues occur → check PDC Emulator first.

C) Authentication Flow

D) Troubleshooting AD Authentication

E) Replication Health

repadmin /replsummary
repadmin /showrepl
dcdiag

F) Group Policy (GPO)

G) Security Best Practices

H) If Asked: “Users Can’t Log In”

1) Verify DC reachable.
2) Confirm DNS resolution.
3) Check time sync.
4) Review Event Viewer logs.
5) Validate replication health.
6) Verify account not locked/expired.

12) DNS & Certificates (Records + TLS Quick Reference)

High-Scoring Mindset:
If authentication or web services fail, DNS is often the root cause. Certificates are common causes of production outages.

A) Common DNS Record Types

B) DNS Troubleshooting

nslookup hostname
Resolve-DnsName hostname   # PowerShell
dig hostname               # Linux (if available)

C) AD & DNS Relationship

D) Certificates (TLS / SSL)

E) Common Certificate Issues

F) Quick Certificate Checks

# Windows (PowerShell)
Get-ChildItem Cert:\LocalMachine\My

# Linux
openssl s_client -connect site:443

G) If Website Shows Security Warning

1) Check certificate expiration.
2) Verify CN/SAN matches URL.
3) Validate certificate chain.
4) Confirm service binding to correct cert.
5) Restart service after fix.