Troubleshooting
Systematic troubleshooting patterns for instance reachability, IAM permission failures, networking misconfiguration, and scaling issues.
Case 1: Instance Not Reachable
- Check EC2 status checks.
- Verify Security Group inbound (22/80/443).
- Confirm route table has internet path via IGW for public host.
- Validate key pair and user name for SSH.
Case 2: Permission Issues
- Inspect CloudTrail event for denied action.
- Use IAM policy simulator.
- Review explicit deny in SCP, permission boundary, or bucket policy.
Case 3: Networking Issues
- Confirm subnet association and route table entries.
- Check SG and NACL together.
- Review DNS resolution and Route 53 records.
Case 4: Scaling Issues
- Check CloudWatch alarm state.
- Validate ASG min/max/desired limits.
- Investigate failed launch template or health check flapping.
Hands-on Runbook
# Quick diagnostics aws ec2 describe-instance-status --instance-ids i-1234567890abcdef0 aws autoscaling describe-scaling-activities --auto-scaling-group-name web-asg --max-items 10 aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=AuthorizeSecurityGroupIngress --max-results 5
Interview Questions
Beginner: First check when EC2 is unreachable?
Security group and network path (subnet route + IGW/NAT) plus instance health.
Security group and network path (subnet route + IGW/NAT) plus instance health.
Intermediate: Why can allow policy still fail?
Because explicit deny elsewhere overrides allow.
Because explicit deny elsewhere overrides allow.
Scenario: ASG adds instances but traffic still fails. What next?
Check target group health checks, app startup completion, and listener rules.
Check target group health checks, app startup completion, and listener rules.
Summary
- Troubleshoot by layer: identity, network, compute, app.
- Use logs and service events to avoid guesswork.
- Most incidents are configuration drift, not platform outage.