Logging captures everything—do not log passwords, API keys, tokens, or PII. Redact sensitive fields before logging. If a secret leaks into logs, consider it compromised and rotate immediately.
Logging and Error Handling
Build production-ready scripts with structured logging, comprehensive error handling, stack traces, and debugging insights—move beyond print() to professional-grade observability.
🧒 Simple Explanation (ELI5)
Imagine printing everything to the terminal with `print()`. In production, scripts run without a terminal—output vanishes. Logging writes to files, timestamps everything, and lets you choose what to record (only errors? or debug details?). Error handling is like saying "if something breaks, do not crash the whole script—catch the error, log it, and decide what to do next."
🔧 Why Do We Need Logging and Error Handling?
- Debugging: figure out what went wrong hours after it happened by reading logs.
- Monitoring: alert ops when errors occur by watching log messages.
- Production readiness: scripts without logging are blind in production—you only find out when everything is on fire.
- Audit trail: who deployed what, when, and what was the outcome?
- Resilience: catch errors, retry operations, then continue instead of crashing.
⚙️ Technical Explanation
Logging module: provides loggers at levels (DEBUG, INFO, WARNING, ERROR, CRITICAL). Configure handlers (file, console, syslog) to route logs. Error handling: try/except catches specific exceptions, finally always runs, raise re-throws, context managers ensure cleanup.
⌨️ Logging and Error Handling Patterns
import logging
import sys
from datetime import datetime
from pathlib import Path
# ===== BASIC LOGGING SETUP =====
# Get logger for your module
logger = logging.getLogger(__name__)
# Set logging level
logger.setLevel(logging.DEBUG)
# Create handler (write to file)
log_dir = Path("/var/log/myapp")
log_dir.mkdir(exist_ok=True)
file_handler = logging.FileHandler(log_dir / "app.log")
file_handler.setLevel(logging.DEBUG)
# Create formatter
formatter = logging.Formatter(
fmt="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
datefmt="%Y-%m-%d %H:%M:%S"
)
file_handler.setFormatter(formatter)
# Add handler to logger
logger.addHandler(file_handler)
# Also log to console
console_handler = logging.StreamHandler(sys.stdout)
console_handler.setLevel(logging.INFO) # Only INFO and above to console
console_handler.setFormatter(formatter)
logger.addHandler(console_handler)
# ===== LOGGING AT DIFFERENT LEVELS =====
logger.debug("Detailed diagnostic info (not shown by default)")
logger.info("General informational message")
logger.warning("Something unexpected but not critical")
logger.error("Serious error, but script continues")
logger.critical("Very serious error, may crash")
# ===== LOGGING EXCEPTIONS =====
try:
result = 10 / 0
except ZeroDivisionError as e:
logger.error("Division by zero occurred", exc_info=True) # exc_info includes full traceback
# Or use exception() which auto-includes exc_info
logger.exception("Division failed")
# ===== ERROR HANDLING: TRY/EXCEPT =====
def deploy_app(version):
"""Deploy app version, handling errors gracefully."""
try:
logger.info(f"Deploying app version {version}")
# Deployment steps
# Maybe this fails...
if not version:
raise ValueError("Version cannot be empty")
logger.info(f"Successfully deployed {version}")
return True
except ValueError as e:
logger.error(f"Invalid version: {e}")
return False
except Exception as e:
logger.critical(f"Unexpected error during deployment: {e}", exc_info=True)
return False
finally:
# Always runs, even if exception occurred
logger.info("Deployment attempt completed (success or failure)")
# ===== CONTEXT MANAGERS: GUARANTEE CLEANUP =====
class DatabaseConnection:
def __init__(self, host):
self.host = host
self.connection = None
def __enter__(self):
logger.info(f"Connecting to {self.host}")
self.connection = f"connection-to-{self.host}" # fake connection
return self
def __exit__(self, exc_type, exc_val, exc_tb):
logger.info(f"Closing connection to {self.host}")
self.connection = None
# Return False to propagate exceptions, True to suppress
if exc_type is not None:
logger.error(f"Error in context: {exc_val}", exc_info=True)
return False
# Use context manager
try:
with DatabaseConnection("db.example.com") as db:
logger.info("Executing queries...")
# If error here, __exit__ still runs
except Exception as e:
logger.error(f"Database operation failed: {e}")
# ===== RETRIES WITH LOGGING =====
def api_call_with_logging(url, max_retries=3):
"""Call API with logging for each attempt."""
import time
for attempt in range(1, max_retries + 1):
try:
logger.info(f"Attempt {attempt}/{max_retries}: Calling {url}")
# Simulate API call
response_code = 500 if attempt < 3 else 200
if response_code != 200:
raise Exception(f"Server error: {response_code}")
logger.info(f"Success on attempt {attempt}")
return "Success"
except Exception as e:
if attempt < max_retries:
wait_time = 2 ** (attempt - 1) # exponential backoff
logger.warning(f"Attempt {attempt} failed: {e}. Retrying in {wait_time}s...")
time.sleep(wait_time)
else:
logger.error(f"All {max_retries} attempts failed. Last error: {e}")
raise
# ===== CUSTOM EXCEPTIONS =====
class DeploymentError(Exception):
"""Raised when deployment fails."""
pass
class ConfigurationError(Exception):
"""Raised when configuration is invalid."""
pass
def validate_config(config):
"""Validate config, raise ConfigurationError if invalid."""
if not config.get("app_name"):
logger.error("Config missing 'app_name'")
raise ConfigurationError("Missing required field: app_name")
logger.debug(f"Config validated: {config}")
# ===== STRUCTURED LOGGING (JSON format) =====
import json
def log_structured(message, level="INFO", **context):
"""Log structured data as JSON (for log aggregation systems)."""
log_entry = {
"timestamp": datetime.now().isoformat(),
"level": level,
"message": message,
**context
}
logger.info(json.dumps(log_entry))
# Usage
log_structured(
"Deployment started",
level="INFO",
version="2.0",
environment="production",
user="ci-system"
)
# ===== LOGGING FUNCTION ENTRY/EXIT =====
def process_data(data):
"""Process data with entry/exit logging."""
logger.debug(f"Processing data: {type(data)}, length: {len(data)}")
try:
result = len(data) > 0
logger.debug(f"Processing complete. Result: {result}")
return result
except Exception as e:
logger.error(f"Processing failed: {e}", exc_info=True)
raise
# ===== REAL-WORLD EXAMPLE: PRODUCTION DEPLOYMENT SCRIPT =====
def deploy_to_production(app_name, version, replicas=3):
"""
Full deployment workflow with comprehensive logging and error handling.
"""
logger.info(f"Starting deployment: {app_name}:{version}")
logger.debug(f"Parameters: replicas={replicas}")
try:
# Step 1: Validate
logger.info("Validating configuration...")
if not app_name or not version:
raise ValueError("app_name and version are required")
logger.info("Configuration validated")
# Step 2: Scale
logger.info(f"Scaling {app_name} to {replicas} replicas...")
# kubectl scale...
logger.info(f"Scaled successfully")
# Step 3: Monitor
logger.info("Waiting for deployment to stabilize...")
time.sleep(5) # fake wait
logger.info("Deployment stabilized")
logger.info(f"✓ Deployment complete: {app_name}:{version}")
return True
except ValueError as e:
logger.error(f"✗ Validation failed: {e}")
return False
except Exception as e:
logger.critical(f"✗ Deployment failed unexpectedly: {e}", exc_info=True)
raise
finally:
logger.debug("Deployment workflow finished")
# ===== AVOID LOGGING SECRETS =====
def bad_logging(api_key):
"""DON'T DO THIS."""
logger.info(f"Using API key: {api_key}") # Secret exposed!
def good_logging(api_key):
"""DO THIS."""
logger.info("Authenticating with API key") # No secret logged
logger.debug(f"API key prefix: {api_key[:8]}...") # Only first 8 chars
# ===== LOGS SHOULD BE PARSEABLE =====
# Bad: scattered info hard to search
logger.info("Starting")
logger.info("Done")
# Good: structured info, searchable
logger.info("Deployment started", extra={"stage": "start", "app": "myapp"})
logger.info("Deployment finished", extra={"stage": "end", "app": "myapp", "status": "success"})
💼 Example (Real-world Use Case)
A Kubernetes deployment script logs every step: "Starting deployment of myapp:v2.1 to prod with 5 replicas", "Rolled out 3 of 5 replicas", "Rolled out 5 of 5 replicas", "All health checks passed". Errors are logged with context ("MySQL connection timeout after 30s retries"). The log trail becomes an audit record: who deployed what, when, what went wrong, and how it was resolved.
🧪 Hands-on
- Set up logging to both console and file with different levels.
- Implement retry logic with logging for each attempt.
- Write a function that catches specific exceptions and logs at appropriate levels.
- Log entry/exit of functions showing parameters and results.
- Create structured (JSON) log entries for a hypothetical deployment.
Write a script that processes a list of items (files, deployments, etc.), logs progress, handles errors per item (continue processing others), and produces a final summary. Use DEBUG for details, INFO for progress, WARNING for recoverable issues, ERROR for failures.
🐛 Debugging Scenario
Problem: script is failing in production but you have no idea why—no error messages.
- Cause: script is not configured to log to a file (only uses print()), or log level is too high (only ERROR, missing INFO/DEBUG).
- Diagnose: check if log files exist in expected location. Add DEBUG logging at each major step. Check log level configuration.
- Fix: configure logging to write to file, set level to INFO, wrap key operations in try/except with error logging. In production CI/CD, always log to files and aggregate logs to central system.
🎯 Interview Questions
Beginner
print() outputs to stdout (terminal only). logging writes to configured destinations (file, syslog, remote server), adds timestamps, and lets you filter by level. print() is for interactive scripts; logging is for production. In production, there is no terminal—logging is your only window into what happened.
Use try/except when you can recover (retry, fallback, alert). Let exceptions propagate if the error is fatal and there is nothing to do. Log always. Example: network timeout—catch, retry, then raise if all retries fail. Example: missing config file—raise immediately (no recovery possible).
finally always runs, even if an exception occurred or you returned early. Use it for cleanup: close files, close database connections, release locks. Context managers (__enter__/__exit__) are the modern way to guarantee cleanup, but understanding finally is important for legacy code.
Scenario-based
Log each API call. Wrap each in try/except, catching specific exceptions. Log errors per API but do not raise. Collect results and log summary. Example: try call_api_1 except log warning, try call_api_2 except log warning, finally log how many succeeded/failed. This prevents one failure from blocking the entire deployment.
🌐 Real-world Usage
Every production application logs. Kubernetes logs container output. Monitoring systems parse app logs for errors. Audit systems log all infrastructure changes. Log aggregation (ELK, Splunk, CloudWatch) centralize logs from thousands of services. Professional DevOps means professional logging.
📝 Summary
Use logging module, not print(). Set level to DEBUG for local dev, INFO for production. Log to files and centralize logs. Each exception type needs different handling—distinguish between retryable errors (retry) and fatal errors (raise). Use try/except/finally for error handling and cleanup. Use context managers for guaranteed cleanup. Log important events (deployment start/end), errors (with stacktrace), and decisions (why we chose option B). Never log secrets. Structured (JSON) logging enables log aggregation and alerting. Comprehensive logging + error handling = production-ready scripts.