Key Takeaways
- 01 Infrastructure as Code is now table stakes, not optional
- 02 GitOps has replaced manual deployment workflows
- 03 Observability is more important than just monitoring
- 04 Serverless and edge computing are changing what's possible
- 05 Security must be integrated throughout the DevOps pipeline
The Old DevOps: SSH and Manual Scripts
Remember when deployment meant:
- SSH into server
git pull origin mainnpm run build- Copy files to production
nginx -s reload- Pray nothing broke
This wasn’t just slow—it was fragile. Each deployment was a potential disaster. Rollback meant manually restoring old files. Scaling meant buying more servers.
The problem with manual infrastructure isn’t that it doesn’t work—it’s that it doesn’t scale with your team, your traffic, or your complexity.
Infrastructure as Code: Treat Servers Like Source Code
Infrastructure as Code (IaC) changed everything by bringing software development practices to infrastructure.
Why IaC Matters
- Version controlled: Every change has git history
- Reproducible: Environments are identical across dev, staging, prod
- Testable: Preview infrastructure changes before applying
- Automated: No more manual SSH sessions
Terraform: Declarative Infrastructure
# main.tf
resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.micro"
tags = {
Name = "WebServer"
Environment = var.environment
}
}
variable "environment" {
description = "Environment name"
type = string
default = "production"
}
Terraform figures out dependencies, creates resources in parallel, and provides a plan showing exactly what will change.
Applying Changes Safely
# See what will change without applying
terraform plan
# Apply changes (interactive approval)
terraform apply
# Import existing resources into state
terraform import aws_instance.web i-1234567890abcdef
# Destroy resources safely
terraform destroy
Always review terraform plan output before applying. One bad resource change can delete production databases.
Containerization: Docker and Beyond
Containers standardized how we package and deploy applications.
Docker: The Universal Runtime
# Dockerfile
FROM node:22-alpine
WORKDIR /app
# Install dependencies (cached in layer)
COPY package*.json ./
RUN npm ci --only=production
# Copy source
COPY . .
# Build
RUN npm run build
# Expose port
EXPOSE 3000
# Run as non-root for security
USER node
CMD ["node", "dist/server.js"]
Benefits:
- Consistent environments: “It works on my machine” disappears
- Rapid scaling: Spin up new instances in seconds
- Isolation: Process crashes don’t affect other services
- Resource limits: Prevent misbehaving containers from consuming all memory
Kubernetes: Orchestrating at Scale
For production workloads, Kubernetes provides:
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web-app
image: your-registry/web-app:1.0.0
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
period: 10
Kubernetes handles:
- Self-healing: Restart crashed containers
- Scaling: Add/remove replicas based on load
- Rolling updates: Zero-downtime deployments
- Service discovery: Automatically load balance traffic
CI/CD: GitOps replaces Manual Deployments
GitOps treats git as the single source of truth. When you merge to main, infrastructure changes automatically.
GitHub Actions: Integrated CI
# .github/workflows/deploy.yml
name: Deploy to Production
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '22'
- name: Install dependencies
run: npm ci
- name: Run tests
run: npm test -- --coverage
- name: Build
run: npm run build
- name: Upload artifact
uses: actions/upload-artifact@v4
with:
name: dist
path: dist/
deploy:
needs: test
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Download artifact
uses: actions/download-artifact@v4
with:
name: dist
- name: Deploy to Vercel
uses: amondnet/vercel-action@v25
with:
vercel-token: ${{ secrets.VERCEL_TOKEN }}
vercel-org-id: ${{ secrets.VERCEL_ORG_ID }}
vercel-project-id: ${{ secrets.VERCEL_PROJECT_ID }}
vercel-args: '--prod'
GitOps means your git history becomes your deployment history. Rollback is git revert main. No more “which version was deployed?” panic.
Monitoring and Observability: Understanding Your Systems
Monitoring tells you something is broken. Observability tells you why.
Metrics: What’s Happening
// Custom metrics with Prometheus
import { Counter, Histogram } from 'prom-client';
const requestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status'],
buckets: [0.1, 0.5, 1, 2.5, 5, 10],
});
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
requestDuration.observe({
method: req.method,
route: req.path,
status: res.statusCode,
}, duration);
});
next();
});
Track:
- Request rates: Traffic patterns, spikes, anomalies
- Latency: P50, P95, P99 response times
- Error rates: 4xx, 5xx status codes
- Resource usage: CPU, memory, disk, network
Logging: What Happened
Structured logs are queryable, not just readable:
{
"timestamp": "2026-02-10T15:30:00Z",
"level": "error",
"service": "api",
"trace_id": "abc123def456",
"user_id": "user-123",
"error": {
"type": "ValidationError",
"message": "Invalid email format",
"code": "ERR_001"
},
"context": {
"path": "/api/users",
"method": "POST"
}
}
Query logs: level:error AND service:api | timestamp > 24h ago
Distributed Tracing: Understanding Flow
import { trace } from '@opentelemetry/api';
const parentSpan = trace.getActiveSpan();
async function processPayment(userId, amount) {
const span = tracer.startSpan('processPayment', {
parent: parentSpan,
attributes: {
'user.id': userId,
'payment.amount': amount,
},
});
try {
const result = await paymentGateway.charge(amount);
span.addEvent('charge_processed');
return result;
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw error;
} finally {
span.end();
}
}
See the entire request flow: load balancer → API → database → payment gateway → response.
Serverless and Edge: The New Paradigm
Not everything needs a server.
AWS Lambda: Pay-Per-Use Compute
// handler.js
import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
const dynamo = new DynamoDBClient({ region: 'us-east-1' });
export const handler = async (event) => {
const { userId } = event.pathParameters;
const response = await dynamo.send(new GetItemCommand({
TableName: 'Users',
Key: { userId: { S: userId } },
}));
return {
statusCode: 200,
body: JSON.stringify(response.Item),
};
};
No servers to manage. Pay only for actual execution time.
Cloudflare Workers: Edge Computing
Run code at edge, closer to users:
// worker.js
export default {
async fetch(request, env) {
// Cache at edge
const cache = caches.default;
const cached = await cache.match(request);
if (cached) {
return new Response(cached.body, {
headers: { 'X-Cache': 'HIT' },
});
}
// Transform response at edge
const response = await fetch(request);
const transformed = transformResponse(response);
// Cache for future
await cache.put(request, transformed.clone());
return transformed;
}
};
Edge computing enables:
- Global latency: Content cached in 300+ locations worldwide
- Dynamic content: Modify responses at edge (A/B testing, authentication)
- DDoS protection: Absorb attacks before reaching origin
Security: DevSecOps
Security isn’t a phase—it’s built into everything.
Secrets Management
Never commit secrets:
# .github/workflows/deploy.yml
- name: Deploy
env:
# ❌ BAD: Committed to repo
# DATABASE_URL: postgresql://user:pass@host/db
# ✅ GOOD: From GitHub Secrets
DATABASE_URL: ${{ secrets.DATABASE_URL }}
Use secret managers:
- GitHub Secrets: For CI/CD
- AWS Secrets Manager: For AWS resources
- Vault: For self-hosted secrets
- 1Password Connect: For development teams
Dependency Scanning
Automated in CI:
# .github/workflows/security.yml
name: Security Scan
on: [push, pull_request]
jobs:
security:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
scan-type: 'fs'
scan-ref: '.'
format: 'sarif'
output: 'trivy-results.sarif'
- name: Upload SARIF file
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: 'trivy-results.sarif'
Block merges with critical vulnerabilities.
Infrastructure Scanning
Scan for misconfigurations:
# tfsec
terraform fmt -check
terraform validate
tfsec .
# Checkov
checkov -d .
Compliance is easier to build in than bolt on later. Use Terraform Guard, Sentinel, or OPA for policy-as-code.
Cost Optimization: Don’t Overpay
Cloud costs spiral without visibility.
Rightsizing Resources
# Before: Overprovisioned
resource "aws_instance" "web" {
instance_type = "m5.large" # 2 vCPU, 8 GB RAM
# Actual usage: 5% CPU, 1 GB RAM
}
# After: Rightsized
resource "aws_instance" "web" {
instance_type = "t3.small" # 2 vCPU, 2 GB RAM
# Fits usage, saves $50/month
}
Auto-Scale Policies
# AWS Auto Scaling Group
AutoScalingGroupName: production-web
TargetTrackingConfigs:
- PredefinedMetricSpecification:
PredefinedMetricType: ASGAverageCPUUtilization
TargetValue: 50.0
ScaleOutCooldown: 300
ScaleInCooldown: 300
Scale up during peak, down during quiet.
Spot Instances
For fault-tolerant workloads:
resource "aws_instance" "worker" {
instance_market_options {
spot_options {
max_price = 0.05 # Bid $0.05/hour
spot_instance_type = "one-time"
}
}
}
Spot instances cost 70-90% less than on-demand.
The Modern DevOps Stack
Infrastructure: Terraform (or Pulumi)
Containers: Docker (or Podman for security)
Orchestration: Kubernetes (or Nomad for simplicity)
CI/CD: GitHub Actions (or GitLab CI)
Monitoring: Prometheus + Grafana
Logging: Loki + Grafana (or Datadog for simplicity)
Tracing: Jaeger (or Honeycomb)
Secrets: AWS Secrets Manager
Security: Trivy + OWASP ZAP
Scanning: SonarQube for code quality
This stack is cloud-agnostic, battle-tested, and well-documented.
Culture: It’s Not Just Tools
DevOps isn’t about tools—it’s about culture:
- Blameless postmortems: Learning from failure without punishment
- Shared responsibility: Devs participate in on-call, Ops in code reviews
- Documentation: Runbooks for every service
- Automation: If you do it twice, automate it
- Measurement: You can’t improve what you don’t measure
Conclusion
DevOps in 2026 is mature. The best practices are clear:
Treat infrastructure as code. Automate everything. Monitor comprehensively. Plan for failures. Optimize continuously. Build blameless culture.
The old days of SSH and manual deployments are over. The modern stack is faster, more reliable, and scales infinitely.
Your job isn’t keeping servers running—it’s building systems that run themselves.
What will you automate?