1/1
</> DevOps
Concepte DevOps
Curs Avansat de DevOps¶
Linux Shell, CI/CD, Docker, Kubernetes, AWS și Python pentru Automatizare¶
Cuprins¶
- Filosofia DevOps și cultura organizațională
- Linux avansant pentru DevOps
- Shell scripting avansat (Bash)
- Controlul versiunilor — Git avansat
- CI/CD — Concepte și arhitecturi
- Jenkins — Pipeline-uri declarative și scripted
- GitHub Actions și GitLab CI/CD
- Docker — Containerizare avansată
- Docker Compose și aplicații multi-container
- Kubernetes — Orchestrare la scară
- Kubernetes — Obiecte avansate și operare
- Helm și managementul pachetelor Kubernetes
- AWS — Servicii fundamentale pentru DevOps
- AWS — Infrastructură ca și cod (IaC) cu Terraform
- Python pentru automatizare DevOps
- Monitorizare, logging și observabilitate
- Securitate în pipeline-ul DevOps (DevSecOps)
- Proiect integrator: pipeline complet de la cod la producție
1. Filosofia DevOps și cultura organizațională¶
1.1 Ce este DevOps?¶
DevOps este o cultură, un set de practici și instrumente care unifică dezvoltarea software (Dev) cu operarea infrastructurii (Ops). Scopul: livrarea rapidă, fiabilă și continuă a software-ului de calitate.
Modelul tradițional (silouri):
Dev ──────► „Funcționează pe mașina mea" ──────► Ops
↓
„Nu pornește în producție"
↓
Blame game ←──────┘
Modelul DevOps (colaborare):
┌──────────────────────────────────────────────────────────┐
│ Echipă unificată Dev+Ops │
│ │
│ Plan → Code → Build → Test → Release → Deploy → Monitor │
│ ↑ │ │
│ └──────────────── Feedback continuu ←──────────────┘ │
└──────────────────────────────────────────────────────────┘
1.2 Principiile CALMS¶
| Principiu | Descriere |
|---|---|
| Culture | Colaborare, fără silouri, responsabilitate partajată |
| Automation | Automatizare a tot ce se repetă: build, test, deploy, infra |
| Lean | Eliminare risipă, batch-uri mici, flux continuu |
| Measurement | Măsurare a tot: performanță, erori, lead time, MTTR |
| Sharing | Cunoștințe partajate, postmortems blameless, documentație |
1.3 Metrici cheie (DORA)¶
| Metrică | Elită | Performant | Mediu |
|---|---|---|---|
| Deployment Frequency | On-demand | Săptămânal | Lunar |
| Lead Time for Changes | < 1 oră | 1-7 zile | 1-6 luni |
| Change Failure Rate | < 5% | 10-15% | 16-30% |
| Mean Time to Recovery | < 1 oră | < 1 zi | 1-7 zile |
2. Linux avansat pentru DevOps¶
2.1 Managementul proceselor¶
# Procese și resurse
ps aux --sort=-%mem | head -20 # Top procese după memorie
ps -eo pid,ppid,user,%cpu,%mem,cmd --sort=-%cpu | head
pstree -p # Arbore procese
top -bn1 -o %MEM # Snapshot top (batch mode, sort memorie)
htop # Interfață interactivă avansată
# Semnale
kill -SIGTERM $PID # Terminare grațioasă
kill -SIGKILL $PID # Terminare forțată (kill -9)
kill -SIGHUP $PID # Reload configurare (multe daemoane)
killall -SIGUSR1 nginx # Semnal către toate procesele cu acel nume
pkill -f "python.*my_script" # Kill după pattern
# Procese în background
long_command & # Pornire în background
nohup long_command > output.log 2>&1 & # Supraviețuiește închiderii terminalului
disown %1 # Detach job din shell
jobs -l # Lista joburi curente
# Limitare resurse
ulimit -n 65535 # Max file descriptors per proces
ulimit -u 4096 # Max procese per utilizator
# Permanent: /etc/security/limits.conf
# nginx soft nofile 65535
# nginx hard nofile 65535
2.2 Gestiunea discurilor și sistemelor de fișiere¶
# Informații disc
lsblk -f # Dispozitive bloc cu filesystems
df -hT # Spațiu disc cu tip filesystem
du -sh /var/log/* # Dimensiune directoare
ncdu /var # Navigare interactivă spațiu disc
# LVM (Logical Volume Management)
pvcreate /dev/sdb # Creează Physical Volume
vgcreate data_vg /dev/sdb # Creează Volume Group
lvcreate -L 50G -n app_lv data_vg # Creează Logical Volume
mkfs.ext4 /dev/data_vg/app_lv # Formatare
mount /dev/data_vg/app_lv /data # Montare
# Extindere volum fără downtime:
lvextend -L +20G /dev/data_vg/app_lv # Extinde LV cu 20GB
resize2fs /dev/data_vg/app_lv # Extinde filesystem (ext4)
# sau: xfs_growfs /data # Pentru XFS
# Monitorizare I/O
iostat -xz 1 # Statistici I/O per disc, la fiecare secundă
iotop # Top procese după I/O
2.3 Networking avansat¶
# Configurare și diagnostică
ip addr show # Interfețe și adrese IP
ip route show # Tabela de rutare
ip link set eth0 up/down # Activare/dezactivare interfață
ss -tlnp # Porturi TCP în LISTEN cu PID
ss -s # Sumar conexiuni
ss -tnp state established # Conexiuni active
# DNS
dig example.com # Query DNS complet
dig +short example.com A # Doar IP-ul
nslookup example.com # Query simplu
host example.com # Rezolvare
# Diagnostică rețea
traceroute -n example.com # Traseu pachete
mtr example.com # Traceroute continuu
tcpdump -i eth0 port 443 -nn # Captură pachete HTTPS
tcpdump -i eth0 -w capture.pcap # Salvare captură
curl -v -o /dev/null https://api.example.com # Debug HTTP
curl -w "@curl-format.txt" https://example.com # Timing detaliat
# Firewall (nftables / iptables)
nft list ruleset # Listare reguli nftables
iptables -L -n -v # Listare reguli iptables
# Bandwidth
iperf3 -s # Server
iperf3 -c server_ip # Client — testare throughput
2.4 Systemd — managementul serviciilor¶
# Control servicii
systemctl start/stop/restart nginx
systemctl enable/disable nginx # Activare la boot
systemctl status nginx # Stare + ultimele log-uri
systemctl is-active nginx
systemctl list-units --type=service --state=running
# Journalctl — log-uri structurate
journalctl -u nginx -f # Urmărire live
journalctl -u nginx --since "1 hour ago"
journalctl -u nginx --since "2024-01-15" --until "2024-01-16"
journalctl -p err -b # Doar erori, boot curent
journalctl --disk-usage # Spațiu ocupat de log-uri
journalctl --vacuum-size=500M # Curățare log-uri vechi
# Creare serviciu custom: /etc/systemd/system/myapp.service
[Unit]
Description=My Application
After=network.target postgresql.service
Requires=postgresql.service
[Service]
Type=simple
User=myapp
Group=myapp
WorkingDirectory=/opt/myapp
ExecStart=/opt/myapp/venv/bin/python app.py
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal
Environment=APP_ENV=production
EnvironmentFile=/opt/myapp/.env
# Securitate (hardening):
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
PrivateTmp=yes
ReadWritePaths=/opt/myapp/data
[Install]
WantedBy=multi-user.target
# Activare serviciu:
sudo systemctl daemon-reload
sudo systemctl enable --now myapp
3. Shell scripting avansat (Bash)¶
3.1 Fundamente robuste¶
#!/usr/bin/env bash
# Template script robust
set -euo pipefail # e: exit la eroare, u: eroare la variabilă nedefinită
# o pipefail: fail la orice comandă din pipe
IFS=$'\n\t' # Separator sănătos (evită probleme cu spații)
# Constante
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
readonly SCRIPT_NAME="$(basename "$0")"
readonly LOG_FILE="/var/log/${SCRIPT_NAME%.sh}.log"
# Funcții de logging
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] [INFO] $*" | tee -a "$LOG_FILE"; }
warn() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] [WARN] $*" | tee -a "$LOG_FILE" >&2; }
err() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] [ERROR] $*" | tee -a "$LOG_FILE" >&2; }
die() { err "$*"; exit 1; }
# Cleanup la ieșire (trap)
cleanup() {
local exit_code=$?
log "Cleanup: removing temp files..."
rm -rf "${TMPDIR:-/tmp}/myapp_$$"
exit "$exit_code"
}
trap cleanup EXIT
trap 'die "Script interrupted"' INT TERM
# Verificare dependențe
for cmd in docker kubectl aws jq; do
command -v "$cmd" &>/dev/null || die "Required command '$cmd' not found"
done
# Verificare root
[[ $EUID -eq 0 ]] || die "This script must be run as root"
3.2 Parsare argumente¶
# Parsare argumente cu getopts
usage() {
cat <<EOF
Usage: $SCRIPT_NAME [OPTIONS]
Options:
-e, --env ENV Environment (dev|staging|prod)
-t, --tag TAG Docker image tag
-d, --dry-run Don't execute, just show commands
-v, --verbose Verbose output
-h, --help Show this help
EOF
}
ENVIRONMENT=""
TAG="latest"
DRY_RUN=false
VERBOSE=false
while [[ $# -gt 0 ]]; do
case "$1" in
-e|--env)
ENVIRONMENT="$2"
shift 2
;;
-t|--tag)
TAG="$2"
shift 2
;;
-d|--dry-run)
DRY_RUN=true
shift
;;
-v|--verbose)
VERBOSE=true
shift
;;
-h|--help)
usage
exit 0
;;
*)
die "Unknown option: $1\nUse --help for usage."
;;
esac
done
# Validare
[[ -n "$ENVIRONMENT" ]] || die "Environment is required (-e)"
[[ "$ENVIRONMENT" =~ ^(dev|staging|prod)$ ]] || die "Invalid environment: $ENVIRONMENT"
3.3 Funcții utilitare DevOps¶
# Retry cu backoff exponențial
retry() {
local max_attempts="${1:-5}"
local delay="${2:-1}"
local attempt=1
shift 2
until "$@"; do
if (( attempt >= max_attempts )); then
err "Command failed after $max_attempts attempts: $*"
return 1
fi
warn "Attempt $attempt/$max_attempts failed. Retrying in ${delay}s..."
sleep "$delay"
delay=$(( delay * 2 ))
attempt=$(( attempt + 1 ))
done
}
# Utilizare:
retry 5 2 curl -sf https://api.example.com/health
# Execuție paralelă cu limită de concurență
parallel_exec() {
local max_jobs="${1:-4}"
shift
local pids=()
for cmd in "$@"; do
eval "$cmd" &
pids+=($!)
if (( ${#pids[@]} >= max_jobs )); then
wait "${pids[0]}"
pids=("${pids[@]:1}")
fi
done
wait
}
# Wait for service
wait_for_service() {
local host="$1" port="$2" timeout="${3:-30}"
local elapsed=0
log "Waiting for $host:$port (timeout: ${timeout}s)..."
until nc -z "$host" "$port" 2>/dev/null; do
(( elapsed >= timeout )) && die "Timeout waiting for $host:$port"
sleep 1
elapsed=$(( elapsed + 1 ))
done
log "$host:$port is available"
}
# Semver comparison
version_gte() {
# Returns 0 if $1 >= $2
printf '%s\n%s' "$2" "$1" | sort -V -C
}
# Safe secret handling
read_secret() {
local prompt="$1"
local secret
read -rsp "$prompt: " secret
echo
printf '%s' "$secret"
}
3.4 Script deploy complet¶
#!/usr/bin/env bash
set -euo pipefail
# === Deploy script pentru aplicație containerizată ===
readonly APP_NAME="mywebapp"
readonly REGISTRY="123456789.dkr.ecr.eu-west-1.amazonaws.com"
readonly NAMESPACE="production"
readonly HEALTH_ENDPOINT="/api/health"
readonly DEPLOY_TIMEOUT=300
deploy() {
local tag="$1"
local image="${REGISTRY}/${APP_NAME}:${tag}"
log "Deploying $image to $NAMESPACE..."
# 1. Verifică dacă imaginea există
if ! docker manifest inspect "$image" &>/dev/null; then
die "Image $image not found in registry"
fi
# 2. Backup la deployment-ul curent
local current_image
current_image=$(kubectl -n "$NAMESPACE" get deploy "$APP_NAME" \
-o jsonpath='{.spec.template.spec.containers[0].image}' 2>/dev/null || echo "none")
log "Current image: $current_image"
# 3. Aplică noul deployment
kubectl -n "$NAMESPACE" set image "deploy/$APP_NAME" \
"$APP_NAME=$image"
# 4. Așteaptă rollout
log "Waiting for rollout (timeout: ${DEPLOY_TIMEOUT}s)..."
if ! kubectl -n "$NAMESPACE" rollout status "deploy/$APP_NAME" \
--timeout="${DEPLOY_TIMEOUT}s"; then
warn "Rollout failed! Initiating rollback..."
kubectl -n "$NAMESPACE" rollout undo "deploy/$APP_NAME"
kubectl -n "$NAMESPACE" rollout status "deploy/$APP_NAME" \
--timeout="${DEPLOY_TIMEOUT}s"
die "Deploy failed, rolled back to previous version"
fi
# 5. Health check
local service_ip
service_ip=$(kubectl -n "$NAMESPACE" get svc "$APP_NAME" \
-o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
log "Running health check on $service_ip..."
retry 10 3 curl -sf "http://${service_ip}${HEALTH_ENDPOINT}"
log "Deploy successful! $APP_NAME is running $image"
}
# Parsare argumente și execuție
TAG="${1:?Usage: $0 <tag>}"
deploy "$TAG"
4. Controlul versiunilor — Git avansat¶
4.1 Strategii de branching¶
# === Git Flow ===
# main ────●────────────────●──────────── (releases)
# ↑ ↑
# develop ──●──●──●──●──●──●──●──●────────── (integrare)
# ↑ ↑ ↑
# feature/x ──●──●──┘ │
# feature/y ─────●──●──●─────┘
# hotfix/z ──────────────────────●──●──→ main + develop
# === Trunk-Based Development (preferat DevOps) ===
# main ──●──●──●──●──●──●──●──●──●──●── (deploy continuu)
# ↑ ↑ ↑ ↑
# short-lived │ │ │ │
# branches ───●──┘ ───●─────┘
# (max 1-2 zile)
# Comenzi Git esențiale workflow:
git checkout -b feature/add-auth
# ... dezvoltare ...
git add -A && git commit -m "feat: add JWT authentication"
git push -u origin feature/add-auth
# Creare Pull Request → Code Review → Merge
# Rebase interactiv (squash commits):
git rebase -i HEAD~5 # Rewrite ultimele 5 commits
# pick abc1234 feat: add auth endpoint
# squash def5678 fix: typo
# squash ghi9012 fix: tests
# → Un singur commit curat
# Cherry-pick (aplică un commit pe alt branch):
git cherry-pick abc1234
# Bisect (găsire commit care a introdus un bug):
git bisect start
git bisect bad HEAD
git bisect good v2.1.0
# Git face binary search automat între cele două puncte
4.2 Conventional Commits și versionare semantică¶
# Format: <type>(<scope>): <description>
# Tipuri: feat, fix, docs, style, refactor, perf, test, ci, chore
git commit -m "feat(auth): add OAuth2 Google login"
git commit -m "fix(api): handle null response from payment gateway"
git commit -m "perf(db): add index on users.email column"
git commit -m "ci: add SonarQube analysis step"
git commit -m "feat!: redesign user API (BREAKING CHANGE)"
# Semantic Versioning: MAJOR.MINOR.PATCH
# MAJOR: breaking changes (feat!)
# MINOR: new features (feat)
# PATCH: bug fixes (fix)
# Ex: 2.4.1 → fix → 2.4.2, feat → 2.5.0, breaking → 3.0.0
# Automatizare cu semantic-release sau standard-version:
npx standard-version # Generează CHANGELOG + bump version
4.3 Git Hooks pentru automatizare¶
# .git/hooks/pre-commit (sau via Husky/pre-commit framework)
#!/usr/bin/env bash
set -euo pipefail
echo "Running pre-commit checks..."
# Lint
if command -v flake8 &>/dev/null; then
flake8 --max-line-length=120 .
fi
# Secrets detection (previne commit-ul accidental de chei/parole)
if command -v gitleaks &>/dev/null; then
gitleaks detect --staged --verbose
fi
# Terraform format check
if command -v terraform &>/dev/null; then
terraform fmt -check -recursive
fi
echo "Pre-commit checks passed!"
5. CI/CD — Concepte și arhitecturi¶
5.1 Pipeline-ul CI/CD complet¶
┌───────────────────────────────────────────────────────────────┐
│ CI/CD PIPELINE │
│ │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌───────┐│
│ │Source│→│Build │→│Test │→│Scan │→│Deploy│→│Monitor││
│ │ │ │ │ │ │ │ │ │ │ │ ││
│ │ Git │ │Compile│ │Unit │ │SAST │ │Staging│ │Metrics││
│ │ Push │ │Docker │ │Integr│ │DAST │ │Prod │ │Alerts ││
│ │ PR │ │Build │ │E2E │ │Deps │ │Canary │ │Logs ││
│ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └───────┘│
│ │
│ ◄─── Continuous Integration ───► ◄── Continuous Delivery ────►│
│ ◄────────────── Continuous Deployment ───────────────────────►│
└───────────────────────────────────────────────────────────────┘
CI = Build + Test automat la fiecare commit
CD (Delivery) = Artefact gata de deploy (manual approval pt. prod)
CD (Deployment) = Deploy automat în producție, fără intervenție umană
5.2 Strategii de deployment¶
Blue-Green:
┌────────────┐ ┌────────────┐
│ Blue (v1) │←LB │ Green (v2) │ ← Deploy v2 pe Green
│ ACTIVE │ │ IDLE │
└────────────┘ └────────────┘
│
Switch load balancer: ▼
┌────────────┐ ┌────────────┐
│ Blue (v1) │ │ Green (v2) │←LB ← Green devine activ
│ IDLE │ │ ACTIVE │ Rollback instant: switch înapoi
└────────────┘ └────────────┘
Canary:
100% traffic → v1
├── 5% traffic → v2 (canary) ← monitorizare erori
├── 25% traffic → v2 ← scale up dacă OK
├── 50% traffic → v2
└── 100% traffic → v2 ← rollout complet
Rolling Update (Kubernetes default):
Pod v1 Pod v1 Pod v1 Pod v1
Pod v2 Pod v1 Pod v1 Pod v1 ← un pod la un moment dat
Pod v2 Pod v2 Pod v1 Pod v1
Pod v2 Pod v2 Pod v2 Pod v1
Pod v2 Pod v2 Pod v2 Pod v2 ← complet
6. Jenkins — Pipeline-uri declarative și scripted¶
6.1 Jenkinsfile declarativ¶
// Jenkinsfile (Declarative Pipeline)
pipeline {
agent {
docker {
image 'python:3.11-slim'
args '-v /var/run/docker.sock:/var/run/docker.sock'
}
}
environment {
REGISTRY = credentials('ecr-registry-url')
APP_NAME = 'mywebapp'
AWS_REGION = 'eu-west-1'
}
options {
timeout(time: 30, unit: 'MINUTES')
disableConcurrentBuilds()
buildDiscarder(logRotator(numToKeepStr: '10'))
}
stages {
stage('Checkout') {
steps {
checkout scm
script {
env.GIT_COMMIT_SHORT = sh(
script: 'git rev-parse --short HEAD',
returnStdout: true
).trim()
env.IMAGE_TAG = "${env.BRANCH_NAME}-${env.GIT_COMMIT_SHORT}"
}
}
}
stage('Install Dependencies') {
steps {
sh '''
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install -r requirements-dev.txt
'''
}
}
stage('Lint & Format') {
parallel {
stage('Flake8') {
steps {
sh 'flake8 --max-line-length=120 src/'
}
}
stage('Black') {
steps {
sh 'black --check src/'
}
}
stage('Mypy') {
steps {
sh 'mypy src/ --ignore-missing-imports'
}
}
}
}
stage('Unit Tests') {
steps {
sh 'pytest tests/unit/ -v --junitxml=reports/unit.xml --cov=src --cov-report=xml'
}
post {
always {
junit 'reports/unit.xml'
cobertura coberturaReportFile: 'coverage.xml'
}
}
}
stage('Build Docker Image') {
steps {
sh """
docker build \
--build-arg BUILD_DATE=\$(date -u +%Y-%m-%dT%H:%M:%SZ) \
--build-arg GIT_COMMIT=${env.GIT_COMMIT_SHORT} \
-t ${REGISTRY}/${APP_NAME}:${IMAGE_TAG} \
-t ${REGISTRY}/${APP_NAME}:latest .
"""
}
}
stage('Integration Tests') {
steps {
sh '''
docker-compose -f docker-compose.test.yml up -d
sleep 10
pytest tests/integration/ -v --junitxml=reports/integration.xml
'''
}
post {
always {
sh 'docker-compose -f docker-compose.test.yml down -v'
junit 'reports/integration.xml'
}
}
}
stage('Security Scan') {
parallel {
stage('Trivy Image Scan') {
steps {
sh "trivy image --severity HIGH,CRITICAL --exit-code 1 ${REGISTRY}/${APP_NAME}:${IMAGE_TAG}"
}
}
stage('Dependency Check') {
steps {
sh 'safety check -r requirements.txt'
}
}
}
}
stage('Push to Registry') {
when {
branch 'main'
}
steps {
withCredentials([usernamePassword(
credentialsId: 'ecr-credentials',
usernameVariable: 'AWS_ACCESS_KEY_ID',
passwordVariable: 'AWS_SECRET_ACCESS_KEY'
)]) {
sh """
aws ecr get-login-password --region ${AWS_REGION} | \
docker login --username AWS --password-stdin ${REGISTRY}
docker push ${REGISTRY}/${APP_NAME}:${IMAGE_TAG}
docker push ${REGISTRY}/${APP_NAME}:latest
"""
}
}
}
stage('Deploy to Staging') {
when { branch 'main' }
steps {
sh """
kubectl --context staging -n staging \
set image deploy/${APP_NAME} ${APP_NAME}=${REGISTRY}/${APP_NAME}:${IMAGE_TAG}
kubectl --context staging -n staging \
rollout status deploy/${APP_NAME} --timeout=300s
"""
}
}
stage('Deploy to Production') {
when { branch 'main' }
input {
message 'Deploy to production?'
ok 'Yes, deploy!'
}
steps {
sh """
kubectl --context production -n production \
set image deploy/${APP_NAME} ${APP_NAME}=${REGISTRY}/${APP_NAME}:${IMAGE_TAG}
kubectl --context production -n production \
rollout status deploy/${APP_NAME} --timeout=300s
"""
}
}
}
post {
success {
slackSend(channel: '#deploys',
color: 'good',
message: "✅ ${APP_NAME} ${IMAGE_TAG} deployed successfully")
}
failure {
slackSend(channel: '#deploys',
color: 'danger',
message: "❌ ${APP_NAME} pipeline failed: ${env.BUILD_URL}")
}
}
}
7. GitHub Actions și GitLab CI/CD¶
7.1 GitHub Actions¶
# .github/workflows/ci-cd.yml
name: CI/CD Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.10', '3.11', '3.12']
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: testdb
POSTGRES_USER: testuser
POSTGRES_PASSWORD: testpass
ports: ['5432:5432']
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install -r requirements-dev.txt
- name: Run tests
env:
DATABASE_URL: postgresql://testuser:testpass@localhost:5432/testdb
run: pytest -v --cov=src --cov-report=xml
- name: Upload coverage
uses: codecov/codecov-action@v4
if: matrix.python-version == '3.12'
with:
file: coverage.xml
build-and-push:
needs: test
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
outputs:
image-tag: ${{ steps.meta.outputs.tags }}
steps:
- uses: actions/checkout@v4
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=
type=raw,value=latest
- name: Login to GHCR
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy:
needs: build-and-push
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: eu-west-1
- name: Update kubeconfig
run: aws eks update-kubeconfig --name my-cluster --region eu-west-1
- name: Deploy to Kubernetes
run: |
kubectl set image deploy/mywebapp \
mywebapp=${{ needs.build-and-push.outputs.image-tag }} \
-n production
kubectl rollout status deploy/mywebapp -n production --timeout=300s
7.2 GitLab CI/CD¶
# .gitlab-ci.yml
stages:
- test
- build
- security
- deploy
variables:
IMAGE: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA
test:
stage: test
image: python:3.12-slim
services:
- postgres:16
variables:
POSTGRES_DB: testdb
POSTGRES_USER: test
POSTGRES_PASSWORD: test
DATABASE_URL: "postgresql://test:test@postgres/testdb"
script:
- pip install -r requirements.txt -r requirements-dev.txt
- pytest -v --junitxml=report.xml --cov=src
artifacts:
reports:
junit: report.xml
build:
stage: build
image: docker:24
services:
- docker:24-dind
script:
- docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
- docker build -t $IMAGE .
- docker push $IMAGE
only:
- main
trivy_scan:
stage: security
image: aquasec/trivy:latest
script:
- trivy image --severity HIGH,CRITICAL --exit-code 1 $IMAGE
only:
- main
deploy_production:
stage: deploy
image: bitnami/kubectl:latest
script:
- kubectl set image deploy/myapp myapp=$IMAGE -n production
- kubectl rollout status deploy/myapp -n production --timeout=300s
environment:
name: production
url: https://app.example.com
when: manual
only:
- main
8. Docker — Containerizare avansată¶
8.1 Dockerfile optimizat — multi-stage build¶
# === Stage 1: Build ===
FROM python:3.12-slim AS builder
WORKDIR /build
# Instalare dependențe de compilare (cache layer separat)
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential libpq-dev \
&& rm -rf /var/lib/apt/lists/*
# Copiază doar requirements (cache dacă nu se schimbă)
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
# === Stage 2: Production ===
FROM python:3.12-slim AS production
# Metadata
LABEL maintainer="devops@company.com" \
version="1.0" \
description="Production web application"
# Utilizator non-root
RUN groupadd -r appuser && useradd -r -g appuser -d /app -s /sbin/nologin appuser
# Runtime dependencies doar
RUN apt-get update && apt-get install -y --no-install-recommends \
libpq5 curl \
&& rm -rf /var/lib/apt/lists/*
# Copiază pachetele Python din builder
COPY --from=builder /install /usr/local
# Copiază codul aplicației
WORKDIR /app
COPY --chown=appuser:appuser src/ ./src/
COPY --chown=appuser:appuser alembic/ ./alembic/
COPY --chown=appuser:appuser alembic.ini .
# Expune portul
EXPOSE 8000
# Health check
HEALTHCHECK --interval=30s --timeout=5s --retries=3 --start-period=10s \
CMD curl -f http://localhost:8000/health || exit 1
# Switch la user non-root
USER appuser
# Entrypoint cu exec form (semnale propagate corect)
ENTRYPOINT ["python", "-m", "uvicorn"]
CMD ["src.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
8.2 .dockerignore¶
# .dockerignore
.git
.gitignore
.env
.env.*
__pycache__
*.pyc
*.pyo
.pytest_cache
.mypy_cache
.coverage
htmlcov/
*.egg-info/
dist/
build/
node_modules/
.vscode/
.idea/
docker-compose*.yml
Dockerfile*
README.md
docs/
tests/
*.md
8.3 Comenzi Docker esențiale¶
# Build
docker build -t myapp:v1 .
docker build -t myapp:v1 --no-cache . # Fără cache
docker build -t myapp:v1 --target builder . # Doar un stage
# Run
docker run -d --name myapp -p 8080:8000 myapp:v1
docker run -d --name myapp \
-p 8080:8000 \
-v $(pwd)/data:/app/data \
-e DATABASE_URL="postgres://..." \
--memory=512m \
--cpus=1.5 \
--restart=unless-stopped \
--network=mynetwork \
myapp:v1
# Debug
docker exec -it myapp /bin/bash
docker logs -f --tail 100 myapp
docker inspect myapp | jq '.[0].NetworkSettings'
docker stats # Resurse live
docker top myapp # Procese în container
# Cleanup
docker system prune -af --volumes # NUCLEAR: șterge tot neutilizat
docker image prune -a # Șterge imagini nefolosite
docker volume prune # Șterge volume orfane
# Registry
docker tag myapp:v1 registry.example.com/myapp:v1
docker push registry.example.com/myapp:v1
docker pull registry.example.com/myapp:v1
9. Docker Compose și aplicații multi-container¶
# docker-compose.yml — aplicație completă
services:
app:
build:
context: .
dockerfile: Dockerfile
target: production
ports:
- "8000:8000"
environment:
- DATABASE_URL=postgresql://appuser:secret@db:5432/appdb
- REDIS_URL=redis://redis:6379/0
- CELERY_BROKER_URL=redis://redis:6379/1
depends_on:
db:
condition: service_healthy
redis:
condition: service_healthy
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 5s
retries: 3
deploy:
resources:
limits:
memory: 512M
cpus: '1.0'
restart: unless-stopped
networks:
- frontend
- backend
worker:
build: .
command: celery -A src.celery_app worker -l info -c 4
environment:
- DATABASE_URL=postgresql://appuser:secret@db:5432/appdb
- CELERY_BROKER_URL=redis://redis:6379/1
depends_on:
- db
- redis
restart: unless-stopped
networks:
- backend
db:
image: postgres:16-alpine
environment:
POSTGRES_DB: appdb
POSTGRES_USER: appuser
POSTGRES_PASSWORD: secret
volumes:
- postgres_data:/var/lib/postgresql/data
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
healthcheck:
test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
interval: 10s
timeout: 5s
retries: 5
networks:
- backend
redis:
image: redis:7-alpine
command: redis-server --maxmemory 128mb --maxmemory-policy allkeys-lru
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
volumes:
- redis_data:/data
networks:
- backend
nginx:
image: nginx:alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/conf.d/default.conf:ro
- ./certs:/etc/nginx/certs:ro
depends_on:
- app
networks:
- frontend
volumes:
postgres_data:
redis_data:
networks:
frontend:
backend:
10. Kubernetes — Orchestrare la scară¶
10.1 Arhitectura Kubernetes¶
┌─────────────────────────────────────────────────────────────┐
│ CONTROL PLANE │
│ │
│ ┌──────────────┐ ┌─────────────┐ ┌───────────────────┐ │
│ │ kube-apiserver│ │ etcd │ │ kube-scheduler │ │
│ │ (REST API, │ │ (key-value │ │ (alege pe ce nod │ │
│ │ authn/authz)│ │ store, │ │ se plasează │ │
│ │ │ │ cluster │ │ pod-urile) │ │
│ │ │ │ state) │ │ │ │
│ └──────────────┘ └─────────────┘ └───────────────────┘ │
│ ┌──────────────────────┐ ┌────────────────────────────┐ │
│ │ kube-controller-mgr │ │ cloud-controller-manager │ │
│ │ (ReplicaSet, Deploy- │ │ (LoadBalancer, volumes, │ │
│ │ ment, Node, Job...) │ │ node lifecycle — cloud) │ │
│ └──────────────────────┘ └────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Worker │ │ Worker │ │ Worker │
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ │ │ │ │ │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
│ │ kubelet │ │ │ │ kubelet │ │ │ │ kubelet │ │
│ │(agent) │ │ │ │ │ │ │ │ │ │
│ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │
│ │kube-proxy│ │ │ │kube-proxy│ │ │ │kube-proxy│ │
│ │(network) │ │ │ │ │ │ │ │ │ │
│ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │
│ │Container │ │ │ │Container │ │ │ │Container │ │
│ │Runtime │ │ │ │Runtime │ │ │ │Runtime │ │
│ │(containerd)│ │ │(containerd)│ │ │(containerd)│ │
│ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │
│ │ Pod A │ │ │ │ Pod C │ │ │ │ Pod E │ │
│ │ Pod B │ │ │ │ Pod D │ │ │ │ Pod F │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │
└──────────────┘ └──────────────┘ └──────────────┘
10.2 Manifeste Kubernetes — aplicație completă¶
# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: myapp
labels:
app.kubernetes.io/name: myapp
---
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: myapp-config
namespace: myapp
data:
APP_ENV: "production"
LOG_LEVEL: "info"
ALLOWED_HOSTS: "app.example.com"
---
# secret.yaml (în practică: folosește External Secrets Operator sau Sealed Secrets)
apiVersion: v1
kind: Secret
metadata:
name: myapp-secrets
namespace: myapp
type: Opaque
stringData:
DATABASE_URL: "postgresql://user:pass@db-host:5432/appdb"
SECRET_KEY: "super-secret-key-here"
---
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
namespace: myapp
labels:
app: myapp
version: v1
spec:
replicas: 3
selector:
matchLabels:
app: myapp
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Max pod-uri extra în timpul update-ului
maxUnavailable: 0 # Niciun pod indisponibil (zero-downtime)
template:
metadata:
labels:
app: myapp
version: v1
spec:
serviceAccountName: myapp-sa
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: myapp
image: registry.example.com/myapp:v1.2.3
ports:
- containerPort: 8000
name: http
envFrom:
- configMapRef:
name: myapp-config
- secretRef:
name: myapp-secrets
resources:
requests:
cpu: 250m # 0.25 core
memory: 256Mi
limits:
cpu: 1000m # 1 core
memory: 512Mi
readinessProbe:
httpGet:
path: /health/ready
port: http
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health/live
port: http
initialDelaySeconds: 15
periodSeconds: 20
startupProbe:
httpGet:
path: /health/live
port: http
failureThreshold: 30
periodSeconds: 2
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: myapp
---
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: myapp
namespace: myapp
spec:
selector:
app: myapp
ports:
- port: 80
targetPort: http
protocol: TCP
type: ClusterIP
---
# hpa.yaml (Horizontal Pod Autoscaler)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: myapp-hpa
namespace: myapp
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
---
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-ingress
namespace: myapp
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/rate-limit: "100"
spec:
ingressClassName: nginx
tls:
- hosts:
- app.example.com
secretName: myapp-tls
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp
port:
number: 80
10.3 Comenzi kubectl esențiale¶
# Informații cluster
kubectl cluster-info
kubectl get nodes -o wide
kubectl top nodes # Resurse per nod
# Operare
kubectl apply -f manifests/ # Aplică toate fișierele din director
kubectl get all -n myapp # Toate resursele din namespace
kubectl get pods -n myapp -o wide # Pods cu detalii
kubectl describe pod myapp-xxx -n myapp # Detalii complete pod
kubectl logs -f myapp-xxx -n myapp # Log-uri live
kubectl logs myapp-xxx -n myapp --previous # Log-uri pod anterior (crash)
kubectl exec -it myapp-xxx -n myapp -- /bin/sh # Shell în pod
# Debugging
kubectl get events -n myapp --sort-by='.lastTimestamp'
kubectl debug pod/myapp-xxx -it --image=busybox # Debug pod ephemeral
# Deployment management
kubectl rollout status deploy/myapp -n myapp
kubectl rollout history deploy/myapp -n myapp
kubectl rollout undo deploy/myapp -n myapp # Rollback la versiunea anterioară
kubectl rollout undo deploy/myapp -n myapp --to-revision=3
# Scaling
kubectl scale deploy/myapp -n myapp --replicas=5
# Port forward (debug local)
kubectl port-forward svc/myapp 8080:80 -n myapp
11. Kubernetes — Obiecte avansate și operare¶
11.1 Jobs și CronJobs¶
# CronJob pentru backup baza de date
apiVersion: batch/v1
kind: CronJob
metadata:
name: db-backup
namespace: myapp
spec:
schedule: "0 2 * * *" # Zilnic la 02:00
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
backoffLimit: 3
activeDeadlineSeconds: 3600
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup
image: postgres:16
command:
- /bin/bash
- -c
- |
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
pg_dump $DATABASE_URL | gzip > /backup/db_${TIMESTAMP}.sql.gz
aws s3 cp /backup/db_${TIMESTAMP}.sql.gz \
s3://my-backups/db/db_${TIMESTAMP}.sql.gz
# Cleanup local
find /backup -mtime +7 -delete
envFrom:
- secretRef:
name: myapp-secrets
volumeMounts:
- name: backup-vol
mountPath: /backup
volumes:
- name: backup-vol
emptyDir:
sizeLimit: 5Gi
11.2 Network Policies¶
# Permite doar trafic de la pods cu label app=myapp către db
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: db-allow-app-only
namespace: myapp
spec:
podSelector:
matchLabels:
app: postgres
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: myapp
ports:
- port: 5432
protocol: TCP
12. Helm și managementul pachetelor Kubernetes¶
# Helm = package manager pentru Kubernetes
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
# Instalare chart:
helm install my-postgres bitnami/postgresql \
--namespace myapp \
--set auth.postgresPassword=secret \
--set primary.persistence.size=50Gi
# Vizualizare:
helm list -n myapp
helm status my-postgres -n myapp
# Upgrade:
helm upgrade my-postgres bitnami/postgresql \
--namespace myapp \
--set primary.resources.limits.memory=2Gi
# Rollback:
helm rollback my-postgres 1 -n myapp
# Chart propriu: mychart/values.yaml
replicaCount: 3
image:
repository: registry.example.com/myapp
tag: "latest"
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 80
ingress:
enabled: true
host: app.example.com
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 1000m
memory: 512Mi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
targetCPUUtilization: 70
13. AWS — Servicii fundamentale pentru DevOps¶
13.1 Harta serviciilor AWS esențiale¶
┌──────────────────────────────────────────────────────────────┐
│ AWS Cloud │
│ │
│ COMPUTE CONTAINERS NETWORKING │
│ ┌──────────┐ ┌───────────┐ ┌──────────────────┐ │
│ │ EC2 │ │ ECS/EKS │ │ VPC │ │
│ │ Lambda │ │ Fargate │ │ ALB/NLB │ │
│ │ ASG │ │ ECR │ │ Route53 (DNS) │ │
│ └──────────┘ └───────────┘ │ CloudFront (CDN) │ │
│ │ API Gateway │ │
│ STORAGE DATABASE └──────────────────┘ │
│ ┌──────────┐ ┌───────────┐ │
│ │ S3 │ │ RDS │ SECURITY │
│ │ EBS │ │ DynamoDB │ ┌──────────────────┐ │
│ │ EFS │ │ ElastiCache│ │ IAM │ │
│ └──────────┘ └───────────┘ │ KMS │ │
│ │ Secrets Manager │ │
│ CI/CD MONITORING │ WAF │ │
│ ┌──────────┐ ┌───────────┐ └──────────────────┘ │
│ │CodePipeline│ │CloudWatch │ │
│ │CodeBuild │ │X-Ray │ IaC │
│ │CodeDeploy │ │CloudTrail │ ┌──────────────────┐ │
│ └──────────┘ └───────────┘ │ CloudFormation │ │
│ │ (sau Terraform) │ │
│ └──────────────────┘ │
└──────────────────────────────────────────────────────────────┘
13.2 AWS CLI — operații comune¶
# Configurare
aws configure # Setup inițial
aws sts get-caller-identity # Verificare identitate curentă
# S3
aws s3 ls # Lista buckets
aws s3 sync ./dist s3://my-bucket/app/ --delete
aws s3 cp backup.sql.gz s3://my-backups/ --storage-class GLACIER
# EC2
aws ec2 describe-instances \
--filters "Name=tag:Environment,Values=production" \
--query 'Reservations[].Instances[].{ID:InstanceId,IP:PrivateIpAddress,State:State.Name}' \
--output table
# ECS
aws ecs update-service --cluster prod --service myapp \
--force-new-deployment
# ECR
aws ecr get-login-password --region eu-west-1 | \
docker login --username AWS --password-stdin 123456789.dkr.ecr.eu-west-1.amazonaws.com
# EKS
aws eks update-kubeconfig --name my-cluster --region eu-west-1
# Secrets Manager
aws secretsmanager get-secret-value --secret-id prod/myapp/db \
--query SecretString --output text | jq .
# Lambda
aws lambda invoke --function-name my-function \
--payload '{"key": "value"}' response.json
14. AWS — Infrastructură ca și cod (IaC) cu Terraform¶
14.1 Structura proiectului Terraform¶
terraform/
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── terraform.tfvars
│ ├── staging/
│ └── prod/
├── modules/
│ ├── networking/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── eks/
│ ├── rds/
│ └── s3/
└── global/
└── iam/
14.2 Modul VPC + EKS¶
# modules/networking/main.tf
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "${var.project}-vpc"
Environment = var.environment
}
}
resource "aws_subnet" "private" {
count = length(var.private_subnets)
vpc_id = aws_vpc.main.id
cidr_block = var.private_subnets[count.index]
availability_zone = var.azs[count.index]
tags = {
Name = "${var.project}-private-${var.azs[count.index]}"
"kubernetes.io/role/internal-elb" = "1"
}
}
resource "aws_subnet" "public" {
count = length(var.public_subnets)
vpc_id = aws_vpc.main.id
cidr_block = var.public_subnets[count.index]
availability_zone = var.azs[count.index]
map_public_ip_on_launch = true
tags = {
Name = "${var.project}-public-${var.azs[count.index]}"
"kubernetes.io/role/elb" = "1"
}
}
# environments/prod/main.tf
module "networking" {
source = "../../modules/networking"
project = "myapp"
environment = "prod"
vpc_cidr = "10.0.0.0/16"
azs = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
}
module "eks" {
source = "../../modules/eks"
cluster_name = "myapp-prod"
cluster_version = "1.29"
vpc_id = module.networking.vpc_id
subnet_ids = module.networking.private_subnet_ids
node_groups = {
general = {
instance_types = ["m6i.large"]
min_size = 3
max_size = 10
desired_size = 3
}
}
}
# Workflow Terraform:
cd terraform/environments/prod
terraform init # Inițializare (descarcă provideri)
terraform plan -out=tfplan # Preview modificări
terraform apply tfplan # Aplică
terraform destroy # Distruge tot (ATENȚIE!)
# State management:
# Backend S3 + DynamoDB lock (recomandat pentru echipe):
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "prod/terraform.tfstate"
region = "eu-west-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
15. Python pentru automatizare DevOps¶
15.1 Script de automatizare AWS cu boto3¶
#!/usr/bin/env python3
"""AWS infrastructure automation utilities."""
import boto3
import json
import sys
from datetime import datetime, timedelta
from typing import Optional
from botocore.exceptions import ClientError
class AWSManager:
"""Manages common AWS operations for DevOps."""
def __init__(self, region: str = "eu-west-1"):
self.region = region
self.ec2 = boto3.client("ec2", region_name=region)
self.ecs = boto3.client("ecs", region_name=region)
self.s3 = boto3.client("s3", region_name=region)
self.cloudwatch = boto3.client("cloudwatch", region_name=region)
self.secretsmanager = boto3.client("secretsmanager", region_name=region)
def get_instances_by_tag(self, tag_key: str, tag_value: str) -> list[dict]:
"""Get EC2 instances filtered by tag."""
response = self.ec2.describe_instances(
Filters=[
{"Name": f"tag:{tag_key}", "Values": [tag_value]},
{"Name": "instance-state-name", "Values": ["running"]},
]
)
instances = []
for reservation in response["Reservations"]:
for inst in reservation["Instances"]:
name = next(
(t["Value"] for t in inst.get("Tags", []) if t["Key"] == "Name"),
"unnamed",
)
instances.append({
"id": inst["InstanceId"],
"name": name,
"private_ip": inst.get("PrivateIpAddress"),
"type": inst["InstanceType"],
"az": inst["Placement"]["AvailabilityZone"],
})
return instances
def force_ecs_deploy(self, cluster: str, service: str) -> str:
"""Force new deployment of an ECS service."""
response = self.ecs.update_service(
cluster=cluster,
service=service,
forceNewDeployment=True,
)
deployment_id = response["service"]["deployments"][0]["id"]
print(f"Triggered deployment {deployment_id} for {service}")
return deployment_id
def cleanup_old_snapshots(self, days: int = 30, dry_run: bool = True) -> int:
"""Delete EBS snapshots older than N days."""
cutoff = datetime.utcnow() - timedelta(days=days)
snapshots = self.ec2.describe_snapshots(OwnerIds=["self"])["Snapshots"]
deleted = 0
for snap in snapshots:
if snap["StartTime"].replace(tzinfo=None) < cutoff:
if dry_run:
print(f"[DRY RUN] Would delete {snap['SnapshotId']} "
f"({snap['StartTime'].date()})")
else:
self.ec2.delete_snapshot(SnapshotId=snap["SnapshotId"])
print(f"Deleted {snap['SnapshotId']}")
deleted += 1
print(f"Total: {deleted} snapshots {'would be' if dry_run else ''} deleted")
return deleted
def get_secret(self, secret_name: str) -> dict:
"""Retrieve a secret from AWS Secrets Manager."""
try:
response = self.secretsmanager.get_secret_value(SecretId=secret_name)
return json.loads(response["SecretString"])
except ClientError as e:
if e.response["Error"]["Code"] == "ResourceNotFoundException":
raise ValueError(f"Secret '{secret_name}' not found")
raise
if __name__ == "__main__":
mgr = AWSManager()
# Lista instanțe de producție
instances = mgr.get_instances_by_tag("Environment", "production")
for inst in instances:
print(f" {inst['name']:30s} {inst['id']:20s} {inst['private_ip']:15s}")
15.2 Script de monitorizare și alertare¶
#!/usr/bin/env python3
"""Health check and alerting script."""
import requests
import smtplib
import json
import time
from email.mime.text import MIMEText
from dataclasses import dataclass
from concurrent.futures import ThreadPoolExecutor, as_completed
@dataclass
class HealthCheckResult:
url: str
status: str # "healthy", "degraded", "down"
status_code: int
response_time_ms: float
error: str = ""
def check_endpoint(url: str, timeout: int = 10) -> HealthCheckResult:
"""Check a single HTTP endpoint."""
try:
start = time.monotonic()
response = requests.get(url, timeout=timeout)
elapsed_ms = (time.monotonic() - start) * 1000
if response.status_code == 200 and elapsed_ms < 2000:
status = "healthy"
elif response.status_code == 200:
status = "degraded"
else:
status = "down"
return HealthCheckResult(
url=url,
status=status,
status_code=response.status_code,
response_time_ms=round(elapsed_ms, 1),
)
except requests.RequestException as e:
return HealthCheckResult(
url=url, status="down", status_code=0,
response_time_ms=0, error=str(e),
)
def check_all_endpoints(endpoints: list[str]) -> list[HealthCheckResult]:
"""Check multiple endpoints in parallel."""
results = []
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(check_endpoint, url): url for url in endpoints}
for future in as_completed(futures):
results.append(future.result())
return results
def send_alert(subject: str, body: str, webhook_url: str):
"""Send alert via Slack webhook."""
payload = {
"text": f"*{subject}*\n```{body}```",
"username": "HealthCheck Bot",
}
requests.post(webhook_url, json=payload, timeout=10)
# Configurare
ENDPOINTS = [
"https://app.example.com/health",
"https://api.example.com/health",
"https://admin.example.com/health",
]
SLACK_WEBHOOK = "https://hooks.slack.com/services/xxx/yyy/zzz"
if __name__ == "__main__":
results = check_all_endpoints(ENDPOINTS)
# Raport
for r in results:
icon = {"healthy": "✅", "degraded": "⚠️", "down": "❌"}[r.status]
print(f"{icon} {r.url:45s} {r.status:10s} "
f"{r.status_code:3d} {r.response_time_ms:7.1f}ms {r.error}")
# Alertare pentru servicii down
down = [r for r in results if r.status == "down"]
if down:
body = "\n".join(f"❌ {r.url} — {r.error or f'HTTP {r.status_code}'}"
for r in down)
send_alert(f"{len(down)} service(s) DOWN", body, SLACK_WEBHOOK)
16. Monitorizare, logging și observabilitate¶
16.1 Cele trei piloni ai observabilității¶
┌─────────────────────────────────────────────────────────────┐
│ OBSERVABILITATE │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ METRICS │ │ LOGS │ │ TRACES │ │
│ │ │ │ │ │ │ │
│ │ Prometheus │ │ ELK Stack │ │ Jaeger │ │
│ │ Grafana │ │ (Elastic, │ │ Zipkin │ │
│ │ CloudWatch │ │ Logstash, │ │ AWS X-Ray │ │
│ │ Datadog │ │ Kibana) │ │ OpenTelemetry │ │
│ │ │ │ Loki │ │ │ │
│ │ "Ce se │ │ Fluentd/Bit │ │ "Care e calea │ │
│ │ întâmplă?" │ │ │ │ unui request?" │ │
│ │ │ │ "De ce s-a │ │ │ │
│ │ CPU, RAM, │ │ întâmplat?"│ │ Latență per │ │
│ │ requests/s, │ │ │ │ serviciu, │ │
│ │ error rate │ │ Stack traces,│ │ dependințe │ │
│ │ │ │ audit log │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────────┘
16.2 Prometheus + Grafana pe Kubernetes¶
# Prometheus ServiceMonitor (dacă folosești prometheus-operator)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp-monitor
namespace: myapp
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: http
path: /metrics
interval: 15s
# Alertă Prometheus
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: myapp-alerts
namespace: myapp
spec:
groups:
- name: myapp.rules
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{app="myapp",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{app="myapp"}[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate (>5%) on myapp"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{app="myapp"}[5m]))
by (le)) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "P95 latency > 1s on myapp"
17. Securitate în pipeline-ul DevOps (DevSecOps)¶
17.1 Security-as-Code în CI/CD¶
# Etape de securitate integrate în pipeline:
security-scan:
stage: security
parallel:
# 1. SAST — Static Application Security Testing
- name: semgrep
script: semgrep scan --config=auto --error src/
# 2. Dependency scanning
- name: dependency-check
script: |
pip-audit -r requirements.txt
safety check -r requirements.txt
# 3. Container image scanning
- name: trivy
script: |
trivy image --severity HIGH,CRITICAL \
--exit-code 1 $IMAGE
# 4. IaC scanning
- name: checkov
script: |
checkov -d terraform/ --framework terraform
checkov -d k8s/ --framework kubernetes
# 5. Secrets detection
- name: gitleaks
script: gitleaks detect --source . --verbose
17.2 Pod Security Standards (Kubernetes)¶
# Restricted pod security: best practice producție
apiVersion: v1
kind: Pod
metadata:
name: secure-pod
spec:
securityContext:
runAsNonRoot: true
runAsUser: 65534
fsGroup: 65534
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: myapp:v1
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir: {}
18. Proiect integrator: pipeline complet de la cod la producție¶
18.1 Diagrama completă¶
Developer → git push
│
▼
GitHub Actions / Jenkins
│
├── 1. Lint (flake8, black, mypy)
├── 2. Unit Tests (pytest, coverage >80%)
├── 3. Build Docker Image (multi-stage)
├── 4. Security Scan (trivy, semgrep, pip-audit)
├── 5. Push to ECR
├── 6. Deploy to Staging (kubectl / Helm)
├── 7. Integration Tests (against staging)
├── 8. Manual Approval Gate
└── 9. Deploy to Production (rolling update)
│
├── Prometheus scrape → Grafana dashboards
├── Fluentd → Elasticsearch → Kibana
├── Alertmanager → Slack / PagerDuty
└── Rollback automat dacă error rate > 5%
Infra managed by:
Terraform (VPC, EKS, RDS, S3, IAM)
Helm (aplicație Kubernetes)
AWS Secrets Manager (credentials)
Monitorizare:
Grafana: dashboards per serviciu
Prometheus: metrici aplicație + cluster
CloudWatch: metrici AWS native
PagerDuty: on-call rotation + escalation
18.2 Checklist lansare producție¶
PRE-DEPLOY:
□ Toate testele trec (unit, integration, e2e)
□ Code review aprobat (min 2 revieweri)
□ Security scan fără vulnerabilități critice
□ Docker image sub 500MB
□ Health endpoints implementate (/health/live, /health/ready)
□ Metrici Prometheus expuse (/metrics)
□ Logging structurat (JSON)
□ Graceful shutdown implementat (SIGTERM handling)
□ Database migrations testate și reversibile
□ Secrets în Secrets Manager (nu în cod/env vars)
□ Resource limits setate (CPU, memory)
□ HPA configurat (autoscaling)
□ Network policies aplicate
□ Pod Security Standards respectate
□ Runbook actualizat
POST-DEPLOY:
□ Health check OK pe toate pod-urile
□ Metrici vizibile în Grafana
□ Error rate stabil (< baseline)
□ Latența P95 în limite
□ Nicio alertă nouă
□ Rollback testat și documentat
Anexe¶
A. Toolchain DevOps recomandat¶
| Categorie | Instrumente recomandate |
|---|---|
| Version Control | Git, GitHub / GitLab |
| CI/CD | GitHub Actions, GitLab CI, Jenkins, ArgoCD |
| Containerizare | Docker, Podman, Buildah |
| Orchestrare | Kubernetes (EKS, GKE, AKS), k3s |
| Package Mgmt K8s | Helm, Kustomize |
| IaC | Terraform, Pulumi, CloudFormation |
| Config Management | Ansible, Chef, Puppet |
| Monitorizare | Prometheus + Grafana, Datadog, CloudWatch |
| Logging | ELK Stack, Loki + Grafana, CloudWatch Logs |
| Tracing | Jaeger, Zipkin, AWS X-Ray, OpenTelemetry |
| Security | Trivy, Snyk, SonarQube, Falco, OPA |
| Secrets | HashiCorp Vault, AWS Secrets Manager |
| Scripting | Bash, Python, Go |
| GitOps | ArgoCD, Flux |
Curs realizat ca material de referință pentru ingineri DevOps, SRE și dezvoltatori interesați de practici moderne de livrare software.
Pe această pagină