1/1
</> DevOps

Concepte DevOps

Lecția 1 ⏱ 90 min

Curs Avansat de DevOps

Linux Shell, CI/CD, Docker, Kubernetes, AWS și Python pentru Automatizare


Cuprins

  1. Filosofia DevOps și cultura organizațională
  2. Linux avansant pentru DevOps
  3. Shell scripting avansat (Bash)
  4. Controlul versiunilor — Git avansat
  5. CI/CD — Concepte și arhitecturi
  6. Jenkins — Pipeline-uri declarative și scripted
  7. GitHub Actions și GitLab CI/CD
  8. Docker — Containerizare avansată
  9. Docker Compose și aplicații multi-container
  10. Kubernetes — Orchestrare la scară
  11. Kubernetes — Obiecte avansate și operare
  12. Helm și managementul pachetelor Kubernetes
  13. AWS — Servicii fundamentale pentru DevOps
  14. AWS — Infrastructură ca și cod (IaC) cu Terraform
  15. Python pentru automatizare DevOps
  16. Monitorizare, logging și observabilitate
  17. Securitate în pipeline-ul DevOps (DevSecOps)
  18. Proiect integrator: pipeline complet de la cod la producție

1. Filosofia DevOps și cultura organizațională

1.1 Ce este DevOps?

DevOps este o cultură, un set de practici și instrumente care unifică dezvoltarea software (Dev) cu operarea infrastructurii (Ops). Scopul: livrarea rapidă, fiabilă și continuă a software-ului de calitate.

Modelul tradițional (silouri):
Dev ──────► „Funcționează pe mașina mea" ──────► Ops
                                                  ↓
                                         „Nu pornește în producție"
                                                  ↓
                                            Blame game ←──────┘

Modelul DevOps (colaborare):
┌──────────────────────────────────────────────────────────┐
│                 Echipă unificată Dev+Ops                  │
│                                                           │
│  Plan → Code → Build → Test → Release → Deploy → Monitor │
│    ↑                                                  │   │
│    └──────────────── Feedback continuu ←──────────────┘   │
└──────────────────────────────────────────────────────────┘

1.2 Principiile CALMS

Principiu Descriere
Culture Colaborare, fără silouri, responsabilitate partajată
Automation Automatizare a tot ce se repetă: build, test, deploy, infra
Lean Eliminare risipă, batch-uri mici, flux continuu
Measurement Măsurare a tot: performanță, erori, lead time, MTTR
Sharing Cunoștințe partajate, postmortems blameless, documentație

1.3 Metrici cheie (DORA)

Metrică Elită Performant Mediu
Deployment Frequency On-demand Săptămânal Lunar
Lead Time for Changes < 1 oră 1-7 zile 1-6 luni
Change Failure Rate < 5% 10-15% 16-30%
Mean Time to Recovery < 1 oră < 1 zi 1-7 zile

2. Linux avansat pentru DevOps

2.1 Managementul proceselor

# Procese și resurse
ps aux --sort=-%mem | head -20         # Top procese după memorie
ps -eo pid,ppid,user,%cpu,%mem,cmd --sort=-%cpu | head
pstree -p                              # Arbore procese
top -bn1 -o %MEM                       # Snapshot top (batch mode, sort memorie)
htop                                   # Interfață interactivă avansată

# Semnale
kill -SIGTERM $PID                     # Terminare grațioasă
kill -SIGKILL $PID                     # Terminare forțată (kill -9)
kill -SIGHUP $PID                      # Reload configurare (multe daemoane)
killall -SIGUSR1 nginx                 # Semnal către toate procesele cu acel nume
pkill -f "python.*my_script"           # Kill după pattern

# Procese în background
long_command &                         # Pornire în background
nohup long_command > output.log 2>&1 & # Supraviețuiește închiderii terminalului
disown %1                              # Detach job din shell
jobs -l                                # Lista joburi curente

# Limitare resurse
ulimit -n 65535                        # Max file descriptors per proces
ulimit -u 4096                         # Max procese per utilizator
# Permanent: /etc/security/limits.conf
# nginx  soft  nofile  65535
# nginx  hard  nofile  65535

2.2 Gestiunea discurilor și sistemelor de fișiere

# Informații disc
lsblk -f                              # Dispozitive bloc cu filesystems
df -hT                                 # Spațiu disc cu tip filesystem
du -sh /var/log/*                      # Dimensiune directoare
ncdu /var                              # Navigare interactivă spațiu disc

# LVM (Logical Volume Management)
pvcreate /dev/sdb                      # Creează Physical Volume
vgcreate data_vg /dev/sdb             # Creează Volume Group
lvcreate -L 50G -n app_lv data_vg    # Creează Logical Volume
mkfs.ext4 /dev/data_vg/app_lv        # Formatare
mount /dev/data_vg/app_lv /data       # Montare

# Extindere volum fără downtime:
lvextend -L +20G /dev/data_vg/app_lv  # Extinde LV cu 20GB
resize2fs /dev/data_vg/app_lv         # Extinde filesystem (ext4)
# sau: xfs_growfs /data                # Pentru XFS

# Monitorizare I/O
iostat -xz 1                           # Statistici I/O per disc, la fiecare secundă
iotop                                  # Top procese după I/O

2.3 Networking avansat

# Configurare și diagnostică
ip addr show                           # Interfețe și adrese IP
ip route show                          # Tabela de rutare
ip link set eth0 up/down               # Activare/dezactivare interfață

ss -tlnp                               # Porturi TCP în LISTEN cu PID
ss -s                                  # Sumar conexiuni
ss -tnp state established              # Conexiuni active

# DNS
dig example.com                        # Query DNS complet
dig +short example.com A               # Doar IP-ul
nslookup example.com                   # Query simplu
host example.com                       # Rezolvare

# Diagnostică rețea
traceroute -n example.com              # Traseu pachete
mtr example.com                        # Traceroute continuu
tcpdump -i eth0 port 443 -nn          # Captură pachete HTTPS
tcpdump -i eth0 -w capture.pcap       # Salvare captură
curl -v -o /dev/null https://api.example.com  # Debug HTTP
curl -w "@curl-format.txt" https://example.com  # Timing detaliat

# Firewall (nftables / iptables)
nft list ruleset                       # Listare reguli nftables
iptables -L -n -v                      # Listare reguli iptables

# Bandwidth
iperf3 -s                              # Server
iperf3 -c server_ip                    # Client — testare throughput

2.4 Systemd — managementul serviciilor

# Control servicii
systemctl start/stop/restart nginx
systemctl enable/disable nginx         # Activare la boot
systemctl status nginx                 # Stare + ultimele log-uri
systemctl is-active nginx
systemctl list-units --type=service --state=running

# Journalctl — log-uri structurate
journalctl -u nginx -f                 # Urmărire live
journalctl -u nginx --since "1 hour ago"
journalctl -u nginx --since "2024-01-15" --until "2024-01-16"
journalctl -p err -b                   # Doar erori, boot curent
journalctl --disk-usage                # Spațiu ocupat de log-uri
journalctl --vacuum-size=500M          # Curățare log-uri vechi
# Creare serviciu custom: /etc/systemd/system/myapp.service
[Unit]
Description=My Application
After=network.target postgresql.service
Requires=postgresql.service

[Service]
Type=simple
User=myapp
Group=myapp
WorkingDirectory=/opt/myapp
ExecStart=/opt/myapp/venv/bin/python app.py
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal
Environment=APP_ENV=production
EnvironmentFile=/opt/myapp/.env

# Securitate (hardening):
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
PrivateTmp=yes
ReadWritePaths=/opt/myapp/data

[Install]
WantedBy=multi-user.target
# Activare serviciu:
sudo systemctl daemon-reload
sudo systemctl enable --now myapp

3. Shell scripting avansat (Bash)

3.1 Fundamente robuste

#!/usr/bin/env bash
# Template script robust

set -euo pipefail    # e: exit la eroare, u: eroare la variabilă nedefinită
                     # o pipefail: fail la orice comandă din pipe
IFS=$'\n\t'          # Separator sănătos (evită probleme cu spații)

# Constante
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
readonly SCRIPT_NAME="$(basename "$0")"
readonly LOG_FILE="/var/log/${SCRIPT_NAME%.sh}.log"

# Funcții de logging
log()  { echo "[$(date '+%Y-%m-%d %H:%M:%S')] [INFO]  $*" | tee -a "$LOG_FILE"; }
warn() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] [WARN]  $*" | tee -a "$LOG_FILE" >&2; }
err()  { echo "[$(date '+%Y-%m-%d %H:%M:%S')] [ERROR] $*" | tee -a "$LOG_FILE" >&2; }
die()  { err "$*"; exit 1; }

# Cleanup la ieșire (trap)
cleanup() {
    local exit_code=$?
    log "Cleanup: removing temp files..."
    rm -rf "${TMPDIR:-/tmp}/myapp_$$"
    exit "$exit_code"
}
trap cleanup EXIT
trap 'die "Script interrupted"' INT TERM

# Verificare dependențe
for cmd in docker kubectl aws jq; do
    command -v "$cmd" &>/dev/null || die "Required command '$cmd' not found"
done

# Verificare root
[[ $EUID -eq 0 ]] || die "This script must be run as root"

3.2 Parsare argumente

# Parsare argumente cu getopts
usage() {
    cat <<EOF
Usage: $SCRIPT_NAME [OPTIONS]

Options:
    -e, --env ENV        Environment (dev|staging|prod)
    -t, --tag TAG        Docker image tag
    -d, --dry-run        Don't execute, just show commands
    -v, --verbose        Verbose output
    -h, --help           Show this help
EOF
}

ENVIRONMENT=""
TAG="latest"
DRY_RUN=false
VERBOSE=false

while [[ $# -gt 0 ]]; do
    case "$1" in
        -e|--env)
            ENVIRONMENT="$2"
            shift 2
            ;;
        -t|--tag)
            TAG="$2"
            shift 2
            ;;
        -d|--dry-run)
            DRY_RUN=true
            shift
            ;;
        -v|--verbose)
            VERBOSE=true
            shift
            ;;
        -h|--help)
            usage
            exit 0
            ;;
        *)
            die "Unknown option: $1\nUse --help for usage."
            ;;
    esac
done

# Validare
[[ -n "$ENVIRONMENT" ]] || die "Environment is required (-e)"
[[ "$ENVIRONMENT" =~ ^(dev|staging|prod)$ ]] || die "Invalid environment: $ENVIRONMENT"

3.3 Funcții utilitare DevOps

# Retry cu backoff exponențial
retry() {
    local max_attempts="${1:-5}"
    local delay="${2:-1}"
    local attempt=1
    shift 2

    until "$@"; do
        if (( attempt >= max_attempts )); then
            err "Command failed after $max_attempts attempts: $*"
            return 1
        fi
        warn "Attempt $attempt/$max_attempts failed. Retrying in ${delay}s..."
        sleep "$delay"
        delay=$(( delay * 2 ))
        attempt=$(( attempt + 1 ))
    done
}

# Utilizare:
retry 5 2 curl -sf https://api.example.com/health

# Execuție paralelă cu limită de concurență
parallel_exec() {
    local max_jobs="${1:-4}"
    shift
    local pids=()

    for cmd in "$@"; do
        eval "$cmd" &
        pids+=($!)
        if (( ${#pids[@]} >= max_jobs )); then
            wait "${pids[0]}"
            pids=("${pids[@]:1}")
        fi
    done
    wait
}

# Wait for service
wait_for_service() {
    local host="$1" port="$2" timeout="${3:-30}"
    local elapsed=0

    log "Waiting for $host:$port (timeout: ${timeout}s)..."
    until nc -z "$host" "$port" 2>/dev/null; do
        (( elapsed >= timeout )) && die "Timeout waiting for $host:$port"
        sleep 1
        elapsed=$(( elapsed + 1 ))
    done
    log "$host:$port is available"
}

# Semver comparison
version_gte() {
    # Returns 0 if $1 >= $2
    printf '%s\n%s' "$2" "$1" | sort -V -C
}

# Safe secret handling
read_secret() {
    local prompt="$1"
    local secret
    read -rsp "$prompt: " secret
    echo
    printf '%s' "$secret"
}

3.4 Script deploy complet

#!/usr/bin/env bash
set -euo pipefail

# === Deploy script pentru aplicație containerizată ===

readonly APP_NAME="mywebapp"
readonly REGISTRY="123456789.dkr.ecr.eu-west-1.amazonaws.com"
readonly NAMESPACE="production"
readonly HEALTH_ENDPOINT="/api/health"
readonly DEPLOY_TIMEOUT=300

deploy() {
    local tag="$1"
    local image="${REGISTRY}/${APP_NAME}:${tag}"

    log "Deploying $image to $NAMESPACE..."

    # 1. Verifică dacă imaginea există
    if ! docker manifest inspect "$image" &>/dev/null; then
        die "Image $image not found in registry"
    fi

    # 2. Backup la deployment-ul curent
    local current_image
    current_image=$(kubectl -n "$NAMESPACE" get deploy "$APP_NAME" \
        -o jsonpath='{.spec.template.spec.containers[0].image}' 2>/dev/null || echo "none")
    log "Current image: $current_image"

    # 3. Aplică noul deployment
    kubectl -n "$NAMESPACE" set image "deploy/$APP_NAME" \
        "$APP_NAME=$image"

    # 4. Așteaptă rollout
    log "Waiting for rollout (timeout: ${DEPLOY_TIMEOUT}s)..."
    if ! kubectl -n "$NAMESPACE" rollout status "deploy/$APP_NAME" \
        --timeout="${DEPLOY_TIMEOUT}s"; then

        warn "Rollout failed! Initiating rollback..."
        kubectl -n "$NAMESPACE" rollout undo "deploy/$APP_NAME"
        kubectl -n "$NAMESPACE" rollout status "deploy/$APP_NAME" \
            --timeout="${DEPLOY_TIMEOUT}s"
        die "Deploy failed, rolled back to previous version"
    fi

    # 5. Health check
    local service_ip
    service_ip=$(kubectl -n "$NAMESPACE" get svc "$APP_NAME" \
        -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

    log "Running health check on $service_ip..."
    retry 10 3 curl -sf "http://${service_ip}${HEALTH_ENDPOINT}"

    log "Deploy successful! $APP_NAME is running $image"
}

# Parsare argumente și execuție
TAG="${1:?Usage: $0 <tag>}"
deploy "$TAG"

4. Controlul versiunilor — Git avansat

4.1 Strategii de branching

# === Git Flow ===
# main      ────●────────────────●──────────── (releases)
#               ↑                ↑
# develop   ──●──●──●──●──●──●──●──●────────── (integrare)
#              ↑     ↑        ↑
# feature/x ──●──●──┘        │
# feature/y ─────●──●──●─────┘
# hotfix/z  ──────────────────────●──●──→ main + develop

# === Trunk-Based Development (preferat DevOps) ===
# main      ──●──●──●──●──●──●──●──●──●──●── (deploy continuu)
#              ↑  ↑     ↑     ↑
# short-lived  │  │     │     │
# branches  ───●──┘  ───●─────┘
# (max 1-2 zile)

# Comenzi Git esențiale workflow:
git checkout -b feature/add-auth
# ... dezvoltare ...
git add -A && git commit -m "feat: add JWT authentication"
git push -u origin feature/add-auth
# Creare Pull Request → Code Review → Merge

# Rebase interactiv (squash commits):
git rebase -i HEAD~5                   # Rewrite ultimele 5 commits
# pick abc1234 feat: add auth endpoint
# squash def5678 fix: typo
# squash ghi9012 fix: tests
# → Un singur commit curat

# Cherry-pick (aplică un commit pe alt branch):
git cherry-pick abc1234

# Bisect (găsire commit care a introdus un bug):
git bisect start
git bisect bad HEAD
git bisect good v2.1.0
# Git face binary search automat între cele două puncte

4.2 Conventional Commits și versionare semantică

# Format: <type>(<scope>): <description>
# Tipuri: feat, fix, docs, style, refactor, perf, test, ci, chore

git commit -m "feat(auth): add OAuth2 Google login"
git commit -m "fix(api): handle null response from payment gateway"
git commit -m "perf(db): add index on users.email column"
git commit -m "ci: add SonarQube analysis step"
git commit -m "feat!: redesign user API (BREAKING CHANGE)"

# Semantic Versioning: MAJOR.MINOR.PATCH
# MAJOR: breaking changes (feat!)
# MINOR: new features (feat)
# PATCH: bug fixes (fix)
# Ex: 2.4.1 → fix → 2.4.2, feat → 2.5.0, breaking → 3.0.0

# Automatizare cu semantic-release sau standard-version:
npx standard-version              # Generează CHANGELOG + bump version

4.3 Git Hooks pentru automatizare

# .git/hooks/pre-commit (sau via Husky/pre-commit framework)
#!/usr/bin/env bash
set -euo pipefail

echo "Running pre-commit checks..."

# Lint
if command -v flake8 &>/dev/null; then
    flake8 --max-line-length=120 .
fi

# Secrets detection (previne commit-ul accidental de chei/parole)
if command -v gitleaks &>/dev/null; then
    gitleaks detect --staged --verbose
fi

# Terraform format check
if command -v terraform &>/dev/null; then
    terraform fmt -check -recursive
fi

echo "Pre-commit checks passed!"

5. CI/CD — Concepte și arhitecturi

5.1 Pipeline-ul CI/CD complet

┌───────────────────────────────────────────────────────────────┐
│                    CI/CD PIPELINE                              │
│                                                                │
│ ┌──────┐  ┌──────┐  ┌──────┐  ┌──────┐  ┌──────┐  ┌───────┐│
│ │Source│→│Build │→│Test  │→│Scan  │→│Deploy│→│Monitor││
│ │      │  │      │  │      │  │      │  │      │  │       ││
│ │ Git  │  │Compile│ │Unit  │  │SAST  │  │Staging│ │Metrics││
│ │ Push │  │Docker │ │Integr│  │DAST  │  │Prod   │ │Alerts ││
│ │ PR   │  │Build  │ │E2E   │  │Deps  │  │Canary │ │Logs   ││
│ └──────┘  └──────┘  └──────┘  └──────┘  └──────┘  └───────┘│
│                                                                │
│ ◄─── Continuous Integration ───► ◄── Continuous Delivery ────►│
│ ◄────────────── Continuous Deployment ───────────────────────►│
└───────────────────────────────────────────────────────────────┘

CI = Build + Test automat la fiecare commit
CD (Delivery) = Artefact gata de deploy (manual approval pt. prod)
CD (Deployment) = Deploy automat în producție, fără intervenție umană

5.2 Strategii de deployment

Blue-Green:
┌────────────┐     ┌────────────┐
│ Blue (v1)  │←LB  │ Green (v2) │  ← Deploy v2 pe Green
│ ACTIVE     │     │ IDLE       │
└────────────┘     └────────────┘
                        │
Switch load balancer:   ▼
┌────────────┐     ┌────────────┐
│ Blue (v1)  │     │ Green (v2) │←LB  ← Green devine activ
│ IDLE       │     │ ACTIVE     │     Rollback instant: switch înapoi
└────────────┘     └────────────┘

Canary:
100% traffic → v1
 ├── 5% traffic → v2 (canary)   ← monitorizare erori
 ├── 25% traffic → v2           ← scale up dacă OK
 ├── 50% traffic → v2
 └── 100% traffic → v2          ← rollout complet

Rolling Update (Kubernetes default):
Pod v1  Pod v1  Pod v1  Pod v1
Pod v2  Pod v1  Pod v1  Pod v1    ← un pod la un moment dat
Pod v2  Pod v2  Pod v1  Pod v1
Pod v2  Pod v2  Pod v2  Pod v1
Pod v2  Pod v2  Pod v2  Pod v2    ← complet

6. Jenkins — Pipeline-uri declarative și scripted

6.1 Jenkinsfile declarativ

// Jenkinsfile (Declarative Pipeline)
pipeline {
    agent {
        docker {
            image 'python:3.11-slim'
            args '-v /var/run/docker.sock:/var/run/docker.sock'
        }
    }

    environment {
        REGISTRY     = credentials('ecr-registry-url')
        APP_NAME     = 'mywebapp'
        AWS_REGION   = 'eu-west-1'
    }

    options {
        timeout(time: 30, unit: 'MINUTES')
        disableConcurrentBuilds()
        buildDiscarder(logRotator(numToKeepStr: '10'))
    }

    stages {
        stage('Checkout') {
            steps {
                checkout scm
                script {
                    env.GIT_COMMIT_SHORT = sh(
                        script: 'git rev-parse --short HEAD',
                        returnStdout: true
                    ).trim()
                    env.IMAGE_TAG = "${env.BRANCH_NAME}-${env.GIT_COMMIT_SHORT}"
                }
            }
        }

        stage('Install Dependencies') {
            steps {
                sh '''
                    python -m pip install --upgrade pip
                    pip install -r requirements.txt
                    pip install -r requirements-dev.txt
                '''
            }
        }

        stage('Lint & Format') {
            parallel {
                stage('Flake8') {
                    steps {
                        sh 'flake8 --max-line-length=120 src/'
                    }
                }
                stage('Black') {
                    steps {
                        sh 'black --check src/'
                    }
                }
                stage('Mypy') {
                    steps {
                        sh 'mypy src/ --ignore-missing-imports'
                    }
                }
            }
        }

        stage('Unit Tests') {
            steps {
                sh 'pytest tests/unit/ -v --junitxml=reports/unit.xml --cov=src --cov-report=xml'
            }
            post {
                always {
                    junit 'reports/unit.xml'
                    cobertura coberturaReportFile: 'coverage.xml'
                }
            }
        }

        stage('Build Docker Image') {
            steps {
                sh """
                    docker build \
                        --build-arg BUILD_DATE=\$(date -u +%Y-%m-%dT%H:%M:%SZ) \
                        --build-arg GIT_COMMIT=${env.GIT_COMMIT_SHORT} \
                        -t ${REGISTRY}/${APP_NAME}:${IMAGE_TAG} \
                        -t ${REGISTRY}/${APP_NAME}:latest .
                """
            }
        }

        stage('Integration Tests') {
            steps {
                sh '''
                    docker-compose -f docker-compose.test.yml up -d
                    sleep 10
                    pytest tests/integration/ -v --junitxml=reports/integration.xml
                '''
            }
            post {
                always {
                    sh 'docker-compose -f docker-compose.test.yml down -v'
                    junit 'reports/integration.xml'
                }
            }
        }

        stage('Security Scan') {
            parallel {
                stage('Trivy Image Scan') {
                    steps {
                        sh "trivy image --severity HIGH,CRITICAL --exit-code 1 ${REGISTRY}/${APP_NAME}:${IMAGE_TAG}"
                    }
                }
                stage('Dependency Check') {
                    steps {
                        sh 'safety check -r requirements.txt'
                    }
                }
            }
        }

        stage('Push to Registry') {
            when {
                branch 'main'
            }
            steps {
                withCredentials([usernamePassword(
                    credentialsId: 'ecr-credentials',
                    usernameVariable: 'AWS_ACCESS_KEY_ID',
                    passwordVariable: 'AWS_SECRET_ACCESS_KEY'
                )]) {
                    sh """
                        aws ecr get-login-password --region ${AWS_REGION} | \
                            docker login --username AWS --password-stdin ${REGISTRY}
                        docker push ${REGISTRY}/${APP_NAME}:${IMAGE_TAG}
                        docker push ${REGISTRY}/${APP_NAME}:latest
                    """
                }
            }
        }

        stage('Deploy to Staging') {
            when { branch 'main' }
            steps {
                sh """
                    kubectl --context staging -n staging \
                        set image deploy/${APP_NAME} ${APP_NAME}=${REGISTRY}/${APP_NAME}:${IMAGE_TAG}
                    kubectl --context staging -n staging \
                        rollout status deploy/${APP_NAME} --timeout=300s
                """
            }
        }

        stage('Deploy to Production') {
            when { branch 'main' }
            input {
                message 'Deploy to production?'
                ok 'Yes, deploy!'
            }
            steps {
                sh """
                    kubectl --context production -n production \
                        set image deploy/${APP_NAME} ${APP_NAME}=${REGISTRY}/${APP_NAME}:${IMAGE_TAG}
                    kubectl --context production -n production \
                        rollout status deploy/${APP_NAME} --timeout=300s
                """
            }
        }
    }

    post {
        success {
            slackSend(channel: '#deploys',
                      color: 'good',
                      message: "✅ ${APP_NAME} ${IMAGE_TAG} deployed successfully")
        }
        failure {
            slackSend(channel: '#deploys',
                      color: 'danger',
                      message: "❌ ${APP_NAME} pipeline failed: ${env.BUILD_URL}")
        }
    }
}

7. GitHub Actions și GitLab CI/CD

7.1 GitHub Actions

# .github/workflows/ci-cd.yml
name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ['3.10', '3.11', '3.12']

    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: testdb
          POSTGRES_USER: testuser
          POSTGRES_PASSWORD: testpass
        ports: ['5432:5432']
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
          cache: 'pip'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install -r requirements-dev.txt

      - name: Run tests
        env:
          DATABASE_URL: postgresql://testuser:testpass@localhost:5432/testdb
        run: pytest -v --cov=src --cov-report=xml

      - name: Upload coverage
        uses: codecov/codecov-action@v4
        if: matrix.python-version == '3.12'
        with:
          file: coverage.xml

  build-and-push:
    needs: test
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write

    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}

    steps:
      - uses: actions/checkout@v4

      - name: Docker meta
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=
            type=raw,value=latest

      - name: Login to GHCR
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    needs: build-and-push
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: eu-west-1

      - name: Update kubeconfig
        run: aws eks update-kubeconfig --name my-cluster --region eu-west-1

      - name: Deploy to Kubernetes
        run: |
          kubectl set image deploy/mywebapp \
            mywebapp=${{ needs.build-and-push.outputs.image-tag }} \
            -n production
          kubectl rollout status deploy/mywebapp -n production --timeout=300s

7.2 GitLab CI/CD

# .gitlab-ci.yml
stages:
  - test
  - build
  - security
  - deploy

variables:
  IMAGE: $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA

test:
  stage: test
  image: python:3.12-slim
  services:
    - postgres:16
  variables:
    POSTGRES_DB: testdb
    POSTGRES_USER: test
    POSTGRES_PASSWORD: test
    DATABASE_URL: "postgresql://test:test@postgres/testdb"
  script:
    - pip install -r requirements.txt -r requirements-dev.txt
    - pytest -v --junitxml=report.xml --cov=src
  artifacts:
    reports:
      junit: report.xml

build:
  stage: build
  image: docker:24
  services:
    - docker:24-dind
  script:
    - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
    - docker build -t $IMAGE .
    - docker push $IMAGE
  only:
    - main

trivy_scan:
  stage: security
  image: aquasec/trivy:latest
  script:
    - trivy image --severity HIGH,CRITICAL --exit-code 1 $IMAGE
  only:
    - main

deploy_production:
  stage: deploy
  image: bitnami/kubectl:latest
  script:
    - kubectl set image deploy/myapp myapp=$IMAGE -n production
    - kubectl rollout status deploy/myapp -n production --timeout=300s
  environment:
    name: production
    url: https://app.example.com
  when: manual
  only:
    - main

8. Docker — Containerizare avansată

8.1 Dockerfile optimizat — multi-stage build

# === Stage 1: Build ===
FROM python:3.12-slim AS builder

WORKDIR /build

# Instalare dependențe de compilare (cache layer separat)
RUN apt-get update && apt-get install -y --no-install-recommends \
        build-essential libpq-dev \
    && rm -rf /var/lib/apt/lists/*

# Copiază doar requirements (cache dacă nu se schimbă)
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# === Stage 2: Production ===
FROM python:3.12-slim AS production

# Metadata
LABEL maintainer="devops@company.com" \
      version="1.0" \
      description="Production web application"

# Utilizator non-root
RUN groupadd -r appuser && useradd -r -g appuser -d /app -s /sbin/nologin appuser

# Runtime dependencies doar
RUN apt-get update && apt-get install -y --no-install-recommends \
        libpq5 curl \
    && rm -rf /var/lib/apt/lists/*

# Copiază pachetele Python din builder
COPY --from=builder /install /usr/local

# Copiază codul aplicației
WORKDIR /app
COPY --chown=appuser:appuser src/ ./src/
COPY --chown=appuser:appuser alembic/ ./alembic/
COPY --chown=appuser:appuser alembic.ini .

# Expune portul
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=5s --retries=3 --start-period=10s \
    CMD curl -f http://localhost:8000/health || exit 1

# Switch la user non-root
USER appuser

# Entrypoint cu exec form (semnale propagate corect)
ENTRYPOINT ["python", "-m", "uvicorn"]
CMD ["src.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

8.2 .dockerignore

# .dockerignore
.git
.gitignore
.env
.env.*
__pycache__
*.pyc
*.pyo
.pytest_cache
.mypy_cache
.coverage
htmlcov/
*.egg-info/
dist/
build/
node_modules/
.vscode/
.idea/
docker-compose*.yml
Dockerfile*
README.md
docs/
tests/
*.md

8.3 Comenzi Docker esențiale

# Build
docker build -t myapp:v1 .
docker build -t myapp:v1 --no-cache .           # Fără cache
docker build -t myapp:v1 --target builder .     # Doar un stage

# Run
docker run -d --name myapp -p 8080:8000 myapp:v1
docker run -d --name myapp \
    -p 8080:8000 \
    -v $(pwd)/data:/app/data \
    -e DATABASE_URL="postgres://..." \
    --memory=512m \
    --cpus=1.5 \
    --restart=unless-stopped \
    --network=mynetwork \
    myapp:v1

# Debug
docker exec -it myapp /bin/bash
docker logs -f --tail 100 myapp
docker inspect myapp | jq '.[0].NetworkSettings'
docker stats                                     # Resurse live
docker top myapp                                 # Procese în container

# Cleanup
docker system prune -af --volumes               # NUCLEAR: șterge tot neutilizat
docker image prune -a                            # Șterge imagini nefolosite
docker volume prune                              # Șterge volume orfane

# Registry
docker tag myapp:v1 registry.example.com/myapp:v1
docker push registry.example.com/myapp:v1
docker pull registry.example.com/myapp:v1

9. Docker Compose și aplicații multi-container

# docker-compose.yml — aplicație completă
services:
  app:
    build:
      context: .
      dockerfile: Dockerfile
      target: production
    ports:
      - "8000:8000"
    environment:
      - DATABASE_URL=postgresql://appuser:secret@db:5432/appdb
      - REDIS_URL=redis://redis:6379/0
      - CELERY_BROKER_URL=redis://redis:6379/1
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 5s
      retries: 3
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: '1.0'
    restart: unless-stopped
    networks:
      - frontend
      - backend

  worker:
    build: .
    command: celery -A src.celery_app worker -l info -c 4
    environment:
      - DATABASE_URL=postgresql://appuser:secret@db:5432/appdb
      - CELERY_BROKER_URL=redis://redis:6379/1
    depends_on:
      - db
      - redis
    restart: unless-stopped
    networks:
      - backend

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: appdb
      POSTGRES_USER: appuser
      POSTGRES_PASSWORD: secret
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - backend

  redis:
    image: redis:7-alpine
    command: redis-server --maxmemory 128mb --maxmemory-policy allkeys-lru
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
    volumes:
      - redis_data:/data
    networks:
      - backend

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/conf.d/default.conf:ro
      - ./certs:/etc/nginx/certs:ro
    depends_on:
      - app
    networks:
      - frontend

volumes:
  postgres_data:
  redis_data:

networks:
  frontend:
  backend:

10. Kubernetes — Orchestrare la scară

10.1 Arhitectura Kubernetes

┌─────────────────────────────────────────────────────────────┐
│                    CONTROL PLANE                             │
│                                                              │
│  ┌──────────────┐  ┌─────────────┐  ┌───────────────────┐  │
│  │ kube-apiserver│  │ etcd        │  │ kube-scheduler    │  │
│  │ (REST API,   │  │ (key-value  │  │ (alege pe ce nod  │  │
│  │  authn/authz)│  │  store,     │  │  se plasează      │  │
│  │              │  │  cluster    │  │  pod-urile)        │  │
│  │              │  │  state)     │  │                    │  │
│  └──────────────┘  └─────────────┘  └───────────────────┘  │
│  ┌──────────────────────┐  ┌────────────────────────────┐   │
│  │ kube-controller-mgr  │  │ cloud-controller-manager   │   │
│  │ (ReplicaSet, Deploy- │  │ (LoadBalancer, volumes,    │   │
│  │  ment, Node, Job...) │  │  node lifecycle — cloud)   │   │
│  └──────────────────────┘  └────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
         │                │                │
         ▼                ▼                ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   Worker     │  │   Worker     │  │   Worker     │
│   Node 1     │  │   Node 2     │  │   Node 3     │
│              │  │              │  │              │
│ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │
│ │ kubelet  │ │  │ │ kubelet  │ │  │ │ kubelet  │ │
│ │(agent)   │ │  │ │          │ │  │ │          │ │
│ ├──────────┤ │  │ ├──────────┤ │  │ ├──────────┤ │
│ │kube-proxy│ │  │ │kube-proxy│ │  │ │kube-proxy│ │
│ │(network) │ │  │ │          │ │  │ │          │ │
│ ├──────────┤ │  │ ├──────────┤ │  │ ├──────────┤ │
│ │Container │ │  │ │Container │ │  │ │Container │ │
│ │Runtime   │ │  │ │Runtime   │ │  │ │Runtime   │ │
│ │(containerd)│ │ │(containerd)│ │ │(containerd)│ │
│ ├──────────┤ │  │ ├──────────┤ │  │ ├──────────┤ │
│ │ Pod A    │ │  │ │ Pod C    │ │  │ │ Pod E    │ │
│ │ Pod B    │ │  │ │ Pod D    │ │  │ │ Pod F    │ │
│ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │
└──────────────┘  └──────────────┘  └──────────────┘

10.2 Manifeste Kubernetes — aplicație completă

# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: myapp
  labels:
    app.kubernetes.io/name: myapp
---
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: myapp-config
  namespace: myapp
data:
  APP_ENV: "production"
  LOG_LEVEL: "info"
  ALLOWED_HOSTS: "app.example.com"
---
# secret.yaml (în practică: folosește External Secrets Operator sau Sealed Secrets)
apiVersion: v1
kind: Secret
metadata:
  name: myapp-secrets
  namespace: myapp
type: Opaque
stringData:
  DATABASE_URL: "postgresql://user:pass@db-host:5432/appdb"
  SECRET_KEY: "super-secret-key-here"
---
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  namespace: myapp
  labels:
    app: myapp
    version: v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1            # Max pod-uri extra în timpul update-ului
      maxUnavailable: 0       # Niciun pod indisponibil (zero-downtime)
  template:
    metadata:
      labels:
        app: myapp
        version: v1
    spec:
      serviceAccountName: myapp-sa
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
      containers:
        - name: myapp
          image: registry.example.com/myapp:v1.2.3
          ports:
            - containerPort: 8000
              name: http
          envFrom:
            - configMapRef:
                name: myapp-config
            - secretRef:
                name: myapp-secrets
          resources:
            requests:
              cpu: 250m           # 0.25 core
              memory: 256Mi
            limits:
              cpu: 1000m          # 1 core
              memory: 512Mi
          readinessProbe:
            httpGet:
              path: /health/ready
              port: http
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health/live
              port: http
            initialDelaySeconds: 15
            periodSeconds: 20
          startupProbe:
            httpGet:
              path: /health/live
              port: http
            failureThreshold: 30
            periodSeconds: 2
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: myapp
---
# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: myapp
  namespace: myapp
spec:
  selector:
    app: myapp
  ports:
    - port: 80
      targetPort: http
      protocol: TCP
  type: ClusterIP
---
# hpa.yaml (Horizontal Pod Autoscaler)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
  namespace: myapp
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
---
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-ingress
  namespace: myapp
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/rate-limit: "100"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - app.example.com
      secretName: myapp-tls
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: myapp
                port:
                  number: 80

10.3 Comenzi kubectl esențiale

# Informații cluster
kubectl cluster-info
kubectl get nodes -o wide
kubectl top nodes                           # Resurse per nod

# Operare
kubectl apply -f manifests/                 # Aplică toate fișierele din director
kubectl get all -n myapp                    # Toate resursele din namespace
kubectl get pods -n myapp -o wide           # Pods cu detalii
kubectl describe pod myapp-xxx -n myapp     # Detalii complete pod
kubectl logs -f myapp-xxx -n myapp          # Log-uri live
kubectl logs myapp-xxx -n myapp --previous  # Log-uri pod anterior (crash)
kubectl exec -it myapp-xxx -n myapp -- /bin/sh  # Shell în pod

# Debugging
kubectl get events -n myapp --sort-by='.lastTimestamp'
kubectl debug pod/myapp-xxx -it --image=busybox  # Debug pod ephemeral

# Deployment management
kubectl rollout status deploy/myapp -n myapp
kubectl rollout history deploy/myapp -n myapp
kubectl rollout undo deploy/myapp -n myapp          # Rollback la versiunea anterioară
kubectl rollout undo deploy/myapp -n myapp --to-revision=3

# Scaling
kubectl scale deploy/myapp -n myapp --replicas=5

# Port forward (debug local)
kubectl port-forward svc/myapp 8080:80 -n myapp

11. Kubernetes — Obiecte avansate și operare

11.1 Jobs și CronJobs

# CronJob pentru backup baza de date
apiVersion: batch/v1
kind: CronJob
metadata:
  name: db-backup
  namespace: myapp
spec:
  schedule: "0 2 * * *"     # Zilnic la 02:00
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      backoffLimit: 3
      activeDeadlineSeconds: 3600
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: backup
              image: postgres:16
              command:
                - /bin/bash
                - -c
                - |
                  TIMESTAMP=$(date +%Y%m%d_%H%M%S)
                  pg_dump $DATABASE_URL | gzip > /backup/db_${TIMESTAMP}.sql.gz
                  aws s3 cp /backup/db_${TIMESTAMP}.sql.gz \
                    s3://my-backups/db/db_${TIMESTAMP}.sql.gz
                  # Cleanup local
                  find /backup -mtime +7 -delete
              envFrom:
                - secretRef:
                    name: myapp-secrets
              volumeMounts:
                - name: backup-vol
                  mountPath: /backup
          volumes:
            - name: backup-vol
              emptyDir:
                sizeLimit: 5Gi

11.2 Network Policies

# Permite doar trafic de la pods cu label app=myapp către db
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: db-allow-app-only
  namespace: myapp
spec:
  podSelector:
    matchLabels:
      app: postgres
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: myapp
      ports:
        - port: 5432
          protocol: TCP

12. Helm și managementul pachetelor Kubernetes

# Helm = package manager pentru Kubernetes
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

# Instalare chart:
helm install my-postgres bitnami/postgresql \
    --namespace myapp \
    --set auth.postgresPassword=secret \
    --set primary.persistence.size=50Gi

# Vizualizare:
helm list -n myapp
helm status my-postgres -n myapp

# Upgrade:
helm upgrade my-postgres bitnami/postgresql \
    --namespace myapp \
    --set primary.resources.limits.memory=2Gi

# Rollback:
helm rollback my-postgres 1 -n myapp
# Chart propriu: mychart/values.yaml
replicaCount: 3
image:
  repository: registry.example.com/myapp
  tag: "latest"
  pullPolicy: IfNotPresent
service:
  type: ClusterIP
  port: 80
ingress:
  enabled: true
  host: app.example.com
resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    cpu: 1000m
    memory: 512Mi
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20
  targetCPUUtilization: 70

13. AWS — Servicii fundamentale pentru DevOps

13.1 Harta serviciilor AWS esențiale

┌──────────────────────────────────────────────────────────────┐
│                     AWS Cloud                                 │
│                                                               │
│  COMPUTE          CONTAINERS       NETWORKING                 │
│  ┌──────────┐    ┌───────────┐    ┌──────────────────┐       │
│  │ EC2      │    │ ECS/EKS   │    │ VPC              │       │
│  │ Lambda   │    │ Fargate   │    │ ALB/NLB          │       │
│  │ ASG      │    │ ECR       │    │ Route53 (DNS)    │       │
│  └──────────┘    └───────────┘    │ CloudFront (CDN) │       │
│                                    │ API Gateway      │       │
│  STORAGE          DATABASE         └──────────────────┘       │
│  ┌──────────┐    ┌───────────┐                                │
│  │ S3       │    │ RDS       │    SECURITY                    │
│  │ EBS      │    │ DynamoDB  │    ┌──────────────────┐       │
│  │ EFS      │    │ ElastiCache│   │ IAM              │       │
│  └──────────┘    └───────────┘    │ KMS              │       │
│                                    │ Secrets Manager  │       │
│  CI/CD            MONITORING       │ WAF              │       │
│  ┌──────────┐    ┌───────────┐    └──────────────────┘       │
│  │CodePipeline│  │CloudWatch │                                │
│  │CodeBuild  │   │X-Ray      │    IaC                        │
│  │CodeDeploy │   │CloudTrail │    ┌──────────────────┐       │
│  └──────────┘    └───────────┘    │ CloudFormation   │       │
│                                    │ (sau Terraform)  │       │
│                                    └──────────────────┘       │
└──────────────────────────────────────────────────────────────┘

13.2 AWS CLI — operații comune

# Configurare
aws configure                          # Setup inițial
aws sts get-caller-identity            # Verificare identitate curentă

# S3
aws s3 ls                              # Lista buckets
aws s3 sync ./dist s3://my-bucket/app/ --delete
aws s3 cp backup.sql.gz s3://my-backups/ --storage-class GLACIER

# EC2
aws ec2 describe-instances \
    --filters "Name=tag:Environment,Values=production" \
    --query 'Reservations[].Instances[].{ID:InstanceId,IP:PrivateIpAddress,State:State.Name}' \
    --output table

# ECS
aws ecs update-service --cluster prod --service myapp \
    --force-new-deployment

# ECR
aws ecr get-login-password --region eu-west-1 | \
    docker login --username AWS --password-stdin 123456789.dkr.ecr.eu-west-1.amazonaws.com

# EKS
aws eks update-kubeconfig --name my-cluster --region eu-west-1

# Secrets Manager
aws secretsmanager get-secret-value --secret-id prod/myapp/db \
    --query SecretString --output text | jq .

# Lambda
aws lambda invoke --function-name my-function \
    --payload '{"key": "value"}' response.json

14. AWS — Infrastructură ca și cod (IaC) cu Terraform

14.1 Structura proiectului Terraform

terraform/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   └── prod/
├── modules/
│   ├── networking/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── eks/
│   ├── rds/
│   └── s3/
└── global/
    └── iam/

14.2 Modul VPC + EKS

# modules/networking/main.tf
resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name        = "${var.project}-vpc"
    Environment = var.environment
  }
}

resource "aws_subnet" "private" {
  count             = length(var.private_subnets)
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.private_subnets[count.index]
  availability_zone = var.azs[count.index]

  tags = {
    Name                              = "${var.project}-private-${var.azs[count.index]}"
    "kubernetes.io/role/internal-elb" = "1"
  }
}

resource "aws_subnet" "public" {
  count                   = length(var.public_subnets)
  vpc_id                  = aws_vpc.main.id
  cidr_block              = var.public_subnets[count.index]
  availability_zone       = var.azs[count.index]
  map_public_ip_on_launch = true

  tags = {
    Name                     = "${var.project}-public-${var.azs[count.index]}"
    "kubernetes.io/role/elb" = "1"
  }
}

# environments/prod/main.tf
module "networking" {
  source = "../../modules/networking"

  project         = "myapp"
  environment     = "prod"
  vpc_cidr        = "10.0.0.0/16"
  azs             = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
}

module "eks" {
  source = "../../modules/eks"

  cluster_name    = "myapp-prod"
  cluster_version = "1.29"
  vpc_id          = module.networking.vpc_id
  subnet_ids      = module.networking.private_subnet_ids

  node_groups = {
    general = {
      instance_types = ["m6i.large"]
      min_size       = 3
      max_size       = 10
      desired_size   = 3
    }
  }
}
# Workflow Terraform:
cd terraform/environments/prod
terraform init                          # Inițializare (descarcă provideri)
terraform plan -out=tfplan              # Preview modificări
terraform apply tfplan                  # Aplică
terraform destroy                       # Distruge tot (ATENȚIE!)

# State management:
# Backend S3 + DynamoDB lock (recomandat pentru echipe):
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "eu-west-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

15. Python pentru automatizare DevOps

15.1 Script de automatizare AWS cu boto3

#!/usr/bin/env python3
"""AWS infrastructure automation utilities."""

import boto3
import json
import sys
from datetime import datetime, timedelta
from typing import Optional
from botocore.exceptions import ClientError


class AWSManager:
    """Manages common AWS operations for DevOps."""

    def __init__(self, region: str = "eu-west-1"):
        self.region = region
        self.ec2 = boto3.client("ec2", region_name=region)
        self.ecs = boto3.client("ecs", region_name=region)
        self.s3 = boto3.client("s3", region_name=region)
        self.cloudwatch = boto3.client("cloudwatch", region_name=region)
        self.secretsmanager = boto3.client("secretsmanager", region_name=region)

    def get_instances_by_tag(self, tag_key: str, tag_value: str) -> list[dict]:
        """Get EC2 instances filtered by tag."""
        response = self.ec2.describe_instances(
            Filters=[
                {"Name": f"tag:{tag_key}", "Values": [tag_value]},
                {"Name": "instance-state-name", "Values": ["running"]},
            ]
        )
        instances = []
        for reservation in response["Reservations"]:
            for inst in reservation["Instances"]:
                name = next(
                    (t["Value"] for t in inst.get("Tags", []) if t["Key"] == "Name"),
                    "unnamed",
                )
                instances.append({
                    "id": inst["InstanceId"],
                    "name": name,
                    "private_ip": inst.get("PrivateIpAddress"),
                    "type": inst["InstanceType"],
                    "az": inst["Placement"]["AvailabilityZone"],
                })
        return instances

    def force_ecs_deploy(self, cluster: str, service: str) -> str:
        """Force new deployment of an ECS service."""
        response = self.ecs.update_service(
            cluster=cluster,
            service=service,
            forceNewDeployment=True,
        )
        deployment_id = response["service"]["deployments"][0]["id"]
        print(f"Triggered deployment {deployment_id} for {service}")
        return deployment_id

    def cleanup_old_snapshots(self, days: int = 30, dry_run: bool = True) -> int:
        """Delete EBS snapshots older than N days."""
        cutoff = datetime.utcnow() - timedelta(days=days)
        snapshots = self.ec2.describe_snapshots(OwnerIds=["self"])["Snapshots"]
        deleted = 0

        for snap in snapshots:
            if snap["StartTime"].replace(tzinfo=None) < cutoff:
                if dry_run:
                    print(f"[DRY RUN] Would delete {snap['SnapshotId']} "
                          f"({snap['StartTime'].date()})")
                else:
                    self.ec2.delete_snapshot(SnapshotId=snap["SnapshotId"])
                    print(f"Deleted {snap['SnapshotId']}")
                deleted += 1

        print(f"Total: {deleted} snapshots {'would be' if dry_run else ''} deleted")
        return deleted

    def get_secret(self, secret_name: str) -> dict:
        """Retrieve a secret from AWS Secrets Manager."""
        try:
            response = self.secretsmanager.get_secret_value(SecretId=secret_name)
            return json.loads(response["SecretString"])
        except ClientError as e:
            if e.response["Error"]["Code"] == "ResourceNotFoundException":
                raise ValueError(f"Secret '{secret_name}' not found")
            raise


if __name__ == "__main__":
    mgr = AWSManager()

    # Lista instanțe de producție
    instances = mgr.get_instances_by_tag("Environment", "production")
    for inst in instances:
        print(f"  {inst['name']:30s} {inst['id']:20s} {inst['private_ip']:15s}")

15.2 Script de monitorizare și alertare

#!/usr/bin/env python3
"""Health check and alerting script."""

import requests
import smtplib
import json
import time
from email.mime.text import MIMEText
from dataclasses import dataclass
from concurrent.futures import ThreadPoolExecutor, as_completed


@dataclass
class HealthCheckResult:
    url: str
    status: str          # "healthy", "degraded", "down"
    status_code: int
    response_time_ms: float
    error: str = ""


def check_endpoint(url: str, timeout: int = 10) -> HealthCheckResult:
    """Check a single HTTP endpoint."""
    try:
        start = time.monotonic()
        response = requests.get(url, timeout=timeout)
        elapsed_ms = (time.monotonic() - start) * 1000

        if response.status_code == 200 and elapsed_ms < 2000:
            status = "healthy"
        elif response.status_code == 200:
            status = "degraded"
        else:
            status = "down"

        return HealthCheckResult(
            url=url,
            status=status,
            status_code=response.status_code,
            response_time_ms=round(elapsed_ms, 1),
        )
    except requests.RequestException as e:
        return HealthCheckResult(
            url=url, status="down", status_code=0,
            response_time_ms=0, error=str(e),
        )


def check_all_endpoints(endpoints: list[str]) -> list[HealthCheckResult]:
    """Check multiple endpoints in parallel."""
    results = []
    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = {executor.submit(check_endpoint, url): url for url in endpoints}
        for future in as_completed(futures):
            results.append(future.result())
    return results


def send_alert(subject: str, body: str, webhook_url: str):
    """Send alert via Slack webhook."""
    payload = {
        "text": f"*{subject}*\n```{body}```",
        "username": "HealthCheck Bot",
    }
    requests.post(webhook_url, json=payload, timeout=10)


# Configurare
ENDPOINTS = [
    "https://app.example.com/health",
    "https://api.example.com/health",
    "https://admin.example.com/health",
]
SLACK_WEBHOOK = "https://hooks.slack.com/services/xxx/yyy/zzz"

if __name__ == "__main__":
    results = check_all_endpoints(ENDPOINTS)

    # Raport
    for r in results:
        icon = {"healthy": "✅", "degraded": "⚠️", "down": "❌"}[r.status]
        print(f"{icon} {r.url:45s} {r.status:10s} "
              f"{r.status_code:3d}  {r.response_time_ms:7.1f}ms  {r.error}")

    # Alertare pentru servicii down
    down = [r for r in results if r.status == "down"]
    if down:
        body = "\n".join(f"❌ {r.url}{r.error or f'HTTP {r.status_code}'}"
                        for r in down)
        send_alert(f"{len(down)} service(s) DOWN", body, SLACK_WEBHOOK)

16. Monitorizare, logging și observabilitate

16.1 Cele trei piloni ai observabilității

┌─────────────────────────────────────────────────────────────┐
│                   OBSERVABILITATE                            │
│                                                              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │   METRICS    │  │    LOGS      │  │    TRACES        │  │
│  │              │  │              │  │                   │  │
│  │ Prometheus   │  │ ELK Stack    │  │ Jaeger           │  │
│  │ Grafana      │  │ (Elastic,   │  │ Zipkin           │  │
│  │ CloudWatch   │  │  Logstash,  │  │ AWS X-Ray        │  │
│  │ Datadog      │  │  Kibana)    │  │ OpenTelemetry    │  │
│  │              │  │ Loki        │  │                   │  │
│  │ "Ce se      │  │ Fluentd/Bit │  │ "Care e calea   │  │
│  │  întâmplă?" │  │              │  │  unui request?" │  │
│  │              │  │ "De ce s-a  │  │                   │  │
│  │ CPU, RAM,   │  │  întâmplat?"│  │ Latență per      │  │
│  │ requests/s, │  │              │  │ serviciu,        │  │
│  │ error rate  │  │ Stack traces,│  │ dependințe       │  │
│  │             │  │ audit log   │  │                   │  │
│  └──────────────┘  └──────────────┘  └──────────────────┘  │
└─────────────────────────────────────────────────────────────┘

16.2 Prometheus + Grafana pe Kubernetes

# Prometheus ServiceMonitor (dacă folosești prometheus-operator)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp-monitor
  namespace: myapp
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

# Alertă Prometheus
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: myapp-alerts
  namespace: myapp
spec:
  groups:
    - name: myapp.rules
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{app="myapp",status=~"5.."}[5m]))
            / sum(rate(http_requests_total{app="myapp"}[5m])) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High error rate (>5%) on myapp"

        - alert: HighLatency
          expr: |
            histogram_quantile(0.95,
              sum(rate(http_request_duration_seconds_bucket{app="myapp"}[5m]))
              by (le)) > 1.0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "P95 latency > 1s on myapp"

17. Securitate în pipeline-ul DevOps (DevSecOps)

17.1 Security-as-Code în CI/CD

# Etape de securitate integrate în pipeline:
security-scan:
  stage: security
  parallel:
    # 1. SAST — Static Application Security Testing
    - name: semgrep
      script: semgrep scan --config=auto --error src/

    # 2. Dependency scanning
    - name: dependency-check
      script: |
        pip-audit -r requirements.txt
        safety check -r requirements.txt

    # 3. Container image scanning
    - name: trivy
      script: |
        trivy image --severity HIGH,CRITICAL \
          --exit-code 1 $IMAGE

    # 4. IaC scanning
    - name: checkov
      script: |
        checkov -d terraform/ --framework terraform
        checkov -d k8s/ --framework kubernetes

    # 5. Secrets detection
    - name: gitleaks
      script: gitleaks detect --source . --verbose

17.2 Pod Security Standards (Kubernetes)

# Restricted pod security: best practice producție
apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 65534
    fsGroup: 65534
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      image: myapp:v1
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop: ["ALL"]
      volumeMounts:
        - name: tmp
          mountPath: /tmp
  volumes:
    - name: tmp
      emptyDir: {}

18. Proiect integrator: pipeline complet de la cod la producție

18.1 Diagrama completă

Developer → git push
    │
    ▼
GitHub Actions / Jenkins
    │
    ├── 1. Lint (flake8, black, mypy)
    ├── 2. Unit Tests (pytest, coverage >80%)
    ├── 3. Build Docker Image (multi-stage)
    ├── 4. Security Scan (trivy, semgrep, pip-audit)
    ├── 5. Push to ECR
    ├── 6. Deploy to Staging (kubectl / Helm)
    ├── 7. Integration Tests (against staging)
    ├── 8. Manual Approval Gate
    └── 9. Deploy to Production (rolling update)
              │
              ├── Prometheus scrape → Grafana dashboards
              ├── Fluentd → Elasticsearch → Kibana
              ├── Alertmanager → Slack / PagerDuty
              └── Rollback automat dacă error rate > 5%

Infra managed by:
    Terraform (VPC, EKS, RDS, S3, IAM)
    Helm (aplicație Kubernetes)
    AWS Secrets Manager (credentials)

Monitorizare:
    Grafana: dashboards per serviciu
    Prometheus: metrici aplicație + cluster
    CloudWatch: metrici AWS native
    PagerDuty: on-call rotation + escalation

18.2 Checklist lansare producție

PRE-DEPLOY:
□ Toate testele trec (unit, integration, e2e)
□ Code review aprobat (min 2 revieweri)
□ Security scan fără vulnerabilități critice
□ Docker image sub 500MB
□ Health endpoints implementate (/health/live, /health/ready)
□ Metrici Prometheus expuse (/metrics)
□ Logging structurat (JSON)
□ Graceful shutdown implementat (SIGTERM handling)
□ Database migrations testate și reversibile
□ Secrets în Secrets Manager (nu în cod/env vars)
□ Resource limits setate (CPU, memory)
□ HPA configurat (autoscaling)
□ Network policies aplicate
□ Pod Security Standards respectate
□ Runbook actualizat

POST-DEPLOY:
□ Health check OK pe toate pod-urile
□ Metrici vizibile în Grafana
□ Error rate stabil (< baseline)
□ Latența P95 în limite
□ Nicio alertă nouă
□ Rollback testat și documentat

Anexe

A. Toolchain DevOps recomandat

Categorie Instrumente recomandate
Version Control Git, GitHub / GitLab
CI/CD GitHub Actions, GitLab CI, Jenkins, ArgoCD
Containerizare Docker, Podman, Buildah
Orchestrare Kubernetes (EKS, GKE, AKS), k3s
Package Mgmt K8s Helm, Kustomize
IaC Terraform, Pulumi, CloudFormation
Config Management Ansible, Chef, Puppet
Monitorizare Prometheus + Grafana, Datadog, CloudWatch
Logging ELK Stack, Loki + Grafana, CloudWatch Logs
Tracing Jaeger, Zipkin, AWS X-Ray, OpenTelemetry
Security Trivy, Snyk, SonarQube, Falco, OPA
Secrets HashiCorp Vault, AWS Secrets Manager
Scripting Bash, Python, Go
GitOps ArgoCD, Flux

Curs realizat ca material de referință pentru ingineri DevOps, SRE și dezvoltatori interesați de practici moderne de livrare software.

Pe această pagină