Prometheus + Grafana 완전 가이드 — 모니터링 & 관찰 가능성

1. 관찰 가능성(Observability)이란?

관찰 가능성은 시스템의 외부 출력만으로 내부 상태를 이해할 수 있는 능력입니다. 세 가지 핵심 신호(Three Pillars of Observability)로 구성됩니다.

신호	설명	도구	특징
메트릭 (Metrics)	시간에 따른 수치 데이터 (CPU, 요청 수, 지연 시간)	Prometheus, Grafana	집계에 강함, 저장 비용 낮음
로그 (Logs)	특정 이벤트의 상세 기록	Loki, ELK Stack	디버깅에 강함, 저장 비용 높음
트레이스 (Traces)	분산 시스템의 요청 흐름 추적	Tempo, Jaeger	마이크로서비스 병목 분석

구분	모니터링 (Monitoring)	관찰 가능성 (Observability)
질문	"무언가 잘못됐나?"	"왜 잘못됐나?"
접근	사전 정의된 임계값 알림	임시 조회(ad-hoc) 분석
알 수 있는 것	알려진 알 수 없는 것	알려지지 않은 알 수 없는 것

2. Prometheus란?

Prometheus는 오픈소스 시스템 모니터링 및 알림 도구입니다. CNCF(Cloud Native Computing Foundation) 졸업 프로젝트로, Kubernetes 생태계의 표준 모니터링 솔루션입니다.

Pull 기반 메트릭 수집

Prometheus는 에이전트가 서버에 데이터를 보내는 Push 방식이 아닌, Prometheus 서버가 주기적으로 타깃에서 데이터를 가져오는 Pull 방식을 사용합니다.

># Prometheus 데이터 모델: 시계열 (Time Series)
# 형식: metric_name{label1="value1", label2="value2"} value timestamp

# 예시 메트릭
http_requests_total{method="GET", path="/api/users", status="200"} 1234 1716000000000
http_requests_total{method="POST", path="/api/users", status="201"} 56 1716000000000
http_requests_total{method="GET", path="/api/users", status="500"} 3 1716000000000

# 4가지 메트릭 타입
# Counter: 단조 증가 (요청 수, 에러 수)
http_requests_total{job="api"} 9825

# Gauge: 임의로 증가/감소 (CPU 사용률, 메모리, 연결 수)
node_memory_MemAvailable_bytes 2147483648

# Histogram: 관찰값 분포 (응답 시간, 요청 크기)
# _bucket, _sum, _count 세 가지 시계열로 저장됨
http_request_duration_seconds_bucket{le="0.1"} 100
http_request_duration_seconds_bucket{le="0.5"} 150
http_request_duration_seconds_bucket{le="1.0"} 160
http_request_duration_seconds_bucket{le="+Inf"} 165
http_request_duration_seconds_sum   23.5
http_request_duration_seconds_count 165

# Summary: 클라이언트 측에서 계산한 분위수
rpc_duration_seconds{quantile="0.5"} 0.012
rpc_duration_seconds{quantile="0.9"} 0.025
rpc_duration_seconds{quantile="0.99"} 0.083text

3. Docker Compose로 설치

># docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:v2.51.2
    container_name: prometheus
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules/:/etc/prometheus/rules/
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
    ports: ["9090:9090"]
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.4.2
    container_name: grafana
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning/:/etc/grafana/provisioning/
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: grafana123
      GF_USERS_ALLOW_SIGN_UP: "false"
    ports: ["3000:3000"]
    depends_on: [prometheus]
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    ports: ["9100:9100"]
    restart: unless-stopped

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    container_name: cadvisor
    privileged: true
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    ports: ["8080:8080"]
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    ports: ["9093:9093"]
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:yaml

># prometheus/prometheus.yml
global:
  scrape_interval:     15s   # 15초마다 메트릭 수집
  evaluation_interval: 15s   # 15초마다 알림 규칙 평가
  scrape_timeout:      10s

# AlertManager 연동
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

# 알림 규칙 파일
rule_files:
  - '/etc/prometheus/rules/*.yml'

# 수집 대상 설정
scrape_configs:
  # Prometheus 자신
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter (서버 시스템 메트릭)
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance

  # cAdvisor (컨테이너 메트릭)
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']yaml

4. 메트릭 수집 설정

># prometheus/prometheus.yml - 고급 scrape 설정
scrape_configs:
  # HTTP 엔드포인트 상태 모니터링 (Blackbox Exporter)
  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://api.example.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

  # 동적 서비스 디스커버리 (파일 기반)
  - job_name: 'dynamic-services'
    file_sd_configs:
      - files: ['/etc/prometheus/targets/*.json']
        refresh_interval: 30s

  # Docker 서비스 디스커버리
  - job_name: 'docker-containers'
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
    relabel_configs:
      # prometheus.scrape=true 레이블이 있는 컨테이너만 수집
      - source_labels: [__meta_docker_container_label_prometheus_scrape]
        action: keep
        regex: 'true'
      # 포트 레이블 사용
      - source_labels: [__meta_docker_container_label_prometheus_port]
        target_label: __address__
        regex: (.+)
        replacement: "${1}"yaml

5. PromQL 기초

PromQL(Prometheus Query Language)은 Prometheus의 함수형 쿼리 언어입니다. 메트릭 이름과 레이블 셀렉터로 시계열을 선택하고, 함수로 변환합니다.

># ── 즉각 벡터 (Instant Vector) ──────────────────────────────
# 현재 시점의 값
http_requests_total

# 레이블 필터링
http_requests_total{job="api", status="200"}

# 정규식 매칭 (=~: 포함, !~: 미포함)
http_requests_total{status=~"2.."}        # 2xx 상태 코드
http_requests_total{status!~"2.."}        # 2xx 이외 상태 코드
http_requests_total{path=~"/api/.*"}      # /api/ 로 시작하는 경로

# ── 범위 벡터 (Range Vector) ──────────────────────────────────
# 과거 5분간의 샘플 수집
http_requests_total[5m]

# 시간 단위: s(초), m(분), h(시간), d(일), w(주)
node_cpu_seconds_total[1h]

# ── rate(): 초당 평균 증가율 ───────────────────────────────────
# Counter 메트릭에 사용 (Gauge에는 사용 금지)
rate(http_requests_total[5m])

# irate(): 마지막 두 샘플만 사용 (스파이크 감지에 적합)
irate(http_requests_total[5m])

# increase(): 범위 내 총 증가량
increase(http_requests_total[1h])

# ── histogram_quantile(): 분위수 계산 ─────────────────────────
# P99 응답 시간
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# P50 (중앙값)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

# 레이블별 P95
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)promql

6. PromQL 집계 & 함수

># ── 집계 연산자 ──────────────────────────────────────────────
# sum: 합산
sum(http_requests_total)                     # 전체 요청 수 합계
sum(http_requests_total) by (job)            # job 레이블별 합계
sum(http_requests_total) without (instance)  # instance 제외한 나머지로 그룹

# count: 개수
count(up == 1)                               # 현재 UP 상태인 타깃 수

# avg: 평균
avg(node_cpu_seconds_total{mode="idle"})

# max / min: 최대/최소
max(node_memory_MemAvailable_bytes)
min(node_filesystem_free_bytes{fstype="ext4"})

# topk / bottomk: 상위/하위 N개
topk(5, rate(http_requests_total[5m]))       # 요청 많은 상위 5개
bottomk(3, node_filesystem_free_bytes)       # 여유 공간 적은 하위 3개

# ── 실무 쿼리 예시 ───────────────────────────────────────────
# CPU 사용률 (%) - 전체 코어 평균
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# CPU 사용률 - 인스턴스별
100 - (
  avg by (instance) (
    rate(node_cpu_seconds_total{mode="idle"}[5m])
  ) * 100
)

# 메모리 사용률 (%)
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# 디스크 사용률
(1 - (node_filesystem_free_bytes / node_filesystem_size_bytes)) * 100

# HTTP 에러율 (5xx / 전체)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# 디스크 가득 찰 때까지 남은 시간 예측 (초)
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)promql

7. Grafana 연동 & 대시보드

📊

Prometheus 데이터소스 연결과 대시보드 임포트만 빠르게 다룹니다. 패널 제작, 변수, Alerting, Provisioning 등 Grafana 자체를 깊이 배우려면 Grafana 완전 가이드를 참고하세요.

># grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      timeInterval: '15s'
      queryTimeout: '60s'yaml

># grafana/provisioning/dashboards/dashboard.yml
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    options:
      path: /etc/grafana/provisioning/dashboardsyaml

💡

대시보드 빠르게 시작하기
Grafana 대시보드에서 Dashboards → Import → 1860 입력 후 "Node Exporter Full" 대시보드를 불러올 수 있습니다. CPU, 메모리, 디스크, 네트워크 패널이 모두 포함되어 있습니다.

># Grafana API로 대시보드 프로그래매틱 생성
curl -X POST http://admin:grafana123@localhost:3000/api/dashboards/import \
  -H 'Content-Type: application/json' \
  -d '{
    "dashboard": {
      "id": null,
      "title": "Node Exporter Full",
      "tags": ["node", "prometheus"]
    },
    "folderId": 0,
    "overwrite": true,
    "inputs": [{
      "name": "DS_PROMETHEUS",
      "type": "datasource",
      "pluginId": "prometheus",
      "value": "Prometheus"
    }]
  }'

# 공식 대시보드 ID로 임포트
# 1860 - Node Exporter Full
# 3662 - Prometheus 2.0 Stats
# 7362 - Node Exporter Dashboard EN 20201010
# 6417 - Kubernetes Clusterbash

8. AlertManager (알림 설정)

># prometheus/rules/alerts.yml
groups:
  - name: node-alerts
    interval: 30s
    rules:
      # CPU 사용률 80% 초과
      - alert: HighCPUUsage
        expr: |
          100 - (avg by (instance) (
            rate(node_cpu_seconds_total{mode="idle"}[5m])
          ) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU 사용률 높음 ({{ $labels.instance }})"
          description: "CPU 사용률이 {{ printf \"%.1f\" $value }}%입니다. (임계값: 80%)"

      # 메모리 사용률 90% 초과
      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "메모리 부족 ({{ $labels.instance }})"
          description: "메모리 사용률 {{ printf \"%.1f\" $value }}%"

      # 디스크 사용률 85% 초과
      - alert: DiskSpaceWarning
        expr: |
          (1 - (node_filesystem_free_bytes / node_filesystem_size_bytes)) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "디스크 부족 경고"

      # 디스크 4시간 내 가득 참 예측
      - alert: DiskWillFillIn4Hours
        expr: |
          predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "디스크 4시간 내 가득 참 예측"

  - name: http-alerts
    rules:
      # HTTP 5xx 에러율 5% 초과
      - alert: HighHTTPErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m])) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "HTTP 에러율 높음"
          description: "에러율 {{ printf \"%.2f\" $value | humanizePercentage }}"yaml

># alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alert@example.com'
  smtp_auth_username: 'alert@example.com'
  smtp_auth_password: 'app-password'
  resolve_timeout: 5m

# 알림 라우팅
route:
  group_by: ['alertname', 'instance']
  group_wait:      30s    # 그룹화 대기 시간
  group_interval:  5m     # 같은 그룹 재알림 간격
  repeat_interval: 4h     # 해결 안 된 알림 반복 간격
  receiver: 'slack-critical'

  routes:
    - match:
        severity: critical
      receiver: 'slack-critical'
    - match:
        severity: warning
      receiver: 'email-warning'

receivers:
  - name: 'slack-critical'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts-critical'
        title: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        text: |
          {{ range .Alerts }}
          *알림:* {{ .Annotations.summary }}
          *상세:* {{ .Annotations.description }}
          *심각도:* {{ .Labels.severity }}
          *시작:* {{ .StartsAt.Format "2006-01-02 15:04:05" }}
          {{ end }}
        send_resolved: true

  - name: 'email-warning'
    email_configs:
      - to: 'ops-team@example.com'
        subject: '[경고] {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          알림: {{ .Annotations.summary }}
          상세: {{ .Annotations.description }}
          {{ end }}

inhibit_rules:
  # critical 발생 시 동일 인스턴스의 warning 억제
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['instance']yaml

9. 앱 메트릭 계측 (Python/Go)

Python (prometheus_client)

># pip install prometheus-client fastapi uvicorn
from prometheus_client import (
    Counter, Gauge, Histogram, Summary,
    start_http_server, CollectorRegistry, CONTENT_TYPE_LATEST, generate_latest
)
from fastapi import FastAPI, Request, Response
import time

app = FastAPI()

# ── 메트릭 정의 ──────────────────────────────────────────────
# Counter: 총 요청 수 (레이블: method, path, status)
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'path', 'status']
)

# Histogram: 응답 시간 분포
REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'path'],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

# Gauge: 현재 처리 중인 요청 수
IN_PROGRESS = Gauge(
    'http_requests_in_progress',
    'HTTP requests currently in progress',
    ['method']
)

# Summary: 데이터베이스 쿼리 시간
DB_QUERY_DURATION = Summary(
    'db_query_duration_seconds',
    'Database query duration'
)

# ── 미들웨어로 자동 계측 ──────────────────────────────────────
@app.middleware('http')
async def metrics_middleware(request: Request, call_next):
    method = request.method
    path = request.url.path

    # /metrics 엔드포인트 제외
    if path == '/metrics':
        return await call_next(request)

    IN_PROGRESS.labels(method=method).inc()
    start_time = time.time()

    response = await call_next(request)

    duration = time.time() - start_time
    status = str(response.status_code)

    REQUEST_COUNT.labels(method=method, path=path, status=status).inc()
    REQUEST_DURATION.labels(method=method, path=path).observe(duration)
    IN_PROGRESS.labels(method=method).dec()

    return response

@app.get('/metrics')
def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)python

Go (prometheus/client_golang)

>package main

import (
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "path", "status"},
    )

    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "path"},
    )

    activeConnections = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "http_active_connections",
        Help: "Number of active HTTP connections",
    })
)

// 미들웨어
func metricsMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        activeConnections.Inc()
        defer activeConnections.Dec()

        wrapped := &responseWriter{ResponseWriter: w, status: 200}
        next.ServeHTTP(wrapped, r)

        duration := time.Since(start).Seconds()
        status := fmt.Sprintf("%d", wrapped.status)

        httpRequestsTotal.
            WithLabelValues(r.Method, r.URL.Path, status).Inc()
        httpRequestDuration.
            WithLabelValues(r.Method, r.URL.Path).Observe(duration)
    })
}

func main() {
    mux := http.NewServeMux()
    mux.Handle("/metrics", promhttp.Handler())
    mux.HandleFunc("/api/users", usersHandler)

    http.ListenAndServe(":8080", metricsMiddleware(mux))
}go

10. Kubernetes 통합 (kube-prometheus-stack)

># Helm으로 kube-prometheus-stack 설치
# Prometheus + Grafana + AlertManager + Node Exporter + kube-state-metrics 포함

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# 기본 설치
helm install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

# 커스텀 values로 설치
helm install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values values.yaml

# 포트 포워딩으로 접속
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090bash

># ServiceMonitor: 서비스의 메트릭 수집 설정
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: my-app
  namespaceSelector:
    matchNames: [default, production]
  endpoints:
    - port: http
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s
---
# PrometheusRule: Kubernetes 클러스터 알림 규칙
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-app-alerts
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: my-app
      rules:
        - alert: PodRestartingTooOften
          expr: |
            rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 5
          for: 0m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} 재시작 빈번"

        - alert: PodCrashLooping
          expr: |
            kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} == 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Pod CrashLoopBackOff: {{ $labels.pod }}"yaml

11. 로그 수집 (Loki)

># docker-compose.yml에 Loki + Promtail 추가
services:
  loki:
    image: grafana/loki:2.9.7
    container_name: loki
    ports: ["3100:3100"]
    volumes:
      - ./loki/config.yml:/etc/loki/config.yml
      - loki_data:/loki
    command: -config.file=/etc/loki/config.yml
    restart: unless-stopped

  promtail:
    image: grafana/promtail:2.9.7
    container_name: promtail
    volumes:
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - ./promtail/config.yml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml
    depends_on: [loki]
    restart: unless-stopped

volumes:
  loki_data:yaml

># promtail/config.yml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  # 시스템 로그
  - job_name: system
    static_configs:
      - targets: [localhost]
        labels:
          job: varlogs
          __path__: /var/log/*.log

  # Docker 컨테이너 로그
  - job_name: containers
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: [__meta_docker_container_name]
        regex: '/(.*)'
        target_label: container
      - source_labels: [__meta_docker_container_log_stream]
        target_label: streamyaml

># LogQL 기본 쿼리 (Grafana Explore에서 사용)

# 모든 로그
{job="varlogs"}

# 특정 컨테이너 로그
{container="my-app"}

# 에러 로그 필터링
{container="my-app"} |= "ERROR"

# 정규식 필터
{job="nginx"} |~ "5[0-9][0-9]"

# 로그 파싱 후 집계: 분당 에러 수
sum(rate({container="my-app"} |= "ERROR" [1m]))

# JSON 로그 파싱
{container="api"} | json | level="error" | line_format "{{.message}}"promql

12. 다음 단계

🚀

Prometheus 심화 로드맵

• OpenTelemetry: 메트릭, 로그, 트레이스를 통합하는 오픈 표준. Prometheus 대신 OTLP 프로토콜로 수집
• Tempo: Grafana의 분산 트레이싱 솔루션. Jaeger 호환, 오브젝트 스토리지 기반
• Grafana Cloud: 관리형 Prometheus + Loki + Tempo + Grafana 통합 플랫폼
• Thanos / Cortex: Prometheus 장기 저장, 고가용성, 수평 확장
• VictoriaMetrics: Prometheus와 호환되는 고성능 시계열 DB
• Prometheus Operator: Kubernetes CRD로 Prometheus 클러스터 관리

연계 가이드: Grafana 가이드 · Kubernetes 가이드 · Docker 가이드

Prometheus 완전 가이드

1. 관찰 가능성(Observability)이란?

2. Prometheus란?

Pull 기반 메트릭 수집

3. Docker Compose로 설치

4. 메트릭 수집 설정

5. PromQL 기초

6. PromQL 집계 & 함수

7. Grafana 연동 & 대시보드

8. AlertManager (알림 설정)

9. 앱 메트릭 계측 (Python/Go)

Python (prometheus_client)

Go (prometheus/client_golang)

10. Kubernetes 통합 (kube-prometheus-stack)

11. 로그 수집 (Loki)

12. 다음 단계