Datadog入門第6部 - アラート・通知システムの実践完全ガイド

インフラストラクチャ監視、アプリケーション監視、ログ管理の基盤が確立されたら、次は効果的なアラート・通知システムの構築です。本記事では、アラート戦略の設計、通知チャネルの最適化、インシデント管理、SLO監視まで、Datadogアラート・通知システムの全領域を実践的に解説します。アラート疲れを防ぎ、真に重要な問題に迅速に対応できる、信頼性の高い監視体制を構築するためのガイドです。

6.1 監視アラートの設計

アラート戦略の基本原則

効果的なアラート設計の哲学

現代の複雑なシステムでは、単純なしきい値ベースのアラートでは誤検知やアラート疲れが生じやすく、真に重要な問題を見逃すリスクがあります。Datadogのアラートシステムは、コンテキスト認識型の高度なアラート戦略を可能にします。

yaml

アラート設計の基本原則:
  1. ビジネス影響に基づく優先度設定:
    - ユーザー体験に直接影響する問題を最優先
    - システム可用性とパフォーマンスの段階的アラート
    - 予防的アラートと緊急アラートの明確な区別
  
  2. アクション可能性の重視:
    - 受信者が実行できる具体的アクションを伴うアラート
    - 診断情報とコンテキストの自動付与
    - 修復手順への直接リンクと担当者エスカレーション
  
  3. アラート疲れの防止:
    - 適切なしきい値設定とハイステリシス
    - 重複アラートの統合と関連アラートのグループ化
    - 時間帯やメンテナンス時の自動制御

  4. コンテキスト保持:
    - 関連メトリクス・ログ・トレースの自動関連付け
    - 障害の根本原因への迅速な導線
    - チーム別・サービス別のアラートスコープ

アラートタイプ別戦略

Datadogでは、データソースと検知ロジックに応じて、複数のアラートタイプを戦略的に組み合わせます。

python

# Datadog Alert Types の戦略的活用
class DatadogAlertStrategy:
    """
    アラートタイプ別の設計戦略
    """
    
    def __init__(self):
        self.alert_types = {
            # 1. メトリクスアラート - インフラ・APM監視
            "metric_alerts": {
                "use_cases": [
                    "CPU使用率・メモリ使用量の継続的監視",
                    "APMパフォーマンス指標（レスポンス時間・エラー率）",
                    "ビジネスKPI（売上・コンバージョン率）"
                ],
                "threshold_strategy": {
                    "static_threshold": "明確な閾値が定義可能な指標",
                    "anomaly_detection": "季節性・傾向性を持つ指標",
                    "outlier_detection": "複数ホスト間での異常検知"
                }
            },
            
            # 2. ログアラート - セキュリティ・エラー監視
            "log_alerts": {
                "patterns": [
                    "セキュリティイベント（不正アクセス・認証失敗）",
                    "アプリケーションエラー（5xx ステータス・例外）",
                    "インフラエラー（ディスク容量・プロセス停止）"
                ],
                "design_principles": {
                    "log_pattern_matching": "正規表現・フィールド条件での精密検知",
                    "threshold_definition": "時間窓内の発生回数による閾値設定",
                    "context_enrichment": "関連ログとメトリクスの自動関連付け"
                }
            },
            
            # 3. 複合アラート - マルチシグナル分析
            "composite_alerts": {
                "scenarios": [
                    "高CPU使用率 + 高エラー率の同時発生",
                    "ネットワーク遅延 + データベース遅延の相関",
                    "複数サービス間の障害連鎖検知"
                ],
                "logical_operators": ["AND", "OR", "NOT"],
                "advanced_correlation": "時間差を考慮した条件設定"
            },
            
            # 4. 予測アラート - 機械学習ベース
            "forecast_alerts": {
                "applications": [
                    "ディスク容量枯渇の予測",
                    "トラフィック急増に対する事前準備",
                    "リソース不足によるスケーリング判断"
                ],
                "ml_algorithms": [
                    "時系列予測モデル",
                    "季節性調整",
                    "トレンド分析"
                ]
            }
        }
    
    def design_alert_hierarchy(self):
        """
        アラート階層の設計
        """
        return {
            "critical": {
                "criteria": "ユーザー体験に即座に影響",
                "response_time": "5分以内",
                "escalation": "即座にオンコール担当者",
                "examples": [
                    "サービス完全停止",
                    "データベース接続不可",
                    "セキュリティ侵害検知"
                ]
            },
            "warning": {
                "criteria": "パフォーマンス劣化・リスク増大",
                "response_time": "30分以内",
                "escalation": "チーム通知 + 担当者判断",
                "examples": [
                    "レスポンス時間増加",
                    "エラー率上昇",
                    "リソース使用量高騰"
                ]
            },
            "info": {
                "criteria": "状況認識・傾向把握",
                "response_time": "営業時間内対応",
                "escalation": "チャットボット通知",
                "examples": [
                    "デプロイメント完了通知",
                    "使用量傾向レポート",
                    "定期健全性チェック"
                ]
            }
        }

メトリクスアラートの設定

静的閾値アラートの実装

明確な運用閾値が定義できる指標に対しては、静的閾値アラートが最も効果的です。

yaml

# CPU使用率監視の実装例
apiVersion: datadog/v1
kind: Monitor
metadata:
  name: high-cpu-usage-alert
spec:
  name: "高CPU使用率検知 - {{host.name}}"
  type: metric alert
  query: >
    avg(last_5m):avg:system.cpu.user{*} by {host} > 80
  message: |
    **🚨 高CPU使用率を検知しました**
    
    **影響サーバー**: {{host.name}}
    **現在のCPU使用率**: {{value}}%
    **閾値**: 80%
    
    **immediate actions**:
    1. プロセス状況確認: `top -p {{host.name}}`
    2. 負荷状況分析: [Infrastructure Dashboard]({{link_to_dashboard}})
    3. アプリケーションログ確認: [Log Explorer]({{link_to_logs}})
    
    **エスカレーション**: 15分継続の場合、インフラチーム通知
    **影響予測**: パフォーマンス劣化の可能性
  
  options:
    thresholds:
      critical: 80
      warning: 70
      critical_recovery: 75
      warning_recovery: 65
    
    # ハイステリシス設定で頻繁な状態変化を防止
    evaluation_delay: 300
    new_host_delay: 300
    
    # 時間帯別しきい値調整
    restricted_roles: ["infrastructure-team"]
    
  tags:
    - "service:infrastructure"
    - "team:sre"
    - "severity:warning"

異常検知アラートの活用

季節性や傾向性を持つ指標では、機械学習ベースの異常検知が威力を発揮します。

python

# 異常検知アラートの設定例
from datadog_api_client.v1 import ApiClient, Configuration, MonitorsApi
from datadog_api_client.v1.model.monitor import Monitor

def create_anomaly_detection_alert():
    """
    トラフィック異常検知アラートの作成
    """
    configuration = Configuration()
    
    monitor = Monitor(
        name="Webトラフィック異常検知",
        type="query alert",
        query="""
        avg(last_4h):anomalies(
            avg:nginx.net.request_per_s{service:web-app} by {host}, 
            'agile', 
            2,
            direction='both',
            alert_window='last_15m',
            interval=60,
            count_default_zero='true'
        ) >= 1
        """,
        message="""
        **📊 Webトラフィック異常を検知**
        
        **サービス**: {{service.name}}
        **検知タイプ**: {{#is_alert}}異常高{{/is_alert}}{{#is_warning}}異常低{{/is_warning}}
        **現在値**: {{value}} requests/sec
        **予測値**: {{forecast}} requests/sec
        
        **分析観点**:
        - トラフィック急増: DDoS攻撃・バイラル効果の可能性
        - トラフィック急減: インフラ障害・DNS問題の可能性
        - 季節パターン逸脱: ビジネスイベント・プロモーション影響
        
        **診断手順**:
        1. [APM Service Overview]({{apm_link}}) でパフォーマンス確認
        2. [Log Analysis]({{log_link}}) でエラーパターン分析
        3. [Infrastructure Map]({{infra_link}}) でリソース状況確認
        
        **自動スケーリング**: 
        {{#is_alert}}スケールアウト推奨{{/is_alert}}
        {{#is_warning}}現状監視継続{{/is_warning}}
        """,
        options={
            "thresholds": {
                "critical": 1.0,
                "critical_recovery": 0.0,
            },
            "evaluation_delay": 300,
            "new_host_delay": 600,
            "require_full_window": False,
            "notify_no_data": True,
            "no_data_timeframe": 20,
            
            # 異常検知パラメータ調整
            "silenced": {},
            "include_tags": True,
            "tags": [
                "service:web-app",
                "alert-type:anomaly",
                "team:platform"
            ]
        },
        priority=2,  # High priority
        restricted_roles=["platform-team", "sre-team"]
    )
    
    with ApiClient(configuration) as api_client:
        api_instance = MonitorsApi(api_client)
        response = api_instance.create_monitor(body=monitor)
        return response

# 外れ値検知の実装
def create_outlier_detection():
    """
    複数ホスト間でのパフォーマンス外れ値検知
    """
    return Monitor(
        name="レスポンス時間外れ値検知",
        type="query alert",
        query="""
        avg(last_10m):outliers(
            avg:trace.servlet.request.duration{service:api-backend} by {host},
            'dbscan',
            tolerance=2.0,
            percentage=20
        ) > 0
        """,
        message="""
        **⚡ API レスポンス時間で外れ値を検知**
        
        **対象サービス**: {{service.name}}
        **異常ホスト**: {{host.name}}
        **現在のレスポンス時間**: {{value}}ms
        **他ホストとの乖離**: 標準偏差 {{outlier_score}} 倍
        
        **調査項目**:
        - ホスト固有の問題: CPU・メモリ・ディスクI/O
        - ネットワーク分離: ロードバランサー・DNS設定
        - アプリケーション状態: JVM GC・コネクションプール
        
        **即座の対応**:
        1. ロードバランサーからの切り離し検討
        2. [Host Dashboard]({{host_dashboard}}) で詳細分析
        3. 必要に応じて緊急再起動
        """,
        options={
            "thresholds": {"critical": 0},
            "evaluation_delay": 300,
        }
    )

ログアラートの設定

セキュリティログ監視

セキュリティイベントの早期検知は、インシデント対応の成功を大きく左右します。

json

{
  "name": "セキュリティ侵害検知アラート",
  "type": "log alert",
  "query": "logs(\"source:nginx status:4* OR status:5*\").index(\"main\").rollup(\"count\").by(\"@http.status_code,@network.client.ip\").last(\"5m\") > 50",
  "message": "**🚨 セキュリティ脅威検知**\n\n**検知内容**:\n- 発生源IP: {{@network.client.ip}}\n- ステータスコード: {{@http.status_code}}\n- 5分間での試行回数: {{value}}\n\n**分析**:\n- 50回/5分を超える異常アクセス\n- ブルートフォース攻撃の可能性\n- WAF・IP遮断の即座実行推奨\n\n**即座の対応**:\n1. IP遮断: `sudo iptables -A INPUT -s {{@network.client.ip}} -j DROP`\n2. [Security Dashboard]({{security_dashboard}}) で詳細分析\n3. インシデント対応チーム通知\n\n**エスカレーション**: セキュリティチーム即座通知",
  "options": {
    "thresholds": {
      "critical": 50,
      "warning": 30
    },
    "evaluation_delay": 60,
    "new_host_delay": 300,
    "groupby_simple_monitor": true,
    "include_tags": true,
    "tags": [
      "security:alert",
      "team:security",
      "priority:critical"
    ]
  }
}

アプリケーションエラー監視

アプリケーションレベルのエラーを迅速にキャッチし、ユーザー影響を最小化します。

yaml

# Application Error Monitoring
apiVersion: datadog/v1
kind: Monitor
metadata:
  name: application-error-spike
spec:
  name: "アプリケーションエラー急増検知"
  type: log alert
  query: >
    logs("service:payment-api status:error").index("main")
    .rollup("count").by("@error.kind,service").last("10m") > 10
  
  message: |
    **💥 支払いAPIでエラー急増を検知**
    
    **エラー詳細**:
    - サービス: {{service}}
    - エラー種別: {{@error.kind}}
    - 10分間でのエラー回数: {{value}}
    
    **ビジネス影響**:
    - 支払い処理の失敗増加
    - ユーザー体験の深刻な劣化
    - 売上機会損失のリスク
    
    **技術的診断**:
    - [APM Error Tracking]({{apm_error_link}}) でスタックトレース確認
    - [Database Monitoring]({{db_link}}) で DB パフォーマンス確認
    - [Service Map]({{service_map}}) で依存関係影響確認
    
    **エスカレーション**:
    - 即座: 開発チーム通知
    - 15分継続: プロダクトマネージャー通知
    - 30分継続: 経営陣エスカレーション
  
  options:
    thresholds:
      critical: 10
      warning: 5
    evaluation_delay: 300
    groupby_simple_monitor: true
    
    # エラー種別毎のアラート生成
    include_tags: true
    tags:
      - "service:payment"
      - "team:backend"
      - "business-critical:true"

複合条件アラートの設定

マルチシグナル相関分析

複数の指標を組み合わせることで、障害の根本原因により迅速にアプローチできます。

python

# 複合アラートの実装例
def create_composite_alert():
    """
    CPU高負荷 + エラー率上昇の複合条件アラート
    """
    composite_query = """
    (
      avg(last_5m):avg:system.cpu.user{service:web-app} > 80
    ) && (
      avg(last_5m):avg:trace.servlet.request.errors{service:web-app}.as_rate() > 0.05
    )
    """
    
    return Monitor(
        name="パフォーマンス劣化複合検知",
        type="query alert",
        query=composite_query,
        message="""
        **🔥 システム負荷 + エラー率の複合問題検知**
        
        **検知条件**:
        ✅ CPU使用率: {{cpu_value}}% (閾値: 80%)
        ✅ エラー率: {{error_rate}}% (閾値: 5%)
        
        **問題の性質**:
        - 高負荷によるタイムアウト・処理失敗
        - リソース枯渇による品質劣化
        - カスケード障害リスクの増大
        
        **緊急対応フローチャート**:
        ```
        1. 負荷分散確認
           ├─ OK → アプリケーション最適化
           └─ NG → インフラスケーリング
        
        2. エラー原因分析
           ├─ タイムアウト → 処理時間最適化
           ├─ 例外発生 → アプリケーション修正
           └─ 外部依存 → サービス連携確認
        
        3. 影響範囲確認
           ├─ 単一サービス → サービス再起動
           └─ 複数サービス → システム全体検証
        ```
        
        **自動対応**:
        - オートスケーリング実行中...
        - エラー詳細をSlackに自動投稿
        """,
        options={
            "thresholds": {"critical": 1},
            "evaluation_delay": 300,
            "require_full_window": True
        }
    )

# 時間差考慮の複合アラート
def create_delayed_correlation_alert():
    """
    データベース遅延 → アプリケーション遅延の時間差相関
    """
    return Monitor(
        name="DB遅延→App遅延連鎖検知",
        type="query alert",
        query="""
        (
          avg(last_10m):avg:postgresql.locks.count{} > 50
        ) && (
          avg(last_5m):avg:trace.flask.request.duration{} > 2000
        )
        """,
        message="""
        **🔗 データベース→アプリケーション遅延連鎖検知**
        
        **時系列分析**:
        - 10分前: DB ロック数急増 ({{db_locks}} locks)
        - 5分前: App レスポンス時間増加 ({{app_duration}}ms)
        
        **連鎖分析**:
        1. DB ロック → クエリ待機時間増加
        2. クエリ遅延 → アプリケーション処理時間増加
        3. App 遅延 → ユーザー体験劣化
        
        **優先対応**:
        1. 長時間ロックのクエリ特定・停止
        2. DB コネクション数・設定確認
        3. アプリケーション側タイムアウト調整
        """
    )

適応的しきい値とMLベースアラート

動的しきい値調整

ビジネスパターンやシステム特性の変化に応じて、自動的にしきい値を調整します。

python

# 適応的アラートシステム
class AdaptiveAlertSystem:
    """
    機械学習ベースの適応的アラートシステム
    """
    
    def __init__(self):
        self.adaptive_strategies = {
            # 1. 季節性対応
            "seasonal_adjustment": {
                "description": "曜日・時間帯・季節に応じた動的しきい値",
                "implementation": {
                    "daily_pattern": "平日・週末のトラフィックパターン学習",
                    "hourly_pattern": "営業時間・深夜帯の負荷パターン",
                    "seasonal_trend": "月次・四半期での長期トレンド"
                },
                "query_example": """
                avg(last_1h):anomalies(
                    avg:aws.elb.request_count{*} by {availability-zone},
                    'seasonal',
                    2,
                    direction='above',
                    alert_window='last_15m',
                    interval=300,
                    timezone='Asia/Tokyo'
                ) >= 0.8
                """
            },
            
            # 2. 成長トレンド対応
            "growth_trend_learning": {
                "description": "サービス成長に伴う基準値自動調整",
                "algorithm": "線形回帰・指数平滑化",
                "parameters": {
                    "learning_period": "過去30日間のデータ学習",
                    "adaptation_rate": "週次での基準値更新",
                    "outlier_exclusion": "異常値の学習データ除外"
                }
            },
            
            # 3. システム状態認識
            "context_aware_thresholds": {
                "description": "デプロイ・メンテナンス状況認識型しきい値",
                "context_signals": [
                    "デプロイメント実行中",
                    "定期メンテナンス時間",
                    "トラフィック急増イベント"
                ],
                "threshold_adjustment": {
                    "deployment": "エラー率しきい値 50% 緩和",
                    "maintenance": "可用性アラート一時停止",
                    "traffic_spike": "負荷しきい値 200% 動的調整"
                }
            }
        }
    
    def create_ml_forecast_alert(self):
        """
        予測アラートの実装
        """
        forecast_query = """
        avg(next_1h):forecast(
            avg:system.disk.used{*} by {host,device},
            'linear',
            1,
            direction='above',
            alert_window='next_1h',
            interval=3600
        ) >= 0.9
        """
        
        return {
            "name": "ディスク容量枯渇予測アラート",
            "query": forecast_query,
            "message": """
            **📈 ディスク容量枯渇を予測検知**
            
            **予測結果**:
            - 現在使用量: {{current_usage}}%
            - 1時間後予測: {{predicted_usage}}%
            - 枯渇予測時刻: {{estimated_full_time}}
            
            **予防対応**:
            1. 不要ファイル削除: `find /var/log -name "*.log" -mtime +7 -delete`
            2. ログローテーション強制実行: `logrotate -f /etc/logrotate.conf`
            3. アプリケーション一時ファイル清理
            
            **スケーリング判断**:
            - 短期対応: 容量拡張・クリーンアップ
            - 長期対応: アーカイブ戦略・ストレージ最適化
            
            **自動化**:
            - 予測検知時の自動クリーンアップスクリプト実行
            - 容量拡張の事前申請プロセス開始
            """,
            "options": {
                "thresholds": {"critical": 0.9, "warning": 0.8},
                "evaluation_delay": 900  # 15分間の猶予
            }
        }

6.2 通知とインシデント管理

通知チャネルの設定

Slack統合の実装

Slackは現代チーム開発におけるコミュニケーションハブです。Datadogとの深い統合により、リアルタイムコラボレーションを実現します。

python

# Slack 統合設定の実装
class DatadogSlackIntegration:
    """
    Datadog-Slack 統合による高度な通知システム
    """
    
    def __init__(self):
        self.notification_strategies = {
            # 1. チーム別・重要度別ルーティング
            "smart_routing": {
                "channels": {
                    "#alerts-critical": {
                        "severity": ["critical"],
                        "services": ["payment", "auth", "checkout"],
                        "escalation_time": "immediate",
                        "mention": "@here"
                    },
                    "#alerts-infrastructure": {
                        "severity": ["warning", "critical"],
                        "categories": ["infrastructure", "database"],
                        "escalation_time": "5m",
                        "mention": "@channel"
                    },
                    "#alerts-application": {
                        "severity": ["warning"],
                        "categories": ["application", "api"],
                        "escalation_time": "15m",
                        "mention": "specific-user"
                    }
                }
            },
            
            # 2. 動的メッセージフォーマット
            "message_formatting": {
                "templates": {
                    "critical_alert": """
🚨 **CRITICAL ALERT** 🚨

**Service**: {{service.name}}
**Alert**: {{alert.name}}
**Status**: {{alert.status}}
**Duration**: {{alert.duration}}

**Quick Actions**:
• [🔍 Investigation Dashboard]({{dashboard_url}})
• [📊 Service Map]({{service_map_url}})
• [📝 Runbook]({{runbook_url}})

**Context**:
{{alert.context}}

**Thread**: Please update this thread with investigation progress
                    """,
                    "warning_alert": """
⚠️ **Warning** - {{service.name}}

**Issue**: {{alert.name}}
**Metric**: {{metric.value}} (threshold: {{threshold}})
**Trend**: {{trend_indicator}}

[View Details]({{alert_url}}) | [Acknowledge]({{ack_url}})
                    """
                }
            }
        }
    
    def configure_slack_webhooks(self):
        """
        Slack Webhook 設定
        """
        webhook_config = {
            "critical_webhook": {
                "url": "https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX",
                "channel": "#alerts-critical",
                "username": "Datadog Critical",
                "icon_emoji": ":rotating_light:",
                "template": """
{
  "text": "🚨 Critical Alert: {{alert.name}}",
  "attachments": [
    {
      "color": "danger",
      "fields": [
        {
          "title": "Service",
          "value": "{{service.name}}",
          "short": true
        },
        {
          "title": "Current Value",
          "value": "{{alert.value}}",
          "short": true
        },
        {
          "title": "Threshold",
          "value": "{{alert.threshold}}",
          "short": true
        },
        {
          "title": "Duration",
          "value": "{{alert.duration}}",
          "short": true
        }
      ],
      "actions": [
        {
          "type": "button",
          "text": "🔍 Investigate",
          "url": "{{dashboard_url}}"
        },
        {
          "type": "button",
          "text": "✅ Acknowledge",
          "url": "{{acknowledge_url}}"
        },
        {
          "type": "button",
          "text": "📖 Runbook",
          "url": "{{runbook_url}}"
        }
      ]
    }
  ]
}
                """,
                "conditions": {
                    "severity": "critical",
                    "business_hours": "always"
                }
            },
            
            "team_specific_routing": {
                "backend_team": {
                    "channel": "#backend-alerts",
                    "services": ["api", "database", "queue"],
                    "mention_users": ["@backend-oncall"],
                    "suppress_low_priority": True
                },
                "frontend_team": {
                    "channel": "#frontend-alerts", 
                    "services": ["web", "mobile-api"],
                    "mention_users": ["@frontend-oncall"],
                    "business_hours_only": True
                }
            }
        }
        return webhook_config
    
    def create_interactive_slack_alert(self):
        """
        インタラクティブSlackアラートの作成
        """
        return {
            "message_template": """
{
  "text": "{{alert.severity_emoji}} {{alert.title}}",
  "blocks": [
    {
      "type": "header",
      "text": {
        "type": "plain_text",
        "text": "🚨 {{alert.title}}"
      }
    },
    {
      "type": "section",
      "fields": [
        {
          "type": "mrkdwn",
          "text": "*Service:*\n{{service.name}}"
        },
        {
          "type": "mrkdwn", 
          "text": "*Environment:*\n{{environment}}"
        },
        {
          "type": "mrkdwn",
          "text": "*Value:*\n{{alert.value}}"
        },
        {
          "type": "mrkdwn",
          "text": "*Duration:*\n{{alert.duration}}"
        }
      ]
    },
    {
      "type": "section",
      "text": {
        "type": "mrkdwn",
        "text": "*Quick Links*"
      },
      "accessory": {
        "type": "button",
        "text": {
          "type": "plain_text",
          "text": "🔍 Investigate"
        },
        "url": "{{investigation_url}}"
      }
    },
    {
      "type": "actions",
      "elements": [
        {
          "type": "button",
          "text": {
            "type": "plain_text",
            "text": "✅ Acknowledge"
          },
          "style": "primary",
          "url": "{{acknowledge_url}}"
        },
        {
          "type": "button",
          "text": {
            "type": "plain_text",
            "text": "🔇 Snooze 1h"
          },
          "url": "{{snooze_url}}"
        },
        {
          "type": "button",
          "text": {
            "type": "plain_text",
            "text": "📋 Create Incident"
          },
          "style": "danger",
          "url": "{{incident_url}}"
        }
      ]
    },
    {
      "type": "context",
      "elements": [
        {
          "type": "mrkdwn",
          "text": "📊 <{{dashboard_url}}|Dashboard> | 📝 <{{runbook_url}}|Runbook> | 🎯 <{{oncall_url}}|On-Call>"
        }
      ]
    }
  ]
}
            """
        }

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244

PagerDuty統合によるエスカレーション

PagerDutyとの統合により、複雑なエスカレーションとオンコール管理を自動化します。

yaml

# PagerDuty 統合設定
apiVersion: datadog/v1
kind: Integration
metadata:
  name: pagerduty-escalation
spec:
  pagerduty:
    # サービス別 PagerDuty ルーティング
    services:
      - name: "critical-infrastructure"
        integration_key: "a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6"
        escalation_policy: "Infrastructure Team Escalation"
        urgency_mapping:
          critical: "high"
          warning: "low"
        
      - name: "payment-service"
        integration_key: "b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7"
        escalation_policy: "Payment Team 24x7"
        urgency_mapping:
          critical: "high"
          warning: "high"  # 支払い関連は Warning も高優先度
    
    # エスカレーション戦略
    escalation_policies:
      - name: "Infrastructure Team Escalation"
        escalation_rules:
          - level: 1
            delay_minutes: 0
            targets:
              - type: "user"
                id: "infrastructure-oncall-primary"
          - level: 2
            delay_minutes: 15
            targets:
              - type: "user"
                id: "infrastructure-oncall-secondary"
              - type: "schedule"
                id: "infrastructure-manager-schedule"
          - level: 3
            delay_minutes: 30
            targets:
              - type: "user"
                id: "cto"
      
      - name: "Payment Team 24x7"
        escalation_rules:
          - level: 1
            delay_minutes: 0
            targets:
              - type: "schedule"
                id: "payment-team-primary"
          - level: 2
            delay_minutes: 5
            targets:
              - type: "schedule" 
                id: "payment-team-secondary"
              - type: "user"
                id: "payment-lead"
          - level: 3
            delay_minutes: 10
            targets:
              - type: "user"
                id: "vp-engineering"

# アラート→PagerDuty自動連携
auto_incident_creation:
  rules:
    - condition: "severity == 'critical' AND service IN ['payment', 'auth']"
      action: "create_high_urgency_incident"
      assignment: "payment-team-24x7"
      
    - condition: "severity == 'critical' AND alert_duration > '10m'"
      action: "escalate_to_management" 
      assignment: "escalation-schedule"
      
    - condition: "alert_count > 5 AND time_window == '15m'"
      action: "create_major_incident"
      assignment: "incident-commander"

Microsoft Teams・Discord統合

多様なコミュニケーションツールへの柔軟な対応により、組織の既存ワークフローに適応します。

python

# Microsoft Teams 統合
def setup_teams_integration():
    """
    Microsoft Teams 向け通知設定
    """
    teams_webhook_config = {
        "webhook_url": "https://outlook.office.com/webhook/...",
        "message_format": {
            "@type": "MessageCard",
            "@context": "http://schema.org/extensions",
            "themeColor": "{{alert.color}}",
            "summary": "Datadog Alert: {{alert.name}}",
            "sections": [
                {
                    "activityTitle": "{{alert.severity_emoji}} {{alert.title}}",
                    "activitySubtitle": "Service: {{service.name}}",
                    "activityImage": "https://datadog-prod.imgix.net/img/dd_logo_70x75.png",
                    "facts": [
                        {
                            "name": "Current Value",
                            "value": "{{alert.value}}"
                        },
                        {
                            "name": "Threshold", 
                            "value": "{{alert.threshold}}"
                        },
                        {
                            "name": "Duration",
                            "value": "{{alert.duration}}"
                        },
                        {
                            "name": "Environment",
                            "value": "{{environment}}"
                        }
                    ],
                    "markdown": True
                }
            ],
            "potentialAction": [
                {
                    "@type": "OpenUri",
                    "name": "View in Datadog",
                    "targets": [
                        {
                            "os": "default",
                            "uri": "{{alert.url}}"
                        }
                    ]
                },
                {
                    "@type": "OpenUri", 
                    "name": "Investigation Dashboard",
                    "targets": [
                        {
                            "os": "default",
                            "uri": "{{dashboard.url}}"
                        }
                    ]
                }
            ]
        }
    }
    return teams_webhook_config

# Discord 統合（ゲーミング・開発コミュニティ向け）
def setup_discord_integration():
    """
    Discord 向け通知設定
    """
    return {
        "webhook_url": "https://discord.com/api/webhooks/...",
        "message_format": {
            "content": "{{alert.severity_emoji}} **{{alert.title}}**",
            "embeds": [
                {
                    "title": "{{alert.name}}",
                    "description": "{{alert.description}}",
                    "color": "{{alert.color_int}}",
                    "fields": [
                        {
                            "name": "🎯 Service",
                            "value": "{{service.name}}",
                            "inline": True
                        },
                        {
                            "name": "📊 Value", 
                            "value": "{{alert.value}}",
                            "inline": True
                        },
                        {
                            "name": "⏱️ Duration",
                            "value": "{{alert.duration}}",
                            "inline": True
                        }
                    ],
                    "footer": {
                        "text": "Datadog Alert System",
                        "icon_url": "https://datadog-prod.imgix.net/img/dd_logo_70x75.png"
                    },
                    "timestamp": "{{alert.timestamp}}"
                }
            ]
        }
    }

エスカレーションポリシーの設計

多層エスカレーション戦略

ビジネス影響度と技術的複雑さに応じた段階的エスカレーションを設計します。

python

# エスカレーション戦略の実装
class EscalationPolicyDesign:
    """
    包括的エスカレーション戦略
    """
    
    def __init__(self):
        self.escalation_matrix = {
            # 1. 時間帯別エスカレーション
            "time_based_escalation": {
                "business_hours": {
                    "definition": "平日 9:00-18:00 JST",
                    "primary_contact": "開発チーム",
                    "escalation_delay": "15分",
                    "escalation_chain": [
                        "担当開発者",
                        "チームリード", 
                        "アーキテクト",
                        "マネージャー"
                    ]
                },
                "after_hours": {
                    "definition": "平日 18:00-9:00, 週末・祝日",
                    "primary_contact": "オンコール担当者",
                    "escalation_delay": "10分",
                    "escalation_chain": [
                        "オンコール・エンジニア",
                        "シニア・オンコール",
                        "インフラ・リード",
                        "CTO"
                    ]
                }
            },
            
            # 2. サービス重要度別エスカレーション
            "service_criticality": {
                "tier1_services": {
                    "services": ["payment", "authentication", "checkout"],
                    "sla_target": "99.9%",
                    "mttr_target": "5分",
                    "escalation": {
                        "immediate": "プライマリ・オンコール",
                        "5min": "セカンダリ・オンコール + プロダクト・マネージャー",
                        "15min": "エンジニアリング・マネージャー",
                        "30min": "VP Engineering + CEO"
                    }
                },
                "tier2_services": {
                    "services": ["analytics", "reporting", "recommendations"],
                    "sla_target": "99.5%",
                    "mttr_target": "15分",
                    "escalation": {
                        "immediate": "担当チーム",
                        "15min": "チーム・リード",
                        "30min": "アーキテクト",
                        "60min": "マネージャー"
                    }
                }
            },
            
            # 3. インシデント規模別エスカレーション
            "incident_severity": {
                "sev1_critical": {
                    "definition": "サービス完全停止・データ損失・セキュリティ侵害",
                    "response_time": "即座",
                    "escalation_schedule": {
                        "0min": "全オンコール・チーム + インシデント・コマンダー",
                        "5min": "エンジニアリング・リーダーシップ",
                        "15min": "C-Suite + 外部ベンダー",
                        "30min": "顧客コミュニケーション・チーム",
                        "60min": "法務・コンプライアンス・チーム"
                    }
                },
                "sev2_major": {
                    "definition": "重要機能の劣化・パフォーマンス問題",
                    "response_time": "5分以内",
                    "escalation_schedule": {
                        "0min": "担当チーム・オンコール",
                        "15min": "チーム・リード + アーキテクト",
                        "30min": "エンジニアリング・マネージャー",
                        "60min": "プロダクト・マネージャー"
                    }
                }
            }
        }
    
    def create_escalation_automation(self):
        """
        自動エスカレーション実装
        """
        return {
            "escalation_triggers": {
                # アラート継続時間による自動エスカレーション
                "duration_based": {
                    "rules": [
                        {
                            "condition": "alert_duration >= 15m AND severity == 'critical'",
                            "action": "escalate_to_manager",
                            "notification": "SMS + Phone call"
                        },
                        {
                            "condition": "alert_duration >= 30m AND service IN ['payment', 'auth']",
                            "action": "escalate_to_executive",
                            "notification": "All channels + Incident bridge"
                        }
                    ]
                },
                
                # 複数アラート同時発生による自動エスカレーション
                "multi_alert_escalation": {
                    "rules": [
                        {
                            "condition": "critical_alerts_count >= 3 AND time_window == '10m'",
                            "action": "declare_major_incident",
                            "escalation": "incident_commander"
                        },
                        {
                            "condition": "affected_services_count >= 5",
                            "action": "escalate_to_architecture_team",
                            "escalation": "system_architect"
                        }
                    ]
                },
                
                # ビジネス時間帯での自動調整
                "business_context_escalation": {
                    "peak_hours": {
                        "definition": "平日 12:00-14:00, 20:00-22:00",
                        "escalation_acceleration": "50%",
                        "additional_notifications": ["business_stakeholders"]
                    },
                    "maintenance_windows": {
                        "definition": "日曜 02:00-06:00",
                        "escalation_suppression": "non_critical_alerts",
                        "modified_thresholds": "relaxed_by_20%"
                    }
                }
            }
        }

アラート疲れの防止策

インテリジェントアラート集約

関連するアラートを自動的にグループ化し、ノイズの削減と効率的な問題解決を実現します。

python

# アラート疲れ防止システム
class AlertFatiguePreventionSystem:
    """
    アラート疲れ防止のための包括的戦略
    """
    
    def __init__(self):
        self.prevention_strategies = {
            # 1. 動的アラートサプレッション
            "dynamic_suppression": {
                "correlation_rules": [
                    {
                        "name": "database_cascade_suppression",
                        "primary_alert": "database_connection_failure",
                        "suppressed_alerts": [
                            "application_database_errors",
                            "api_timeout_errors",
                            "queue_processing_delays"
                        ],
                        "suppression_duration": "primary_alert_duration + 5m",
                        "message_override": "Suppressed: Related to primary DB issue"
                    },
                    {
                        "name": "deployment_noise_reduction",
                        "trigger": "deployment_event_detected",
                        "suppressed_categories": [
                            "error_rate_temporary_spike",
                            "latency_variation_during_rollout"
                        ],
                        "suppression_duration": "30m",
                        "conditions": "error_rate < 10% AND latency_increase < 200%"
                    }
                ]
            },
            
            # 2. 適応的しきい値調整
            "adaptive_thresholds": {
                "learning_algorithms": {
                    "false_positive_reduction": {
                        "algorithm": "gradient_boosting",
                        "features": [
                            "historical_alert_outcomes",
                            "system_context_at_alert_time",
                            "resolution_actions_taken"
                        ],
                        "retraining_frequency": "weekly",
                        "confidence_threshold": 0.85
                    },
                    "seasonal_adjustment": {
                        "algorithm": "seasonal_decomposition",
                        "components": ["trend", "seasonal", "residual"],
                        "adjustment_frequency": "daily",
                        "lookback_period": "90_days"
                    }
                }
            },
            
            # 3. インテリジェントグルーピング
            "intelligent_grouping": {
                "grouping_strategies": [
                    {
                        "strategy": "temporal_correlation",
                        "description": "時間的に近接したアラートのグループ化",
                        "time_window": "5m",
                        "correlation_threshold": 0.8,
                        "max_group_size": 10
                    },
                    {
                        "strategy": "service_dependency",
                        "description": "サービス依存関係に基づくグループ化",
                        "dependency_graph": "service_map",
                        "propagation_delay": "2m",
                        "impact_scoring": "business_criticality_weighted"
                    },
                    {
                        "strategy": "root_cause_clustering",
                        "description": "根本原因の共通性によるクラスタリング",
                        "clustering_algorithm": "dbscan",
                        "feature_extraction": "alert_metadata_vectorization",
                        "cluster_stability_threshold": 0.7
                    }
                ]
            }
        }
    
    def implement_smart_alert_routing(self):
        """
        スマートアラートルーティングの実装
        """
        routing_config = {
            "routing_rules": [
                {
                    "rule_name": "expertise_based_routing",
                    "description": "過去の解決実績に基づく担当者自動割り当て",
                    "implementation": {
                        "algorithm": "collaborative_filtering",
                        "factors": [
                            "past_resolution_success_rate",
                            "domain_expertise_score",
                            "current_workload",
                            "availability_status"
                        ],
                        "fallback": "round_robin_assignment"
                    }
                },
                {
                    "rule_name": "context_aware_prioritization", 
                    "description": "ビジネスコンテキストを考慮した優先度調整",
                    "context_factors": [
                        "current_traffic_volume",
                        "business_events_calendar",
                        "deployment_schedule",
                        "customer_tier_impact"
                    ],
                    "priority_adjustment": {
                        "high_traffic_periods": "+2_levels",
                        "business_critical_hours": "+1_level", 
                        "maintenance_windows": "-1_level"
                    }
                }
            ],
            
            "notification_optimization": {
                "digest_mode": {
                    "trigger_conditions": [
                        "low_priority_alerts_count > 5",
                        "alert_frequency > 10_per_hour"
                    ],
                    "digest_frequency": "every_30m",
                    "summary_format": "categorized_counts_with_trends"
                },
                "escalation_bypass": {
                    "auto_resolve_conditions": [
                        "alert_duration < 2m AND auto_recovery == true",
                        "known_transient_issue == true"
                    ],
                    "confidence_requirements": "ml_confidence > 0.95"
                }
            }
        }
        return routing_config

# 実装例: アラート相関エンジン
def create_alert_correlation_engine():
    """
    アラート相関分析エンジン
    """
    correlation_config = {
        "correlation_engine": {
            "temporal_correlation": {
                "query": """
                SELECT 
                    a1.alert_id as primary_alert,
                    a2.alert_id as correlated_alert,
                    CORRELATION(a1.timestamp, a2.timestamp) as time_correlation,
                    OVERLAP(a1.affected_services, a2.affected_services) as service_overlap
                FROM alerts a1, alerts a2
                WHERE a1.timestamp BETWEEN a2.timestamp - INTERVAL 5 MINUTE 
                    AND a2.timestamp + INTERVAL 5 MINUTE
                AND a1.alert_id != a2.alert_id
                HAVING time_correlation > 0.8 OR service_overlap > 0.5
                """,
                "action": "group_correlated_alerts"
            },
            
            "causal_inference": {
                "algorithm": "granger_causality",
                "lag_windows": [1, 5, 15, 30],  # minutes
                "significance_threshold": 0.05,
                "minimum_occurrences": 5,
                "causal_strength_threshold": 0.7
            },
            
            "suppression_rules": {
                "downstream_suppression": {
                    "rule": "IF primary_service_alert THEN suppress dependent_service_alerts",
                    "dependency_source": "service_map",
                    "suppression_duration": "primary_alert_duration + 10m"
                },
                "infrastructure_cascade": {
                    "rule": "IF infrastructure_alert THEN suppress application_alerts",
                    "layers": ["network", "compute", "storage", "application"],
                    "cascade_delay": "2m_per_layer"
                }
            }
        }
    }
    return correlation_config

ダウンタイム設定と保守作業

計画的ダウンタイム管理

メンテナンスやデプロイメント時の適切なアラート制御により、偽陽性の削減と重要アラートの見逃し防止を両立します。

python

# 計画的ダウンタイム管理システム
class PlannedDowntimeManagement:
    """
    計画的ダウンタイムの包括的管理
    """
    
    def __init__(self):
        self.downtime_strategies = {
            # 1. デプロイメント時アラート制御
            "deployment_downtime": {
                "trigger_sources": [
                    "ci_cd_pipeline_webhook",
                    "deployment_api_call",
                    "git_tag_creation"
                ],
                "downtime_scope": {
                    "services": "deployment_target_services",
                    "alert_types": ["error_rate", "latency_spike", "throughput_drop"],
                    "duration": "deployment_duration + 15m",
                    "partial_suppression": {
                        "error_rate": "threshold_relaxed_by_300%",
                        "latency": "threshold_relaxed_by_200%"
                    }
                },
                "safety_mechanisms": {
                    "max_duration": "2h",
                    "critical_alerts_override": ["security_breach", "data_loss"],
                    "rollback_detection": "automatic_downtime_cancellation"
                }
            },
            
            # 2. インフラメンテナンス
            "infrastructure_maintenance": {
                "scheduling": {
                    "preferred_windows": [
                        "Sunday 02:00-06:00 JST",
                        "Tuesday 03:00-05:00 JST"  # Low traffic periods
                    ],
                    "advance_notice": "72h",
                    "stakeholder_approval": ["infrastructure_team", "product_team"]
                },
                "downtime_configuration": {
                    "scope": "infrastructure_layer_specific",
                    "cascade_detection": True,
                    "partial_service_monitoring": {
                        "keep_active": ["security_alerts", "data_integrity_checks"],
                        "modify_thresholds": ["availability_checks", "performance_metrics"]
                    }
                }
            },
            
            # 3. サードパーティ依存関係
            "third_party_dependencies": {
                "known_providers": [
                    {
                        "provider": "aws",
                        "services": ["rds", "s3", "cloudfront"],
                        "status_page": "https://status.aws.amazon.com/",
                        "webhook": "aws_status_webhook",
                        "auto_downtime": {
                            "condition": "provider_service_degraded",
                            "affected_alerts": "dependency_related_alerts",
                            "message": "AWS service degradation detected - alerts suppressed"
                        }
                    },
                    {
                        "provider": "stripe",
                        "services": ["payment_processing"],
                        "status_page": "https://status.stripe.com/",
                        "auto_downtime": {
                            "condition": "payment_provider_incident",
                            "affected_alerts": "payment_related_alerts"
                        }
                    }
                ]
            }
        }
    
    def create_intelligent_downtime_scheduler(self):
        """
        インテリジェントダウンタイムスケジューラー
        """
        return {
            "scheduler_config": {
                "traffic_pattern_analysis": {
                    "data_sources": ["nginx_access_logs", "application_metrics"],
                    "analysis_period": "past_30_days",
                    "optimal_window_detection": {
                        "criteria": [
                            "traffic_volume < 20% of peak",
                            "active_user_count < 100",
                            "business_critical_operations == 0"
                        ]
                    }
                },
                
                "impact_assessment": {
                    "user_impact_scoring": {
                        "factors": [
                            "estimated_affected_users",
                            "revenue_impact_per_minute",
                            "customer_tier_distribution"
                        ],
                        "threshold": "max_acceptable_impact_score"
                    },
                    "business_calendar_integration": {
                        "avoid_periods": [
                            "black_friday",
                            "end_of_month_processing",
                            "product_launches"
                        ],
                        "calendar_apis": ["google_calendar", "outlook"]
                    }
                },
                
                "automated_coordination": {
                    "pre_downtime_actions": [
                        "stakeholder_notification_24h_advance",
                        "status_page_update",
                        "customer_support_team_briefing"
                    ],
                    "during_downtime_monitoring": [
                        "essential_alerts_only",
                        "progress_tracking",
                        "rollback_readiness_check"
                    ],
                    "post_downtime_validation": [
                        "service_health_verification",
                        "performance_baseline_confirmation",
                        "alert_system_reactivation"
                    ]
                }
            }
        }

# ダウンタイム設定の実装例
def configure_deployment_downtime():
    """
    デプロイメント時のダウンタイム設定
    """
    return {
        "downtime_config": {
            "name": "Production Deployment - API Service",
            "scope": {
                "monitors": [
                    "api-error-rate",
                    "api-response-time", 
                    "api-throughput"
                ],
                "tags": ["service:api", "env:production"]
            },
            "schedule": {
                "start": "{{deployment.start_time}}",
                "end": "{{deployment.estimated_end_time + 15m}}",
                "timezone": "Asia/Tokyo"
            },
            "message": """
🚀 **Scheduled Deployment - API Service**

**Deployment Window**: {{deployment.start_time}} - {{deployment.end_time}}
**Estimated Duration**: {{deployment.duration}}
**Services Affected**: API Backend, Authentication Service

**Alert Suppression**:
- Error rate alerts: Relaxed thresholds (temporary spike expected)
- Latency alerts: Disabled during rolling deployment
- Availability alerts: Modified for zero-downtime deployment

**Emergency Escalation**:
- Critical security alerts: ACTIVE
- Data integrity alerts: ACTIVE
- Infrastructure alerts: ACTIVE

**Rollback Trigger**: Error rate > 10% for 5+ minutes
**Contact**: DevOps Team (#devops-alerts)
**Runbook**: https://wiki.company.com/deployments/api-service
            """,
            "recurrence": None,  # One-time downtime
            "suppression_type": "partial"  # Not complete suppression
        }
    }

SLO/SLI監視の実装

Service Level Objectives の設計

SLO（Service Level Objectives）による目標設定とSLI（Service Level Indicators）による客観的測定で、品質管理を体系化します。

python

# SLO/SLI監視システム
class SLOMonitoringSystem:
    """
    包括的SLO/SLI監視システム
    """
    
    def __init__(self):
        self.slo_definitions = {
            # 1. Availability SLO
            "availability_slo": {
                "target": "99.9%",  # "three nines"
                "measurement_window": "rolling_30_days",
                "sli_definition": {
                    "metric": "availability_percentage",
                    "calculation": """
                    (total_time - downtime) / total_time * 100
                    WHERE downtime = sum(incident_duration WHERE severity IN ['critical', 'major'])
                    """,
                    "data_sources": [
                        "synthetic_monitoring",
                        "real_user_monitoring",
                        "infrastructure_health_checks"
                    ]
                },
                "error_budget": {
                    "monthly_allowance": "43.8_minutes",  # 0.1% of 30 days
                    "burn_rate_alerting": {
                        "fast_burn": "2x_target_in_1h",      # 12x normal rate
                        "slow_burn": "0.1x_target_in_6h"     # Long-term drift
                    }
                }
            },
            
            # 2. Latency SLO
            "latency_slo": {
                "targets": {
                    "p50": "< 200ms",
                    "p95": "< 500ms", 
                    "p99": "< 1000ms"
                },
                "measurement_window": "rolling_7_days",
                "sli_definition": {
                    "metric": "response_time_percentiles",
                    "calculation": """
                    PERCENTILE(response_time, [50, 95, 99])
                    FROM http_requests 
                    WHERE status_code < 500
                    GROUP BY time_bucket('5m', timestamp)
                    """,
                    "exclusions": [
                        "health_check_endpoints",
                        "internal_service_calls",
                        "status_code >= 500"  # Don't penalize for server errors
                    ]
                },
                "compliance_tracking": {
                    "good_minutes": "minutes_where_p95 < 500ms",
                    "total_minutes": "all_minutes_with_traffic",
                    "target_compliance": "95%"  # 95% of minutes must meet SLO
                }
            },
            
            # 3. Error Rate SLO
            "error_rate_slo": {
                "target": "< 0.1%",  # 99.9% success rate
                "measurement_window": "rolling_24_hours",
                "sli_definition": {
                    "metric": "error_rate_percentage",
                    "calculation": """
                    (error_requests / total_requests) * 100
                    WHERE error_requests = COUNT(status_code >= 500)
                    AND total_requests = COUNT(*)
                    """,
                    "time_granularity": "1_minute_buckets"
                },
                "error_budget_policy": {
                    "budget_depletion_50%": "warning_alert",
                    "budget_depletion_80%": "critical_alert",
                    "budget_depletion_100%": "feature_freeze_trigger"
                }
            },
            
            # 4. Throughput SLO
            "throughput_slo": {
                "target": "> 1000_rps_p95",  # 95th percentile throughput
                "measurement_window": "rolling_1_hour",
                "sli_definition": {
                    "metric": "requests_per_second",
                    "calculation": """
                    PERCENTILE(requests_per_second, 95)
                    FROM (
                        SELECT COUNT(*) / 60 as requests_per_second
                        FROM http_requests
                        GROUP BY time_bucket('1m', timestamp)
                    )
                    """
                },
                "capacity_alerting": {
                    "utilization_threshold": "80%_of_target",
                    "trend_analysis": "24h_forecast_breach"
                }
            }
        }
    
    def create_slo_monitoring_alerts(self):
        """
        SLO監視アラートの作成
        """
        slo_alerts = {
            # エラーバジェット消費率アラート
            "error_budget_burn_rate": {
                "name": "SLO Error Budget Burn Rate Alert",
                "query": """
                (
                  (
                    sum(last_1h):sum:http.requests.errors{service:api-backend}.as_count() /
                    sum(last_1h):sum:http.requests.total{service:api-backend}.as_count()
                  ) * 100
                ) > 
                (0.1 * 12)  # 12x normal error rate = 1.2%
                """,
                "message": """
🔥 **SLO エラーバジェット高速消費検知**

**現在のエラー率**: {{error_rate}}%
**通常目標**: 0.1%
**高速消費閾値**: 1.2% (12倍速)

**エラーバジェット状況**:
- 現在の消費速度: {{burn_rate}}x
- このペースでの枯渇予測: {{estimated_depletion_time}}
- 月間残りバジェット: {{remaining_budget}}%

**即座の対応**:
1. 🚨 インシデント対応チーム招集
2. 📊 [Error Tracking Dashboard]({{error_dashboard}})で根本原因分析
3. 🛑 必要に応じて機能フラグでトラフィック制限

**エスカレーション**:
- 15分継続: プロダクトマネージャー通知
- 30分継続: 新機能デプロイ停止検討
                """,
                "priority": "high"
            },
            
            # SLO 目標未達成アラート
            "slo_target_miss": {
                "name": "SLO Target Miss - Availability",
                "query": """
                (
                  sum(last_24h):sum:service.availability{service:api-backend} /
                  sum(last_24h):count_not_null(service.availability{service:api-backend})
                ) < 99.9
                """,
                "message": """
📉 **SLO 目標未達成: 可用性**

**24時間可用性**: {{availability}}%
**SLO目標**: 99.9%
**目標差分**: {{slo_gap}}%

**影響分析**:
- ダウンタイム累計: {{total_downtime}}
- 月間エラーバジェット消費: {{budget_consumed}}%
- 主要インシデント: {{major_incidents}}

**改善アクション**:
1. インシデント根本原因分析レビュー
2. 自動復旧メカニズムの改善検討
3. インフラストラクチャ冗長性の見直し

**長期戦略**:
- SLO目標の妥当性見直し
- 予防的監視強化
- 障害訓練の実施
                """
            },
            
            # レイテンシSLO違反アラート
            "latency_slo_violation": {
                "name": "Latency SLO Violation - P95",
                "query": """
                percentile(last_5m):p95:trace.web.request.duration{service:api-backend} > 500
                """,
                "message": """
⏱️ **レイテンシ SLO 違反**

**現在のP95レスポンス時間**: {{p95_latency}}ms
**SLO目標**: 500ms
**超過時間**: {{latency_excess}}ms

**パフォーマンス分析**:
- P50: {{p50_latency}}ms
- P99: {{p99_latency}}ms
- 異常レスポンス率: {{slow_response_rate}}%

**診断項目**:
1. 🔍 [APM Service Map]({{service_map}}) で依存関係分析
2. 💾 [Database Performance]({{db_dashboard}}) でクエリ最適化
3. 🚀 [Infrastructure Metrics]({{infra_dashboard}}) でリソース確認

**自動対応**:
- オートスケーリング評価中...
- 負荷分散調整実行中...
                """
            }
        }
        return slo_alerts
    
    def implement_error_budget_policy(self):
        """
        エラーバジェットポリシーの実装
        """
        return {
            "error_budget_policy": {
                "budget_thresholds": {
                    "green_zone": {
                        "range": "0-50% consumed",
                        "actions": [
                            "normal_development_velocity",
                            "new_feature_rollout_approved",
                            "experimental_features_allowed"
                        ]
                    },
                    "yellow_zone": {
                        "range": "50-80% consumed", 
                        "actions": [
                            "increased_monitoring_focus",
                            "deployment_frequency_review",
                            "reliability_improvement_prioritization"
                        ]
                    },
                    "red_zone": {
                        "range": "80-100% consumed",
                        "actions": [
                            "feature_freeze_consideration",
                            "urgent_reliability_fixes_only",
                            "daily_slo_review_meetings"
                        ]
                    },
                    "crisis_zone": {
                        "range": "100%+ consumed",
                        "actions": [
                            "immediate_feature_freeze",
                            "all_hands_reliability_focus",
                            "executive_escalation"
                        ]
                    }
                },
                
                "automated_responses": {
                    "deployment_gates": {
                        "condition": "error_budget_remaining < 20%",
                        "action": "require_sre_approval_for_deployments"
                    },
                    "traffic_shaping": {
                        "condition": "error_budget_burn_rate > 10x",
                        "action": "activate_rate_limiting_and_circuit_breakers"
                    },
                    "alert_escalation": {
                        "condition": "error_budget_depletion_forecast < 7_days",
                        "action": "daily_slo_review_meeting_schedule"
                    }
                }
            }
        }

# SLO ダッシュボード設定
def create_slo_dashboard():
    """
    SLO監視ダッシュボードの設定
    """
    return {
        "dashboard_config": {
            "title": "Service Level Objectives - Real-time Monitoring",
            "widgets": [
                {
                    "title": "SLO Compliance Summary",
                    "type": "query_value",
                    "queries": [
                        {
                            "name": "Availability SLO",
                            "query": "avg:slo.availability.compliance{service:api-backend}",
                            "display": "percentage"
                        },
                        {
                            "name": "Latency SLO (P95)",
                            "query": "avg:slo.latency.p95.compliance{service:api-backend}",
                            "display": "percentage"
                        },
                        {
                            "name": "Error Rate SLO",
                            "query": "avg:slo.error_rate.compliance{service:api-backend}",
                            "display": "percentage"
                        }
                    ]
                },
                {
                    "title": "Error Budget Burn Rate",
                    "type": "timeseries",
                    "queries": [
                        {
                            "query": "rate(slo.error_budget.consumed{service:api-backend})",
                            "display_type": "line"
                        }
                    ],
                    "yaxis": {
                        "scale": "linear",
                        "min": 0,
                        "max": "auto"
                    }
                },
                {
                    "title": "SLO Trend Analysis",
                    "type": "heatmap",
                    "query": "avg:slo.compliance.matrix{*} by {service,slo_type}",
                    "time_range": "7d"
                }
            ]
        }
    }

このコンテンツで第6部：アラート・通知編が完成しました。効果的なアラート戦略の設計から高度なSLO監視まで、Datadogアラート・通知システムの包括的な実装ガイドを提供しています。

まとめ

本記事では、Datadogアラート・通知システムの完全実装について解説しました：

🎯 主要成果

戦略的アラート設計: ビジネス影響度に基づく優先度設定とアクション可能なアラート
多様な通知チャネル: Slack、PagerDuty、Teams統合による効率的コミュニケーション
インテリジェントエスカレーション: 時間帯・重要度・コンテキスト認識型の自動エスカレーション
アラート疲れ防止: ML活用による相関分析とノイズ削減
計画的ダウンタイム管理: メンテナンス・デプロイ時の適切なアラート制御
SLO/SLI監視: エラーバジェット管理による品質保証とビジネス整合

次の学習段階として、第7部：セキュリティ監視編では、脅威検知、コンプライアンス監視、セキュリティダッシュボードについて詳しく解説予定です。

Datadog入門 第6部 - アラート・通知システムの実践完全ガイド ​

6.1 監視アラートの設計 ​

アラート戦略の基本原則 ​

効果的なアラート設計の哲学 ​

アラートタイプ別戦略 ​

メトリクスアラートの設定 ​

静的閾値アラートの実装 ​

異常検知アラートの活用 ​

ログアラートの設定 ​

セキュリティログ監視 ​

アプリケーションエラー監視 ​

複合条件アラートの設定 ​

マルチシグナル相関分析 ​

適応的しきい値とMLベースアラート ​

動的しきい値調整 ​

6.2 通知とインシデント管理 ​

通知チャネルの設定 ​

Slack統合の実装 ​

PagerDuty統合によるエスカレーション ​

Microsoft Teams・Discord統合 ​

エスカレーションポリシーの設計 ​

多層エスカレーション戦略 ​

アラート疲れの防止策 ​

インテリジェントアラート集約 ​

ダウンタイム設定と保守作業 ​

計画的ダウンタイム管理 ​

SLO/SLI監視の実装 ​

Service Level Objectives の設計 ​

まとめ ​

🎯 主要成果 ​

Datadog入門第6部 - アラート・通知システムの実践完全ガイド

6.1 監視アラートの設計

アラート戦略の基本原則

効果的なアラート設計の哲学

アラートタイプ別戦略

メトリクスアラートの設定

静的閾値アラートの実装

異常検知アラートの活用

ログアラートの設定

セキュリティログ監視

アプリケーションエラー監視

複合条件アラートの設定

マルチシグナル相関分析

適応的しきい値とMLベースアラート

動的しきい値調整

6.2 通知とインシデント管理

通知チャネルの設定

Slack統合の実装

PagerDuty統合によるエスカレーション

Microsoft Teams・Discord統合

エスカレーションポリシーの設計

多層エスカレーション戦略

アラート疲れの防止策

インテリジェントアラート集約

ダウンタイム設定と保守作業

計画的ダウンタイム管理

SLO/SLI監視の実装

Service Level Objectives の設計

まとめ

🎯 主要成果