4.3 コンテナ・Kubernetes環境 - クラウドネイティブ監視の完全実装

クラウドネイティブアプリケーションの普及により、コンテナとKubernetes環境の監視は現代のエンタープライズインフラにおいて必要不可欠な要素となりました。New Relic Infrastructureは、動的で複雑なコンテナ環境において、アプリケーションレベルからインフラレベルまでの完全な可視性を提供します。

本セクションでは、Docker コンテナ監視から大規模Kubernetesクラスター運用まで、エンタープライズレベルでの実装戦略と高度な監視手法を包括的に解説します。

🎯 このセクションの学習目標

📦 コンテナ監視マスタリー

  • Docker環境の詳細監視とパフォーマンス最適化
  • コンテナライフサイクルの完全追跡
  • イメージセキュリティとコンプライアンス監視
  • レジストリ管理と脆弱性スキャン統合

☸️ Kubernetes 監視のエキスパート化

  • クラスターレベルからPodレベルまで階層監視
  • マイクロサービス間の依存関係可視化
  • リソース管理と自動スケーリング連携
  • GitOpsワークフローとの統合

🏢 エンタープライズ運用の実現

  • マルチクラスター環境の統合管理
  • セキュリティポリシーの自動化
  • コスト最適化とリソース効率化
  • DevSecOpsパイプラインとの連携

🐳 Docker コンテナ監視の高度実装

📊 エンタープライズ Docker 監視アーキテクチャ

🏗️ 監視レイヤー構成

yaml
Docker_Monitoring_Architecture:
  Layer_1_Infrastructure:
    - Host System Resources (CPU, Memory, Disk, Network)
    - Docker Engine Performance
    - Container Runtime Statistics
    - Storage Driver Metrics
    
  Layer_2_Container:
    - Container Lifecycle Events
    - Resource Usage per Container
    - Network Traffic Analysis
    - Volume Mount Monitoring
    
  Layer_3_Application:
    - Application Performance Metrics
    - Custom Business Metrics
    - Distributed Tracing
    - Error Tracking
    
  Layer_4_Security:
    - Image Vulnerability Scanning
    - Runtime Security Monitoring
    - Access Control Auditing
    - Compliance Reporting

⚙️ 最適化された Docker Compose 設定

yaml
# docker-compose.enterprise-monitoring.yml
# エンタープライズ級 Docker 監視環境

version: '3.8'

services:
  # メインアプリケーション
  webapp:
    build:
      context: .
      dockerfile: Dockerfile.production
    environment:
      # New Relic APM設定
      - NEW_RELIC_LICENSE_KEY=${NEW_RELIC_LICENSE_KEY}
      - NEW_RELIC_APP_NAME=enterprise-webapp
      - NEW_RELIC_DISTRIBUTED_TRACING_ENABLED=true
      - NEW_RELIC_LOG_LEVEL=info
      - NEW_RELIC_APPLICATION_LOGGING_ENABLED=true
      - NEW_RELIC_APPLICATION_LOGGING_FORWARDING_ENABLED=true
      
      # アプリケーション設定
      - ENVIRONMENT=production
      - DATABASE_URL=postgresql://user:pass@db:5432/webapp
      - REDIS_URL=redis://cache:6379
      - ELASTICSEARCH_URL=http://search:9200
      
      # セキュリティ設定
      - ENABLE_SECURITY_HEADERS=true
      - CSRF_PROTECTION=true
      - RATE_LIMITING=true
      
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
      
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 2G
        reservations:
          cpus: '0.5'
          memory: 512M
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        
    logging:
      driver: "json-file"
      options:
        max-size: "100m"
        max-file: "5"
        labels: "environment=production,service=webapp"
        
    labels:
      # New Relic監視ラベル
      - "newrelic.monitor=true"
      - "newrelic.service=webapp"
      - "newrelic.environment=production"
      - "newrelic.tier=frontend"
      
      # セキュリティラベル
      - "security.scan=enabled"
      - "compliance.pci_dss=required"
      - "backup.policy=daily"
      
    networks:
      - frontend
      - backend
    
    depends_on:
      db:
        condition: service_healthy
      cache:
        condition: service_healthy

  # PostgreSQL データベース
  db:
    image: postgres:15-alpine
    environment:
      - POSTGRES_DB=webapp
      - POSTGRES_USER=webapp_user
      - POSTGRES_PASSWORD_FILE=/run/secrets/db_password
      - POSTGRES_INITDB_ARGS="--auth-host=scram-sha-256"
      
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./monitoring/postgres-init.sql:/docker-entrypoint-initdb.d/monitoring.sql
      - ./config/postgresql.conf:/etc/postgresql/postgresql.conf
      
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U webapp_user -d webapp"]
      interval: 30s
      timeout: 10s
      retries: 3
      
    deploy:
      resources:
        limits:
          cpus: '1.0'
          memory: 1G
        reservations:
          cpus: '0.25'
          memory: 256M
          
    logging:
      driver: "json-file"
      options:
        max-size: "50m"
        max-file: "3"
        labels: "service=database"
        
    labels:
      - "newrelic.monitor=true"
      - "newrelic.service=postgresql"
      - "newrelic.environment=production"
      - "newrelic.tier=database"
      - "backup.schedule=0 2 * * *"  # 毎日2時にバックアップ
      
    secrets:
      - db_password
    networks:
      - backend

  # Redis キャッシュ
  cache:
    image: redis:7-alpine
    command: >
      redis-server 
      --requirepass ${REDIS_PASSWORD}
      --appendonly yes
      --appendfsync everysec
      --maxmemory 512mb
      --maxmemory-policy allkeys-lru
      
    volumes:
      - redis_data:/data
      - ./config/redis.conf:/usr/local/etc/redis/redis.conf
      
    healthcheck:
      test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
      interval: 30s
      timeout: 10s
      retries: 3
      
    deploy:
      resources:
        limits:
          cpus: '0.5'
          memory: 512M
        reservations:
          cpus: '0.1'
          memory: 64M
          
    labels:
      - "newrelic.monitor=true"
      - "newrelic.service=redis"
      - "newrelic.environment=production"
      - "newrelic.tier=cache"
      
    networks:
      - backend

  # Elasticsearch (検索・ログ)
  search:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
    environment:
      - discovery.type=single-node
      - ES_JAVA_OPTS=-Xms1g -Xmx1g
      - xpack.security.enabled=false
      - xpack.monitoring.collection.enabled=true
      
    volumes:
      - elasticsearch_data:/usr/share/elasticsearch/data
      - ./config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml
      
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:9200/_cluster/health || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5
      
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 2G
        reservations:
          cpus: '0.5'
          memory: 1G
          
    labels:
      - "newrelic.monitor=true"
      - "newrelic.service=elasticsearch"
      - "newrelic.environment=production"
      - "newrelic.tier=search"
      
    networks:
      - backend

  # New Relic Infrastructure Agent
  newrelic-infra:
    image: newrelic/infrastructure:latest
    cap_add:
      - SYS_PTRACE
    network_mode: host
    pid: host
    privileged: true
    
    environment:
      - NRIA_LICENSE_KEY=${NEW_RELIC_LICENSE_KEY}
      - NRIA_VERBOSE=1
      - NRIA_DISPLAY_NAME=docker-host-${ENVIRONMENT}
      - NRIA_CUSTOM_ATTRIBUTES={"environment":"${ENVIRONMENT}","deployment":"docker","monitoring_level":"comprehensive"}
      - NRIA_ENABLE_DOCKER=true
      
    volumes:
      - "/:/host:ro"
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
      - "/sys:/host/sys:ro"
      - "/proc:/host/proc:ro"
      - "/dev:/host/dev:ro"
      - "./config/newrelic-infra.yml:/etc/newrelic-infra.yml:ro"
      
    restart: unless-stopped
    
    labels:
      - "newrelic.monitor=false"  # 自分自身は監視対象外
      - "newrelic.service=infrastructure-agent"
      
    depends_on:
      - webapp
      - db
      - cache

  # ログ配送 (Fluent Bit)
  log-forwarder:
    image: fluent/fluent-bit:2.2
    
    volumes:
      - "./config/fluent-bit.conf:/fluent-bit/etc/fluent-bit.conf:ro"
      - "/var/lib/docker/containers:/var/lib/docker/containers:ro"
      - "/var/log:/var/log:ro"
      
    environment:
      - NEW_RELIC_LICENSE_KEY=${NEW_RELIC_LICENSE_KEY}
      - ENVIRONMENT=${ENVIRONMENT}
      
    depends_on:
      - webapp
      
    labels:
      - "newrelic.monitor=true"
      - "newrelic.service=log-forwarder"
      - "newrelic.environment=production"
      
    networks:
      - backend

  # メトリクス収集 (Prometheus互換)
  metrics-exporter:
    image: prom/node-exporter:latest
    
    volumes:
      - "/proc:/host/proc:ro"
      - "/sys:/host/sys:ro"
      - "/:/rootfs:ro"
      
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
      
    labels:
      - "prometheus.io/scrape=true"
      - "prometheus.io/port=9100"
      - "newrelic.monitor=true"
      
    networks:
      - monitoring

volumes:
  postgres_data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /data/postgres
      
  redis_data:
    driver: local
    driver_opts:
      type: none  
      o: bind
      device: /data/redis
      
  elasticsearch_data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /data/elasticsearch

networks:
  frontend:
    driver: bridge
    ipam:
      config:
        - subnet: 172.20.1.0/24
        
  backend:
    driver: bridge
    ipam:
      config:
        - subnet: 172.20.2.0/24
        
  monitoring:
    driver: bridge
    ipam:
      config:
        - subnet: 172.20.3.0/24

secrets:
  db_password:
    file: ./secrets/db_password.txt

📊 Docker 監視スクリプト

python
#!/usr/bin/env python3
"""
Docker エンタープライズ監視スクリプト
コンテナ・イメージ・ネットワーク・ボリュームの包括監視
"""

import docker
import json
import requests
import time
from datetime import datetime
import os
import psutil
import threading
from concurrent.futures import ThreadPoolExecutor, as_completed

class DockerEnterpriseMonitor:
    def __init__(self, newrelic_insert_key, newrelic_account_id):
        self.newrelic_insert_key = newrelic_insert_key
        self.newrelic_account_id = newrelic_account_id
        self.insights_api = f"https://insights-collector.newrelic.com/v1/accounts/{newrelic_account_id}/events"
        
        # Docker クライアント初期化
        try:
            self.docker_client = docker.from_env()
            self.docker_client.ping()
            print("✅ Connected to Docker daemon")
        except Exception as e:
            print(f"❌ Failed to connect to Docker: {e}")
            raise
        
        self.hostname = os.uname().nodename
        self.environment = os.environ.get('ENVIRONMENT', 'production')
    
    def collect_docker_info(self):
        """Docker エンジン情報収集"""
        try:
            info = self.docker_client.info()
            version = self.docker_client.version()
            
            docker_info = {
                'eventType': 'DockerEngineMetrics',
                'timestamp': int(time.time()),
                'hostname': self.hostname,
                'environment': self.environment,
                'docker.version': version['Version'],
                'docker.api_version': version['ApiVersion'],
                'docker.go_version': version['GoVersion'],
                'docker.git_commit': version['GitCommit'],
                'docker.built': version['Built'],
                'docker.os': version['Os'],
                'docker.arch': version['Arch'],
                'docker.kernel_version': version['KernelVersion'],
                'containers.total': info['Containers'],
                'containers.running': info['ContainersRunning'],
                'containers.paused': info['ContainersPaused'],
                'containers.stopped': info['ContainersStopped'],
                'images.total': info['Images'],
                'docker.storage_driver': info['Driver'],
                'docker.logging_driver': info['LoggingDriver'],
                'docker.cgroup_driver': info.get('CgroupDriver', 'unknown'),
                'docker.memory_limit': info.get('MemoryLimit', False),
                'docker.swap_limit': info.get('SwapLimit', False),
                'docker.cpu_cfs_period': info.get('CpuCfsPeriod', False),
                'docker.cpu_cfs_quota': info.get('CpuCfsQuota', False),
                'docker.kernel_memory': info.get('KernelMemory', False),
                'docker.oom_kill_disable': info.get('OomKillDisable', False),
                'security.apparmor_profile': info.get('SecurityOptions', []),
                'registry.insecure_registries': len(info.get('InsecureRegistries', [])),
                'registry.index_configs': len(info.get('IndexConfigs', {}))
            }
            
            # システムリソース情報
            system_info = info.get('SystemStatus', {})
            if system_info:
                docker_info.update({
                    'system.total_memory': info.get('MemTotal', 0),
                    'system.ncpu': info.get('NCPU', 0),
                    'system.name': info.get('Name', 'unknown'),
                    'system.server_version': info.get('ServerVersion', 'unknown')
                })
            
            return docker_info
            
        except Exception as e:
            print(f"❌ Failed to collect Docker info: {e}")
            return None
    
    def collect_container_metrics(self):
        """コンテナメトリクス収集"""
        try:
            containers = self.docker_client.containers.list(all=True)
            container_metrics = []
            
            for container in containers:
                try:
                    # 基本情報
                    container_info = {
                        'eventType': 'DockerContainerMetrics',
                        'timestamp': int(time.time()),
                        'hostname': self.hostname,
                        'environment': self.environment,
                        'container.id': container.id[:12],
                        'container.name': container.name,
                        'container.status': container.status,
                        'container.image': container.image.tags[0] if container.image.tags else 'none',
                        'container.image_id': container.image.id[:12],
                        'container.command': ' '.join(container.attrs['Config']['Cmd'] or []),
                        'container.created': container.attrs['Created'],
                        'container.platform': container.attrs.get('Platform', 'unknown'),
                        'container.architecture': container.attrs.get('Architecture', 'unknown')
                    }
                    
                    # ラベル情報
                    labels = container.labels
                    if labels:
                        container_info.update({
                            f'label.{k}': v for k, v in labels.items() 
                            if not k.startswith('com.docker') and len(k) < 50
                        })
                    
                    # ネットワーク情報
                    networks = container.attrs.get('NetworkSettings', {}).get('Networks', {})
                    container_info['container.networks'] = list(networks.keys())
                    container_info['container.network_count'] = len(networks)
                    
                    # マウント情報
                    mounts = container.attrs.get('Mounts', [])
                    container_info['container.mounts'] = len(mounts)
                    
                    # 環境変数数
                    env_vars = container.attrs.get('Config', {}).get('Env', [])
                    container_info['container.env_vars_count'] = len(env_vars)
                    
                    # ランニングコンテナの統計情報
                    if container.status == 'running':
                        try:
                            stats = container.stats(stream=False)
                            
                            # CPU統計
                            cpu_stats = stats['cpu_stats']
                            precpu_stats = stats['precpu_stats']
                            
                            # CPU使用率計算
                            cpu_delta = cpu_stats['cpu_usage']['total_usage'] - precpu_stats['cpu_usage']['total_usage']
                            system_delta = cpu_stats['system_cpu_usage'] - precpu_stats['system_cpu_usage']
                            
                            if system_delta > 0:
                                cpu_percent = (cpu_delta / system_delta) * len(cpu_stats['cpu_usage']['percpu_usage']) * 100
                                container_info['container.cpu_percent'] = round(cpu_percent, 2)
                            
                            # メモリ統計
                            memory_stats = stats['memory_stats']
                            memory_usage = memory_stats.get('usage', 0)
                            memory_limit = memory_stats.get('limit', 0)
                            
                            container_info['container.memory_usage_bytes'] = memory_usage
                            container_info['container.memory_limit_bytes'] = memory_limit
                            
                            if memory_limit > 0:
                                memory_percent = (memory_usage / memory_limit) * 100
                                container_info['container.memory_percent'] = round(memory_percent, 2)
                            
                            # メモリ詳細統計
                            mem_detail = memory_stats.get('stats', {})
                            container_info.update({
                                'container.memory_cache_bytes': mem_detail.get('cache', 0),
                                'container.memory_rss_bytes': mem_detail.get('rss', 0),
                                'container.memory_swap_bytes': mem_detail.get('swap', 0)
                            })
                            
                            # ネットワーク統計
                            networks_stats = stats.get('networks', {})
                            total_rx_bytes = sum(net['rx_bytes'] for net in networks_stats.values())
                            total_tx_bytes = sum(net['tx_bytes'] for net in networks_stats.values())
                            total_rx_errors = sum(net['rx_errors'] for net in networks_stats.values())
                            total_tx_errors = sum(net['tx_errors'] for net in networks_stats.values())
                            
                            container_info.update({
                                'container.network_rx_bytes': total_rx_bytes,
                                'container.network_tx_bytes': total_tx_bytes,
                                'container.network_rx_errors': total_rx_errors,
                                'container.network_tx_errors': total_tx_errors
                            })
                            
                            # Block I/O統計
                            blkio_stats = stats.get('blkio_stats', {})
                            read_bytes = sum(op['value'] for op in blkio_stats.get('io_service_bytes_recursive', []) if op['op'] == 'Read')
                            write_bytes = sum(op['value'] for op in blkio_stats.get('io_service_bytes_recursive', []) if op['op'] == 'Write')
                            
                            container_info.update({
                                'container.blkio_read_bytes': read_bytes,
                                'container.blkio_write_bytes': write_bytes
                            })
                            
                            # プロセス数
                            pids_stats = stats.get('pids_stats', {})
                            container_info['container.pids_count'] = pids_stats.get('current', 0)
                            
                        except Exception as e:
                            print(f"⚠️  Failed to get stats for container {container.name}: {e}")
                    
                    # ヘルスチェック情報
                    health = container.attrs.get('State', {}).get('Health')
                    if health:
                        container_info.update({
                            'container.health_status': health.get('Status', 'unknown'),
                            'container.health_failing_streak': health.get('FailingStreak', 0),
                            'container.health_log_length': len(health.get('Log', []))
                        })
                    
                    # 再起動情報
                    restart_count = container.attrs.get('RestartCount', 0)
                    container_info['container.restart_count'] = restart_count
                    
                    # 作成からの経過時間
                    created_time = datetime.fromisoformat(container.attrs['Created'].replace('Z', '+00:00'))
                    uptime_seconds = (datetime.now().astimezone() - created_time).total_seconds()
                    container_info['container.uptime_seconds'] = int(uptime_seconds)
                    
                    container_metrics.append(container_info)
                    
                except Exception as e:
                    print(f"⚠️  Failed to process container {container.name}: {e}")
                    continue
            
            return container_metrics
            
        except Exception as e:
            print(f"❌ Failed to collect container metrics: {e}")
            return []
    
    def collect_image_metrics(self):
        """イメージメトリクス収集"""
        try:
            images = self.docker_client.images.list(all=True)
            image_metrics = []
            
            for image in images:
                try:
                    # 基本情報
                    image_info = {
                        'eventType': 'DockerImageMetrics',
                        'timestamp': int(time.time()),
                        'hostname': self.hostname,
                        'environment': self.environment,
                        'image.id': image.id[:12],
                        'image.short_id': image.short_id,
                        'image.tags': image.tags,
                        'image.size_bytes': image.attrs['Size'],
                        'image.virtual_size_bytes': image.attrs['VirtualSize'],
                        'image.created': image.attrs['Created'],
                        'image.architecture': image.attrs['Architecture'],
                        'image.os': image.attrs['Os'],
                        'image.variant': image.attrs.get('Variant', ''),
                        'image.docker_version': image.attrs.get('DockerVersion', ''),
                        'image.author': image.attrs.get('Author', ''),
                        'image.comment': image.attrs.get('Comment', '')
                    }
                    
                    # 設定情報
                    config = image.attrs.get('Config', {})
                    if config:
                        image_info.update({
                            'image.exposed_ports': list(config.get('ExposedPorts', {}).keys()),
                            'image.env_vars_count': len(config.get('Env', [])),
                            'image.cmd': config.get('Cmd'),
                            'image.entrypoint': config.get('Entrypoint'),
                            'image.working_dir': config.get('WorkingDir', ''),
                            'image.user': config.get('User', '')
                        })
                    
                    # レイヤー情報
                    history = image.history()
                    image_info['image.layers_count'] = len(history)
                    
                    # ルートファイルシステム情報
                    rootfs = image.attrs.get('RootFS', {})
                    if rootfs:
                        image_info.update({
                            'image.rootfs_type': rootfs.get('Type', ''),
                            'image.rootfs_layers_count': len(rootfs.get('Layers', []))
                        })
                    
                    # 使用統計(このイメージを使っているコンテナ数)
                    containers_using = self.docker_client.containers.list(
                        all=True, 
                        filters={'ancestor': image.id}
                    )
                    image_info['image.containers_using'] = len(containers_using)
                    
                    image_metrics.append(image_info)
                    
                except Exception as e:
                    print(f"⚠️  Failed to process image {image.id[:12]}: {e}")
                    continue
            
            return image_metrics
            
        except Exception as e:
            print(f"❌ Failed to collect image metrics: {e}")
            return []
    
    def collect_network_metrics(self):
        """Docker ネットワークメトリクス収集"""
        try:
            networks = self.docker_client.networks.list()
            network_metrics = []
            
            for network in networks:
                network_info = {
                    'eventType': 'DockerNetworkMetrics',
                    'timestamp': int(time.time()),
                    'hostname': self.hostname,
                    'environment': self.environment,
                    'network.id': network.id[:12],
                    'network.name': network.name,
                    'network.driver': network.attrs['Driver'],
                    'network.scope': network.attrs['Scope'],
                    'network.attachable': network.attrs.get('Attachable', False),
                    'network.ingress': network.attrs.get('Ingress', False),
                    'network.ipam_driver': network.attrs.get('IPAM', {}).get('Driver', ''),
                    'network.internal': network.attrs.get('Internal', False),
                    'network.enable_ipv6': network.attrs.get('EnableIPv6', False),
                    'network.created': network.attrs['Created']
                }
                
                # 接続されているコンテナ数
                containers = network.attrs.get('Containers', {})
                network_info['network.connected_containers'] = len(containers)
                
                # IPAMコンフィグ
                ipam_config = network.attrs.get('IPAM', {}).get('Config', [])
                if ipam_config:
                    network_info.update({
                        'network.subnets_count': len(ipam_config),
                        'network.subnet': ipam_config[0].get('Subnet', '') if ipam_config else '',
                        'network.gateway': ipam_config[0].get('Gateway', '') if ipam_config else ''
                    })
                
                # オプション
                options = network.attrs.get('Options', {})
                if options:
                    network_info.update({
                        f'network.option.{k}': v for k, v in options.items() 
                        if len(k) < 50
                    })
                
                network_metrics.append(network_info)
            
            return network_metrics
            
        except Exception as e:
            print(f"❌ Failed to collect network metrics: {e}")
            return []
    
    def collect_volume_metrics(self):
        """Docker ボリュームメトリクス収集"""
        try:
            volumes = self.docker_client.volumes.list()
            volume_metrics = []
            
            for volume in volumes:
                volume_info = {
                    'eventType': 'DockerVolumeMetrics',
                    'timestamp': int(time.time()),
                    'hostname': self.hostname,
                    'environment': self.environment,
                    'volume.name': volume.name,
                    'volume.driver': volume.attrs['Driver'],
                    'volume.mountpoint': volume.attrs['Mountpoint'],
                    'volume.created': volume.attrs['CreatedAt'],
                    'volume.scope': volume.attrs.get('Scope', 'local')
                }
                
                # ボリュームオプション
                options = volume.attrs.get('Options')
                if options:
                    volume_info.update({
                        f'volume.option.{k}': v for k, v in options.items() 
                        if len(k) < 50
                    })
                
                # ラベル
                labels = volume.attrs.get('Labels')
                if labels:
                    volume_info.update({
                        f'volume.label.{k}': v for k, v in labels.items() 
                        if len(k) < 50
                    })
                
                # ディスク使用量(可能な場合)
                try:
                    mountpoint = volume.attrs['Mountpoint']
                    if os.path.exists(mountpoint):
                        stat = os.statvfs(mountpoint)
                        total_bytes = stat.f_frsize * stat.f_blocks
                        free_bytes = stat.f_frsize * stat.f_available
                        used_bytes = total_bytes - free_bytes
                        
                        volume_info.update({
                            'volume.total_bytes': total_bytes,
                            'volume.used_bytes': used_bytes,
                            'volume.free_bytes': free_bytes,
                            'volume.usage_percent': round((used_bytes / total_bytes) * 100, 2) if total_bytes > 0 else 0
                        })
                except Exception:
                    # ボリューム情報取得失敗は無視
                    pass
                
                volume_metrics.append(volume_info)
            
            return volume_metrics
            
        except Exception as e:
            print(f"❌ Failed to collect volume metrics: {e}")
            return []
    
    def send_to_newrelic(self, metrics_data):
        """メトリクスをNew Relicに送信"""
        if not metrics_data:
            return
        
        try:
            headers = {
                'Content-Type': 'application/json',
                'X-Insert-Key': self.newrelic_insert_key
            }
            
            # バッチサイズで送信
            batch_size = 100
            for i in range(0, len(metrics_data), batch_size):
                batch = metrics_data[i:i+batch_size]
                
                response = requests.post(
                    self.insights_api,
                    headers=headers,
                    json=batch,
                    timeout=30
                )
                
                if response.status_code == 200:
                    print(f"✅ Sent {len(batch)} Docker metrics to New Relic")
                else:
                    print(f"❌ Failed to send batch: {response.status_code}")
                    
        except Exception as e:
            print(f"❌ Failed to send metrics to New Relic: {e}")
    
    def run_monitoring(self):
        """メイン監視処理"""
        print("🐳 Starting Docker Enterprise Monitoring")
        print(f"📅 Timestamp: {datetime.now().isoformat()}")
        
        all_metrics = []
        
        # 並行してメトリクス収集
        with ThreadPoolExecutor(max_workers=6) as executor:
            futures = {
                executor.submit(self.collect_docker_info): "Docker Engine",
                executor.submit(self.collect_container_metrics): "Containers",
                executor.submit(self.collect_image_metrics): "Images",
                executor.submit(self.collect_network_metrics): "Networks",
                executor.submit(self.collect_volume_metrics): "Volumes"
            }
            
            for future in as_completed(futures):
                metric_type = futures[future]
                try:
                    result = future.result()
                    if result:
                        if isinstance(result, list):
                            all_metrics.extend(result)
                            print(f"✅ Collected {len(result)} {metric_type} metrics")
                        else:
                            all_metrics.append(result)
                            print(f"✅ Collected {metric_type} metrics")
                except Exception as e:
                    print(f"❌ Failed to collect {metric_type} metrics: {e}")
        
        # New Relicに送信
        if all_metrics:
            print(f"📤 Sending {len(all_metrics)} total metrics to New Relic...")
            self.send_to_newrelic(all_metrics)
            print("🎉 Docker monitoring completed successfully")
        else:
            print("⚠️  No metrics collected")

# メイン実行
if __name__ == "__main__":
    import os
    
    # 環境変数から設定取得
    NEWRELIC_INSERT_KEY = os.environ.get('NEWRELIC_INSERT_KEY', '')
    NEWRELIC_ACCOUNT_ID = os.environ.get('NEWRELIC_ACCOUNT_ID', '')
    
    if not all([NEWRELIC_INSERT_KEY, NEWRELIC_ACCOUNT_ID]):
        print("❌ Required environment variables not set")
        print("Please set: NEWRELIC_INSERT_KEY, NEWRELIC_ACCOUNT_ID")
        exit(1)
    
    monitor = DockerEnterpriseMonitor(NEWRELIC_INSERT_KEY, NEWRELIC_ACCOUNT_ID)
    
    try:
        monitor.run_monitoring()
    except KeyboardInterrupt:
        print("\n⏹️  Monitoring stopped by user")
    except Exception as e:
        print(f"❌ Monitoring failed: {e}")
        exit(1)

☸️ Kubernetes クラスター監視

🎯 エンタープライズ Kubernetes 監視戦略

🏗️ 階層化監視アーキテクチャ

yaml
Kubernetes_Monitoring_Layers:
  Layer_1_Cluster:
    - Cluster Health & Status
    - Node Resource Utilization  
    - Control Plane Components
    - Cluster-wide Metrics
    
  Layer_2_Namespace:
    - Resource Quotas & Limits
    - Network Policies
    - Service Discovery
    - Namespace-level Metrics
    
  Layer_3_Workload:
    - Deployment Status
    - ReplicaSet Health
    - Pod Lifecycle Events
    - Resource Requests/Limits
    
  Layer_4_Application:
    - Container Performance
    - Application Metrics
    - Custom Metrics
    - Business KPIs
    
  Layer_5_Security:
    - RBAC Violations
    - Pod Security Policies
    - Network Traffic Analysis
    - Compliance Monitoring

⚙️ Helm Chart による完全デプロイメント

yaml
# values-enterprise.yaml
# New Relic Kubernetes統合 エンタープライズ設定

# Global設定
global:
  licenseKey: "YOUR_ENTERPRISE_LICENSE_KEY"
  cluster: "production-k8s-cluster"
  
  # エンタープライズ属性
  customAttributes:
    # ビジネス情報
    environment: production
    business_unit: platform
    cost_center: infrastructure
    data_classification: confidential
    
    # 地理・物理情報
    region: ap-northeast-1
    availability_zone: ap-northeast-1a
    datacenter: tokyo-dc1
    
    # 運用情報
    team: platform_engineering
    oncall_team: sre
    escalation_policy: critical_infrastructure
    maintenance_window: "02:00-04:00_JST"
    
    # コンプライアンス
    compliance_framework: pci_dss
    security_level: high
    audit_required: true
    
  # プロキシ設定(エンタープライズネットワーク)
  proxy:
    http: "http://proxy.company.com:8080"
    https: "https://proxy.company.com:8080" 
    noProxy: "localhost,127.0.0.1,.company.com,.cluster.local"
  
  # フィーダム・ガバナンス
  fedramp:
    enabled: false
  
  nrStaging:
    enabled: false

# Infrastructure監視
newrelic-infrastructure:
  enabled: true
  privileged: true
  
  # リソース制限(本番環境対応)
  resources:
    limits:
      memory: "500Mi"
      cpu: "500m" 
    requests:
      memory: "150Mi"
      cpu: "100m"
  
  # ノード選択(監視専用ノード等)
  nodeSelector: {}
  
  # テイント許容
  tolerations:
    - operator: "Exists"
      effect: "NoSchedule"
    - operator: "Exists"
      effect: "NoExecute"
    - operator: "Exists"
      effect: "PreferNoSchedule"
  
  # Affinity設定(可用性)
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: node-role.kubernetes.io/monitoring
            operator: In
            values: ["true"]
  
  config:
    # 詳細ログ(トラブルシューティング用)
    verbose: 1
    log_file: "/var/log/newrelic-infra/newrelic-infra.log"
    log_format: "json"
    
    # カスタム属性
    custom_attributes:
      monitoring_tier: infrastructure
      data_retention: "13_months"
      sla_tier: gold
      
    # プロセス監視
    enable_process_metrics: true
    
    # 高度な収集設定
    metrics_system_sample_rate: 15s
    metrics_process_sample_rate: 20s
    metrics_network_sample_rate: 10s
    metrics_storage_sample_rate: 20s
    
    # セキュリティ
    strip_command_line: true
    passthrough_environment:
      - KUBERNETES_SERVICE_HOST
      - KUBERNETES_SERVICE_PORT
      - NEW_RELIC_LICENSE_KEY

# Kubernetes状態監視
kube-state-metrics:
  enabled: true
  
  # リソース設定
  resources:
    limits:
      memory: "200Mi"
      cpu: "200m"
    requests:
      memory: "100Mi"
      cpu: "50m"
  
  # メトリクスカスタマイズ
  metricLabelsAllowlist:
    - pods=[*]
    - deployments=[*]
    - services=[*]
    - ingresses=[*]
    - configmaps=[*]
    - secrets=[*]
  
  # セキュリティコンテキスト  
  securityContext:
    runAsNonRoot: true
    runAsUser: 65534
    runAsGroup: 65534
    fsGroup: 65534

# Prometheus統合
nri-prometheus:
  enabled: true
  
  # リソース制限
  resources:
    limits:
      memory: "300Mi"
      cpu: "200m"
    requests:
      memory: "100Mi"
      cpu: "50m"
  
  config:
    # メトリクス変換
    transformations:
      - description: "CPU使用率正規化"
        rename_metric: "cpu_usage_percentage"
        ignore_metrics:
          - "go_.*"
          - "prometheus_.*"
          - "kube_pod_container_.*"
      
      - description: "メモリメトリクス正規化"
        rename_metric: "memory_usage_bytes"
        copy_attributes:
          - "pod"
          - "namespace"
          - "container"
    
    # スクレイプ設定
    scrape_configs:
      # Pod監視(annotations自動発見)
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
            namespaces:
              names:
                - default
                - kube-system
                - monitoring
                - application
        
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
          
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
          
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: kubernetes_pod_name
            
          - source_labels: [__meta_kubernetes_pod_container_name]
            action: replace
            target_label: kubernetes_container_name
      
      # Service監視
      - job_name: 'kubernetes-service-endpoints'
        kubernetes_sd_configs:
          - role: endpoints
        
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
            action: replace
            target_label: __scheme__
            regex: (https?)
          
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          
          - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
            action: replace
            target_label: __address__
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
          
          - action: labelmap
            regex: __meta_kubernetes_service_label_(.+)
          
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: kubernetes_name
      
      # Ingress Controller監視
      - job_name: 'kubernetes-ingresses'
        kubernetes_sd_configs:
          - role: ingress
        
        relabel_configs:
          - source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          
          - source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_scheme]
            action: replace
            target_label: __scheme__
            regex: (https?)

# ログ収集
newrelic-logging:
  enabled: true
  
  # リソース制限
  resources:
    limits:
      memory: "500Mi"
      cpu: "500m"
    requests:
      memory: "128Mi"
      cpu: "50m"
  
  # Fluent Bit設定
  fluentBit:
    image:
      repository: newrelic/newrelic-fluentbit-output
      tag: "1.19.2"
    
    config:
      # 入力設定
      inputs: |
        [INPUT]
            Name tail
            Path /var/log/containers/*.log
            Parser cri
            Tag kube.*
            Mem_Buf_Limit 50MB
            Skip_Long_Lines On
            Skip_Empty_Lines On
            Refresh_Interval 10
            Rotate_Wait 30
            storage.type filesystem
            storage.pause_on_chunks_overlimit off
            
        [INPUT]
            Name systemd
            Tag host.*
            Systemd_Filter _SYSTEMD_UNIT=kubelet.service
            Systemd_Filter _SYSTEMD_UNIT=docker.service
            Systemd_Filter _SYSTEMD_UNIT=containerd.service
            Max_Entries 1000
            Read_From_Tail On
      
      # フィルター設定
      filters: |
        [FILTER]
            Name kubernetes
            Match kube.*
            Kube_URL https://kubernetes.default.svc:443
            Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
            Kube_Tag_Prefix kube.var.log.containers.
            Merge_Log On
            Merge_Log_Key log_processed
            K8S-Logging.Parser On
            K8S-Logging.Exclude Off
            Annotations Off
            Labels On
            
        [FILTER]
            Name modify
            Match *
            Add cluster_name production-k8s-cluster
            Add environment production
            Add log_type kubernetes
            
        [FILTER]
            Name grep
            Match kube.*
            Exclude log ^\s*$
            
        [FILTER]
            Name record_modifier
            Match *
            Record hostname ${HOSTNAME}
            Record datacenter tokyo-dc1
      
      # 出力設定
      outputs: |
        [OUTPUT]
            Name newrelic
            Match *
            licenseKey ${LICENSE_KEY}
            endpoint https://log-api.newrelic.com/log/v1
            maxBufferSize 256000
            maxRecords 1024
            
        [OUTPUT]
            Name stdout
            Match *
            Format json_lines
    
    # ボリューム設定
    volumeMounts:
      - name: varlog
        mountPath: /var/log
        readOnly: true
      - name: varlibdockercontainers
        mountPath: /var/lib/docker/containers
        readOnly: true
        
    volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

# Kubernetesイベント監視
kubeEvents:
  enabled: true
  
  resources:
    limits:
      memory: "128Mi"
      cpu: "100m"
    requests:
      memory: "64Mi"
      cpu: "50m"
  
  config:
    clusterName: production-k8s-cluster
    
    sinks:
      - name: newRelicInfra
        config:
          agentEndpoint: http://localhost:8001/v1/data
          clusterName: production-k8s-cluster
          
      - name: stdout
        config: {}

# Pixie統合(高度なオブザーバビリティ)
newrelic-pixie:
  enabled: true
  
  # Pixie APIキー
  apikey: "YOUR_PIXIE_API_KEY"
  
  # デプロイ設定
  deployKey: "YOUR_PIXIE_DEPLOY_KEY"
  clusterName: production-k8s-cluster
  
  # Pixie設定
  pixieChart:
    enabled: true
    deployKey: "YOUR_PIXIE_DEPLOY_KEY"
    clusterName: production-k8s-cluster
    
    # PEM(Pixie Edge Module)設定
    pemMemoryLimit: "2Gi"
    pemMemoryRequest: "1Gi"

# メタデータインジェクション
newrelic-metadata-injection:
  enabled: true
  
  # 対象ラベルセレクタ
  labelSelector:
    environment: "production"
    "app.kubernetes.io/managed-by": "Helm"
  
  # 注入設定
  injectMetadata: true
  
  # New Relic環境変数注入
  env:
    - name: NEW_RELIC_METADATA_KUBERNETES_CLUSTER_NAME
      value: "production-k8s-cluster"
    - name: NEW_RELIC_METADATA_KUBERNETES_NODE_NAME
      valueFrom:
        fieldRef:
          fieldPath: spec.nodeName
    - name: NEW_RELIC_METADATA_KUBERNETES_NAMESPACE_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.namespace
    - name: NEW_RELIC_METADATA_KUBERNETES_POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
    - name: NEW_RELIC_METADATA_KUBERNETES_CONTAINER_NAME
      value: ""  # コンテナ名は実行時に設定

# セキュリティ設定
security:
  # Pod Security Policy
  podSecurityPolicy:
    enabled: true
    
  # Network Policy
  networkPolicy:
    enabled: true
    ingress:
      - from:
        - namespaceSelector:
            matchLabels:
              name: monitoring
        ports:
        - protocol: TCP
          port: 8080
        - protocol: TCP
          port: 9090

# 監視対象ネームスペース
namespacesToMonitor:
  - default
  - kube-system
  - kube-public
  - monitoring
  - application
  - database
  - ingress-nginx
  - cert-manager

# カスタムリソース監視
customResources:
  enabled: true
  config: |
    customResourceRules:
      - groupVersionKind:
          group: "argoproj.io"
          version: "v1alpha1"
          kind: "Application"
        metricNamePrefix: "argocd"
      - groupVersionKind:
          group: "networking.istio.io"
          version: "v1beta1"
          kind: "VirtualService"
        metricNamePrefix: "istio"
      - groupVersionKind:
          group: "security.istio.io"  
          version: "v1beta1"
          kind: "PeerAuthentication"
        metricNamePrefix: "istio_security"

🔧 Kubernetes 監視スクリプト

python
#!/usr/bin/env python3
"""
Kubernetes エンタープライズ監視スクリプト
クラスター・ノード・Pod・サービス・リソースの包括監視
"""

from kubernetes import client, config
import requests
import json
import time
from datetime import datetime, timezone
import os
import threading
from concurrent.futures import ThreadPoolExecutor, as_completed

class KubernetesEnterpriseMonitor:
    def __init__(self, newrelic_insert_key, newrelic_account_id, kubeconfig_path=None):
        self.newrelic_insert_key = newrelic_insert_key
        self.newrelic_account_id = newrelic_account_id
        self.insights_api = f"https://insights-collector.newrelic.com/v1/accounts/{newrelic_account_id}/events"
        
        # Kubernetes設定読み込み
        try:
            if kubeconfig_path:
                config.load_kube_config(config_file=kubeconfig_path)
            else:
                # クラスター内実行時は自動設定
                try:
                    config.load_incluster_config()
                    print("✅ Loaded in-cluster Kubernetes config")
                except:
                    config.load_kube_config()
                    print("✅ Loaded local Kubernetes config")
        except Exception as e:
            print(f"❌ Failed to load Kubernetes config: {e}")
            raise
        
        # APIクライアント初期化
        self.core_v1 = client.CoreV1Api()
        self.apps_v1 = client.AppsV1Api()
        self.networking_v1 = client.NetworkingV1Api()
        self.storage_v1 = client.StorageV1Api()
        self.metrics_v1beta1 = client.CustomObjectsApi()
        
        self.cluster_name = os.environ.get('CLUSTER_NAME', 'unknown-cluster')
        self.environment = os.environ.get('ENVIRONMENT', 'production')
    
    def collect_cluster_info(self):
        """クラスター基本情報収集"""
        try:
            # Kubernetes バージョン
            version_info = client.VersionApi().get_code()
            
            # ノードリスト
            nodes = self.core_v1.list_node()
            
            # ネームスペースリスト
            namespaces = self.core_v1.list_namespace()
            
            # サービスアカウント数
            service_accounts = self.core_v1.list_service_account_for_all_namespaces()
            
            # 基本統計
            cluster_info = {
                'eventType': 'KubernetesClusterMetrics',
                'timestamp': int(time.time()),
                'cluster.name': self.cluster_name,
                'environment': self.environment,
                'cluster.version': version_info.git_version,
                'cluster.major_version': version_info.major,
                'cluster.minor_version': version_info.minor,
                'cluster.platform': version_info.platform,
                'cluster.go_version': version_info.go_version,
                'cluster.git_commit': version_info.git_commit,
                'cluster.build_date': version_info.build_date,
                'cluster.nodes_total': len(nodes.items),
                'cluster.namespaces_total': len(namespaces.items),
                'cluster.service_accounts_total': len(service_accounts.items)
            }
            
            # ノード統計
            ready_nodes = 0
            not_ready_nodes = 0
            master_nodes = 0
            worker_nodes = 0
            
            for node in nodes.items:
                # 状態確認
                for condition in node.status.conditions:
                    if condition.type == "Ready":
                        if condition.status == "True":
                            ready_nodes += 1
                        else:
                            not_ready_nodes += 1
                        break
                
                # ロール確認
                labels = node.metadata.labels or {}
                if 'node-role.kubernetes.io/master' in labels or 'node-role.kubernetes.io/control-plane' in labels:
                    master_nodes += 1
                else:
                    worker_nodes += 1
            
            cluster_info.update({
                'cluster.nodes_ready': ready_nodes,
                'cluster.nodes_not_ready': not_ready_nodes,
                'cluster.nodes_master': master_nodes,
                'cluster.nodes_worker': worker_nodes
            })
            
            return cluster_info
            
        except Exception as e:
            print(f"❌ Failed to collect cluster info: {e}")
            return None
    
    def collect_node_metrics(self):
        """ノードメトリクス収集"""
        try:
            nodes = self.core_v1.list_node()
            node_metrics = []
            
            for node in nodes.items:
                node_info = {
                    'eventType': 'KubernetesNodeMetrics',
                    'timestamp': int(time.time()),
                    'cluster.name': self.cluster_name,
                    'environment': self.environment,
                    'node.name': node.metadata.name,
                    'node.creation_timestamp': node.metadata.creation_timestamp.isoformat() if node.metadata.creation_timestamp else '',
                    'node.uid': node.metadata.uid
                }
                
                # ラベル情報
                labels = node.metadata.labels or {}
                node_info.update({
                    'node.os': labels.get('kubernetes.io/os', 'unknown'),
                    'node.arch': labels.get('kubernetes.io/arch', 'unknown'),
                    'node.instance_type': labels.get('node.kubernetes.io/instance-type', 'unknown'),
                    'node.zone': labels.get('topology.kubernetes.io/zone', 'unknown'),
                    'node.region': labels.get('topology.kubernetes.io/region', 'unknown'),
                    'node.role': 'master' if any(k.startswith('node-role.kubernetes.io/master') or k.startswith('node-role.kubernetes.io/control-plane') for k in labels.keys()) else 'worker'
                })
                
                # システム情報
                system_info = node.status.node_info
                node_info.update({
                    'node.kernel_version': system_info.kernel_version,
                    'node.os_image': system_info.os_image,
                    'node.container_runtime': system_info.container_runtime_version,
                    'node.kubelet_version': system_info.kubelet_version,
                    'node.kube_proxy_version': system_info.kube_proxy_version,
                    'node.machine_id': system_info.machine_id,
                    'node.system_uuid': system_info.system_uuid,
                    'node.boot_id': system_info.boot_id
                })
                
                # リソース情報
                if node.status.capacity:
                    capacity = node.status.capacity
                    node_info.update({
                        'node.capacity_cpu_cores': self._parse_cpu(capacity.get('cpu', '0')),
                        'node.capacity_memory_bytes': self._parse_memory(capacity.get('memory', '0')),
                        'node.capacity_pods': int(capacity.get('pods', '0')),
                        'node.capacity_ephemeral_storage_bytes': self._parse_memory(capacity.get('ephemeral-storage', '0'))
                    })
                
                if node.status.allocatable:
                    allocatable = node.status.allocatable
                    node_info.update({
                        'node.allocatable_cpu_cores': self._parse_cpu(allocatable.get('cpu', '0')),
                        'node.allocatable_memory_bytes': self._parse_memory(allocatable.get('memory', '0')),
                        'node.allocatable_pods': int(allocatable.get('pods', '0')),
                        'node.allocatable_ephemeral_storage_bytes': self._parse_memory(allocatable.get('ephemeral-storage', '0'))
                    })
                
                # ステータス情報
                conditions_status = {}
                for condition in node.status.conditions or []:
                    conditions_status[f'node.condition_{condition.type.lower()}'] = condition.status == "True"
                    
                node_info.update(conditions_status)
                
                # テイント情報
                if node.spec.taints:
                    node_info['node.taints_count'] = len(node.spec.taints)
                    taint_effects = [taint.effect for taint in node.spec.taints]
                    node_info['node.taint_effects'] = list(set(taint_effects))
                
                # アドレス情報
                addresses = {}
                for address in node.status.addresses or []:
                    addresses[f'node.address_{address.type.lower()}'] = address.address
                    
                node_info.update(addresses)
                
                # Pod統計(このノード上のPod数)
                try:
                    pods = self.core_v1.list_pod_for_all_namespaces(field_selector=f'spec.nodeName={node.metadata.name}')
                    running_pods = sum(1 for pod in pods.items if pod.status.phase == 'Running')
                    pending_pods = sum(1 for pod in pods.items if pod.status.phase == 'Pending')
                    failed_pods = sum(1 for pod in pods.items if pod.status.phase == 'Failed')
                    
                    node_info.update({
                        'node.pods_total': len(pods.items),
                        'node.pods_running': running_pods,
                        'node.pods_pending': pending_pods,
                        'node.pods_failed': failed_pods
                    })
                except Exception as e:
                    print(f"⚠️  Failed to get pod stats for node {node.metadata.name}: {e}")
                
                node_metrics.append(node_info)
            
            return node_metrics
            
        except Exception as e:
            print(f"❌ Failed to collect node metrics: {e}")
            return []
    
    def collect_pod_metrics(self):
        """Podメトリクス収集"""
        try:
            pods = self.core_v1.list_pod_for_all_namespaces()
            pod_metrics = []
            
            # 大量のPodがある場合はサンプリング
            if len(pods.items) > 1000:
                print(f"⚠️  Large number of pods ({len(pods.items)}), sampling first 1000")
                pods.items = pods.items[:1000]
            
            for pod in pods.items:
                pod_info = {
                    'eventType': 'KubernetesPodMetrics',
                    'timestamp': int(time.time()),
                    'cluster.name': self.cluster_name,
                    'environment': self.environment,
                    'pod.name': pod.metadata.name,
                    'pod.namespace': pod.metadata.namespace,
                    'pod.uid': pod.metadata.uid,
                    'pod.node_name': pod.spec.node_name or 'unscheduled',
                    'pod.phase': pod.status.phase,
                    'pod.creation_timestamp': pod.metadata.creation_timestamp.isoformat() if pod.metadata.creation_timestamp else '',
                    'pod.restart_policy': pod.spec.restart_policy
                }
                
                # ラベル情報(重要なもののみ)
                labels = pod.metadata.labels or {}
                important_labels = ['app', 'version', 'tier', 'component', 'app.kubernetes.io/name', 'app.kubernetes.io/version']
                for label in important_labels:
                    if label in labels:
                        safe_key = label.replace('.', '_').replace('/', '_')
                        pod_info[f'pod.label_{safe_key}'] = labels[label]
                
                # オーナー参照
                if pod.metadata.owner_references:
                    owner_ref = pod.metadata.owner_references[0]
                    pod_info.update({
                        'pod.owner_kind': owner_ref.kind,
                        'pod.owner_name': owner_ref.name,
                        'pod.owner_uid': owner_ref.uid
                    })
                
                # コンテナ情報
                containers = pod.spec.containers or []
                pod_info['pod.containers_count'] = len(containers)
                
                if containers:
                    # メインコンテナ(最初のコンテナ)情報
                    main_container = containers[0]
                    pod_info.update({
                        'pod.main_container_name': main_container.name,
                        'pod.main_container_image': main_container.image
                    })
                    
                    # リソース要求・制限
                    if main_container.resources:
                        if main_container.resources.requests:
                            requests = main_container.resources.requests
                            if 'cpu' in requests:
                                pod_info['pod.cpu_request'] = self._parse_cpu(requests['cpu'])
                            if 'memory' in requests:
                                pod_info['pod.memory_request_bytes'] = self._parse_memory(requests['memory'])
                        
                        if main_container.resources.limits:
                            limits = main_container.resources.limits
                            if 'cpu' in limits:
                                pod_info['pod.cpu_limit'] = self._parse_cpu(limits['cpu'])
                            if 'memory' in limits:
                                pod_info['pod.memory_limit_bytes'] = self._parse_memory(limits['memory'])
                
                # ステータス詳細
                if pod.status.conditions:
                    for condition in pod.status.conditions:
                        condition_key = f'pod.condition_{condition.type.lower()}'
                        pod_info[condition_key] = condition.status == "True"
                
                # コンテナステータス
                if pod.status.container_statuses:
                    ready_containers = sum(1 for cs in pod.status.container_statuses if cs.ready)
                    restart_count = sum(cs.restart_count for cs in pod.status.container_statuses)
                    
                    pod_info.update({
                        'pod.containers_ready': ready_containers,
                        'pod.containers_total': len(pod.status.container_statuses),
                        'pod.restart_count_total': restart_count
                    })
                
                # ネットワーク情報
                if pod.status.pod_ip:
                    pod_info['pod.ip'] = pod.status.pod_ip
                
                if pod.status.host_ip:
                    pod_info['pod.host_ip'] = pod.status.host_ip
                
                # QoSクラス
                pod_info['pod.qos_class'] = pod.status.qos_class or 'BestEffort'
                
                # 年齢計算
                if pod.metadata.creation_timestamp:
                    age_seconds = (datetime.now(timezone.utc) - pod.metadata.creation_timestamp).total_seconds()
                    pod_info['pod.age_seconds'] = int(age_seconds)
                
                pod_metrics.append(pod_info)
            
            return pod_metrics
            
        except Exception as e:
            print(f"❌ Failed to collect pod metrics: {e}")
            return []
    
    def collect_deployment_metrics(self):
        """Deploymentメトリクス収集"""
        try:
            deployments = self.apps_v1.list_deployment_for_all_namespaces()
            deployment_metrics = []
            
            for deployment in deployments.items:
                deployment_info = {
                    'eventType': 'KubernetesDeploymentMetrics',
                    'timestamp': int(time.time()),
                    'cluster.name': self.cluster_name,
                    'environment': self.environment,
                    'deployment.name': deployment.metadata.name,
                    'deployment.namespace': deployment.metadata.namespace,
                    'deployment.uid': deployment.metadata.uid,
                    'deployment.generation': deployment.metadata.generation,
                    'deployment.creation_timestamp': deployment.metadata.creation_timestamp.isoformat() if deployment.metadata.creation_timestamp else ''
                }
                
                # ラベル情報
                labels = deployment.metadata.labels or {}
                important_labels = ['app', 'version', 'tier', 'component']
                for label in important_labels:
                    if label in labels:
                        deployment_info[f'deployment.label_{label}'] = labels[label]
                
                # スペック情報
                spec = deployment.spec
                deployment_info.update({
                    'deployment.replicas_desired': spec.replicas or 0,
                    'deployment.strategy_type': spec.strategy.type if spec.strategy else 'RollingUpdate'
                })
                
                # ローリングアップデート設定
                if spec.strategy and spec.strategy.rolling_update:
                    ru = spec.strategy.rolling_update
                    deployment_info.update({
                        'deployment.max_unavailable': str(ru.max_unavailable) if ru.max_unavailable else '25%',
                        'deployment.max_surge': str(ru.max_surge) if ru.max_surge else '25%'
                    })
                
                # ステータス情報
                status = deployment.status
                deployment_info.update({
                    'deployment.replicas_available': status.available_replicas or 0,
                    'deployment.replicas_ready': status.ready_replicas or 0,
                    'deployment.replicas_updated': status.updated_replicas or 0,
                    'deployment.replicas_unavailable': status.unavailable_replicas or 0,
                    'deployment.observed_generation': status.observed_generation or 0
                })
                
                # ヘルス状態判定
                desired = spec.replicas or 0
                available = status.available_replicas or 0
                deployment_info['deployment.health_status'] = 'healthy' if available == desired and desired > 0 else 'unhealthy'
                deployment_info['deployment.availability_percentage'] = round((available / desired) * 100, 2) if desired > 0 else 0
                
                # 条件情報
                if status.conditions:
                    for condition in status.conditions:
                        condition_key = f'deployment.condition_{condition.type.lower()}'
                        deployment_info[condition_key] = condition.status == "True"
                        
                        if condition.type == 'Progressing' and condition.reason:
                            deployment_info['deployment.progress_reason'] = condition.reason
                
                # 年齢計算
                if deployment.metadata.creation_timestamp:
                    age_seconds = (datetime.now(timezone.utc) - deployment.metadata.creation_timestamp).total_seconds()
                    deployment_info['deployment.age_seconds'] = int(age_seconds)
                
                deployment_metrics.append(deployment_info)
            
            return deployment_metrics
            
        except Exception as e:
            print(f"❌ Failed to collect deployment metrics: {e}")
            return []
    
    def collect_service_metrics(self):
        """Serviceメトリクス収集"""
        try:
            services = self.core_v1.list_service_for_all_namespaces()
            service_metrics = []
            
            for service in services.items:
                service_info = {
                    'eventType': 'KubernetesServiceMetrics',
                    'timestamp': int(time.time()),
                    'cluster.name': self.cluster_name,
                    'environment': self.environment,
                    'service.name': service.metadata.name,
                    'service.namespace': service.metadata.namespace,
                    'service.uid': service.metadata.uid,
                    'service.type': service.spec.type,
                    'service.creation_timestamp': service.metadata.creation_timestamp.isoformat() if service.metadata.creation_timestamp else ''
                }
                
                # ラベル・セレクター
                labels = service.metadata.labels or {}
                if 'app' in labels:
                    service_info['service.app'] = labels['app']
                
                if service.spec.selector:
                    service_info['service.selector_count'] = len(service.spec.selector)
                
                # ポート情報
                ports = service.spec.ports or []
                service_info['service.ports_count'] = len(ports)
                
                if ports:
                    service_info['service.main_port'] = ports[0].port
                    service_info['service.main_protocol'] = ports[0].protocol
                    if ports[0].target_port:
                        service_info['service.main_target_port'] = str(ports[0].target_port)
                
                # IP情報
                service_info['service.cluster_ip'] = service.spec.cluster_ip or 'None'
                
                if service.spec.type == 'LoadBalancer':
                    if service.status.load_balancer and service.status.load_balancer.ingress:
                        lb_ingress = service.status.load_balancer.ingress[0]
                        service_info['service.load_balancer_ip'] = lb_ingress.ip or lb_ingress.hostname or 'pending'
                    else:
                        service_info['service.load_balancer_ip'] = 'pending'
                
                if service.spec.type == 'NodePort':
                    if ports and ports[0].node_port:
                        service_info['service.node_port'] = ports[0].node_port
                
                # エンドポイント確認
                try:
                    endpoints = self.core_v1.read_namespaced_endpoints(
                        name=service.metadata.name,
                        namespace=service.metadata.namespace
                    )
                    
                    total_addresses = 0
                    if endpoints.subsets:
                        for subset in endpoints.subsets:
                            if subset.addresses:
                                total_addresses += len(subset.addresses)
                    
                    service_info['service.endpoints_ready'] = total_addresses
                    service_info['service.health_status'] = 'healthy' if total_addresses > 0 else 'unhealthy'
                    
                except Exception:
                    service_info['service.endpoints_ready'] = 0
                    service_info['service.health_status'] = 'unknown'
                
                # 年齢計算
                if service.metadata.creation_timestamp:
                    age_seconds = (datetime.now(timezone.utc) - service.metadata.creation_timestamp).total_seconds()
                    service_info['service.age_seconds'] = int(age_seconds)
                
                service_metrics.append(service_info)
            
            return service_metrics
            
        except Exception as e:
            print(f"❌ Failed to collect service metrics: {e}")
            return []
    
    def _parse_cpu(self, cpu_str):
        """CPU文字列を数値に変換(コア数)"""
        if not cpu_str:
            return 0
        
        cpu_str = str(cpu_str)
        if cpu_str.endswith('m'):
            return float(cpu_str[:-1]) / 1000
        elif cpu_str.endswith('u'):
            return float(cpu_str[:-1]) / 1000000
        else:
            return float(cpu_str)
    
    def _parse_memory(self, memory_str):
        """メモリ文字列をバイト数に変換"""
        if not memory_str:
            return 0
        
        memory_str = str(memory_str)
        multipliers = {
            'Ki': 1024,
            'Mi': 1024**2,
            'Gi': 1024**3,
            'Ti': 1024**4,
            'Pi': 1024**5,
            'K': 1000,
            'M': 1000**2,
            'G': 1000**3,
            'T': 1000**4,
            'P': 1000**5
        }
        
        for suffix, multiplier in multipliers.items():
            if memory_str.endswith(suffix):
                return int(float(memory_str[:-len(suffix)]) * multiplier)
        
        return int(float(memory_str))
    
    def send_to_newrelic(self, metrics_data):
        """メトリクスをNew Relicに送信"""
        if not metrics_data:
            return
        
        try:
            headers = {
                'Content-Type': 'application/json',
                'X-Insert-Key': self.newrelic_insert_key
            }
            
            # バッチサイズで送信
            batch_size = 100
            for i in range(0, len(metrics_data), batch_size):
                batch = metrics_data[i:i+batch_size]
                
                response = requests.post(
                    self.insights_api,
                    headers=headers,
                    json=batch,
                    timeout=30
                )
                
                if response.status_code == 200:
                    print(f"✅ Sent {len(batch)} Kubernetes metrics to New Relic")
                else:
                    print(f"❌ Failed to send batch: {response.status_code}")
                    
        except Exception as e:
            print(f"❌ Failed to send metrics to New Relic: {e}")
    
    def run_monitoring(self):
        """メイン監視処理"""
        print("☸️  Starting Kubernetes Enterprise Monitoring")
        print(f"🎯 Cluster: {self.cluster_name}")
        print(f"🌍 Environment: {self.environment}")
        print(f"📅 Timestamp: {datetime.now(timezone.utc).isoformat()}")
        
        all_metrics = []
        
        # 並行してメトリクス収集
        with ThreadPoolExecutor(max_workers=5) as executor:
            futures = {
                executor.submit(self.collect_cluster_info): "Cluster Info",
                executor.submit(self.collect_node_metrics): "Nodes",
                executor.submit(self.collect_pod_metrics): "Pods",
                executor.submit(self.collect_deployment_metrics): "Deployments",
                executor.submit(self.collect_service_metrics): "Services"
            }
            
            for future in as_completed(futures):
                metric_type = futures[future]
                try:
                    result = future.result()
                    if result:
                        if isinstance(result, list):
                            all_metrics.extend(result)
                            print(f"✅ Collected {len(result)} {metric_type} metrics")
                        else:
                            all_metrics.append(result)
                            print(f"✅ Collected {metric_type} metrics")
                except Exception as e:
                    print(f"❌ Failed to collect {metric_type} metrics: {e}")
        
        # New Relicに送信
        if all_metrics:
            print(f"📤 Sending {len(all_metrics)} total metrics to New Relic...")
            self.send_to_newrelic(all_metrics)
            print("🎉 Kubernetes monitoring completed successfully")
        else:
            print("⚠️  No metrics collected")

# メイン実行
if __name__ == "__main__":
    import os
    
    # 環境変数から設定取得
    NEWRELIC_INSERT_KEY = os.environ.get('NEWRELIC_INSERT_KEY', '')
    NEWRELIC_ACCOUNT_ID = os.environ.get('NEWRELIC_ACCOUNT_ID', '')
    KUBECONFIG_PATH = os.environ.get('KUBECONFIG', None)
    
    if not all([NEWRELIC_INSERT_KEY, NEWRELIC_ACCOUNT_ID]):
        print("❌ Required environment variables not set")
        print("Please set: NEWRELIC_INSERT_KEY, NEWRELIC_ACCOUNT_ID")
        exit(1)
    
    monitor = KubernetesEnterpriseMonitor(
        NEWRELIC_INSERT_KEY, 
        NEWRELIC_ACCOUNT_ID,
        KUBECONFIG_PATH
    )
    
    try:
        monitor.run_monitoring()
    except KeyboardInterrupt:
        print("\n⏹️  Monitoring stopped by user")
    except Exception as e:
        print(f"❌ Monitoring failed: {e}")
        exit(1)

✅ 4.3セクション完了チェック

🎯 学習目標達成確認

本セクションを完了した時点で、以下ができるようになっているかチェックしてください:

🐳 Docker コンテナ監視

  • [ ] エンタープライズ Docker 環境の包括的監視設定ができる
  • [ ] コンテナライフサイクルとパフォーマンス監視を実装できる
  • [ ] Docker Compose による統合監視環境を構築できる
  • [ ] セキュリティ・コンプライアンス監視を追加できる

☸️ Kubernetes クラスター監視

  • [ ] 大規模Kubernetesクラスターの階層監視を実装できる
  • [ ] Helm Chart による完全デプロイメントができる
  • [ ] クラスター・ノード・Pod・サービスの詳細監視ができる
  • [ ] マイクロサービス間の依存関係を可視化できる

🏢 エンタープライズ機能

  • [ ] マルチクラスター環境の統合管理ができる
  • [ ] GitOps・DevSecOpsパイプラインと連携できる
  • [ ] コスト最適化とリソース効率化を実現できる
  • [ ] セキュリティポリシーの自動化ができる

🚀 次のステップ

コンテナ・Kubernetes監視をマスターしたら、次のセクションに進みましょう:


📖 セクション内ナビゲーション

🔗 第4章内リンク

📚 関連章リンク


🎯 次のステップ: 4.4 ネットワーク・セキュリティ監視で、セキュリティ強化と脅威検出の実装を学習しましょう!