4.3 コンテナ・Kubernetes環境 - クラウドネイティブ監視の完全実装
クラウドネイティブアプリケーションの普及により、コンテナとKubernetes環境の監視は現代のエンタープライズインフラにおいて必要不可欠な要素となりました。New Relic Infrastructureは、動的で複雑なコンテナ環境において、アプリケーションレベルからインフラレベルまでの完全な可視性を提供します。
本セクションでは、Docker コンテナ監視から大規模Kubernetesクラスター運用まで、エンタープライズレベルでの実装戦略と高度な監視手法を包括的に解説します。
🎯 このセクションの学習目標
📦 コンテナ監視マスタリー
- Docker環境の詳細監視とパフォーマンス最適化
- コンテナライフサイクルの完全追跡
- イメージセキュリティとコンプライアンス監視
- レジストリ管理と脆弱性スキャン統合
☸️ Kubernetes 監視のエキスパート化
- クラスターレベルからPodレベルまで階層監視
- マイクロサービス間の依存関係可視化
- リソース管理と自動スケーリング連携
- GitOpsワークフローとの統合
🏢 エンタープライズ運用の実現
- マルチクラスター環境の統合管理
- セキュリティポリシーの自動化
- コスト最適化とリソース効率化
- DevSecOpsパイプラインとの連携
🐳 Docker コンテナ監視の高度実装
📊 エンタープライズ Docker 監視アーキテクチャ
🏗️ 監視レイヤー構成
yaml
Docker_Monitoring_Architecture:
Layer_1_Infrastructure:
- Host System Resources (CPU, Memory, Disk, Network)
- Docker Engine Performance
- Container Runtime Statistics
- Storage Driver Metrics
Layer_2_Container:
- Container Lifecycle Events
- Resource Usage per Container
- Network Traffic Analysis
- Volume Mount Monitoring
Layer_3_Application:
- Application Performance Metrics
- Custom Business Metrics
- Distributed Tracing
- Error Tracking
Layer_4_Security:
- Image Vulnerability Scanning
- Runtime Security Monitoring
- Access Control Auditing
- Compliance Reporting
⚙️ 最適化された Docker Compose 設定
yaml
# docker-compose.enterprise-monitoring.yml
# エンタープライズ級 Docker 監視環境
version: '3.8'
services:
# メインアプリケーション
webapp:
build:
context: .
dockerfile: Dockerfile.production
environment:
# New Relic APM設定
- NEW_RELIC_LICENSE_KEY=${NEW_RELIC_LICENSE_KEY}
- NEW_RELIC_APP_NAME=enterprise-webapp
- NEW_RELIC_DISTRIBUTED_TRACING_ENABLED=true
- NEW_RELIC_LOG_LEVEL=info
- NEW_RELIC_APPLICATION_LOGGING_ENABLED=true
- NEW_RELIC_APPLICATION_LOGGING_FORWARDING_ENABLED=true
# アプリケーション設定
- ENVIRONMENT=production
- DATABASE_URL=postgresql://user:pass@db:5432/webapp
- REDIS_URL=redis://cache:6379
- ELASTICSEARCH_URL=http://search:9200
# セキュリティ設定
- ENABLE_SECURITY_HEADERS=true
- CSRF_PROTECTION=true
- RATE_LIMITING=true
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
deploy:
resources:
limits:
cpus: '2.0'
memory: 2G
reservations:
cpus: '0.5'
memory: 512M
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "5"
labels: "environment=production,service=webapp"
labels:
# New Relic監視ラベル
- "newrelic.monitor=true"
- "newrelic.service=webapp"
- "newrelic.environment=production"
- "newrelic.tier=frontend"
# セキュリティラベル
- "security.scan=enabled"
- "compliance.pci_dss=required"
- "backup.policy=daily"
networks:
- frontend
- backend
depends_on:
db:
condition: service_healthy
cache:
condition: service_healthy
# PostgreSQL データベース
db:
image: postgres:15-alpine
environment:
- POSTGRES_DB=webapp
- POSTGRES_USER=webapp_user
- POSTGRES_PASSWORD_FILE=/run/secrets/db_password
- POSTGRES_INITDB_ARGS="--auth-host=scram-sha-256"
volumes:
- postgres_data:/var/lib/postgresql/data
- ./monitoring/postgres-init.sql:/docker-entrypoint-initdb.d/monitoring.sql
- ./config/postgresql.conf:/etc/postgresql/postgresql.conf
healthcheck:
test: ["CMD-SHELL", "pg_isready -U webapp_user -d webapp"]
interval: 30s
timeout: 10s
retries: 3
deploy:
resources:
limits:
cpus: '1.0'
memory: 1G
reservations:
cpus: '0.25'
memory: 256M
logging:
driver: "json-file"
options:
max-size: "50m"
max-file: "3"
labels: "service=database"
labels:
- "newrelic.monitor=true"
- "newrelic.service=postgresql"
- "newrelic.environment=production"
- "newrelic.tier=database"
- "backup.schedule=0 2 * * *" # 毎日2時にバックアップ
secrets:
- db_password
networks:
- backend
# Redis キャッシュ
cache:
image: redis:7-alpine
command: >
redis-server
--requirepass ${REDIS_PASSWORD}
--appendonly yes
--appendfsync everysec
--maxmemory 512mb
--maxmemory-policy allkeys-lru
volumes:
- redis_data:/data
- ./config/redis.conf:/usr/local/etc/redis/redis.conf
healthcheck:
test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
interval: 30s
timeout: 10s
retries: 3
deploy:
resources:
limits:
cpus: '0.5'
memory: 512M
reservations:
cpus: '0.1'
memory: 64M
labels:
- "newrelic.monitor=true"
- "newrelic.service=redis"
- "newrelic.environment=production"
- "newrelic.tier=cache"
networks:
- backend
# Elasticsearch (検索・ログ)
search:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
environment:
- discovery.type=single-node
- ES_JAVA_OPTS=-Xms1g -Xmx1g
- xpack.security.enabled=false
- xpack.monitoring.collection.enabled=true
volumes:
- elasticsearch_data:/usr/share/elasticsearch/data
- ./config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:9200/_cluster/health || exit 1"]
interval: 30s
timeout: 10s
retries: 5
deploy:
resources:
limits:
cpus: '2.0'
memory: 2G
reservations:
cpus: '0.5'
memory: 1G
labels:
- "newrelic.monitor=true"
- "newrelic.service=elasticsearch"
- "newrelic.environment=production"
- "newrelic.tier=search"
networks:
- backend
# New Relic Infrastructure Agent
newrelic-infra:
image: newrelic/infrastructure:latest
cap_add:
- SYS_PTRACE
network_mode: host
pid: host
privileged: true
environment:
- NRIA_LICENSE_KEY=${NEW_RELIC_LICENSE_KEY}
- NRIA_VERBOSE=1
- NRIA_DISPLAY_NAME=docker-host-${ENVIRONMENT}
- NRIA_CUSTOM_ATTRIBUTES={"environment":"${ENVIRONMENT}","deployment":"docker","monitoring_level":"comprehensive"}
- NRIA_ENABLE_DOCKER=true
volumes:
- "/:/host:ro"
- "/var/run/docker.sock:/var/run/docker.sock:ro"
- "/sys:/host/sys:ro"
- "/proc:/host/proc:ro"
- "/dev:/host/dev:ro"
- "./config/newrelic-infra.yml:/etc/newrelic-infra.yml:ro"
restart: unless-stopped
labels:
- "newrelic.monitor=false" # 自分自身は監視対象外
- "newrelic.service=infrastructure-agent"
depends_on:
- webapp
- db
- cache
# ログ配送 (Fluent Bit)
log-forwarder:
image: fluent/fluent-bit:2.2
volumes:
- "./config/fluent-bit.conf:/fluent-bit/etc/fluent-bit.conf:ro"
- "/var/lib/docker/containers:/var/lib/docker/containers:ro"
- "/var/log:/var/log:ro"
environment:
- NEW_RELIC_LICENSE_KEY=${NEW_RELIC_LICENSE_KEY}
- ENVIRONMENT=${ENVIRONMENT}
depends_on:
- webapp
labels:
- "newrelic.monitor=true"
- "newrelic.service=log-forwarder"
- "newrelic.environment=production"
networks:
- backend
# メトリクス収集 (Prometheus互換)
metrics-exporter:
image: prom/node-exporter:latest
volumes:
- "/proc:/host/proc:ro"
- "/sys:/host/sys:ro"
- "/:/rootfs:ro"
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
labels:
- "prometheus.io/scrape=true"
- "prometheus.io/port=9100"
- "newrelic.monitor=true"
networks:
- monitoring
volumes:
postgres_data:
driver: local
driver_opts:
type: none
o: bind
device: /data/postgres
redis_data:
driver: local
driver_opts:
type: none
o: bind
device: /data/redis
elasticsearch_data:
driver: local
driver_opts:
type: none
o: bind
device: /data/elasticsearch
networks:
frontend:
driver: bridge
ipam:
config:
- subnet: 172.20.1.0/24
backend:
driver: bridge
ipam:
config:
- subnet: 172.20.2.0/24
monitoring:
driver: bridge
ipam:
config:
- subnet: 172.20.3.0/24
secrets:
db_password:
file: ./secrets/db_password.txt
📊 Docker 監視スクリプト
python
#!/usr/bin/env python3
"""
Docker エンタープライズ監視スクリプト
コンテナ・イメージ・ネットワーク・ボリュームの包括監視
"""
import docker
import json
import requests
import time
from datetime import datetime
import os
import psutil
import threading
from concurrent.futures import ThreadPoolExecutor, as_completed
class DockerEnterpriseMonitor:
def __init__(self, newrelic_insert_key, newrelic_account_id):
self.newrelic_insert_key = newrelic_insert_key
self.newrelic_account_id = newrelic_account_id
self.insights_api = f"https://insights-collector.newrelic.com/v1/accounts/{newrelic_account_id}/events"
# Docker クライアント初期化
try:
self.docker_client = docker.from_env()
self.docker_client.ping()
print("✅ Connected to Docker daemon")
except Exception as e:
print(f"❌ Failed to connect to Docker: {e}")
raise
self.hostname = os.uname().nodename
self.environment = os.environ.get('ENVIRONMENT', 'production')
def collect_docker_info(self):
"""Docker エンジン情報収集"""
try:
info = self.docker_client.info()
version = self.docker_client.version()
docker_info = {
'eventType': 'DockerEngineMetrics',
'timestamp': int(time.time()),
'hostname': self.hostname,
'environment': self.environment,
'docker.version': version['Version'],
'docker.api_version': version['ApiVersion'],
'docker.go_version': version['GoVersion'],
'docker.git_commit': version['GitCommit'],
'docker.built': version['Built'],
'docker.os': version['Os'],
'docker.arch': version['Arch'],
'docker.kernel_version': version['KernelVersion'],
'containers.total': info['Containers'],
'containers.running': info['ContainersRunning'],
'containers.paused': info['ContainersPaused'],
'containers.stopped': info['ContainersStopped'],
'images.total': info['Images'],
'docker.storage_driver': info['Driver'],
'docker.logging_driver': info['LoggingDriver'],
'docker.cgroup_driver': info.get('CgroupDriver', 'unknown'),
'docker.memory_limit': info.get('MemoryLimit', False),
'docker.swap_limit': info.get('SwapLimit', False),
'docker.cpu_cfs_period': info.get('CpuCfsPeriod', False),
'docker.cpu_cfs_quota': info.get('CpuCfsQuota', False),
'docker.kernel_memory': info.get('KernelMemory', False),
'docker.oom_kill_disable': info.get('OomKillDisable', False),
'security.apparmor_profile': info.get('SecurityOptions', []),
'registry.insecure_registries': len(info.get('InsecureRegistries', [])),
'registry.index_configs': len(info.get('IndexConfigs', {}))
}
# システムリソース情報
system_info = info.get('SystemStatus', {})
if system_info:
docker_info.update({
'system.total_memory': info.get('MemTotal', 0),
'system.ncpu': info.get('NCPU', 0),
'system.name': info.get('Name', 'unknown'),
'system.server_version': info.get('ServerVersion', 'unknown')
})
return docker_info
except Exception as e:
print(f"❌ Failed to collect Docker info: {e}")
return None
def collect_container_metrics(self):
"""コンテナメトリクス収集"""
try:
containers = self.docker_client.containers.list(all=True)
container_metrics = []
for container in containers:
try:
# 基本情報
container_info = {
'eventType': 'DockerContainerMetrics',
'timestamp': int(time.time()),
'hostname': self.hostname,
'environment': self.environment,
'container.id': container.id[:12],
'container.name': container.name,
'container.status': container.status,
'container.image': container.image.tags[0] if container.image.tags else 'none',
'container.image_id': container.image.id[:12],
'container.command': ' '.join(container.attrs['Config']['Cmd'] or []),
'container.created': container.attrs['Created'],
'container.platform': container.attrs.get('Platform', 'unknown'),
'container.architecture': container.attrs.get('Architecture', 'unknown')
}
# ラベル情報
labels = container.labels
if labels:
container_info.update({
f'label.{k}': v for k, v in labels.items()
if not k.startswith('com.docker') and len(k) < 50
})
# ネットワーク情報
networks = container.attrs.get('NetworkSettings', {}).get('Networks', {})
container_info['container.networks'] = list(networks.keys())
container_info['container.network_count'] = len(networks)
# マウント情報
mounts = container.attrs.get('Mounts', [])
container_info['container.mounts'] = len(mounts)
# 環境変数数
env_vars = container.attrs.get('Config', {}).get('Env', [])
container_info['container.env_vars_count'] = len(env_vars)
# ランニングコンテナの統計情報
if container.status == 'running':
try:
stats = container.stats(stream=False)
# CPU統計
cpu_stats = stats['cpu_stats']
precpu_stats = stats['precpu_stats']
# CPU使用率計算
cpu_delta = cpu_stats['cpu_usage']['total_usage'] - precpu_stats['cpu_usage']['total_usage']
system_delta = cpu_stats['system_cpu_usage'] - precpu_stats['system_cpu_usage']
if system_delta > 0:
cpu_percent = (cpu_delta / system_delta) * len(cpu_stats['cpu_usage']['percpu_usage']) * 100
container_info['container.cpu_percent'] = round(cpu_percent, 2)
# メモリ統計
memory_stats = stats['memory_stats']
memory_usage = memory_stats.get('usage', 0)
memory_limit = memory_stats.get('limit', 0)
container_info['container.memory_usage_bytes'] = memory_usage
container_info['container.memory_limit_bytes'] = memory_limit
if memory_limit > 0:
memory_percent = (memory_usage / memory_limit) * 100
container_info['container.memory_percent'] = round(memory_percent, 2)
# メモリ詳細統計
mem_detail = memory_stats.get('stats', {})
container_info.update({
'container.memory_cache_bytes': mem_detail.get('cache', 0),
'container.memory_rss_bytes': mem_detail.get('rss', 0),
'container.memory_swap_bytes': mem_detail.get('swap', 0)
})
# ネットワーク統計
networks_stats = stats.get('networks', {})
total_rx_bytes = sum(net['rx_bytes'] for net in networks_stats.values())
total_tx_bytes = sum(net['tx_bytes'] for net in networks_stats.values())
total_rx_errors = sum(net['rx_errors'] for net in networks_stats.values())
total_tx_errors = sum(net['tx_errors'] for net in networks_stats.values())
container_info.update({
'container.network_rx_bytes': total_rx_bytes,
'container.network_tx_bytes': total_tx_bytes,
'container.network_rx_errors': total_rx_errors,
'container.network_tx_errors': total_tx_errors
})
# Block I/O統計
blkio_stats = stats.get('blkio_stats', {})
read_bytes = sum(op['value'] for op in blkio_stats.get('io_service_bytes_recursive', []) if op['op'] == 'Read')
write_bytes = sum(op['value'] for op in blkio_stats.get('io_service_bytes_recursive', []) if op['op'] == 'Write')
container_info.update({
'container.blkio_read_bytes': read_bytes,
'container.blkio_write_bytes': write_bytes
})
# プロセス数
pids_stats = stats.get('pids_stats', {})
container_info['container.pids_count'] = pids_stats.get('current', 0)
except Exception as e:
print(f"⚠️ Failed to get stats for container {container.name}: {e}")
# ヘルスチェック情報
health = container.attrs.get('State', {}).get('Health')
if health:
container_info.update({
'container.health_status': health.get('Status', 'unknown'),
'container.health_failing_streak': health.get('FailingStreak', 0),
'container.health_log_length': len(health.get('Log', []))
})
# 再起動情報
restart_count = container.attrs.get('RestartCount', 0)
container_info['container.restart_count'] = restart_count
# 作成からの経過時間
created_time = datetime.fromisoformat(container.attrs['Created'].replace('Z', '+00:00'))
uptime_seconds = (datetime.now().astimezone() - created_time).total_seconds()
container_info['container.uptime_seconds'] = int(uptime_seconds)
container_metrics.append(container_info)
except Exception as e:
print(f"⚠️ Failed to process container {container.name}: {e}")
continue
return container_metrics
except Exception as e:
print(f"❌ Failed to collect container metrics: {e}")
return []
def collect_image_metrics(self):
"""イメージメトリクス収集"""
try:
images = self.docker_client.images.list(all=True)
image_metrics = []
for image in images:
try:
# 基本情報
image_info = {
'eventType': 'DockerImageMetrics',
'timestamp': int(time.time()),
'hostname': self.hostname,
'environment': self.environment,
'image.id': image.id[:12],
'image.short_id': image.short_id,
'image.tags': image.tags,
'image.size_bytes': image.attrs['Size'],
'image.virtual_size_bytes': image.attrs['VirtualSize'],
'image.created': image.attrs['Created'],
'image.architecture': image.attrs['Architecture'],
'image.os': image.attrs['Os'],
'image.variant': image.attrs.get('Variant', ''),
'image.docker_version': image.attrs.get('DockerVersion', ''),
'image.author': image.attrs.get('Author', ''),
'image.comment': image.attrs.get('Comment', '')
}
# 設定情報
config = image.attrs.get('Config', {})
if config:
image_info.update({
'image.exposed_ports': list(config.get('ExposedPorts', {}).keys()),
'image.env_vars_count': len(config.get('Env', [])),
'image.cmd': config.get('Cmd'),
'image.entrypoint': config.get('Entrypoint'),
'image.working_dir': config.get('WorkingDir', ''),
'image.user': config.get('User', '')
})
# レイヤー情報
history = image.history()
image_info['image.layers_count'] = len(history)
# ルートファイルシステム情報
rootfs = image.attrs.get('RootFS', {})
if rootfs:
image_info.update({
'image.rootfs_type': rootfs.get('Type', ''),
'image.rootfs_layers_count': len(rootfs.get('Layers', []))
})
# 使用統計(このイメージを使っているコンテナ数)
containers_using = self.docker_client.containers.list(
all=True,
filters={'ancestor': image.id}
)
image_info['image.containers_using'] = len(containers_using)
image_metrics.append(image_info)
except Exception as e:
print(f"⚠️ Failed to process image {image.id[:12]}: {e}")
continue
return image_metrics
except Exception as e:
print(f"❌ Failed to collect image metrics: {e}")
return []
def collect_network_metrics(self):
"""Docker ネットワークメトリクス収集"""
try:
networks = self.docker_client.networks.list()
network_metrics = []
for network in networks:
network_info = {
'eventType': 'DockerNetworkMetrics',
'timestamp': int(time.time()),
'hostname': self.hostname,
'environment': self.environment,
'network.id': network.id[:12],
'network.name': network.name,
'network.driver': network.attrs['Driver'],
'network.scope': network.attrs['Scope'],
'network.attachable': network.attrs.get('Attachable', False),
'network.ingress': network.attrs.get('Ingress', False),
'network.ipam_driver': network.attrs.get('IPAM', {}).get('Driver', ''),
'network.internal': network.attrs.get('Internal', False),
'network.enable_ipv6': network.attrs.get('EnableIPv6', False),
'network.created': network.attrs['Created']
}
# 接続されているコンテナ数
containers = network.attrs.get('Containers', {})
network_info['network.connected_containers'] = len(containers)
# IPAMコンフィグ
ipam_config = network.attrs.get('IPAM', {}).get('Config', [])
if ipam_config:
network_info.update({
'network.subnets_count': len(ipam_config),
'network.subnet': ipam_config[0].get('Subnet', '') if ipam_config else '',
'network.gateway': ipam_config[0].get('Gateway', '') if ipam_config else ''
})
# オプション
options = network.attrs.get('Options', {})
if options:
network_info.update({
f'network.option.{k}': v for k, v in options.items()
if len(k) < 50
})
network_metrics.append(network_info)
return network_metrics
except Exception as e:
print(f"❌ Failed to collect network metrics: {e}")
return []
def collect_volume_metrics(self):
"""Docker ボリュームメトリクス収集"""
try:
volumes = self.docker_client.volumes.list()
volume_metrics = []
for volume in volumes:
volume_info = {
'eventType': 'DockerVolumeMetrics',
'timestamp': int(time.time()),
'hostname': self.hostname,
'environment': self.environment,
'volume.name': volume.name,
'volume.driver': volume.attrs['Driver'],
'volume.mountpoint': volume.attrs['Mountpoint'],
'volume.created': volume.attrs['CreatedAt'],
'volume.scope': volume.attrs.get('Scope', 'local')
}
# ボリュームオプション
options = volume.attrs.get('Options')
if options:
volume_info.update({
f'volume.option.{k}': v for k, v in options.items()
if len(k) < 50
})
# ラベル
labels = volume.attrs.get('Labels')
if labels:
volume_info.update({
f'volume.label.{k}': v for k, v in labels.items()
if len(k) < 50
})
# ディスク使用量(可能な場合)
try:
mountpoint = volume.attrs['Mountpoint']
if os.path.exists(mountpoint):
stat = os.statvfs(mountpoint)
total_bytes = stat.f_frsize * stat.f_blocks
free_bytes = stat.f_frsize * stat.f_available
used_bytes = total_bytes - free_bytes
volume_info.update({
'volume.total_bytes': total_bytes,
'volume.used_bytes': used_bytes,
'volume.free_bytes': free_bytes,
'volume.usage_percent': round((used_bytes / total_bytes) * 100, 2) if total_bytes > 0 else 0
})
except Exception:
# ボリューム情報取得失敗は無視
pass
volume_metrics.append(volume_info)
return volume_metrics
except Exception as e:
print(f"❌ Failed to collect volume metrics: {e}")
return []
def send_to_newrelic(self, metrics_data):
"""メトリクスをNew Relicに送信"""
if not metrics_data:
return
try:
headers = {
'Content-Type': 'application/json',
'X-Insert-Key': self.newrelic_insert_key
}
# バッチサイズで送信
batch_size = 100
for i in range(0, len(metrics_data), batch_size):
batch = metrics_data[i:i+batch_size]
response = requests.post(
self.insights_api,
headers=headers,
json=batch,
timeout=30
)
if response.status_code == 200:
print(f"✅ Sent {len(batch)} Docker metrics to New Relic")
else:
print(f"❌ Failed to send batch: {response.status_code}")
except Exception as e:
print(f"❌ Failed to send metrics to New Relic: {e}")
def run_monitoring(self):
"""メイン監視処理"""
print("🐳 Starting Docker Enterprise Monitoring")
print(f"📅 Timestamp: {datetime.now().isoformat()}")
all_metrics = []
# 並行してメトリクス収集
with ThreadPoolExecutor(max_workers=6) as executor:
futures = {
executor.submit(self.collect_docker_info): "Docker Engine",
executor.submit(self.collect_container_metrics): "Containers",
executor.submit(self.collect_image_metrics): "Images",
executor.submit(self.collect_network_metrics): "Networks",
executor.submit(self.collect_volume_metrics): "Volumes"
}
for future in as_completed(futures):
metric_type = futures[future]
try:
result = future.result()
if result:
if isinstance(result, list):
all_metrics.extend(result)
print(f"✅ Collected {len(result)} {metric_type} metrics")
else:
all_metrics.append(result)
print(f"✅ Collected {metric_type} metrics")
except Exception as e:
print(f"❌ Failed to collect {metric_type} metrics: {e}")
# New Relicに送信
if all_metrics:
print(f"📤 Sending {len(all_metrics)} total metrics to New Relic...")
self.send_to_newrelic(all_metrics)
print("🎉 Docker monitoring completed successfully")
else:
print("⚠️ No metrics collected")
# メイン実行
if __name__ == "__main__":
import os
# 環境変数から設定取得
NEWRELIC_INSERT_KEY = os.environ.get('NEWRELIC_INSERT_KEY', '')
NEWRELIC_ACCOUNT_ID = os.environ.get('NEWRELIC_ACCOUNT_ID', '')
if not all([NEWRELIC_INSERT_KEY, NEWRELIC_ACCOUNT_ID]):
print("❌ Required environment variables not set")
print("Please set: NEWRELIC_INSERT_KEY, NEWRELIC_ACCOUNT_ID")
exit(1)
monitor = DockerEnterpriseMonitor(NEWRELIC_INSERT_KEY, NEWRELIC_ACCOUNT_ID)
try:
monitor.run_monitoring()
except KeyboardInterrupt:
print("\n⏹️ Monitoring stopped by user")
except Exception as e:
print(f"❌ Monitoring failed: {e}")
exit(1)
☸️ Kubernetes クラスター監視
🎯 エンタープライズ Kubernetes 監視戦略
🏗️ 階層化監視アーキテクチャ
yaml
Kubernetes_Monitoring_Layers:
Layer_1_Cluster:
- Cluster Health & Status
- Node Resource Utilization
- Control Plane Components
- Cluster-wide Metrics
Layer_2_Namespace:
- Resource Quotas & Limits
- Network Policies
- Service Discovery
- Namespace-level Metrics
Layer_3_Workload:
- Deployment Status
- ReplicaSet Health
- Pod Lifecycle Events
- Resource Requests/Limits
Layer_4_Application:
- Container Performance
- Application Metrics
- Custom Metrics
- Business KPIs
Layer_5_Security:
- RBAC Violations
- Pod Security Policies
- Network Traffic Analysis
- Compliance Monitoring
⚙️ Helm Chart による完全デプロイメント
yaml
# values-enterprise.yaml
# New Relic Kubernetes統合 エンタープライズ設定
# Global設定
global:
licenseKey: "YOUR_ENTERPRISE_LICENSE_KEY"
cluster: "production-k8s-cluster"
# エンタープライズ属性
customAttributes:
# ビジネス情報
environment: production
business_unit: platform
cost_center: infrastructure
data_classification: confidential
# 地理・物理情報
region: ap-northeast-1
availability_zone: ap-northeast-1a
datacenter: tokyo-dc1
# 運用情報
team: platform_engineering
oncall_team: sre
escalation_policy: critical_infrastructure
maintenance_window: "02:00-04:00_JST"
# コンプライアンス
compliance_framework: pci_dss
security_level: high
audit_required: true
# プロキシ設定(エンタープライズネットワーク)
proxy:
http: "http://proxy.company.com:8080"
https: "https://proxy.company.com:8080"
noProxy: "localhost,127.0.0.1,.company.com,.cluster.local"
# フィーダム・ガバナンス
fedramp:
enabled: false
nrStaging:
enabled: false
# Infrastructure監視
newrelic-infrastructure:
enabled: true
privileged: true
# リソース制限(本番環境対応)
resources:
limits:
memory: "500Mi"
cpu: "500m"
requests:
memory: "150Mi"
cpu: "100m"
# ノード選択(監視専用ノード等)
nodeSelector: {}
# テイント許容
tolerations:
- operator: "Exists"
effect: "NoSchedule"
- operator: "Exists"
effect: "NoExecute"
- operator: "Exists"
effect: "PreferNoSchedule"
# Affinity設定(可用性)
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-role.kubernetes.io/monitoring
operator: In
values: ["true"]
config:
# 詳細ログ(トラブルシューティング用)
verbose: 1
log_file: "/var/log/newrelic-infra/newrelic-infra.log"
log_format: "json"
# カスタム属性
custom_attributes:
monitoring_tier: infrastructure
data_retention: "13_months"
sla_tier: gold
# プロセス監視
enable_process_metrics: true
# 高度な収集設定
metrics_system_sample_rate: 15s
metrics_process_sample_rate: 20s
metrics_network_sample_rate: 10s
metrics_storage_sample_rate: 20s
# セキュリティ
strip_command_line: true
passthrough_environment:
- KUBERNETES_SERVICE_HOST
- KUBERNETES_SERVICE_PORT
- NEW_RELIC_LICENSE_KEY
# Kubernetes状態監視
kube-state-metrics:
enabled: true
# リソース設定
resources:
limits:
memory: "200Mi"
cpu: "200m"
requests:
memory: "100Mi"
cpu: "50m"
# メトリクスカスタマイズ
metricLabelsAllowlist:
- pods=[*]
- deployments=[*]
- services=[*]
- ingresses=[*]
- configmaps=[*]
- secrets=[*]
# セキュリティコンテキスト
securityContext:
runAsNonRoot: true
runAsUser: 65534
runAsGroup: 65534
fsGroup: 65534
# Prometheus統合
nri-prometheus:
enabled: true
# リソース制限
resources:
limits:
memory: "300Mi"
cpu: "200m"
requests:
memory: "100Mi"
cpu: "50m"
config:
# メトリクス変換
transformations:
- description: "CPU使用率正規化"
rename_metric: "cpu_usage_percentage"
ignore_metrics:
- "go_.*"
- "prometheus_.*"
- "kube_pod_container_.*"
- description: "メモリメトリクス正規化"
rename_metric: "memory_usage_bytes"
copy_attributes:
- "pod"
- "namespace"
- "container"
# スクレイプ設定
scrape_configs:
# Pod監視(annotations自動発見)
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- default
- kube-system
- monitoring
- application
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- source_labels: [__meta_kubernetes_pod_container_name]
action: replace
target_label: kubernetes_container_name
# Service監視
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
# Ingress Controller監視
- job_name: 'kubernetes-ingresses'
kubernetes_sd_configs:
- role: ingress
relabel_configs:
- source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
# ログ収集
newrelic-logging:
enabled: true
# リソース制限
resources:
limits:
memory: "500Mi"
cpu: "500m"
requests:
memory: "128Mi"
cpu: "50m"
# Fluent Bit設定
fluentBit:
image:
repository: newrelic/newrelic-fluentbit-output
tag: "1.19.2"
config:
# 入力設定
inputs: |
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser cri
Tag kube.*
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Skip_Empty_Lines On
Refresh_Interval 10
Rotate_Wait 30
storage.type filesystem
storage.pause_on_chunks_overlimit off
[INPUT]
Name systemd
Tag host.*
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
Systemd_Filter _SYSTEMD_UNIT=docker.service
Systemd_Filter _SYSTEMD_UNIT=containerd.service
Max_Entries 1000
Read_From_Tail On
# フィルター設定
filters: |
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
Annotations Off
Labels On
[FILTER]
Name modify
Match *
Add cluster_name production-k8s-cluster
Add environment production
Add log_type kubernetes
[FILTER]
Name grep
Match kube.*
Exclude log ^\s*$
[FILTER]
Name record_modifier
Match *
Record hostname ${HOSTNAME}
Record datacenter tokyo-dc1
# 出力設定
outputs: |
[OUTPUT]
Name newrelic
Match *
licenseKey ${LICENSE_KEY}
endpoint https://log-api.newrelic.com/log/v1
maxBufferSize 256000
maxRecords 1024
[OUTPUT]
Name stdout
Match *
Format json_lines
# ボリューム設定
volumeMounts:
- name: varlog
mountPath: /var/log
readOnly: true
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
# Kubernetesイベント監視
kubeEvents:
enabled: true
resources:
limits:
memory: "128Mi"
cpu: "100m"
requests:
memory: "64Mi"
cpu: "50m"
config:
clusterName: production-k8s-cluster
sinks:
- name: newRelicInfra
config:
agentEndpoint: http://localhost:8001/v1/data
clusterName: production-k8s-cluster
- name: stdout
config: {}
# Pixie統合(高度なオブザーバビリティ)
newrelic-pixie:
enabled: true
# Pixie APIキー
apikey: "YOUR_PIXIE_API_KEY"
# デプロイ設定
deployKey: "YOUR_PIXIE_DEPLOY_KEY"
clusterName: production-k8s-cluster
# Pixie設定
pixieChart:
enabled: true
deployKey: "YOUR_PIXIE_DEPLOY_KEY"
clusterName: production-k8s-cluster
# PEM(Pixie Edge Module)設定
pemMemoryLimit: "2Gi"
pemMemoryRequest: "1Gi"
# メタデータインジェクション
newrelic-metadata-injection:
enabled: true
# 対象ラベルセレクタ
labelSelector:
environment: "production"
"app.kubernetes.io/managed-by": "Helm"
# 注入設定
injectMetadata: true
# New Relic環境変数注入
env:
- name: NEW_RELIC_METADATA_KUBERNETES_CLUSTER_NAME
value: "production-k8s-cluster"
- name: NEW_RELIC_METADATA_KUBERNETES_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: NEW_RELIC_METADATA_KUBERNETES_NAMESPACE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: NEW_RELIC_METADATA_KUBERNETES_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NEW_RELIC_METADATA_KUBERNETES_CONTAINER_NAME
value: "" # コンテナ名は実行時に設定
# セキュリティ設定
security:
# Pod Security Policy
podSecurityPolicy:
enabled: true
# Network Policy
networkPolicy:
enabled: true
ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- protocol: TCP
port: 8080
- protocol: TCP
port: 9090
# 監視対象ネームスペース
namespacesToMonitor:
- default
- kube-system
- kube-public
- monitoring
- application
- database
- ingress-nginx
- cert-manager
# カスタムリソース監視
customResources:
enabled: true
config: |
customResourceRules:
- groupVersionKind:
group: "argoproj.io"
version: "v1alpha1"
kind: "Application"
metricNamePrefix: "argocd"
- groupVersionKind:
group: "networking.istio.io"
version: "v1beta1"
kind: "VirtualService"
metricNamePrefix: "istio"
- groupVersionKind:
group: "security.istio.io"
version: "v1beta1"
kind: "PeerAuthentication"
metricNamePrefix: "istio_security"
🔧 Kubernetes 監視スクリプト
python
#!/usr/bin/env python3
"""
Kubernetes エンタープライズ監視スクリプト
クラスター・ノード・Pod・サービス・リソースの包括監視
"""
from kubernetes import client, config
import requests
import json
import time
from datetime import datetime, timezone
import os
import threading
from concurrent.futures import ThreadPoolExecutor, as_completed
class KubernetesEnterpriseMonitor:
def __init__(self, newrelic_insert_key, newrelic_account_id, kubeconfig_path=None):
self.newrelic_insert_key = newrelic_insert_key
self.newrelic_account_id = newrelic_account_id
self.insights_api = f"https://insights-collector.newrelic.com/v1/accounts/{newrelic_account_id}/events"
# Kubernetes設定読み込み
try:
if kubeconfig_path:
config.load_kube_config(config_file=kubeconfig_path)
else:
# クラスター内実行時は自動設定
try:
config.load_incluster_config()
print("✅ Loaded in-cluster Kubernetes config")
except:
config.load_kube_config()
print("✅ Loaded local Kubernetes config")
except Exception as e:
print(f"❌ Failed to load Kubernetes config: {e}")
raise
# APIクライアント初期化
self.core_v1 = client.CoreV1Api()
self.apps_v1 = client.AppsV1Api()
self.networking_v1 = client.NetworkingV1Api()
self.storage_v1 = client.StorageV1Api()
self.metrics_v1beta1 = client.CustomObjectsApi()
self.cluster_name = os.environ.get('CLUSTER_NAME', 'unknown-cluster')
self.environment = os.environ.get('ENVIRONMENT', 'production')
def collect_cluster_info(self):
"""クラスター基本情報収集"""
try:
# Kubernetes バージョン
version_info = client.VersionApi().get_code()
# ノードリスト
nodes = self.core_v1.list_node()
# ネームスペースリスト
namespaces = self.core_v1.list_namespace()
# サービスアカウント数
service_accounts = self.core_v1.list_service_account_for_all_namespaces()
# 基本統計
cluster_info = {
'eventType': 'KubernetesClusterMetrics',
'timestamp': int(time.time()),
'cluster.name': self.cluster_name,
'environment': self.environment,
'cluster.version': version_info.git_version,
'cluster.major_version': version_info.major,
'cluster.minor_version': version_info.minor,
'cluster.platform': version_info.platform,
'cluster.go_version': version_info.go_version,
'cluster.git_commit': version_info.git_commit,
'cluster.build_date': version_info.build_date,
'cluster.nodes_total': len(nodes.items),
'cluster.namespaces_total': len(namespaces.items),
'cluster.service_accounts_total': len(service_accounts.items)
}
# ノード統計
ready_nodes = 0
not_ready_nodes = 0
master_nodes = 0
worker_nodes = 0
for node in nodes.items:
# 状態確認
for condition in node.status.conditions:
if condition.type == "Ready":
if condition.status == "True":
ready_nodes += 1
else:
not_ready_nodes += 1
break
# ロール確認
labels = node.metadata.labels or {}
if 'node-role.kubernetes.io/master' in labels or 'node-role.kubernetes.io/control-plane' in labels:
master_nodes += 1
else:
worker_nodes += 1
cluster_info.update({
'cluster.nodes_ready': ready_nodes,
'cluster.nodes_not_ready': not_ready_nodes,
'cluster.nodes_master': master_nodes,
'cluster.nodes_worker': worker_nodes
})
return cluster_info
except Exception as e:
print(f"❌ Failed to collect cluster info: {e}")
return None
def collect_node_metrics(self):
"""ノードメトリクス収集"""
try:
nodes = self.core_v1.list_node()
node_metrics = []
for node in nodes.items:
node_info = {
'eventType': 'KubernetesNodeMetrics',
'timestamp': int(time.time()),
'cluster.name': self.cluster_name,
'environment': self.environment,
'node.name': node.metadata.name,
'node.creation_timestamp': node.metadata.creation_timestamp.isoformat() if node.metadata.creation_timestamp else '',
'node.uid': node.metadata.uid
}
# ラベル情報
labels = node.metadata.labels or {}
node_info.update({
'node.os': labels.get('kubernetes.io/os', 'unknown'),
'node.arch': labels.get('kubernetes.io/arch', 'unknown'),
'node.instance_type': labels.get('node.kubernetes.io/instance-type', 'unknown'),
'node.zone': labels.get('topology.kubernetes.io/zone', 'unknown'),
'node.region': labels.get('topology.kubernetes.io/region', 'unknown'),
'node.role': 'master' if any(k.startswith('node-role.kubernetes.io/master') or k.startswith('node-role.kubernetes.io/control-plane') for k in labels.keys()) else 'worker'
})
# システム情報
system_info = node.status.node_info
node_info.update({
'node.kernel_version': system_info.kernel_version,
'node.os_image': system_info.os_image,
'node.container_runtime': system_info.container_runtime_version,
'node.kubelet_version': system_info.kubelet_version,
'node.kube_proxy_version': system_info.kube_proxy_version,
'node.machine_id': system_info.machine_id,
'node.system_uuid': system_info.system_uuid,
'node.boot_id': system_info.boot_id
})
# リソース情報
if node.status.capacity:
capacity = node.status.capacity
node_info.update({
'node.capacity_cpu_cores': self._parse_cpu(capacity.get('cpu', '0')),
'node.capacity_memory_bytes': self._parse_memory(capacity.get('memory', '0')),
'node.capacity_pods': int(capacity.get('pods', '0')),
'node.capacity_ephemeral_storage_bytes': self._parse_memory(capacity.get('ephemeral-storage', '0'))
})
if node.status.allocatable:
allocatable = node.status.allocatable
node_info.update({
'node.allocatable_cpu_cores': self._parse_cpu(allocatable.get('cpu', '0')),
'node.allocatable_memory_bytes': self._parse_memory(allocatable.get('memory', '0')),
'node.allocatable_pods': int(allocatable.get('pods', '0')),
'node.allocatable_ephemeral_storage_bytes': self._parse_memory(allocatable.get('ephemeral-storage', '0'))
})
# ステータス情報
conditions_status = {}
for condition in node.status.conditions or []:
conditions_status[f'node.condition_{condition.type.lower()}'] = condition.status == "True"
node_info.update(conditions_status)
# テイント情報
if node.spec.taints:
node_info['node.taints_count'] = len(node.spec.taints)
taint_effects = [taint.effect for taint in node.spec.taints]
node_info['node.taint_effects'] = list(set(taint_effects))
# アドレス情報
addresses = {}
for address in node.status.addresses or []:
addresses[f'node.address_{address.type.lower()}'] = address.address
node_info.update(addresses)
# Pod統計(このノード上のPod数)
try:
pods = self.core_v1.list_pod_for_all_namespaces(field_selector=f'spec.nodeName={node.metadata.name}')
running_pods = sum(1 for pod in pods.items if pod.status.phase == 'Running')
pending_pods = sum(1 for pod in pods.items if pod.status.phase == 'Pending')
failed_pods = sum(1 for pod in pods.items if pod.status.phase == 'Failed')
node_info.update({
'node.pods_total': len(pods.items),
'node.pods_running': running_pods,
'node.pods_pending': pending_pods,
'node.pods_failed': failed_pods
})
except Exception as e:
print(f"⚠️ Failed to get pod stats for node {node.metadata.name}: {e}")
node_metrics.append(node_info)
return node_metrics
except Exception as e:
print(f"❌ Failed to collect node metrics: {e}")
return []
def collect_pod_metrics(self):
"""Podメトリクス収集"""
try:
pods = self.core_v1.list_pod_for_all_namespaces()
pod_metrics = []
# 大量のPodがある場合はサンプリング
if len(pods.items) > 1000:
print(f"⚠️ Large number of pods ({len(pods.items)}), sampling first 1000")
pods.items = pods.items[:1000]
for pod in pods.items:
pod_info = {
'eventType': 'KubernetesPodMetrics',
'timestamp': int(time.time()),
'cluster.name': self.cluster_name,
'environment': self.environment,
'pod.name': pod.metadata.name,
'pod.namespace': pod.metadata.namespace,
'pod.uid': pod.metadata.uid,
'pod.node_name': pod.spec.node_name or 'unscheduled',
'pod.phase': pod.status.phase,
'pod.creation_timestamp': pod.metadata.creation_timestamp.isoformat() if pod.metadata.creation_timestamp else '',
'pod.restart_policy': pod.spec.restart_policy
}
# ラベル情報(重要なもののみ)
labels = pod.metadata.labels or {}
important_labels = ['app', 'version', 'tier', 'component', 'app.kubernetes.io/name', 'app.kubernetes.io/version']
for label in important_labels:
if label in labels:
safe_key = label.replace('.', '_').replace('/', '_')
pod_info[f'pod.label_{safe_key}'] = labels[label]
# オーナー参照
if pod.metadata.owner_references:
owner_ref = pod.metadata.owner_references[0]
pod_info.update({
'pod.owner_kind': owner_ref.kind,
'pod.owner_name': owner_ref.name,
'pod.owner_uid': owner_ref.uid
})
# コンテナ情報
containers = pod.spec.containers or []
pod_info['pod.containers_count'] = len(containers)
if containers:
# メインコンテナ(最初のコンテナ)情報
main_container = containers[0]
pod_info.update({
'pod.main_container_name': main_container.name,
'pod.main_container_image': main_container.image
})
# リソース要求・制限
if main_container.resources:
if main_container.resources.requests:
requests = main_container.resources.requests
if 'cpu' in requests:
pod_info['pod.cpu_request'] = self._parse_cpu(requests['cpu'])
if 'memory' in requests:
pod_info['pod.memory_request_bytes'] = self._parse_memory(requests['memory'])
if main_container.resources.limits:
limits = main_container.resources.limits
if 'cpu' in limits:
pod_info['pod.cpu_limit'] = self._parse_cpu(limits['cpu'])
if 'memory' in limits:
pod_info['pod.memory_limit_bytes'] = self._parse_memory(limits['memory'])
# ステータス詳細
if pod.status.conditions:
for condition in pod.status.conditions:
condition_key = f'pod.condition_{condition.type.lower()}'
pod_info[condition_key] = condition.status == "True"
# コンテナステータス
if pod.status.container_statuses:
ready_containers = sum(1 for cs in pod.status.container_statuses if cs.ready)
restart_count = sum(cs.restart_count for cs in pod.status.container_statuses)
pod_info.update({
'pod.containers_ready': ready_containers,
'pod.containers_total': len(pod.status.container_statuses),
'pod.restart_count_total': restart_count
})
# ネットワーク情報
if pod.status.pod_ip:
pod_info['pod.ip'] = pod.status.pod_ip
if pod.status.host_ip:
pod_info['pod.host_ip'] = pod.status.host_ip
# QoSクラス
pod_info['pod.qos_class'] = pod.status.qos_class or 'BestEffort'
# 年齢計算
if pod.metadata.creation_timestamp:
age_seconds = (datetime.now(timezone.utc) - pod.metadata.creation_timestamp).total_seconds()
pod_info['pod.age_seconds'] = int(age_seconds)
pod_metrics.append(pod_info)
return pod_metrics
except Exception as e:
print(f"❌ Failed to collect pod metrics: {e}")
return []
def collect_deployment_metrics(self):
"""Deploymentメトリクス収集"""
try:
deployments = self.apps_v1.list_deployment_for_all_namespaces()
deployment_metrics = []
for deployment in deployments.items:
deployment_info = {
'eventType': 'KubernetesDeploymentMetrics',
'timestamp': int(time.time()),
'cluster.name': self.cluster_name,
'environment': self.environment,
'deployment.name': deployment.metadata.name,
'deployment.namespace': deployment.metadata.namespace,
'deployment.uid': deployment.metadata.uid,
'deployment.generation': deployment.metadata.generation,
'deployment.creation_timestamp': deployment.metadata.creation_timestamp.isoformat() if deployment.metadata.creation_timestamp else ''
}
# ラベル情報
labels = deployment.metadata.labels or {}
important_labels = ['app', 'version', 'tier', 'component']
for label in important_labels:
if label in labels:
deployment_info[f'deployment.label_{label}'] = labels[label]
# スペック情報
spec = deployment.spec
deployment_info.update({
'deployment.replicas_desired': spec.replicas or 0,
'deployment.strategy_type': spec.strategy.type if spec.strategy else 'RollingUpdate'
})
# ローリングアップデート設定
if spec.strategy and spec.strategy.rolling_update:
ru = spec.strategy.rolling_update
deployment_info.update({
'deployment.max_unavailable': str(ru.max_unavailable) if ru.max_unavailable else '25%',
'deployment.max_surge': str(ru.max_surge) if ru.max_surge else '25%'
})
# ステータス情報
status = deployment.status
deployment_info.update({
'deployment.replicas_available': status.available_replicas or 0,
'deployment.replicas_ready': status.ready_replicas or 0,
'deployment.replicas_updated': status.updated_replicas or 0,
'deployment.replicas_unavailable': status.unavailable_replicas or 0,
'deployment.observed_generation': status.observed_generation or 0
})
# ヘルス状態判定
desired = spec.replicas or 0
available = status.available_replicas or 0
deployment_info['deployment.health_status'] = 'healthy' if available == desired and desired > 0 else 'unhealthy'
deployment_info['deployment.availability_percentage'] = round((available / desired) * 100, 2) if desired > 0 else 0
# 条件情報
if status.conditions:
for condition in status.conditions:
condition_key = f'deployment.condition_{condition.type.lower()}'
deployment_info[condition_key] = condition.status == "True"
if condition.type == 'Progressing' and condition.reason:
deployment_info['deployment.progress_reason'] = condition.reason
# 年齢計算
if deployment.metadata.creation_timestamp:
age_seconds = (datetime.now(timezone.utc) - deployment.metadata.creation_timestamp).total_seconds()
deployment_info['deployment.age_seconds'] = int(age_seconds)
deployment_metrics.append(deployment_info)
return deployment_metrics
except Exception as e:
print(f"❌ Failed to collect deployment metrics: {e}")
return []
def collect_service_metrics(self):
"""Serviceメトリクス収集"""
try:
services = self.core_v1.list_service_for_all_namespaces()
service_metrics = []
for service in services.items:
service_info = {
'eventType': 'KubernetesServiceMetrics',
'timestamp': int(time.time()),
'cluster.name': self.cluster_name,
'environment': self.environment,
'service.name': service.metadata.name,
'service.namespace': service.metadata.namespace,
'service.uid': service.metadata.uid,
'service.type': service.spec.type,
'service.creation_timestamp': service.metadata.creation_timestamp.isoformat() if service.metadata.creation_timestamp else ''
}
# ラベル・セレクター
labels = service.metadata.labels or {}
if 'app' in labels:
service_info['service.app'] = labels['app']
if service.spec.selector:
service_info['service.selector_count'] = len(service.spec.selector)
# ポート情報
ports = service.spec.ports or []
service_info['service.ports_count'] = len(ports)
if ports:
service_info['service.main_port'] = ports[0].port
service_info['service.main_protocol'] = ports[0].protocol
if ports[0].target_port:
service_info['service.main_target_port'] = str(ports[0].target_port)
# IP情報
service_info['service.cluster_ip'] = service.spec.cluster_ip or 'None'
if service.spec.type == 'LoadBalancer':
if service.status.load_balancer and service.status.load_balancer.ingress:
lb_ingress = service.status.load_balancer.ingress[0]
service_info['service.load_balancer_ip'] = lb_ingress.ip or lb_ingress.hostname or 'pending'
else:
service_info['service.load_balancer_ip'] = 'pending'
if service.spec.type == 'NodePort':
if ports and ports[0].node_port:
service_info['service.node_port'] = ports[0].node_port
# エンドポイント確認
try:
endpoints = self.core_v1.read_namespaced_endpoints(
name=service.metadata.name,
namespace=service.metadata.namespace
)
total_addresses = 0
if endpoints.subsets:
for subset in endpoints.subsets:
if subset.addresses:
total_addresses += len(subset.addresses)
service_info['service.endpoints_ready'] = total_addresses
service_info['service.health_status'] = 'healthy' if total_addresses > 0 else 'unhealthy'
except Exception:
service_info['service.endpoints_ready'] = 0
service_info['service.health_status'] = 'unknown'
# 年齢計算
if service.metadata.creation_timestamp:
age_seconds = (datetime.now(timezone.utc) - service.metadata.creation_timestamp).total_seconds()
service_info['service.age_seconds'] = int(age_seconds)
service_metrics.append(service_info)
return service_metrics
except Exception as e:
print(f"❌ Failed to collect service metrics: {e}")
return []
def _parse_cpu(self, cpu_str):
"""CPU文字列を数値に変換(コア数)"""
if not cpu_str:
return 0
cpu_str = str(cpu_str)
if cpu_str.endswith('m'):
return float(cpu_str[:-1]) / 1000
elif cpu_str.endswith('u'):
return float(cpu_str[:-1]) / 1000000
else:
return float(cpu_str)
def _parse_memory(self, memory_str):
"""メモリ文字列をバイト数に変換"""
if not memory_str:
return 0
memory_str = str(memory_str)
multipliers = {
'Ki': 1024,
'Mi': 1024**2,
'Gi': 1024**3,
'Ti': 1024**4,
'Pi': 1024**5,
'K': 1000,
'M': 1000**2,
'G': 1000**3,
'T': 1000**4,
'P': 1000**5
}
for suffix, multiplier in multipliers.items():
if memory_str.endswith(suffix):
return int(float(memory_str[:-len(suffix)]) * multiplier)
return int(float(memory_str))
def send_to_newrelic(self, metrics_data):
"""メトリクスをNew Relicに送信"""
if not metrics_data:
return
try:
headers = {
'Content-Type': 'application/json',
'X-Insert-Key': self.newrelic_insert_key
}
# バッチサイズで送信
batch_size = 100
for i in range(0, len(metrics_data), batch_size):
batch = metrics_data[i:i+batch_size]
response = requests.post(
self.insights_api,
headers=headers,
json=batch,
timeout=30
)
if response.status_code == 200:
print(f"✅ Sent {len(batch)} Kubernetes metrics to New Relic")
else:
print(f"❌ Failed to send batch: {response.status_code}")
except Exception as e:
print(f"❌ Failed to send metrics to New Relic: {e}")
def run_monitoring(self):
"""メイン監視処理"""
print("☸️ Starting Kubernetes Enterprise Monitoring")
print(f"🎯 Cluster: {self.cluster_name}")
print(f"🌍 Environment: {self.environment}")
print(f"📅 Timestamp: {datetime.now(timezone.utc).isoformat()}")
all_metrics = []
# 並行してメトリクス収集
with ThreadPoolExecutor(max_workers=5) as executor:
futures = {
executor.submit(self.collect_cluster_info): "Cluster Info",
executor.submit(self.collect_node_metrics): "Nodes",
executor.submit(self.collect_pod_metrics): "Pods",
executor.submit(self.collect_deployment_metrics): "Deployments",
executor.submit(self.collect_service_metrics): "Services"
}
for future in as_completed(futures):
metric_type = futures[future]
try:
result = future.result()
if result:
if isinstance(result, list):
all_metrics.extend(result)
print(f"✅ Collected {len(result)} {metric_type} metrics")
else:
all_metrics.append(result)
print(f"✅ Collected {metric_type} metrics")
except Exception as e:
print(f"❌ Failed to collect {metric_type} metrics: {e}")
# New Relicに送信
if all_metrics:
print(f"📤 Sending {len(all_metrics)} total metrics to New Relic...")
self.send_to_newrelic(all_metrics)
print("🎉 Kubernetes monitoring completed successfully")
else:
print("⚠️ No metrics collected")
# メイン実行
if __name__ == "__main__":
import os
# 環境変数から設定取得
NEWRELIC_INSERT_KEY = os.environ.get('NEWRELIC_INSERT_KEY', '')
NEWRELIC_ACCOUNT_ID = os.environ.get('NEWRELIC_ACCOUNT_ID', '')
KUBECONFIG_PATH = os.environ.get('KUBECONFIG', None)
if not all([NEWRELIC_INSERT_KEY, NEWRELIC_ACCOUNT_ID]):
print("❌ Required environment variables not set")
print("Please set: NEWRELIC_INSERT_KEY, NEWRELIC_ACCOUNT_ID")
exit(1)
monitor = KubernetesEnterpriseMonitor(
NEWRELIC_INSERT_KEY,
NEWRELIC_ACCOUNT_ID,
KUBECONFIG_PATH
)
try:
monitor.run_monitoring()
except KeyboardInterrupt:
print("\n⏹️ Monitoring stopped by user")
except Exception as e:
print(f"❌ Monitoring failed: {e}")
exit(1)
✅ 4.3セクション完了チェック
🎯 学習目標達成確認
本セクションを完了した時点で、以下ができるようになっているかチェックしてください:
🐳 Docker コンテナ監視
- [ ] エンタープライズ Docker 環境の包括的監視設定ができる
- [ ] コンテナライフサイクルとパフォーマンス監視を実装できる
- [ ] Docker Compose による統合監視環境を構築できる
- [ ] セキュリティ・コンプライアンス監視を追加できる
☸️ Kubernetes クラスター監視
- [ ] 大規模Kubernetesクラスターの階層監視を実装できる
- [ ] Helm Chart による完全デプロイメントができる
- [ ] クラスター・ノード・Pod・サービスの詳細監視ができる
- [ ] マイクロサービス間の依存関係を可視化できる
🏢 エンタープライズ機能
- [ ] マルチクラスター環境の統合管理ができる
- [ ] GitOps・DevSecOpsパイプラインと連携できる
- [ ] コスト最適化とリソース効率化を実現できる
- [ ] セキュリティポリシーの自動化ができる
🚀 次のステップ
コンテナ・Kubernetes監視をマスターしたら、次のセクションに進みましょう:
- 4.4 ネットワーク・セキュリティ監視 - セキュリティ強化と脅威検出
📖 セクション内ナビゲーション
🔗 第4章内リンク
- 🏠 第4章メイン - 章全体の概要
- 🔍 4.1 Infrastructure監視基礎 - 基礎概念
- 🖥️ 4.2 サーバー・クラウド監視 - 前のセクション
- 🔒 4.4 セキュリティ監視 - 次のセクション
- 🤖 4.5 自動化・IaC - 運用自動化
- 📊 4.6 運用戦略 - エンタープライズ運用
📚 関連章リンク
- 第3章:New Relic機能 - プラットフォーム機能の理解
- 第5章:New Relic APM - アプリケーション監視
🎯 次のステップ: 4.4 ネットワーク・セキュリティ監視で、セキュリティ強化と脅威検出の実装を学習しましょう!