Monitoring & Observability
Comprehensive monitoring and observability for NeuraScale across all environments.
Overview
NeuraScale implements a multi-layered observability strategy:
- Application Monitoring: Performance metrics, error tracking, and distributed tracing 🔧 Beta
- Infrastructure Monitoring: GCP resource usage, Kubernetes health, and system metrics 🚀 Coming Soon
- Business Monitoring: User activity, neural session analytics, and compliance auditing 📅 Planned
- Security Monitoring: Access logs, threat detection, and compliance events 🚀 Coming Soon
Monitoring Platform Status
Component | Status | Details |
---|---|---|
Prometheus Metrics | ✓ Available | Deployed via Helm, collecting service metrics |
Application Logging | ✓ Available | Structured JSON logging with correlation IDs |
Grafana Dashboards | 🔧 Beta | Basic dashboards, custom ones in development |
GCP Monitoring | 🚀 Coming Soon | Cloud deployment metrics pending |
Distributed Tracing | 📅 Planned | OpenTelemetry implementation planned |
Alerting | 📅 Planned | PagerDuty integration planned |
Stack Components
Stack Components
- Google Cloud Monitoring: Infrastructure metrics, logs, and alerts 🚀 Coming Soon
- Prometheus + Grafana: Application metrics and custom dashboards ✓ Available
- OpenTelemetry: Distributed tracing and instrumentation 📅 Planned
- Sentry: Error tracking and performance monitoring 📅 Planned
Prometheus Deployment
Prometheus is actively deployed in our Neural Engine infrastructure via Helm charts and is collecting metrics from all services.
Prometheus Configuration
NeuraScale uses Prometheus for metrics collection from all Neural Engine services. The deployment is managed through Helm charts in the neural-engine/helm
directory.
Service Endpoints
All Neural Engine services expose metrics endpoints:
- API Gateway: Port 9092 at
/metrics
- Device Manager: Port 9091 at
/metrics
- Signal Processor: Port 8080 at
/metrics
- ML Pipeline: Port 9093 at
/metrics
- MCP Server: Port 9094 at
/metrics
Metrics Collected
- HTTP request rates and latencies
- Processing queue depths
- Device connection status
- Signal quality metrics
- Resource utilization (CPU, memory, GPU)
- Error rates and types
Grafana Dashboards
Basic dashboards are available for:
- Service health overview
- API performance metrics
- Neural processing pipeline status
- Resource utilization trends
Application Monitoring
Metrics Collection
FastAPI Metrics
# neural-engine/src/monitoring/api_metrics.py
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from fastapi import FastAPI, Request, Response
from fastapi.responses import PlainTextResponse
import time
from typing import Callable
# Define metrics
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
http_request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration in seconds',
['method', 'endpoint']
)
active_requests = Gauge(
'http_requests_active',
'Active HTTP requests'
)
class PrometheusMiddleware:
"""Middleware to collect Prometheus metrics"""
def __init__(self, app: FastAPI):
self.app = app
async def __call__(self, request: Request, call_next: Callable):
# Track active requests
active_requests.inc()
# Record start time
start_time = time.time()
# Process request
response = await call_next(request)
# Calculate duration
duration = time.time() - start_time
# Record metrics
http_requests_total.labels(
method=request.method,
endpoint=request.url.path,
status=response.status_code
).inc()
http_request_duration.labels(
method=request.method,
endpoint=request.url.path
).observe(duration)
# Decrement active requests
active_requests.dec()
return response
# Add metrics endpoint
@app.get("/metrics", response_class=PlainTextResponse)
async def metrics():
"""Prometheus metrics endpoint"""
return generate_latest()
# Health check with detailed status
@app.get("/health")
async def health_check():
"""Comprehensive health check"""
health_status = {
"status": "healthy",
"timestamp": datetime.utcnow().isoformat(),
"checks": {
"database": await check_database_health(),
"cache": await check_cache_health(),
"gpu": check_gpu_health(),
"disk_space": check_disk_space()
}
}
# Determine overall health
if any(not check["healthy"] for check in health_status["checks"].values()):
health_status["status"] = "unhealthy"
return Response(content=json.dumps(health_status), status_code=503)
return health_status
Distributed Tracing
# neural-engine/src/monitoring/tracing.py
from opentelemetry import trace
from opentelemetry.exporter.cloud_trace import CloudTraceSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
def setup_tracing(app: FastAPI, service_name: str):
"""Configure OpenTelemetry tracing"""
# Create resource
resource = Resource.create({
"service.name": service_name,
"service.version": get_version(),
"deployment.environment": os.getenv("ENVIRONMENT", "development")
})
# Setup tracer provider
tracer_provider = TracerProvider(resource=resource)
trace.set_tracer_provider(tracer_provider)
# Configure Cloud Trace exporter
cloud_trace_exporter = CloudTraceSpanExporter()
span_processor = BatchSpanProcessor(cloud_trace_exporter)
tracer_provider.add_span_processor(span_processor)
# Instrument libraries
FastAPIInstrumentor.instrument_app(app)
SQLAlchemyInstrumentor().instrument()
RequestsInstrumentor().instrument()
return trace.get_tracer(service_name)
# Usage in application
tracer = setup_tracing(app, "neural-engine")
@app.post("/api/v1/process")
async def process_signal(request: SignalRequest):
"""Process neural signal with tracing"""
with tracer.start_as_current_span("process_signal") as span:
# Add span attributes
span.set_attribute("signal.type", request.signal_type)
span.set_attribute("signal.channels", len(request.channels))
span.set_attribute("signal.sampling_rate", request.sampling_rate)
# Preprocessing span
with tracer.start_as_current_span("preprocess"):
preprocessed = await preprocess_signal(request.data)
# Feature extraction span
with tracer.start_as_current_span("extract_features"):
features = await extract_features(preprocessed)
span.set_attribute("features.count", len(features))
# Model inference span
with tracer.start_as_current_span("model_inference") as inference_span:
inference_span.set_attribute("model.name", "neural_classifier_v2")
result = await run_inference(features)
return result
Error Tracking
# neural-engine/src/monitoring/error_tracking.py
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
from sentry_sdk.integrations.sqlalchemy import SqlalchemyIntegration
from sentry_sdk.integrations.logging import LoggingIntegration
def setup_sentry(dsn: str, environment: str):
"""Configure Sentry error tracking"""
sentry_sdk.init(
dsn=dsn,
environment=environment,
integrations=[
FastApiIntegration(
transaction_style="endpoint",
failed_request_status_codes={400, 403, 404, 405, 500, 503}
),
SqlalchemyIntegration(),
LoggingIntegration(
level=logging.INFO,
event_level=logging.ERROR
)
],
traces_sample_rate=0.1, # 10% of transactions
profiles_sample_rate=0.1, # 10% profiling
attach_stacktrace=True,
send_default_pii=False, # HIPAA compliance
before_send=sanitize_event # Remove PHI
)
def sanitize_event(event, hint):
"""Remove any PHI from Sentry events"""
# Remove sensitive fields
sensitive_fields = ['patient_id', 'session_id', 'email', 'name']
if 'extra' in event:
for field in sensitive_fields:
event['extra'].pop(field, None)
if 'user' in event:
event['user'] = {'id': event['user'].get('id')}
return event
# Custom error handling
@app.exception_handler(NeuralProcessingError)
async def neural_processing_error_handler(request: Request, exc: NeuralProcessingError):
"""Handle neural processing errors with detailed tracking"""
# Log to Sentry with context
with sentry_sdk.push_scope() as scope:
scope.set_tag("error.type", "neural_processing")
scope.set_context("processing", {
"stage": exc.stage,
"signal_type": exc.signal_type,
"duration": exc.processing_duration
})
sentry_sdk.capture_exception(exc)
# Return sanitized error response
return JSONResponse(
status_code=500,
content={
"error": "Processing failed",
"error_id": sentry_sdk.last_event_id(),
"stage": exc.stage
}
)
Infrastructure Monitoring
Google Cloud Monitoring
# neural-engine/monitoring/alerting-policies.yaml
apiVersion: monitoring.googleapis.com/v3
kind: AlertPolicy
metadata:
name: neural-engine-high-latency
spec:
displayName: "Neural Engine High Latency"
conditions:
- displayName: "API latency > 100ms"
conditionThreshold:
filter: |
resource.type = "k8s_container"
resource.labels.container_name = "neural-engine"
metric.type = "kubernetes.io/container/request_latency"
aggregations:
- alignmentPeriod: 60s
perSeriesAligner: ALIGN_PERCENTILE_95
comparison: COMPARISON_GT
thresholdValue: 0.1
duration: 300s
notificationChannels:
- projects/neurascale/notificationChannels/12345 # PagerDuty
- projects/neurascale/notificationChannels/67890 # Slack
alertStrategy:
autoClose: 86400s # 24 hours
---
apiVersion: monitoring.googleapis.com/v3
kind: AlertPolicy
metadata:
name: neural-engine-error-rate
spec:
displayName: "Neural Engine Error Rate"
conditions:
- displayName: "Error rate > 1%"
conditionThreshold:
filter: |
resource.type = "cloud_run_revision"
metric.type = "run.googleapis.com/request_count"
metric.labels.response_code_class != "2xx"
aggregations:
- alignmentPeriod: 300s
perSeriesAligner: ALIGN_RATE
crossSeriesReducer: REDUCE_SUM
comparison: COMPARISON_GT
thresholdValue: 0.01
duration: 600s
Custom Dashboards
# neural-engine/monitoring/dashboard_generator.py
from google.cloud import monitoring_dashboard_v1
import json
def create_neural_dashboard(project_id: str):
"""Create custom monitoring dashboard"""
client = monitoring_dashboard_v1.DashboardsServiceClient()
dashboard_config = {
"displayName": "Neural Engine Performance",
"mosaicLayout": {
"columns": 12,
"tiles": [
{
"width": 6,
"height": 4,
"widget": {
"title": "Request Rate",
"xyChart": {
"dataSets": [{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": 'metric.type="custom.googleapis.com/neural/http_requests_total"',
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
}
}
}]
}
}
},
{
"xPos": 6,
"width": 6,
"height": 4,
"widget": {
"title": "Processing Latency (p95)",
"xyChart": {
"dataSets": [{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": 'metric.type="custom.googleapis.com/neural/processing_latency"',
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_PERCENTILE_95"
}
}
}
}]
}
}
},
{
"yPos": 4,
"width": 12,
"height": 4,
"widget": {
"title": "GPU Utilization",
"xyChart": {
"dataSets": [{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": 'metric.type="custom.googleapis.com/neural/gpu_utilization"',
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN"
}
}
}
}]
}
}
}
]
}
}
dashboard = monitoring_dashboard_v1.Dashboard(dashboard_config)
project_path = f"projects/{project_id}"
return client.create_dashboard(
parent=project_path,
dashboard=dashboard
)
Kubernetes Monitoring
Prometheus Deployment
NeuraScale uses Prometheus for metrics collection from all Neural Engine services. The Prometheus instance is deployed as part of the Helm chart with the following configuration:
- Scrape Interval: 15 seconds
- Retention: 30 days
- High Availability: 2 replicas
- Service Discovery: Automatic via Kubernetes annotations
All Neural Engine services expose metrics endpoints:
- API Gateway: Port 9092 at
/metrics
- Device Manager: Port 9091 at
/metrics
- Signal Processor: Port 8080 at
/metrics
- ML Pipeline: Port 9093 at
/metrics
- MCP Server: Port 9094 at
/metrics
Accessing Prometheus
# Port-forward to access Prometheus UI
kubectl port-forward -n neural-engine svc/prometheus 9090:9090
# Access Prometheus at http://localhost:9090
Prometheus Configuration
# neural-engine/kubernetes/monitoring/prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'neural-engine'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- neural-engine
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- job_name: 'kubernetes-cadvisor'
kubernetes_sd_configs:
- role: node
metrics_path: /metrics/cadvisor
relabel_configs:
- source_labels: [__address__]
regex: '([^:]+):\d+'
replacement: '$1:10250'
target_label: __address__
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
spec:
replicas: 2
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:latest
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus/'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
ports:
- containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: data
mountPath: /prometheus
resources:
requests:
memory: "2Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "1"
volumes:
- name: config
configMap:
name: prometheus-config
- name: data
persistentVolumeClaim:
claimName: prometheus-data
Log Management
Structured Logging
# neural-engine/src/monitoring/structured_logging.py
import structlog
from google.cloud import logging as cloud_logging
import json
def setup_structured_logging(service_name: str):
"""Configure structured logging with Cloud Logging integration"""
# Initialize Cloud Logging client
client = cloud_logging.Client()
handler = client.get_default_handler()
# Configure structlog
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.dev.set_exc_info,
structlog.processors.dict_tracebacks,
add_service_context,
sanitize_phi,
structlog.processors.JSONRenderer()
],
context_class=dict,
logger_factory=structlog.PrintLoggerFactory(),
wrapper_class=structlog.make_filtering_bound_logger(logging.INFO),
cache_logger_on_first_use=True,
)
return structlog.get_logger(service_name)
def add_service_context(logger, method_name, event_dict):
"""Add service context to all logs"""
event_dict['service'] = {
'name': logger.name,
'version': get_version(),
'environment': os.getenv('ENVIRONMENT', 'development')
}
# Add trace context if available
span = trace.get_current_span()
if span and span.is_recording():
event_dict['trace'] = {
'trace_id': format(span.get_span_context().trace_id, '032x'),
'span_id': format(span.get_span_context().span_id, '016x')
}
return event_dict
def sanitize_phi(logger, method_name, event_dict):
"""Remove PHI from logs for HIPAA compliance"""
sensitive_fields = ['patient_id', 'session_data', 'neural_signals']
for field in sensitive_fields:
if field in event_dict:
event_dict[field] = '[REDACTED]'
return event_dict
# Usage
logger = setup_structured_logging("neural-engine")
logger.info(
"signal_processed",
signal_type="EEG",
channels=64,
duration_ms=1000,
processing_time_ms=15.3
)
Log Analysis
# neural-engine/monitoring/log_analysis.py
from google.cloud import bigquery
from datetime import datetime, timedelta
class LogAnalyzer:
"""Analyze logs for patterns and anomalies"""
def __init__(self, project_id: str):
self.client = bigquery.Client(project=project_id)
self.dataset_id = f"{project_id}.neural_logs"
def analyze_error_patterns(self, hours: int = 24) -> Dict[str, Any]:
"""Analyze error patterns in logs"""
query = f"""
WITH error_logs AS (
SELECT
timestamp,
jsonPayload.error_type AS error_type,
jsonPayload.error_message AS error_message,
jsonPayload.service.name AS service_name,
jsonPayload.trace.trace_id AS trace_id
FROM `{self.dataset_id}.stderr`
WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL {hours} HOUR)
AND severity = 'ERROR'
)
SELECT
error_type,
COUNT(*) as error_count,
COUNT(DISTINCT trace_id) as affected_requests,
ARRAY_AGG(DISTINCT error_message LIMIT 5) as sample_messages,
MIN(timestamp) as first_occurrence,
MAX(timestamp) as last_occurrence
FROM error_logs
GROUP BY error_type
ORDER BY error_count DESC
"""
results = self.client.query(query).to_dataframe()
return {
'error_summary': results.to_dict('records'),
'total_errors': results['error_count'].sum(),
'unique_error_types': len(results),
'analysis_period_hours': hours
}
def detect_anomalies(self) -> List[Dict[str, Any]]:
"""Detect anomalies in log patterns"""
query = f"""
WITH hourly_stats AS (
SELECT
TIMESTAMP_TRUNC(timestamp, HOUR) as hour,
COUNT(*) as log_count,
COUNTIF(severity = 'ERROR') as error_count,
AVG(CAST(jsonPayload.processing_time_ms AS FLOAT64)) as avg_processing_time
FROM `{self.dataset_id}.stdout`
WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY hour
),
baseline AS (
SELECT
AVG(log_count) as avg_logs,
STDDEV(log_count) as stddev_logs,
AVG(error_count) as avg_errors,
STDDEV(error_count) as stddev_errors
FROM hourly_stats
)
SELECT
h.hour,
h.log_count,
h.error_count,
h.avg_processing_time,
CASE
WHEN h.log_count > b.avg_logs + (3 * b.stddev_logs) THEN 'high_volume'
WHEN h.log_count < b.avg_logs - (3 * b.stddev_logs) THEN 'low_volume'
WHEN h.error_count > b.avg_errors + (3 * b.stddev_errors) THEN 'high_errors'
ELSE 'normal'
END as anomaly_type
FROM hourly_stats h
CROSS JOIN baseline b
WHERE h.hour > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
AND (
h.log_count > b.avg_logs + (3 * b.stddev_logs) OR
h.log_count < b.avg_logs - (3 * b.stddev_logs) OR
h.error_count > b.avg_errors + (3 * b.stddev_errors)
)
ORDER BY h.hour DESC
"""
anomalies = self.client.query(query).to_dataframe()
return anomalies.to_dict('records')
Business Metrics
Session Analytics
# neural-engine/src/monitoring/business_metrics.py
from prometheus_client import Counter, Gauge, Histogram
import structlog
# Business metrics
sessions_created = Counter(
'neural_sessions_created_total',
'Total neural sessions created',
['device_type', 'signal_type']
)
active_sessions = Gauge(
'neural_sessions_active',
'Currently active neural sessions',
['device_type']
)
session_duration = Histogram(
'neural_session_duration_seconds',
'Neural session duration',
['device_type', 'completion_status'],
buckets=[60, 300, 600, 1800, 3600, 7200] # 1m, 5m, 10m, 30m, 1h, 2h
)
data_quality_score = Histogram(
'neural_data_quality_score',
'Neural data quality scores',
['device_type', 'signal_type'],
buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)
class BusinessMetricsCollector:
"""Collect and track business metrics"""
def __init__(self):
self.logger = structlog.get_logger(__name__)
async def track_session_created(
self,
session_id: str,
device_type: str,
signal_type: str
):
"""Track new session creation"""
sessions_created.labels(
device_type=device_type,
signal_type=signal_type
).inc()
active_sessions.labels(device_type=device_type).inc()
self.logger.info(
"session_created",
session_id=session_id,
device_type=device_type,
signal_type=signal_type
)
async def track_session_completed(
self,
session_id: str,
device_type: str,
duration_seconds: float,
completion_status: str,
quality_score: float
):
"""Track session completion"""
active_sessions.labels(device_type=device_type).dec()
session_duration.labels(
device_type=device_type,
completion_status=completion_status
).observe(duration_seconds)
data_quality_score.labels(
device_type=device_type,
signal_type=signal_type
).observe(quality_score)
self.logger.info(
"session_completed",
session_id=session_id,
duration_seconds=duration_seconds,
completion_status=completion_status,
quality_score=quality_score
)
Compliance Monitoring
# neural-engine/src/monitoring/compliance_metrics.py
from prometheus_client import Counter, Gauge
import structlog
from datetime import datetime
# Compliance metrics
phi_access_events = Counter(
'neural_phi_access_total',
'PHI access events',
['access_type', 'user_role', 'resource_type']
)
security_events = Counter(
'neural_security_events_total',
'Security events',
['event_type', 'severity']
)
audit_log_size = Gauge(
'neural_audit_log_size_bytes',
'Audit log size in bytes'
)
class ComplianceMonitor:
"""Monitor compliance-related events"""
def __init__(self):
self.logger = structlog.get_logger(__name__)
async def log_phi_access(
self,
user_id: str,
user_role: str,
resource_type: str,
resource_id: str,
access_type: str,
ip_address: str
):
"""Log PHI access for HIPAA compliance"""
phi_access_events.labels(
access_type=access_type,
user_role=user_role,
resource_type=resource_type
).inc()
# Structured audit log
self.logger.info(
"phi_access",
user_id=user_id,
user_role=user_role,
resource_type=resource_type,
resource_id=resource_id,
access_type=access_type,
ip_address=ip_address,
timestamp=datetime.utcnow().isoformat(),
compliance_event=True
)
async def log_security_event(
self,
event_type: str,
severity: str,
details: Dict[str, Any]
):
"""Log security events"""
security_events.labels(
event_type=event_type,
severity=severity
).inc()
self.logger.warning(
"security_event",
event_type=event_type,
severity=severity,
details=details,
compliance_event=True
)
Alerting Rules
Critical Alerts
# neural-engine/monitoring/alerts/critical.yaml
groups:
- name: neural_critical
interval: 30s
rules:
- alert: NeuralEngineDown
expr: up{job="neural-engine"} == 0
for: 2m
labels:
severity: critical
team: platform
annotations:
summary: "Neural Engine instance {{ $labels.instance }} is down"
description: "Neural Engine has been down for more than 2 minutes."
runbook_url: "https://neurascale.docs/runbooks/neural-engine-down"
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is above 5% for the last 5 minutes"
- alert: DatabaseConnectionPoolExhausted
expr: database_connections_active / database_connections_max > 0.9
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Database connection pool nearly exhausted"
description: "{{ $value | humanizePercentage }} of connections are in use"
- alert: GPUMemoryExhausted
expr: neural_gpu_memory_usage_bytes / neural_gpu_memory_total_bytes > 0.95
for: 2m
labels:
severity: critical
team: ml
annotations:
summary: "GPU memory nearly exhausted on {{ $labels.device_name }}"
description: "GPU memory usage is at {{ $value | humanizePercentage }}"
Dashboards
Grafana Dashboard Configuration
{
"dashboard": {
"title": "NeuraScale Neural Engine",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{endpoint}}"
}
],
"type": "graph"
},
{
"title": "Latency Percentiles",
"targets": [
{
"expr": "histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "p99"
}
],
"type": "graph"
},
{
"title": "Active Sessions",
"targets": [
{
"expr": "neural_sessions_active",
"legendFormat": "{{device_type}}"
}
],
"type": "graph"
},
{
"title": "Processing Queue Depth",
"targets": [
{
"expr": "neural_processing_queue_depth",
"legendFormat": "{{queue_name}}"
}
],
"type": "graph"
}
]
}
}
Monitoring Checklist
- Configure Prometheus metrics collection
- Set up Cloud Monitoring dashboards
- Implement distributed tracing with OpenTelemetry
- Configure Sentry error tracking
- Set up log aggregation and analysis
- Create alerting rules for critical issues
- Implement SLO monitoring
- Configure compliance audit logging
- Set up performance profiling
- Create runbooks for common issues
Comprehensive monitoring ensures NeuraScale operates reliably and efficiently, enabling quick detection and resolution of issues while maintaining compliance requirements.