Migrating from Self-Hosted Prometheus to Grafana Cloud: Preserving Dashboards, Alerts, and History

Migrating from Self-Hosted Prometheus to Grafana Cloud: Preserving Dashboards, Alerts, and History

Problem

Self-hosted Prometheus consumes 500GB+ storage within 6 months for a 20-node Kubernetes cluster. HA requires Thanos or Cortex, significant operational complexity. Cross-cluster aggregation needs federation or remote write. Grafana needs persistent storage, user management, and backup. The total operational cost of self-managed observability typically exceeds the cost of a managed backend.

But migration must preserve dashboards, alert rules, and recording rules without a detection gap. Moving from one observability backend to another is like replacing the engines on a plane in flight.

Threat Model

  • Adversary: Not direct, the threat is losing security monitoring during the migration window. If your security alert rules stop functioning during migration, attackers have a detection gap to exploit.

Configuration

Phase 1: Remote Write (Parallel Running)

Configure Prometheus to send metrics to both local storage and Grafana Cloud simultaneously:

# prometheus.yml - add remote_write section
remote_write:
  - url: "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push"
    basic_auth:
      username: "${GRAFANA_CLOUD_PROMETHEUS_USER}"
      password: "${GRAFANA_CLOUD_API_KEY}"
    queue_config:
      max_samples_per_send: 5000
      max_shards: 10
      capacity: 10000
    write_relabel_configs:
      # Optional: filter which metrics are sent to reduce costs
      - source_labels: [__name__]
        regex: "go_.*"
        action: drop
# Apply the config
kubectl rollout restart statefulset prometheus-server -n monitoring

# Verify remote write is working:
# Check Prometheus targets page: http://prometheus:9090/targets
# Check Grafana Cloud: Explore → select Prometheus data source → run a query

# Monitor remote write health:
# prometheus_remote_storage_succeeded_samples_total should increase
# prometheus_remote_storage_failed_samples_total should stay at 0

Phase 2: Dashboard Migration

# Export all Grafana dashboards as JSON
# Using the Grafana API:
GRAFANA_URL="http://grafana.monitoring.svc:3000"
GRAFANA_TOKEN="your-api-token"

# List all dashboards
curl -s -H "Authorization: Bearer $GRAFANA_TOKEN" \
  "$GRAFANA_URL/api/search?type=dash-db" | jq -r '.[].uid' > dashboard-uids.txt

# Export each dashboard
mkdir -p dashboards-export
while read uid; do
  curl -s -H "Authorization: Bearer $GRAFANA_TOKEN" \
    "$GRAFANA_URL/api/dashboards/uid/$uid" | jq '.dashboard' > "dashboards-export/$uid.json"
  echo "Exported: $uid"
done < dashboard-uids.txt

# Import to Grafana Cloud
CLOUD_URL="https://your-org.grafana.net"
CLOUD_TOKEN="your-cloud-api-token"

for f in dashboards-export/*.json; do
  # Wrap in the import format
  jq '{dashboard: ., overwrite: true}' "$f" | \
    curl -s -X POST -H "Authorization: Bearer $CLOUD_TOKEN" \
    -H "Content-Type: application/json" \
    -d @- "$CLOUD_URL/api/dashboards/db"
  echo "Imported: $f"
done

Data source references: After import, dashboards reference the old data source. Update to the Grafana Cloud Prometheus data source:

# In each imported dashboard JSON, replace the datasource reference:
# Old: {"type": "prometheus", "uid": "local-prometheus"}
# New: {"type": "prometheus", "uid": "grafanacloud-prom"}

for f in dashboards-export/*.json; do
  sed -i 's/"uid": "local-prometheus"/"uid": "grafanacloud-prom"/g' "$f"
done
# Re-import the updated dashboards

Phase 3: Alert Rule Migration

# Export PrometheusRule resources
kubectl get prometheusrules --all-namespaces -o yaml > prometheus-rules-export.yaml

# Convert to Grafana Cloud alerting format.
# Grafana Cloud uses Grafana Alerting (not Alertmanager directly).
# The PromQL expressions are compatible - only the wrapping format changes.

For each PrometheusRule, create a corresponding Grafana Cloud alert rule:

# Grafana Cloud alert rule (created via API or UI)
# Each PrometheusRule group becomes a Grafana alerting rule group.
{
  "name": "security-auth-alerts",
  "interval": "30s",
  "rules": [
    {
      "grafana_alert": {
        "title": "BruteForceDetected",
        "condition": "A",
        "data": [
          {
            "refId": "A",
            "queryType": "",
            "relativeTimeRange": {"from": 300, "to": 0},
            "datasourceUid": "grafanacloud-prom",
            "model": {
              "expr": "sum by (source_ip, service) (rate(auth_failures_total{result=\"failure\"}[5m])) > 0.5",
              "intervalMs": 30000,
              "maxDataPoints": 43200
            }
          }
        ],
        "for": "2m",
        "labels": {"severity": "warning"},
        "annotations": {
          "summary": "Possible brute force detected"
        }
      }
    }
  ]
}

Phase 4: Verification

# Run for 24-48 hours with both systems active.
# Compare key metrics between self-hosted and Grafana Cloud:

# On self-hosted Prometheus:
curl -s "http://prometheus:9090/api/v1/query?query=up" | jq '.data.result | length'

# On Grafana Cloud (using the Grafana Cloud Prometheus API):
curl -s -u "$USER:$API_KEY" \
  "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/api/v1/query?query=up" | \
  jq '.data.result | length'

# Both should return the same count.
# Check security-specific metrics:
# - auth_failures_total
# - apiserver_authorization_decisions_total
# - certmanager_certificate_expiration_timestamp_seconds
# - cilium_drop_count_total

Phase 5: Cut Over and Decommission

# After 24-48 hours of verified parallel running:

# 1. Update all alert notification channels to point to Grafana Cloud OnCall (#178)
# 2. Disable alerts on the self-hosted Prometheus (set all rules to inactive)
# 3. Remove the self-hosted Grafana from DNS/bookmarks
# 4. Keep self-hosted Prometheus running for 7 more days (historical queries)
# 5. After 7 days: decommission self-hosted Prometheus and Grafana

# The remote_write configuration stays - Prometheus continues to scrape
# and ship metrics to Grafana Cloud. The local storage can be reduced
# to minimal retention (2h for write-ahead log only).

Cost Estimation

# Calculate your Grafana Cloud cost based on current Prometheus usage:

# Count active time series:
curl -s "http://prometheus:9090/api/v1/label/__name__/values" | jq '.data | length'
# This is the number of metric names. Multiply by average labels per metric
# to get active series count.

# Or use the TSDB stats:
curl -s "http://prometheus:9090/api/v1/status/tsdb" | jq '.data.seriesCountByMetricName[:10]'

# Grafana Cloud pricing (as of 2026):
# Free: 10,000 active series, 50GB logs, 50GB traces
# Pro: $8/1000 active series/month + $0.50/GB logs
#
# Example: 50,000 active series = ~$400/month on Grafana Cloud
# vs. engineering time to manage self-hosted: $800-3,200/month

Expected Behaviour

  • All metrics flowing to Grafana Cloud via remote write within 1 hour of configuration
  • All dashboards imported and rendering identically to self-hosted Grafana
  • All alert rules firing with the same thresholds and conditions
  • No detection gap during the parallel running period
  • Self-hosted Prometheus can be decommissioned after 7-day parallel running

Trade-offs

Decision Impact Risk Mitigation
Parallel running (both active) Double metric storage cost during migration (7-14 days) Double the cost is temporary Migration period is short. The cost is negligible compared to the long-term savings.
Accept historical data loss No migration of Thanos/Prometheus historical data Lose trend data; security baselines need re-establishment Accept 30-day baseline gap. Historical data is still accessible on self-hosted during the transition period.
Remote write to Grafana Cloud Prometheus becomes a collection agent, not a storage backend Dependency on Grafana Cloud availability Prometheus local storage provides a buffer. If Grafana Cloud is unreachable, metrics buffer locally and ship when connection restores.

Failure Modes

Failure Symptom Detection Recovery
Remote write fails Metrics not arriving in Grafana Cloud prometheus_remote_storage_failed_samples_total increases; Grafana Cloud shows no data Check credentials, endpoint URL, network egress from Prometheus namespace. Prometheus buffers locally. metrics ship when connection restores.
Dashboard variable mismatch Imported dashboards show “No data” Visual comparison during parallel running reveals blank panels Update data source UIDs and variable queries to match Grafana Cloud Prometheus data source.
Alert rule PromQL incompatible Alerts don’t fire in Grafana Cloud Test each alert rule; Grafana Alerting shows rule error Minor PromQL syntax differences between Prometheus and Grafana Cloud. Adjust as needed. most queries work unchanged.
Cost exceeds estimate Grafana Cloud bill higher than expected Invoice exceeds budget Use write_relabel_configs to drop high-cardinality, low-value metrics before shipping. Review seriesCountByMetricName for optimization targets.

When to Consider a Managed Alternative

This article IS the managed alternative. Direct affiliate path to Grafana Cloud (#108).

Alternatives:

  • Axiom (#112): 500GB/month free, unlimited retention, serverless query. Better for teams that want to ingest everything (metrics + logs + traces) without worrying about cardinality or retention costs.
  • Chronosphere (#116): Built for high-cardinality environments with cost control. For teams where cardinality is the primary scaling challenge.
  • VictoriaMetrics (#111): Self-hosted but lower resource usage than Prometheus. Extends the self-hosted stage before needing managed.
  • SigNoz (#117): OpenTelemetry-native unified observability. For teams migrating to OTel.

Sponsored guide opportunity: Grafana Labs sponsors a deep-dive on Grafana Cloud migration specific to security monitoring use cases.

Premium content pack: Observability migration toolkit. export scripts, dashboard conversion tools, alert rule migration templates, cost estimation calculator, and verification scripts for Grafana Cloud, Axiom, and Chronosphere.