AI API Key Management: Rotation, Scoping, and Abuse Detection

AI API Key Management: Rotation, Scoping, and Abuse Detection

Problem

AI services have turned API keys into direct spending controls. A leaked OpenAI or Anthropic key can generate thousands of dollars in charges within hours. Unlike traditional API keys where abuse means data loss, AI API key abuse means immediate financial loss at compute prices of $0.01-$0.10 per request for large models.

Teams manage keys across multiple AI providers (OpenAI, Anthropic, Cohere, internal inference endpoints), multiple environments (development, staging, production), and multiple teams. The result is key sprawl: keys hardcoded in notebooks, shared in Slack, stored in environment variables on developer laptops, and duplicated across CI/CD pipelines. No single system tracks which keys exist, who has access, what they are scoped to, or when they were last rotated.

A compromised key is rarely detected through the provider’s built-in monitoring. Most AI API providers offer basic usage dashboards but no real-time anomaly detection. By the time a team notices an unexpected bill, the damage is done.

Target systems: Applications consuming AI provider APIs. Internal inference endpoints with API key authentication. Vault or equivalent secrets management. API gateways proxying AI requests.

Threat Model

  • Adversary: External attacker who obtains an API key through source code exposure (public Git repository), compromised CI/CD pipeline, insider threat, or phishing. Also: automated scanners that trawl GitHub for API key patterns.
  • Objective: Use the key for unauthorized inference (generating content, running models at the victim’s expense) or extract data through prompt injection and data exfiltration via model context.
  • Blast radius: A single compromised key with no rate limits or scope restrictions can generate unlimited API calls. At $0.06 per 1K tokens for GPT-4 class models, a script running overnight can produce $10,000+ in charges. If the key also has access to fine-tuning endpoints, the attacker can download or corrupt fine-tuned models.

Configuration

Centralised Key Storage with Vault

Store all AI API keys in Vault with dynamic secret generation where providers support it.

# vault-ai-secrets-engine.hcl
# Mount a KV secrets engine for AI provider keys
path "secret/data/ai-providers/*" {
  capabilities = ["read"]
}

# Per-environment, per-team access policies
path "secret/data/ai-providers/openai/production" {
  capabilities = ["read"]
  allowed_parameters = {
    "version" = []
  }
}

path "secret/data/ai-providers/anthropic/staging" {
  capabilities = ["read"]
}
# Store keys with metadata
vault kv put secret/ai-providers/openai/production \
  api_key="sk-proj-..." \
  org_id="org-..." \
  created_at="2026-04-22" \
  rotated_at="2026-04-22" \
  owner="ml-platform-team" \
  scope="chat-completions-only" \
  max_monthly_spend="5000"

Per-Key Scoping and Rate Limits

Use an API gateway to enforce per-key rate limits and scope restrictions, regardless of what the upstream provider allows.

# kong-ai-gateway-config.yaml
# Kong gateway configuration for AI API proxying
_format_version: "3.0"

services:
  - name: openai-proxy
    url: https://api.openai.com
    routes:
      - name: openai-route
        paths:
          - /ai/openai
        strip_path: true

plugins:
  # Rate limiting per consumer (team/service)
  - name: rate-limiting
    service: openai-proxy
    config:
      minute: 60
      hour: 1000
      day: 10000
      policy: redis
      redis_host: redis.kong-system.svc.cluster.local

  # Request size limiting (prevent abuse via large prompts)
  - name: request-size-limiting
    service: openai-proxy
    config:
      allowed_payload_size: 64  # KB

  # Request transformer: inject the real API key
  # Consumers send a scoped internal key; Kong swaps it for the provider key
  - name: request-transformer
    service: openai-proxy
    config:
      remove:
        headers:
          - Authorization
      add:
        headers:
          - "Authorization: Bearer $(vault read -field=api_key secret/ai-providers/openai/production)"

consumers:
  - username: ml-team-prod
    keyauth_credentials:
      - key: internal-ml-prod-key-001
  - username: data-science-staging
    keyauth_credentials:
      - key: internal-ds-staging-key-001

Automated Key Rotation

#!/bin/bash
# rotate-ai-keys.sh
# Run monthly via cron or CI pipeline

set -euo pipefail

PROVIDER="openai"
ENV="production"
VAULT_PATH="secret/ai-providers/${PROVIDER}/${ENV}"

# Step 1: Generate new key via provider API
# OpenAI example (API key creation endpoint)
NEW_KEY=$(curl -s -X POST "https://api.openai.com/v1/organization/api_keys" \
  -H "Authorization: Bearer $(vault kv get -field=admin_key secret/ai-providers/${PROVIDER}/admin)" \
  -H "Content-Type: application/json" \
  -d '{"name": "production-'"$(date +%Y%m%d)"'", "scope": {"type": "project", "project_id": "proj_abc123"}}' \
  | jq -r '.api_key')

# Step 2: Store new key in Vault
vault kv put "${VAULT_PATH}" \
  api_key="${NEW_KEY}" \
  rotated_at="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  previous_key="$(vault kv get -field=api_key ${VAULT_PATH})"

# Step 3: Restart consuming pods to pick up new key
kubectl rollout restart deployment/ai-gateway -n ml-platform

# Step 4: Verify new key works
sleep 30
HEALTH=$(curl -s -o /dev/null -w "%{http_code}" \
  -H "Authorization: Bearer ${NEW_KEY}" \
  "https://api.openai.com/v1/models")

if [ "${HEALTH}" != "200" ]; then
  echo "ERROR: New key health check failed. Rolling back."
  vault kv put "${VAULT_PATH}" \
    api_key="$(vault kv get -field=previous_key ${VAULT_PATH})" \
    rotated_at="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
    rollback="true"
  kubectl rollout restart deployment/ai-gateway -n ml-platform
  exit 1
fi

# Step 5: Revoke old key (after grace period for in-flight requests)
echo "New key verified. Old key will be revoked in 24 hours."
echo "Schedule: revoke-old-key.sh ${PROVIDER} ${ENV}"

Abuse Detection

# prometheus-ai-key-abuse-alerts.yaml
groups:
  - name: ai-key-abuse
    rules:
      # Sudden spike in request volume (3x normal for time of day)
      - alert: AIKeyUsageSpike
        expr: >
          sum by (consumer) (
            rate(kong_http_requests_total{service="openai-proxy"}[15m])
          ) > 3 * avg_over_time(
            sum by (consumer) (
              rate(kong_http_requests_total{service="openai-proxy"}[15m])
            )[7d:1h]
          )
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "AI API usage spike for {{ $labels.consumer }}: 3x normal volume"
          runbook: "Check if a new batch job was deployed. If not, rotate the key immediately."

      # Requests from unexpected source IPs
      - alert: AIKeyNewSourceIP
        expr: >
          count by (consumer) (
            count by (consumer, client_ip) (
              kong_http_requests_total{service="openai-proxy"}
            )
          ) > 1.5 * avg_over_time(
            count by (consumer) (
              count by (consumer, client_ip) (
                kong_http_requests_total{service="openai-proxy"}
              )
            )[7d:1h]
          )
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "AI API key {{ $labels.consumer }} used from new source IPs"

      # Cost threshold alert
      - alert: AIMonthlySpendThreshold
        expr: >
          sum by (consumer) (
            increase(kong_http_requests_total{service=~".*-proxy"}[30d])
          ) * 0.03 > 5000
        labels:
          severity: critical
        annotations:
          summary: "Estimated monthly AI API spend for {{ $labels.consumer }} exceeds $5,000"

Emergency Revocation

#!/bin/bash
# emergency-revoke.sh - run when compromise is detected
set -euo pipefail

PROVIDER=$1
ENV=$2

echo "EMERGENCY: Revoking ${PROVIDER} ${ENV} API key"

# Immediately update Vault to empty value (blocks new requests)
vault kv put "secret/ai-providers/${PROVIDER}/${ENV}" \
  api_key="REVOKED-$(date +%s)" \
  revoked_at="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  revoked_by="${USER}"

# Restart gateway to pick up revocation
kubectl rollout restart deployment/ai-gateway -n ml-platform

# Notify team
curl -X POST "${SLACK_WEBHOOK}" \
  -H "Content-Type: application/json" \
  -d "{\"text\": \"SECURITY: ${PROVIDER} ${ENV} API key revoked by ${USER}. Generate new key and update Vault.\"}"

echo "Key revoked. Generate new key via provider dashboard and run rotate-ai-keys.sh"

Expected Behaviour

  • No AI API keys exist outside Vault. Applications retrieve keys through Vault sidecar or API gateway injection.
  • Each team and environment has a separate key with explicit rate limits
  • Key rotation runs monthly with automated verification
  • Usage spikes trigger alerts within 5 minutes
  • Emergency revocation completes within 2 minutes (Vault update + gateway restart)
  • Monthly spend per key is tracked and alerted on threshold breach

Trade-offs

Decision Impact Risk Mitigation
API gateway proxy (Kong) Adds 5-15ms latency per request Gateway becomes a single point of failure Run gateway in HA (3+ replicas). Circuit breaker falls back to direct API call with cached key.
Monthly key rotation Limits exposure window of compromised keys Rotation failure leaves stale key Automated health check after rotation. Alert on rotation script failure. Keep previous key in Vault for rollback.
Per-consumer rate limits Prevents runaway spend from any single consumer Legitimate batch jobs may hit limits Configurable per-consumer limits. Batch jobs use dedicated consumer with higher limits and tighter monitoring.
Internal proxy keys (not provider keys) Teams never see the real provider key More infrastructure to manage (gateway + Vault) Justified for any team spending over $500/month on AI APIs. Below that, direct provider keys with Vault storage may suffice.

Failure Modes

Failure Symptom Detection Recovery
Vault unavailable Applications cannot retrieve API keys; AI features fail Vault health check fails; application error logs show Vault connection timeout Vault HA cluster. If total outage, fall back to cached key in gateway (time-limited cache, 1 hour max).
Key rotation fails mid-process New key invalid, old key may be revoked Rotation script exits with error; health check fails Automatic rollback to previous key (stored in Vault). Alert on rotation failure.
Rate limit too aggressive Legitimate requests rejected (HTTP 429) Application error rate increases; Kong logs show rate limit rejections Increase rate limit for the affected consumer. Review usage patterns to set appropriate limits.
Compromised key detected late Unexpected charges on provider bill Monthly bill review; provider usage dashboard Revoke key immediately. File dispute with provider (most providers credit unauthorized usage if reported promptly). Rotate all keys on same provider.

When to Consider a Managed Alternative

HCP Vault (#65) for managed secrets lifecycle, eliminating Vault cluster operations. Doppler (#68) for universal secrets sync across environments when Vault is too complex. Kong (#86) Konnect for managed API gateway with built-in analytics per consumer.

Premium content pack: API key rotation automation templates. Vault policies, Kong gateway configuration, rotation scripts for major AI providers, and Prometheus alert rules for usage monitoring.