Adversarial Attacks on Embeddings: Poisoning Vector Stores and Manipulating Semantic Search

Adversarial Attacks on Embeddings: Poisoning Vector Stores and Manipulating Semantic Search

Problem

Embedding-based retrieval powers RAG pipelines, semantic search, recommendation systems, and classification. The embedding space is treated as a trustworthy representation of meaning, but it is not. An attacker who can inject documents into the indexing pipeline controls what gets retrieved. An attacker who understands the embedding model can craft adversarial documents that sit close to target queries in vector space while containing arbitrary content.

Most teams validate the text content of documents before indexing but never inspect the resulting embeddings. A document that passes content moderation can still be adversarial in embedding space: semantically close to high-value queries, positioned to displace legitimate results, and carrying payloads (prompt injections, misinformation, or data exfiltration instructions) that the retrieval system faithfully surfaces.

The attack surface spans three areas: poisoning the vector store through the ingestion pipeline, crafting queries that manipulate retrieval results, and exploiting the geometry of the embedding space to create adversarial collisions.

Threat Model

  • Adversary: (1) Attacker with write access to the document ingestion pipeline (employee, compromised service account, supply chain). (2) Attacker who understands the embedding model architecture and can craft adversarial inputs offline. (3) Attacker who can submit queries to a RAG endpoint and observe retrieved content.
  • Objective: Inject documents that get retrieved for high-value queries (poisoning). Displace legitimate documents from retrieval results (denial of service). Embed prompt injection payloads in documents that will be passed to an LLM. Extract information about the embedding model or indexed documents through query probing.
  • Blast radius: Poisoned retrievals lead to incorrect LLM outputs. In agentic systems, this can trigger unauthorized actions. In customer-facing systems, it causes misinformation or brand damage.

Configuration

Embedding Integrity Validation at Ingestion

# embedding_validator.py - validate embeddings before they enter the vector store
import numpy as np
from typing import Tuple, List
from dataclasses import dataclass

@dataclass
class EmbeddingValidationResult:
    valid: bool
    reasons: List[str]
    embedding_norm: float
    nearest_cluster_distance: float

class EmbeddingValidator:
    """
    Validate embeddings before insertion into the vector store.
    Detects anomalous embeddings that may indicate adversarial crafting.
    """

    def __init__(self, expected_dim: int = 1536, norm_range: Tuple[float, float] = (0.9, 1.1)):
        self.expected_dim = expected_dim
        self.norm_range = norm_range
        self.cluster_centroids = None  # loaded from baseline
        self.max_cluster_distance = None

    def load_baseline(self, centroids: np.ndarray, threshold_percentile_99: float):
        """Load cluster centroids from a baseline computed over known-good documents."""
        self.cluster_centroids = centroids
        self.max_cluster_distance = threshold_percentile_99

    def validate(self, embedding: np.ndarray, document_text: str) -> EmbeddingValidationResult:
        reasons = []
        norm = float(np.linalg.norm(embedding))

        # Check dimensionality
        if embedding.shape[0] != self.expected_dim:
            reasons.append(f"dimension_mismatch: expected {self.expected_dim}, got {embedding.shape[0]}")

        # Check norm (most embedding models produce near-unit-norm vectors)
        if not (self.norm_range[0] <= norm <= self.norm_range[1]):
            reasons.append(f"abnormal_norm: {norm:.4f} outside [{self.norm_range[0]}, {self.norm_range[1]}]")

        # Check for NaN or Inf
        if np.any(np.isnan(embedding)) or np.any(np.isinf(embedding)):
            reasons.append("contains_nan_or_inf")

        # Check distance to nearest known cluster
        nearest_distance = float("inf")
        if self.cluster_centroids is not None:
            distances = np.linalg.norm(self.cluster_centroids - embedding, axis=1)
            nearest_distance = float(np.min(distances))
            if nearest_distance > self.max_cluster_distance:
                reasons.append(
                    f"outlier_embedding: distance {nearest_distance:.4f} "
                    f"exceeds threshold {self.max_cluster_distance:.4f}"
                )

        # Check text-embedding coherence (length ratio heuristic)
        word_count = len(document_text.split())
        if word_count < 5:
            reasons.append("suspiciously_short_document")

        return EmbeddingValidationResult(
            valid=len(reasons) == 0,
            reasons=reasons,
            embedding_norm=norm,
            nearest_cluster_distance=nearest_distance,
        )

Semantic Drift Detection

# prometheus-embedding-drift.yaml
# Monitor for sudden changes in the distribution of newly indexed embeddings
groups:
  - name: embedding-drift
    interval: 5m
    rules:
      # Track average cosine similarity of new embeddings to their nearest cluster centroid
      - record: embedding:avg_cluster_distance:5m
        expr: >
          avg(embedding_nearest_cluster_distance_bucket) by (index_name)

      # Alert when new embeddings are systematically further from known clusters
      - alert: EmbeddingSemanticDrift
        expr: >
          embedding:avg_cluster_distance:5m > 0.45
          and
          rate(embedding_documents_indexed_total[5m]) > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Semantic drift detected in {{ $labels.index_name }}"
          description: >
            Average distance of new embeddings to cluster centroids has increased
            to {{ $value | humanize }}. This may indicate adversarial document
            injection or a significant shift in ingested content.

      # Alert on sudden spike in embedding validation failures
      - alert: EmbeddingValidationFailureSpike
        expr: >
          rate(embedding_validation_failures_total[5m])
          / rate(embedding_documents_indexed_total[5m]) > 0.1
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "{{ $value | humanizePercentage }} of new embeddings failing validation"

Input Sanitisation Before Embedding

# pre_embedding_sanitizer.py - clean documents before they reach the embedding model
import re
import hashlib
from typing import Optional

class PreEmbeddingSanitizer:
    """
    Sanitize document content before embedding.
    Prevents adversarial text patterns that manipulate embedding geometry.
    """

    # Patterns that attempt to manipulate embedding space positioning
    ADVERSARIAL_PATTERNS = [
        # Repeated keyword stuffing (inflates relevance for specific queries)
        (r"(\b\w+\b)(\s+\1){10,}", "keyword_stuffing"),
        # Invisible unicode characters used to shift embeddings
        (r"[\u200b\u200c\u200d\ufeff\u00ad]{3,}", "invisible_unicode"),
        # Base64-encoded payloads hidden in documents
        (r"[A-Za-z0-9+/]{100,}={0,2}", "encoded_payload"),
        # Homoglyph substitution (mixing Latin and Cyrillic)
        (r"[\u0400-\u04ff].*[\x41-\x5a\x61-\x7a]", "homoglyph_mixing"),
    ]

    def __init__(self):
        self.seen_hashes = set()

    def sanitize(self, text: str) -> tuple[str, list[str]]:
        warnings = []

        # Check for near-duplicate content (adversarial document flooding)
        content_hash = hashlib.sha256(text.strip().lower().encode()).hexdigest()
        if content_hash in self.seen_hashes:
            warnings.append("duplicate_document")
        self.seen_hashes.add(content_hash)

        # Check for adversarial patterns
        for pattern, category in self.ADVERSARIAL_PATTERNS:
            if re.search(pattern, text):
                warnings.append(f"adversarial_pattern:{category}")

        # Strip invisible unicode
        cleaned = re.sub(r"[\u200b\u200c\u200d\ufeff\u00ad]", "", text)

        # Normalise whitespace
        cleaned = re.sub(r"\s+", " ", cleaned).strip()

        return cleaned, warnings

Kubernetes Deployment for Embedding Validation Service

# embedding-validator-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: embedding-validator
  namespace: ai-pipeline
spec:
  replicas: 2
  selector:
    matchLabels:
      app: embedding-validator
  template:
    metadata:
      labels:
        app: embedding-validator
    spec:
      containers:
        - name: validator
          image: internal-registry/embedding-validator:1.4.0
          ports:
            - containerPort: 8080
          env:
            - name: BASELINE_PATH
              value: "/data/baselines/cluster_centroids.npy"
            - name: MAX_CLUSTER_DISTANCE
              value: "0.55"
            - name: EMBEDDING_DIM
              value: "1536"
          resources:
            requests:
              cpu: 200m
              memory: 512Mi
            limits:
              cpu: 500m
              memory: 1Gi
          volumeMounts:
            - name: baseline-data
              mountPath: /data/baselines
              readOnly: true
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            periodSeconds: 10
      volumes:
        - name: baseline-data
          persistentVolumeClaim:
            claimName: embedding-baselines-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: embedding-validator
  namespace: ai-pipeline
spec:
  selector:
    app: embedding-validator
  ports:
    - port: 8080
      targetPort: 8080
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: embedding-validator-policy
  namespace: ai-pipeline
spec:
  podSelector:
    matchLabels:
      app: embedding-validator
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: document-ingestion
      ports:
        - port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: vector-store
      ports:
        - port: 6333

Expected Behaviour

  • All documents pass through the sanitizer before embedding generation
  • Embedding validation rejects vectors with anomalous norms, dimensions, or cluster distances
  • Duplicate documents are detected and flagged before indexing
  • Semantic drift alerts fire within 15 minutes of sustained anomalous ingestion
  • Validation failure rate above 10% triggers a critical alert and pauses ingestion
  • Adversarial text patterns (keyword stuffing, invisible unicode, homoglyphs) are stripped or flagged before embedding

Trade-offs

Control Impact Risk Mitigation
Cluster distance threshold Rejects embeddings far from known topics Legitimate new topics get flagged as anomalous Retrain baselines monthly. Allow manual override for new document categories with approval.
Duplicate detection Prevents flooding attacks Legitimate re-indexing of updated documents is blocked Use content-hash with version tracking. Allow updates that change >20% of content.
Pre-embedding sanitisation Removes adversarial text patterns Overly aggressive cleaning may alter document meaning Log all modifications. Allow human review of sanitised documents before final indexing.
Embedding validation latency Adds 10-50ms per document at ingestion time Slows bulk indexing operations Run validation asynchronously for batch ingestion. Synchronous for real-time.

Failure Modes

Failure Symptom Detection Recovery
Baseline too old New legitimate documents consistently rejected Validation failure rate climbs steadily over weeks Rebuild cluster centroids from current corpus. Schedule monthly baseline refresh.
Adversarial bypass Attacker crafts document that passes all checks but poisons retrieval Retrieved results degrade in quality; user reports inaccurate answers Manual review of recently indexed documents. Add the bypass technique to the pattern list.
Validator service down Documents indexed without validation Health check failures; gap in validation metrics Ingestion pipeline should block (not skip) when validator is unreachable. Queue documents for later validation.
False positive on legitimate content Good documents rejected at ingestion Data team reports missing documents; ingestion rejection rate spikes Review and widen cluster distance threshold. Add the document category to the baseline.

When to Consider a Managed Alternative

Embedding security requires maintaining baselines, updating adversarial pattern detection, and monitoring drift across potentially millions of vectors. This operational load scales with corpus size.

  • Pinecone (#147): Managed vector database with built-in metadata filtering and access control. Monitoring dashboards for index health.
  • Weaviate (#148): Self-hosted or cloud-managed vector database with OIDC authentication, multi-tenancy, and RBAC for collections.
  • Qdrant (#149): High-performance vector database with payload filtering, snapshots for rollback, and access control.

Premium content pack: Embedding security configuration pack. Baseline computation scripts, embedding validation service (Python), Prometheus alerting rules for semantic drift, pre-embedding sanitisation library, and adversarial embedding detection test suite.