Surveillance RAG on vLLM + GKE

vLLM GKE PGVector YOLO CLIP Llama 3.1

Challenge

A physical security surveillance system relied on cloud LLM APIs (GPT-4, Grok) and Pinecone for vector search. This created data privacy concerns, unpredictable costs, and vendor lock-in. The system needed to move to fully self-hosted infrastructure while maintaining low-latency, multi-modal search over thousands of surveillance detections.

Solution

We migrated the entire stack to Google GKE with self-hosted Llama 3.1 8B on vLLM, replaced Pinecone with PGVector on PostgreSQL, containerized the CV detection pipeline and RAG chat as separate microservices, and codified everything in Terraform and Helm.

Architecture

  • vLLM serving Llama 3.1 8B Instruct on NVIDIA L4 GPU
  • PGVector on PostgreSQL 16 with 512-dim CLIP embeddings
  • CV Pipeline: YOLO + Mask R-CNN detection, CLIP embedding, BLIP captioning
  • ReAct agent loop with native tool calling for structured queries
  • FastAPI service with OpenAI-compatible endpoints
  • GCS Fuse CSI for surveillance image access
  • Terraform IaC for GKE, VPC, IAM, and GCS
  • Helm chart for one-command deployment

Results

100%

Self-hosted, no cloud LLM APIs

~60%

Cost reduction vs. cloud APIs

Multi-Tenant LLM Platform on GKE

GKE vLLM Terraform Istio H100

Challenge

An enterprise needed a shared LLM serving platform that could host multiple models simultaneously, isolate tenant traffic, and autoscale GPU resources based on demand — all while maintaining sub-second response times.

Solution

We designed a multi-model vLLM deployment on GKE with Istio service mesh for tenant isolation, prefix-aware request routing, and GPU autoscaling from 0 to N based on queue depth. Terraform modules made the platform reproducible across environments.

Key Deliverables

  • GKE cluster with H100 and L4 GPU node pools
  • Istio service mesh with per-tenant rate limiting
  • HPA scaling on vLLM request queue metrics
  • GCS model weight cache for fast cold starts
  • Prometheus + Grafana monitoring dashboards
  • Terraform modules for multi-env deployment

Automated Security Camera Analytics

YOLO Mask R-CNN Kubernetes GPU

Challenge

A residential security operation needed automated analysis of camera feeds — detecting people, vehicles, and animals, classifying them as known/unknown entities, and enabling natural language search over detection history.

Solution

We built a Kubernetes CronJob pipeline that processes images through a multi-model stack (YOLO for detection, Mask R-CNN for segmentation, CLIP for embedding, BLIP for captioning), stores results in PGVector, and exposes them through a RAG-powered chat interface.

Pipeline Architecture

  • GCS-mounted image ingestion via CSI driver
  • Multi-model detection: YOLO v8 + Mask R-CNN
  • CLIP 512-dim embeddings for semantic search
  • BLIP auto-captioning for natural language context
  • Friendly/unfriendly entity classification
  • Batch upsert to PGVector with IVFFlat indexing
  • L4 GPU spot instances for cost efficiency

Have a Similar Challenge?

Let's discuss how we can architect and deploy the right solution for your AI infrastructure needs.

Get in Touch →