SleepSort - .NET C# & Azure consultancy

Intro

In the previous post I set up a local LLM stack on the ASUS Ascent GX10: vLLM for inference, Open WebUI for a chat interface, OpenCode for agentic coding, and Prometheus + Grafana for observability. I mentioned in closing that I wanted to dig into RAG properly — not just the knowledge upload button in Open WebUI, but a real pipeline with a dedicated vector store, a local embedding model, and actual control over chunking and retrieval.

This post is that follow-up. The use case: a Tabletop game rules assistant. I’ve scanned the rulebook as a PDF, and I want to be able to ask natural-language questions about the rules and get precise, sourced answers — without hallucinations, and without sending my rulebook to an external API.

What we’ll be building

Compared to the previous setup, there are three new moving parts: a second vLLM instance dedicated to generating embeddings, a Qdrant vector database, and an Open WebUI Pipelines container that wires the whole retrieval flow together.

architecture-beta
  group workstation["Local Workstation"]
  group ascent["ASUS Ascent GX10"]
  group homelab["Homelab — HP EliteDesk G5 Mini"]

  service ingest(mdi:laptop)["Ingestion Script"] in workstation

  service webui(internet)["Open WebUI"] in ascent
  service pipelines(server)["OWU Pipelines"] in ascent
  service vllm_chat(server)["vLLM — Nemotron 30B"] in ascent
  service vllm_embed(server)["vLLM — Qwen3 Embed"] in ascent

  service qdrant(database)["Qdrant"] in homelab
  service prometheus(logos:prometheus)["Prometheus"] in homelab
  service grafana(logos:grafana)["Grafana"] in homelab

  webui:B -- T:pipelines
  pipelines:R -- L:vllm_chat
  pipelines:B -- T:vllm_embed
  pipelines:R -- L:qdrant
  vllm_chat:R -- L:prometheus
  vllm_embed:R -- L:prometheus
  qdrant:R -- L:prometheus
  prometheus:R -- L:grafana
  ingest:R -- L:vllm_embed
  ingest:R -- L:qdrant

The ingestion script runs once (or whenever I want to add new documents): it reads PDFs, generates embeddings via the embedding vLLM, and stores the resulting vectors in Qdrant.

At query time, a user types a question in Open WebUI. That request gets intercepted by the Pipelines container which will act as a middleware for chat completion requests. The pipeline rewrites the query, retrieves relevant chunks from Qdrant, reranks them, and injects the best-matching passages as context into the final prompt before forwarding it to the chat model.

Model choices

Chat model: Nemotron 3 Nano 30B A3B FP8

For inference I’m running nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8. I went with an FP8 quantization to keep its memory footprint manageable, and I wanted to check out the quality of the nemotron models in general (as they are a competitor to the qwen models I used previously).

A lot of models could have worked here, the main idea behind the setup is that the RAG layer provided most of the added value and context, so any somewhat decent model should be able to synthesize the final step.

Embedding model: Qwen3 Embedding 0.6B

Qwen/Qwen3-Embedding-0.6B is a compact embedding model, turning document contents into embedding vectors. In my current setup, I’ve spun it up as a completely separate vLLM instance, though I could have spun it up temporarily to generate the embeddings, and then dedicate the full memory of my AI box to the chat model.

Infrastructure

Two vLLM instances + Open WebUI

The docker compose on the Ascent now runs four services: a chat vLLM, an embedding vLLM, Open WebUI, and the Pipelines container:

services:
  vllm-chat:
    image: vllm/vllm-openai:cu130-nightly
    container_name: vllm-chat
    privileged: true
    restart: unless-stopped
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HF_TOKEN=${HF_TOKEN}
    volumes:
      - /home/dries/.cache/huggingface:/root/.cache/huggingface
    ports:
      - "0.0.0.0:8000:8000"
    ipc: host
    networks:
      - ai
    shm_size: 64gb
    command: >
      nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
      --host 0.0.0.0
      --port 8000
      --dtype auto
      --trust-remote-code
      --enable-prefix-caching
      --max-model-len 65536
      --max-num-batched-tokens 65536
      --max-num-seqs 4
      --gpu-memory-utilization 0.55

  vllm-embedding:
    image: vllm/vllm-openai:cu130-nightly
    container_name: vllm-embedding
    privileged: true
    restart: unless-stopped
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HF_TOKEN=${HF_TOKEN}
    volumes:
      - /home/dries/.cache/huggingface:/root/.cache/huggingface
    ports:
      - "0.0.0.0:8001:8001"
    ipc: host
    networks:
      - ai
    command: >
      Qwen/Qwen3-Embedding-0.6B
      --host 0.0.0.0
      --port 8001
      --trust-remote-code
      --max-model-len 32768
      --gpu-memory-utilization 0.35

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    networks:
      - ai
    environment:
      - OPENAI_API_BASE_URL=http://vllm-chat:8000/v1
      - OPENAI_API_KEY=not-needed
      - RAG_EMBEDDING_ENGINE=openai
      - RAG_EMBEDDING_MODEL=Qwen/Qwen3-Embedding-0.6B
      - RAG_OPENAI_API_BASE_URL=http://vllm-embedding:8001/v1
      - RAG_OPENAI_API_KEY=dummy
    volumes:
      - /models/open-webui:/app/backend/data
    ports:
      - "0.0.0.0:3000:8080"

  pipelines:
    image: ghcr.io/open-webui/pipelines:main
    container_name: pipelines
    restart: unless-stopped
    networks:
      - ai
    volumes:
      - /home/dries/source/OpenWebUiRagPipeline:/app/pipelines
    ports:
      - "9099:9099"
    environment:
      - PIPELINES_DIR=/app/pipelines
      - PIPELINES_REQUIREMENTS_PATH=/app/pipelines/requirements.txt

networks:
  ai:
    driver: bridge

Open WebUI gets the embedding vLLM wired in via environment variables, so its native RAG features (document uploads, knowledge collections) also use the local embedding model rather than an external API.

Qdrant on the HP EliteDesk

Qdrant runs on the HP EliteDesk alongside Prometheus and Grafana. It’s a lightweight service — no GPU needed — and offloading it from the Ascent preserves more unified memory for the models.

services:
  qdrant:
    image: qdrant/qdrant:latest
    container_name: qdrant
    restart: unless-stopped
    ports:
      - "6333:6333"   # REST API
      - "6334:6334"   # gRPC
    volumes:
      - qdrant_data:/qdrant/storage
    environment:
      - QDRANT__SERVICE__GRPC_PORT=6334

volumes:
  qdrant_data:

Observability: adding Qdrant metrics to our prometheus instance

The Prometheus scrape config gains two targets (both vLLM ports) and a new Qdrant job:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

scrape_configs:
  - job_name: vllm
    static_configs:
      - targets: ['<asus-ip>:8000', '<asus-ip>:8001']
    metrics_path: /metrics
    scrape_interval: 5s
    scrape_timeout: 4s
  - job_name: qdrant
    static_configs:
      - targets: ['<homelab-ip>:6333']
    metrics_path: /metrics
    scrape_interval: 15s
    scrape_timeout: 10s

Ingestion pipeline

Before any queries can be answered, the rulebook needs to be chunked, embedded, and stored in Qdrant. A quick primer on what’s actually happening under the hood: An embedding is a dense numerical vector that captures the semantic meaning of a piece of text. Two sentences that mean roughly the same thing will produce vectors pointing in similar directions, even if they share no words at all. The Google ML Crash Course has a solid walkthrough of the underlying theory if you want to go deeper.

Querying Qdrant works by embedding the user’s question in that same vector space, then finding stored chunks whose vectors are closest to it. “Closest” is measured by cosine similarity: the cosine of the angle between two vectors. A score of 1 means the vectors point in exactly the same direction (a strong semantic match), 0 means they’re orthogonal (unrelated). The Qdrant docs cover how the vector index handles this at scale using approximate nearest-neighbour search, keeping retrieval fast even over large collections.

A short Python script handles this, using LlamaIndex’s PyMuPDFReader and SentenceSplitter:

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings, StorageContext
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
import qdrant_client
from llama_index.readers.file import PyMuPDFReader
from pathlib import Path

client = qdrant_client.QdrantClient(url="http://localhost:6333")
vector_store = QdrantVectorStore(client=client, collection_name="documents")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

Settings.embed_model = OpenAIEmbedding(
    api_base="http://<asus-ip>:8001/v1",
    api_key="dummy",
    model_name="Qwen/Qwen3-Embedding-0.6B",
)

Settings.node_parser = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=50,
)

loader = PyMuPDFReader()
documents = []
for f in Path("./testdocs").glob("*.pdf"):
    documents.extend(loader.load(file_path=f))

index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
print(f"Indexed {len(documents)} documents into Qdrant")

The chunk size of 1024 tokens is on the larger side, and intentionally so. Game rules tend to be verbose and self-referential — a rule about “turn order” might need the surrounding paragraph to be unambiguous. Larger chunks preserve that surrounding context at the cost of some retrieval precision, which the reranker compensates for downstream.

For the rulebook I uploaded, the embedding generation was quite fast: ~30s for a 300 page rulebook. It resulted in ~700 data points in Qdrant, which I briefly tested via the Qdrant dashboard by running a few top-k queries against it.

The RAG pipeline

Open WebUI’s Pipelines feature acts as a chat completion middleware: it lets you intercept every chat request with a custom Python class before it reaches the model. The pipeline here does four things before a single token is generated:

Query rewriting — reword the user’s natural-language question into a concise search query optimized for rulebook terminology
Retrieval — embed the rewritten query and pull the top-20 most similar chunks from Qdrant
Reranking — pass those 20 candidates through a cross-encoder (cross-encoder/ms-marco-MiniLM-L-6-v2) and keep the top 5
Fallback — if none of the reranked results clear the score threshold, expand the query into 2 alternatives and retry

Let’s walk through the key parts.

Retrieval with AutoMergingRetriever

The initial retriever uses LlamaIndex’s AutoMergingRetriever on top of a standard top-k vector search. In the current setup with flat SentenceSplitter chunks it behaves like a regular retriever, but wiring it in now means moving to hierarchical ingestion later (using HierarchicalNodeParser to create a parent/child node tree) requires no pipeline changes — the retriever already knows how to promote sibling leaf chunks to their parent when enough of them match.

index = VectorStoreIndex.from_vector_store(vector_store, storage_context=storage_context)
self.retriever = AutoMergingRetriever(
    index.as_retriever(similarity_top_k=20),
    storage_context,
    verbose=False,
)

Query rewriting

Before hitting Qdrant, the pipeline rewrites the query. A rules assistant is a case where this pays off: users tend to ask colloquially (“can I attack twice?”) but the rulebook uses precise terminology (“ranged attack”, “action economy”). The rewrite step bridges that gap:

def _rewrite_query(self, user_message: str) -> str:
    prompt = (
        "Rewrite the following question as a concise search query optimized for "
        "retrieving tabletop wargame rules. Use technical terminology. "
        "Output only the rewritten query, nothing else.\n\n"
        f"Question: {user_message}"
    )
    return Settings.llm.complete(prompt).text.strip()

Fallback with query expansion

If the reranker finds no results above the score threshold, rather than giving up, the pipeline generates two alternative queries seeded by the weak first-pass results and tries again:

def _expand_query(self, original_query: str, first_pass_nodes) -> List[str]:
    snippets = "\n".join(
        f"- {n.get_text()[:200]}" for n in first_pass_nodes[:3] if n.get_text()
    )
    prompt = (
        f'A user asked: "{original_query}"\n\n'
        f"A rulebook search returned these loosely related snippets:\n{snippets}\n\n"
        f"Generate 2 short alternative search queries that might find more relevant rules. "
        f"Output only the queries, one per line, no numbering or explanation."
    )
    ...

The second-pass nodes get reranked and score-filtered again. If nothing passes, the pipeline returns a clean “I couldn’t find this in the rulebook” rather than hallucinating.

Score normalization

The cross-encoder outputs raw logits, not bounded probabilities. A sigmoid (an S-shaped function that squashes any real number into the 0–1 range) maps these to [0, 1] before comparing against the threshold, which makes the MIN_SCORE valve an interpretable confidence cut-off rather than an arbitrary raw logit value:

@staticmethod
def _normalize_score(score: float) -> float:
    try:
        return round(1 / (1 + math.exp(-score)), 3)
    except (OverflowError, TypeError):
        return 0.0

Handling thinking tokens

Nemotron (and other reasoning-capable models) emits its internal reasoning inside <think>...</think> tags. The pipeline’s _wrap_think_stream method ensures these are surfaced correctly in the Open WebUI chat: it opens the tag immediately, strips any duplicate model-emitted opening tag, synthesizes a closing tag if the model never closes the block, and then streams the answer portion without buffering.

Configurable valves

The pipeline exposes all the tuneable parameters as Valves — Open WebUI renders these as form fields in the admin UI, so you can adjust them without touching the code:

Valve	Default	Purpose
`TOP_K`	20	Chunks retrieved from Qdrant before reranking
`RERANKER_TOP_N`	5	Chunks kept after reranking
`MIN_SCORE`	0.3	Sigmoid-normalized score cut-off
`MAX_SOURCES`	3	Source references shown in the response footer
`LLM_MODEL`	Nemotron 30B	Chat model to use
`EMBEDDING_MODEL`	Qwen3 Embed 0.6B	Embedding model to use
`RERANKER_MODEL`	ms-marco-MiniLM-L-6-v2	Reranker to use

Each answer also includes a sourced footer showing which pages were used and their reranker confidence score, which makes it easy to spot when the pipeline is working well versus when the rulebook coverage is thin.

Key takeaways

Qdrant is simple to operate — near-zero config to get running, good Python client, and the built-in /metrics endpoint made adding it to the Prometheus scrape trivial.
Query rewriting makes a real difference — the gap between natural-language questions and rulebook phrasing is large enough that retrieval quality noticeably improves with the rewrite step. Worth the extra round-trip.
The reranker is the most important quality gate — pulling 20 candidates and cutting to 5 via the cross-encoder consistently surfaces better context than tuning TOP_K alone. Don’t skip it.
Open WebUI Pipelines is underrated — it’s a clean, low-friction way to wrap custom retrieval logic around any model. Hot-reloading pipelines from a mounted volume makes iteration fast.
Vocabulary is the pipeline’s blind spot — query rewriting closes a lot of the gap between natural language and rulebook phrasing, but it can’t guarantee synonym coverage. If the rewrite settles on “activating” an ability when the rulebook uses “triggering”, the relevant chunks may never surface. It’s a good argument for eventually exploring hybrid retrieval (combining dense vectors with sparse keyword search) so that exact terminology matches are never missed entirely.

Overall, I learned quite a bit about RAG as a pattern, and how to instrument it into something you’d actually want to use day-to-day.

Building a Local RAG Pipeline on the ASUS Ascent GX10