Knowledge Base PRO

The Knowledge Base turns your agents from generic conversationalists into domain experts. Upload your documentation, product guides, FAQs, or any reference material, and your agents will use it to answer questions accurately — grounded in your actual content rather than guessing.

How RAG Works

The Knowledge Base is powered by Retrieval-Augmented Generation (RAG), a technique that combines search with AI generation. Here is exactly how it works, step by step:

Upload — You upload documents (PDF, TXT, MD, CSV) or provide URLs to crawl. The raw content is extracted and stored on your VPS.
Chunk — Each document is split into smaller pieces called chunks. ClawHQ uses a chunk size of 2,000 characters with a 200-character overlap between consecutive chunks. The overlap ensures that no information is lost at chunk boundaries — if a key sentence spans two chunks, the overlap captures it in both.
Embed — Each chunk is converted into a numerical vector (an embedding) using the all-MiniLM-L6-v2 model, which produces 384-dimensional vectors. These vectors capture the semantic meaning of each chunk, so chunks about similar topics end up close together in vector space.
Search — When a user asks a question, the question is also embedded into a vector. The system searches for chunks whose vectors are most similar to the question vector. This is called semantic search — it finds relevant content even when the exact words differ.
Inject Context — The most relevant chunks are inserted into the prompt as context, placed before the user's question. The model sees both the retrieved content and the question together.
Model Answers — The AI model generates its response using the injected context. Because the relevant information is right there in the prompt, the model can give accurate, grounded answers instead of relying solely on its training data.

Why RAG matters: Without RAG, your agent can only answer from its general training knowledge. With RAG, it answers from your data — your product docs, your policies, your specific domain expertise. This dramatically reduces hallucination and increases answer accuracy.

Supported File Types

The Knowledge Base accepts four file formats:

PDF — Product manuals, research papers, contracts, any document saved as PDF. Text is extracted automatically; scanned images within PDFs are not supported.
TXT — Plain text files of any kind.
MD — Markdown files, including those with frontmatter. Markdown formatting is preserved during chunking.
CSV — Tabular data. Each row is treated as a potential chunk boundary, making CSV ideal for FAQ lists or structured reference data.

Upload Files or Crawl URLs

You can populate your Knowledge Base in two ways:

File upload — Drag and drop files directly into the dashboard, or click to browse. Multiple files can be uploaded simultaneously. Each file is processed (extracted, chunked, embedded) automatically.
URL crawl — Provide a URL and ClawHQ will fetch the page content, extract the text, and process it through the same chunking and embedding pipeline. This is useful for ingesting existing web documentation, help center articles, or blog posts.

Hybrid Search

ClawHQ uses hybrid search, which combines two search strategies simultaneously:

Vector search — Finds semantically similar content using embedding cosine similarity. Catches relevant results even when the wording differs from the query.
Keyword search — Traditional text matching that finds exact terms, product names, error codes, and other specific strings that semantic search might rank lower.

Both searches run simultaneously, and the results are merged using a weighted combination. This ensures that you get the best of both worlds: semantic understanding plus precise keyword matching.

Relevance Threshold

The default relevance threshold is a 0.3 cosine similarity score. Chunks scoring below this threshold are excluded from the context injected into the model prompt. This prevents low-quality matches from polluting the context window.

You can adjust the threshold in the Knowledge Base settings. A higher threshold (e.g., 0.5) means only very closely matching chunks are used — more precise but potentially missing relevant content. A lower threshold (e.g., 0.2) casts a wider net but may include marginally relevant chunks.

Tip: Use the "Test Your KB" feature (described below) to experiment with different thresholds. Search for questions your users actually ask and verify that the right chunks are returned.

Test Your Knowledge Base

The "Test Your KB" interface lets you run search queries against your Knowledge Base and see exactly which chunks are returned, along with their relevance scores. This is invaluable for:

Verifying that your documents are chunked correctly
Tuning the relevance threshold
Identifying gaps in your content (queries that return no results)
Confirming that the right content surfaces for expected questions

Type a question in the search bar, and the results panel shows each matching chunk with its source document, chunk index, similarity score, and the full chunk text.

Retrieval Tracking

Retrieval tracking records which documents and chunks are used most frequently in agent responses. Over time, this data reveals which parts of your Knowledge Base are most valuable and which documents are rarely accessed.

Use retrieval tracking to prioritize content updates. Documents that are retrieved frequently but receive negative feedback may need rewriting. Documents that are never retrieved may need better titles or additional context to improve their discoverability.

Chunk Viewer

The Chunk Viewer lets you inspect individual chunks within any document. See how your content was split, verify that chunk boundaries make sense, and identify cases where important information was split across chunks in an unhelpful way.

Each chunk displays its index number, character count, the first and last few words of overlap with adjacent chunks, and its embedding vector magnitude. You can navigate through chunks sequentially or jump to a specific chunk by index.

Connectors

Connect external data sources to keep your Knowledge Base synchronized with content that lives outside of ClawHQ:

Google Drive — Connect your Google Drive account and select specific files or folders to sync. When the source document changes in Drive, the Knowledge Base automatically re-processes it.
Notion — Connect your Notion workspace and select databases or pages to sync. Notion content is extracted, chunked, and embedded just like uploaded files.

Connectors check for updates periodically and re-index changed content automatically. You can also trigger a manual sync at any time from the Connectors panel.

Feedback

When a user receives an answer that was sourced from the Knowledge Base, they can provide thumbs-up or thumbs-down feedback. This feedback is linked back to the specific chunks that were used, giving you a direct signal about content quality.

Over time, feedback data helps you identify documents that are helpful versus those that lead to poor answers. Combined with retrieval tracking, feedback creates a continuous improvement loop for your Knowledge Base content.

Reindexing Documents

If you update the embedding model, change chunk settings, or modify a document's content, you can trigger a reindex. Reindexing re-processes the document through the full pipeline: extraction, chunking, and embedding. The old chunks are replaced with the new ones atomically, so there is no downtime during reindexing.

Data Privacy

All Knowledge Base data — uploaded files, extracted text, chunks, and embedding vectors — is stored on your VPS. Your data never leaves your instance. The embedding model runs locally on the VPS, so document content is never sent to external services for processing.

Next steps: After populating your Knowledge Base, use the Analytics module to track how Knowledge Base-sourced answers affect your CSAT scores and resolution rates.