Written by Mike Pearlstein, CISSP, MSc AI, CEO of Fusion Computing Limited. Helping Canadian businesses build and manage secure IT infrastructure since 2012 across Toronto, Hamilton, and Metro Vancouver.
Retrieval-augmented generation (RAG) is the architecture pattern that lets a large language model answer questions using your company’s own documents instead of guessing from its training data. It is the difference between ChatGPT making up an answer about your warranty policy and a chatbot quoting your warranty PDF with a footnote pointing to page 4.
For most Canadian SMBs evaluating AI right now, the question is not whether RAG matters. The real question is whether you can deploy it without a data scientist on staff, and without your client files leaving the country. This guide answers both.
Key Takeaways
- RAG grounds LLM answers in your documents (PDFs, SharePoint, wikis, ticketing data) by retrieving relevant chunks at query time and feeding them to the model along with the question.
- The original 2020 paper (Lewis et al., NeurIPS) showed RAG models generate “more specific, diverse and factual language” than parametric-only models on knowledge-intensive tasks.
- RAG is cheaper and lower-risk than fine-tuning for most SMB use cases. You update an index, not a model.
- For Canadian SMBs, the deployment decision is shaped by data residency (PIPEDA, provincial privacy law) more than by model choice. Canada-region Azure or AWS endpoints are the default starting point.
- Engagements vary by scope. Fusion Computing scopes and prices every RAG deployment up front based on data sources, document volume, and integration complexity.
What is RAG, in one paragraph?
Pre-Copilot prerequisite: RAG quality is bounded by the same SharePoint permission cascade Copilot reads. Start with the Pre-Copilot SharePoint Audit before tuning the retrieval layer.
RAG is a two-step pattern. Step one: when a user asks a question, the system searches your private knowledge base for the chunks of text most relevant to that question. Step two: those chunks get pasted into the prompt sent to a large language model along with the original question, and the model generates an answer grounded in that retrieved context.
The model still does the writing. The retrieval system controls what facts the model sees. That separation is the whole point.
According to IBM Research’s 2023 explainer on retrieval-augmented generation, RAG “ensures that the model has access to the most current, reliable facts, and that users have access to the model’s sources, ensuring that its claims can be checked for accuracy.” That second half matters as much as the first. A grounded answer with a citation is auditable. A hallucinated answer with no citation is a liability.
Wondering whether RAG fits your stack? Book a 30-minute AI architecture review →
How RAG actually works in production
The diagram below traces a single question from user input to grounded answer. Every production RAG system FC has built for a Canadian SMB follows this shape, with components swapped (Azure OpenAI vs Amazon Bedrock vs OpenAI direct) but the topology unchanged.
The components in plain language:
- Embedding model. A small model (text-embedding-3-large, Cohere Embed v3, or similar) converts text into a numerical vector that captures meaning. Your question becomes a vector. So does every chunk of every document you ingested.
- Vector index. A specialized database (Azure AI Search, Pinecone, Weaviate, pgvector on Postgres) stores those chunk-vectors. Given a question-vector, it returns the chunk-vectors closest in meaning, usually within milliseconds.
- Retriever. Modern production RAG systems use hybrid retrieval (keyword search plus vector search) and often add a reranking step. Microsoft’s Azure AI Search RAG documentation, last updated March 2026, recommends “hybrid queries that combine keyword (nonvector) and vector search for maximum recall” with semantic reranking on top.
- Generator. The LLM (GPT-4o, Claude Sonnet, Gemini Pro, or a Canada-hosted equivalent) receives the original question plus the retrieved chunks and generates the answer. The retrieved chunks act as context the model is instructed to ground its answer in.
- Citation surface. Production RAG systems return the source chunk IDs alongside the answer so the UI can show “According to HR-Policy.pdf page 12…” This is what makes the answer auditable.
The deeper engineering work in any FC RAG project is in chunking strategy (how you split documents before embedding), retrieval tuning (how many chunks to fetch, when to rerank), and access control (which retrieved chunks a given user is permitted to see). Choosing the LLM is comparatively trivial. Most of our clients can swap the model behind the same retrieval layer without touching the application.
RAG vs fine-tuning vs prompt engineering vs long-context
The four patterns get conflated in vendor pitches. They solve different problems. The table below is the reference we hand Canadian SMB CIOs when they ask which one to use.
“We’d been quoted $180,000 to fine-tune a model on our knowledge base. Fusion built a RAG pipeline against the same documents in six weeks, our staff can update it without retraining, and our compliance team signed off on it because the source documents never leave Canadian-region storage.”
| Pattern | When to use | Cost profile | Accuracy on private facts | Governance profile |
|---|---|---|---|---|
| RAG | Answer questions from private documents; documents change frequently; auditability matters. | Scope-dependent. Per-query cost is embedding + retrieval + LLM tokens. Documents update without retraining. | High when retrieval is tuned. Citations let you verify each answer. | Strong. Documents stay in your index, never enter model weights. Access control enforceable at retrieval layer. |
| Fine-tuning | Teach the model a style, format, or domain language. Not for teaching new facts. Requires hundreds to thousands of training examples. | High. Training run plus ongoing retraining whenever facts change. Significantly more than RAG for the same data set. | Unreliable for facts. Fine-tuned models still hallucinate and have no citation surface. | Weaker. Training data becomes part of model weights. Auditability is hard. |
| Prompt engineering | Stable reusable tasks where the inputs are short and the knowledge is general. ChatGPT-style copilot use cases. | Lowest. No infrastructure beyond the model API. | Low for private facts (model has not seen them). High for general reasoning. | Depends entirely on the model vendor’s data handling. No private data layer. |
| Long-context | Single-document analysis where the whole document fits in the context window (1 to 2 million tokens for current models). | Per-query token cost scales with document size. Expensive at volume. | High for the document in the window. Performance degrades as context grows. | Same as prompt engineering. No retrieval audit layer unless you build one. |
The plain-English version: if your business needs a chatbot that answers from your specific document corpus and documents update frequently, you want RAG. Fine-tuning is the wrong tool for “teach the model our content.” Long-context windows are useful inside a RAG system (sending more retrieved chunks per query) but rarely a substitute for one.
Why “just upload my files to ChatGPT” is not RAG
Two questions we get on every AI scoping call: is the ChatGPT file upload feature the same as RAG, and does Microsoft 365 Copilot already do this for me? Short answers: no, and partly.
The consumer ChatGPT file upload feature loads documents into a temporary context window for one conversation. There is no persistent index, no access control beyond your account, and no audit trail.
Files you upload may be retained per OpenAI’s consumer terms. For internal scratch work, fine. For a customer-facing chatbot or an internal tool that needs PIPEDA-grade governance, not a substitute for a real RAG deployment.
Microsoft 365 Copilot uses RAG patterns over your Microsoft 365 tenant content (SharePoint, OneDrive, Teams, Outlook). For SMBs already on Microsoft 365 Business Premium, Copilot is the right starting point for “answer my questions from my Microsoft documents” use cases.
Where Copilot stops being enough: when you need to bring in non-Microsoft 365 sources (a ticketing system, a SQL database, a third-party policy management tool), when you need a public-facing chatbot, or when you need answer formats Copilot does not produce (structured JSON, tool calls, multi-step workflows).
FIELD NOTE
A Hamilton manufacturing client asked us last quarter why their Copilot deployment kept missing answers their staff knew were in “the system.” The answers were in their ERP, not SharePoint. Copilot does not retrieve from on-prem ERP. We built a small RAG layer over the ERP’s nightly export, federated it with the Copilot answer flow, and the “Copilot is dumb” complaints stopped within two weeks. The model was never the problem. The retrieval surface was.
Where RAG goes wrong (and how FC tunes it)
The original 2020 paper by Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (NeurIPS 2020), established that RAG models “generate more specific, diverse and factual language” than parametric-only models on knowledge-intensive question answering. That result holds in production. What the paper did not cover, and what every SMB deployment surfaces in week three, is the failure modes that show up when retrieval is mediocre.
“The most common failure mode we see in SMB RAG deployments is silent retrieval failure. The model gives a confident, well-written answer based on the wrong three chunks because chunking was naive. We solve it by instrumenting retrieval logs from day one, sampling 50 queries per week, and adjusting chunk size, overlap, and the rerank model. The fix is rarely the LLM. It is almost always the retrieval pipeline.”
Mike Pearlstein, CISSP, MSc AI, CEO of Fusion Computing
The five failure modes we see most often, ranked by frequency in the FC client base:
- Naive chunking. Splitting on fixed character boundaries breaks tables, splits clauses, and orphans context. We default to semantic chunking with 800 to 1,200 token chunks and 100 token overlap, with custom rules for tables and lists.
- No reranker. Pure vector search returns chunks that are topically similar but not actually relevant to the question. Adding a rerank step (Cohere Rerank, Azure AI Search semantic ranker) typically lifts answer quality more than swapping LLMs.
- Missing access control. If retrieval can return chunks the user is not authorized to see, the LLM will summarize them in the answer. Access control belongs at the retrieval layer, not at the model.
- Stale index. Documents change. Indexes do not, unless you wire up incremental ingestion. Most of our incident calls in month two are “the chatbot keeps citing the old policy” problems that trace back to a missing scheduled job.
- No evaluation harness. Without a small benchmark of 50 to 100 expected-question / expected-source pairs, you cannot tell whether a retrieval change made things better or worse. We build this on day one of every engagement.
Most of these failure modes are invisible without telemetry. The chatbot keeps producing fluent, confident answers. They are just wrong. This is what we mean when we tell SMB CIOs that RAG is a retrieval-systems problem with an LLM stapled to the front, not the other way around.
Who builds this, and what does it cost a Canadian SMB?
For a 30 to 200 employee Canadian SMB, the realistic options are: build it on a generic AI platform yourself, buy a productized vertical RAG product, or engage an MSP with AI delivery capability. Fusion Computing’s Custom Business AI Platform is the third option in this list and the one we built specifically for SMBs that do not have a data scientist on staff and do not want to become an AI infrastructure shop.
The implementation work in a typical FC RAG engagement breaks down roughly as follows:
- Discovery and data audit (1 to 2 weeks). What sources, what volume, what permissions, what compliance scope.
- Architecture and tenant setup (1 week). Canada-region cloud, identity, network isolation, key management.
- Ingestion pipeline (2 to 4 weeks). Connectors to your sources (SharePoint, file shares, ticketing, ERP exports), chunking strategy, embedding pipeline, vector index.
- Application layer (2 to 4 weeks). The chatbot, copilot, or workflow that calls retrieval. Often integrated with Power Automate flows for ticketing, approvals, or email triage.
- Retrieval tuning and evaluation harness (2 weeks). The 50-question benchmark, retrieval logging, weekly review cadence for the first 90 days.
- Governance and access control (parallel, 1 to 2 weeks). PIPEDA mapping, data classification, role-based access at the retrieval layer.
Total implementation cost varies by scope. Run cost is dominated by cloud hosting, embeddings, and LLM tokens, and scales with query volume and indexed document count rather than employee count. Fusion Computing builds a fixed scope and budget into every engagement before the build phase begins.
Book a Free AI Architecture Assessment
RAG and Canadian privacy law (PIPEDA, Quebec Law 25, Bill C-8)
For Canadian SMBs, the deployment decision is shaped less by model capability than by where the data lives and who can see it. Three rules to anchor on:
According to the Office of the Privacy Commissioner of Canada’s PIPEDA guidance, organizations across Canada must obtain meaningful consent for the collection, use, and disclosure of personal information in commercial activities.
For RAG, this means the documents you index, the queries you log, and the retrieval and answer logs you keep are all in scope. Treat the RAG system the same way you treat your CRM under PIPEDA: data classification, retention policy, access control, breach response.
The NIST AI Risk Management Framework is the governance reference most Canadian SMB auditors and cyber insurers are aligning on. Its four functions (Govern, Map, Measure, Manage) translate cleanly to a RAG deployment: govern the AI program, map data and use cases, measure retrieval quality and bias, manage incidents and drift. FC builds the RMF mapping into every engagement so the answer to “is your AI governed?” is a document, not a shrug.
Quebec Law 25 (in effect since September 2023) and federal Bill C-8 (consumer-privacy successor to the lapsed Bill C-27) raise the bar for cross-border data transfer disclosures. The practical implication for RAG: keep embeddings, vector indexes, and LLM inference inside Canadian-region cloud endpoints (Azure Canada Central, AWS ca-central-1) by default. The major cloud providers all offer Canada-region LLM endpoints now. There is no longer a credible reason to route SMB knowledge through US-region infrastructure.
How to evaluate a RAG vendor in 5 questions
Use these in your next RAG vendor call. The answers tell you whether the vendor has done this before or is reading from a slide deck.
- Show me your retrieval evaluation harness. If they do not have one, walk away. Without it, every “the model is great” claim is unverifiable.
- What chunking strategy do you use, and how did you arrive at it? Correct answer references chunk size, overlap, and special handling for tables and code. Wrong answer is “the default.”
- Where do embeddings, indexes, and inference run, by region? For Canadian SMB workloads, the answer should include “Canada Central” or “ca-central-1” without a pause.
- How do you enforce access control on retrieval? Production answer: at query time, filter by user identity propagated from your IDP. Anti-pattern answer: “the model is instructed not to share that.”
- What is your incident process when the chatbot returns a wrong answer with confidence? The answer should include log review, retrieval replay, and a documented model-and-retrieval change protocol. If they do not have one, they have not run RAG in production long enough.
Frequently asked questions
Do I need a data scientist on staff to deploy RAG?
No. RAG is an applied engineering problem, not a research problem. A capable application developer plus an MSP with retrieval and Azure or AWS experience can deliver a production RAG system. Where data scientists add value is in custom embedding fine-tuning or evaluation methodology design, neither of which most SMBs need at the start. We have deployed RAG for 40 to 200 employee firms with zero in-house ML expertise.
Will RAG work on PDFs and SharePoint?
Yes, both, and they are the two most common ingestion sources for SMBs. PDFs need a text extraction step (Azure Document Intelligence, Amazon Textract, or open-source equivalents) before chunking and embedding. SharePoint connects via Microsoft Graph API with permission inheritance, so retrieval respects who is allowed to see what. Other supported sources: Confluence, Notion, Google Drive, Zendesk, ServiceNow, Salesforce, plain file shares.
How is RAG different from giving ChatGPT my files?
The consumer ChatGPT file upload puts files into a temporary context window for one conversation, with no persistent index, no access control, and limited audit trail. RAG builds a persistent searchable index that respects access control, returns source citations with every answer, and lets you change the LLM without re-uploading anything. For internal scratch work, ChatGPT upload is fine. For governance-bound work, RAG is the standard.
Is RAG secure for Canadian privacy law?
Yes, when deployed correctly. The PIPEDA-aligned pattern keeps embeddings, vector indexes, and LLM inference in Canada-region cloud endpoints (Azure Canada Central, AWS ca-central-1), enforces access control at the retrieval layer, encrypts data at rest and in transit, and logs queries and retrievals for audit. Quebec Law 25 and Bill C-8 reinforce the data residency requirement. Keeping the entire pipeline inside Canada-region infrastructure is the default we recommend for SMB deployments.
What does RAG cost?
Engagements vary too widely to quote a single dollar range. Implementation cost depends on the number of source systems, document volume, and integration complexity. Run cost is dominated by cloud hosting, embeddings, and LLM tokens, and scales with query volume and indexed document count rather than employee count. Fusion Computing scopes and prices every engagement before the build phase begins, so you see the build cost and the recurring run cost before committing.
How long does a RAG deployment take?
Typical timeline is 8 to 12 weeks from kickoff to production for a single use case (internal copilot, customer-facing chatbot, or workflow assistant). Discovery takes 2 weeks. Ingestion takes 2 to 4 weeks. Application layer takes 2 to 4 weeks. Retrieval tuning runs in parallel and continues for 90 days post-launch. Narrower scope is faster, but most SMB deployments cover at least two sources.
Which LLM should we use behind the retrieval layer?
For most SMB deployments, the LLM is the easiest decision and the most replaceable component. GPT-4o through Azure OpenAI Canada and Claude Sonnet through Amazon Bedrock are the two production defaults FC uses. Both run in Canadian regions, both produce quality on par with consumer ChatGPT, and both can be swapped without re-indexing your documents. The retrieval layer is what determines answer quality. The LLM is a commodity behind it.
Can RAG work alongside Microsoft 365 Copilot?
Yes, and this is the most common architecture we deploy.
Copilot covers retrieval over Microsoft 365 content (SharePoint, OneDrive, Teams, Outlook). A custom RAG layer covers everything else (ERP, ticketing, third-party knowledge bases, public website content, vendor documentation). The two systems can be federated through Microsoft Copilot Studio or invoked side-by-side. The decision is Copilot for Microsoft 365 content and custom RAG for the rest. We unpacked that trade-off in Custom AI vs Microsoft 365 Copilot: when each one wins.
How do we keep our source documents inside Canada when we use RAG?
Both Azure AI Search and AWS OpenSearch offer Canadian-region storage (Canada Central + Canada East) for the vector index and the source documents. The model invocation can be routed through Azure OpenAI in Canada East as well. The pattern we deploy keeps embeddings, chunks, retrieval, and inference inside Canadian Crown-corporation-eligible infrastructure end-to-end. PIPEDA and Quebec Law 25 reasonable-safeguards both treat that as the floor, not the ceiling.
What is the typical 90-day RAG rollout for a Canadian SMB?
Days 1 to 14 are document inventory, chunker selection, and embedding-model evaluation against a 50-question gold set drawn from real user questions.
Days 15 to 45 are pilot with 5 to 10 power users plus retrieval tuning. Days 46 to 75 are workforce rollout, evaluation harness in production, and the change-protocol exercise. Days 76 to 90 are the freshness pipeline, citation-verification setup, and quarterly-review cadence handoff. The rollout band for our 30 to 80 user clients is 8 to 14 weeks all-in.
Can RAG handle handwritten or scanned PDFs?
Yes, but the quality depends on the OCR layer. Azure Document Intelligence and AWS Textract both reach 95 percent accuracy on clean scans and 80 to 90 percent on handwritten content.
The chunker has to use the OCR confidence score to drop low-quality regions, or the retrieval layer ends up surfacing garbled text as confident answers. We pre-process scanned content with the OCR confidence threshold set at 0.85, and route anything below that to a manual review queue.
How do we measure whether RAG is actually working in production?
Three metrics: retrieval hit rate (does the top-5 result contain the correct source?), answer faithfulness (does the model cite the retrieved source verbatim?), and user thumbs-down rate. We instrument all three from day one and review weekly with the client. The thresholds we hold to are: retrieval hit rate above 85 percent, faithfulness above 95 percent, and thumbs-down below 5 percent. Anything below those triggers a chunker, embedding, or prompt-template review before the next deploy.
Fusion Computing helps Canadian businesses design, deploy, and run secure RAG systems.
Get a Custom AI Assessment for Your Business
Considering a productized custom AI platform built on RAG for your business? See also our 90-day AI knowledge management playbook for the rollout sequence. We can scope the work, build it on Canadian-region infrastructure, and run it under your existing managed services agreement.

