RAG: Building an Internal Help Desk with Mac mini M4 Pro + Dify (2025, Part 1)
About This Article
In a separate post, I wrote about buying a Mac mini M4 Pro for work and using spare time to do LoRA training: Related article.
Which brings us to the real topic here: a plan for building an internal help desk–oriented RAG system.
Because we’ll handle confidential company information and can’t send it to external SaaS, we decided to build the RAG stack inside the LAN. To prioritize development speed and practicality, we’ll adopt the low-code platform Dify instead of writing everything from scratch.
1. System Architecture
| Item | Details |
|---|---|
| Hardware | Mac mini M4 Pro (24GB) |
| Platform software | Docker (required to run Dify) |
| Application | Dify (tool for creating and managing chatbots) |
AI Model Options
| Option | Setup | Pros | Cons |
|---|---|---|---|
| A (Recommended) | External API (OpenAI / Anthropic) | Smarter, minimal memory use | Monthly cost, data leaves the network |
| B (Security-first) | Local LLM (Ollama + Qwen2.5-14B) | No data leakage | Uses ~10GB RAM, harder to tune |
2. Why Dify in 2025
We’ll use the latest Dify as of December 2025. The AI field has moved incredibly fast from late 2024 to late 2025, and Dify has evolved just as dramatically.
Focusing on an internal help desk, here are four key improvements over the 2024-era experience.
2-1. Much better RAG search (GraphRAG)
In 2024, keyword and vector search were the norm. In the latest version, GraphRAG-like capabilities are strengthened and integrated.
Strictly speaking, GraphRAG and knowledge graphs aren’t identical, but they both leverage relationships between pieces of information. For a deeper dive, see the explainer from Microsoft Research.
| Version | Behavior |
|---|---|
| 2024 edition | Search for “Error code E-001” → only pages containing that term show up |
| 2025 edition | Understands relationships among “E-001”, “accounting system”, and “restart” and searches accordingly |
Even when manuals are scattered, the AI connects the dots to answer, drastically reducing “nonsensical” replies.
2-2. Stronger agent capabilities
It used to “just answer questions”. Now it has a stronger ability to think and act (Agentic Workflow).
| Version | Behavior |
|---|---|
| 2024 edition | If search fails: “I don’t know.” |
| 2025 edition | 1) Search → 2) If info is missing, ask the user → 3) Search again → 4) If still no go, “handoff to a human” |
This “ask back” and “rethink” logic is easy to build directly in Dify.
2-3. Image (screenshot) support
Multimodality is now standard.
- Users can paste error-screen screenshots and ask “What is this?”
- Even with local LLMs via Ollama, visual models like Llava or Qwen-VL can analyze images without sending them outside the company
2-4. More stable integration with local LLMs (Ollama)
Back in 2024, “Ollama support” still felt experimental. Now it’s essentially native.
- Function calling support: Even with local LLMs (e.g., Qwen2.5), you can reliably call tools like an internal DB search or Slack notifications, just like with API models.
- Benefit of the M4 Pro: It’s much easier to realize “smart agent behavior” entirely locally without external APIs.
3. Rollout Steps
A shortest path to business impact.
Phase 1: Environment setup (Day 1)
When the Mac arrives, lay the groundwork first.
Install Docker Desktop for Mac
Required to run Dify. Download from the official site and install.
Install Dify
Run the following in Terminal.
git clone https://github.com/langgenius/dify.git
cd dify/docker
cp .env.example .env
docker compose up -d
You can then access the admin UI in your browser at http://localhost/install.
Important: Always use the latest docker-compose.yml from the official GitHub repo. Copy-pasting from older posts (early 2024) may leave key features broken.
Pin the network
Give the Mac a static IP so others on the internal LAN can access it.
Phase 2: Build a prototype (about one week)
Make something that actually works here.
Connect a model
Register your LLMs in Dify’s settings.
Pro tip for work: start with external APIs. They are significantly smarter than local LLMs, which helps avoid the early perception that “AI is too dumb to be useful.”
| Model | Price (input/output) | Notes |
|---|---|---|
| GPT-5.1 | 10 per 1M tokens | Among the cheapest |
| Gemini 2.5 Pro | 10 per 1M tokens | Cheapest tier; Google camp |
| Gemini 3 Pro | 12 per 1M tokens | Latest; >200K context costs more |
| Claude Sonnet 4.5 | 15 per 1M tokens | Anthropic camp |
| Claude Opus 4.5 | 25 per 1M tokens | For image analysis and complex reasoning |
Register knowledge (manuals)
Upload internal policies and troubleshooting guides in PDF or Word to Dify. They are automatically vectorized (transformed into a searchable form).
Phase 3: Implement logic (handoff to a human)
This is the core feature. Build it with Dify’s Workflow.
Build the flow
[Start] → [Knowledge Search (RAG)] → [LLM Answer Generation] → [Branch]
Add branching conditions (IF/ELSE)
- Condition: “Answer confidence is low” or “User pressed the ‘Not helpful’ button”
- TRUE (unresolved): go to an [HTTP Request] node. Hit a Slack/Teams webhook to notify IT staff.
- FALSE (resolved): finish as-is
4. Operating Rules on an M4 Pro (24GB)
Approximate memory usage.
| Setup | Memory use | Notes |
|---|---|---|
| Dify (Docker) only | ~4–6 GB | Safe to keep always on |
| Dify + local LLM (Qwen2.5-14B) | ~14–16 GB | Recommended during business hours |
| Dify + local LLM + many concurrent users | ~18–20 GB | Still some headroom |
5. Running Cost
When using APIs (Option A)
- GPT-5.1 / Gemini 2.5 Pro: 10 output per 1M tokens
- Gemini 3 Pro: 12 output per 1M tokens
- Claude Sonnet 4.5: 15 output per 1M tokens
- For 1,000 questions/month × average 2,000 tokens, the cheapest tier comes out to about $3–8/month
With local LLMs (Option B) = electricity only
- Mac mini M4 Pro max power draw: ~65 W
- 24 hours × 30 days at full load: ~46 kWh → about 1,400 yen/month (assuming 30 yen/kWh)
- More realistic average (30 W): about 650 yen/month
With local LLMs, you can run it for just a few hundred yen per month in electricity.
6. Advice for Developers
This is work, so you’ll probably need to explain it to someone. When you do, try this approach.
”First, show a high-accuracy API-based version”
Local LLMs are hard to tune. Start with APIs so people agree “this system is useful”, then say “we’ll localize it (leverage the M4 Pro) for cost and security”—that tends to land better.
”Keep local LLMs to 14B”
32B+ models won’t reach practical speeds with the M4 Pro’s memory. You’ll get complaints that “it’s too slow”. Don’t overreach; pick efficient, strong models like Qwen2.5-14B or Mistral-Nemo-12B.
If someone says, “Performance dropped after we switched!”
Local LLMs will underperform API models; they also update less frequently—that’s life.
If your environment allows “Haha, yes, APIs are better!” then fine. But in reality, many of us have security and cost constraints.
In those cases: “You decide the balance of cost and risk; we can support either path.” It’s not a decision for engineers.
Bonus: If you’re told “improve local performance” with the same hardware
That’s a tall order. Proceed only if you can sacrifice response time.
- Use 32B–70B models (e.g., Qwen2.5-32B)
- They won’t fully fit in memory, so inference will swap to disk
- Expect each answer to take tens of seconds to minutes
If you switch to an approach like “generate answers offline as a nightly batch”, it’s not impossible—but does that still function as a help desk?
Realistically, either upgrade the machine (64 GB RAM or more) or go back to APIs.
Summary
- With Dify + a Mac mini M4 Pro, an internal RAG help desk is entirely feasible
- The 2025-era Dify has stronger GraphRAG and agent features—it’s a different beast from a year ago
- Start with APIs to create success, then gradually localize
Once the Mac mini arrives, I’ll actually build it. In Part 2, I’ll report the concrete steps and validation results.