RAG: Building an Internal Help Desk with Mac mini M4 Pro + Dify (2025, Part 1)

About This Article

In a separate post, I wrote about buying a Mac mini M4 Pro for work and using spare time to do LoRA training: Related article.

Which brings us to the real topic here: a plan for building an internal help desk–oriented RAG system.

Because we’ll handle confidential company information and can’t send it to external SaaS, we decided to build the RAG stack inside the LAN. To prioritize development speed and practicality, we’ll adopt the low-code platform Dify instead of writing everything from scratch.

1. System Architecture

Item	Details
Hardware	Mac mini M4 Pro (24GB)
Platform software	Docker (required to run Dify)
Application	Dify (tool for creating and managing chatbots)

AI Model Options

Option	Setup	Pros	Cons
A (Recommended)	External API (OpenAI / Anthropic)	Smarter, minimal memory use	Monthly cost, data leaves the network
B (Security-first)	Local LLM (Ollama + Qwen2.5-14B)	No data leakage	Uses ~10GB RAM, harder to tune

2. Why Dify in 2025

We’ll use the latest Dify as of December 2025. The AI field has moved incredibly fast from late 2024 to late 2025, and Dify has evolved just as dramatically.

Focusing on an internal help desk, here are four key improvements over the 2024-era experience.

2-1. Much better RAG search (GraphRAG)

In 2024, keyword and vector search were the norm. In the latest version, GraphRAG-like capabilities are strengthened and integrated.

Strictly speaking, GraphRAG and knowledge graphs aren’t identical, but they both leverage relationships between pieces of information. For a deeper dive, see the explainer from Microsoft Research.

Version	Behavior
2024 edition	Search for “Error code E-001” → only pages containing that term show up
2025 edition	Understands relationships among “E-001”, “accounting system”, and “restart” and searches accordingly

Even when manuals are scattered, the AI connects the dots to answer, drastically reducing “nonsensical” replies.

2-2. Stronger agent capabilities

It used to “just answer questions”. Now it has a stronger ability to think and act (Agentic Workflow).

Version	Behavior
2024 edition	If search fails: “I don’t know.”
2025 edition	1) Search → 2) If info is missing, ask the user → 3) Search again → 4) If still no go, “handoff to a human”

This “ask back” and “rethink” logic is easy to build directly in Dify.

2-3. Image (screenshot) support

Multimodality is now standard.

Users can paste error-screen screenshots and ask “What is this?”
Even with local LLMs via Ollama, visual models like Llava or Qwen-VL can analyze images without sending them outside the company

2-4. More stable integration with local LLMs (Ollama)

Back in 2024, “Ollama support” still felt experimental. Now it’s essentially native.

Function calling support: Even with local LLMs (e.g., Qwen2.5), you can reliably call tools like an internal DB search or Slack notifications, just like with API models.
Benefit of the M4 Pro: It’s much easier to realize “smart agent behavior” entirely locally without external APIs.

3. Rollout Steps

A shortest path to business impact.

Phase 1: Environment setup (Day 1)

When the Mac arrives, lay the groundwork first.

Install Docker Desktop for Mac

Required to run Dify. Download from the official site and install.

Install Dify

Run the following in Terminal.

git clone https://github.com/langgenius/dify.git
cd dify/docker
cp .env.example .env
docker compose up -d

You can then access the admin UI in your browser at http://localhost/install.

Important: Always use the latest docker-compose.yml from the official GitHub repo. Copy-pasting from older posts (early 2024) may leave key features broken.

Pin the network

Give the Mac a static IP so others on the internal LAN can access it.

Phase 2: Build a prototype (about one week)

Make something that actually works here.

Connect a model

Pro tip for work: start with external APIs. They are significantly smarter than local LLMs, which helps avoid the early perception that “AI is too dumb to be useful.”

Model	Price (input/output)	Notes
GPT-5.1	$1.25 /$ 10 per 1M tokens	Among the cheapest
Gemini 2.5 Pro	$1.25 /$ 10 per 1M tokens	Cheapest tier; Google camp
Gemini 3 Pro	$2 /$ 12 per 1M tokens	Latest; >200K context costs more
Claude Sonnet 4.5	$3 /$ 15 per 1M tokens	Anthropic camp
Claude Opus 4.5	$5 /$ 25 per 1M tokens	For image analysis and complex reasoning

Register knowledge (manuals)

Upload internal policies and troubleshooting guides in PDF or Word to Dify. They are automatically vectorized (transformed into a searchable form).

Phase 3: Implement logic (handoff to a human)

This is the core feature. Build it with Dify’s Workflow.

Build the flow

[Start] → [Knowledge Search (RAG)] → [LLM Answer Generation] → [Branch]

Add branching conditions (IF/ELSE)

Condition: “Answer confidence is low” or “User pressed the ‘Not helpful’ button”
TRUE (unresolved): go to an [HTTP Request] node. Hit a Slack/Teams webhook to notify IT staff.
FALSE (resolved): finish as-is

4. Operating Rules on an M4 Pro (24GB)

Approximate memory usage.

Setup	Memory use	Notes
Dify (Docker) only	~4–6 GB	Safe to keep always on
Dify + local LLM (Qwen2.5-14B)	~14–16 GB	Recommended during business hours
Dify + local LLM + many concurrent users	~18–20 GB	Still some headroom

5. Running Cost

When using APIs (Option A)

GPT-5.1 / Gemini 2.5 Pro: $1.25 input /$ 10 output per 1M tokens
Gemini 3 Pro: $2 input /$ 12 output per 1M tokens
Claude Sonnet 4.5: $3 input /$ 15 output per 1M tokens
For 1,000 questions/month × average 2,000 tokens, the cheapest tier comes out to about $3–8/month

With local LLMs (Option B) = electricity only

Mac mini M4 Pro max power draw: ~65 W
24 hours × 30 days at full load: ~46 kWh → about 1,400 yen/month (assuming 30 yen/kWh)
More realistic average (30 W): about 650 yen/month

With local LLMs, you can run it for just a few hundred yen per month in electricity.

6. Advice for Developers

This is work, so you’ll probably need to explain it to someone. When you do, try this approach.

”First, show a high-accuracy API-based version”

Local LLMs are hard to tune. Start with APIs so people agree “this system is useful”, then say “we’ll localize it (leverage the M4 Pro) for cost and security”—that tends to land better.

”Keep local LLMs to 14B”

32B+ models won’t reach practical speeds with the M4 Pro’s memory. You’ll get complaints that “it’s too slow”. Don’t overreach; pick efficient, strong models like Qwen2.5-14B or Mistral-Nemo-12B.

If someone says, “Performance dropped after we switched!”

Local LLMs will underperform API models; they also update less frequently—that’s life.

If your environment allows “Haha, yes, APIs are better!” then fine. But in reality, many of us have security and cost constraints.

In those cases: “You decide the balance of cost and risk; we can support either path.” It’s not a decision for engineers.

Bonus: If you’re told “improve local performance” with the same hardware

That’s a tall order. Proceed only if you can sacrifice response time.

Use 32B–70B models (e.g., Qwen2.5-32B)
They won’t fully fit in memory, so inference will swap to disk
Expect each answer to take tens of seconds to minutes

If you switch to an approach like “generate answers offline as a nightly batch”, it’s not impossible—but does that still function as a help desk?

Realistically, either upgrade the machine (64 GB RAM or more) or go back to APIs.

Summary

With Dify + a Mac mini M4 Pro, an internal RAG help desk is entirely feasible
The 2025-era Dify has stronger GraphRAG and agent features—it’s a different beast from a year ago
Start with APIs to create success, then gradually localize

Once the Mac mini arrives, I’ll actually build it. In Part 2, I’ll report the concrete steps and validation results.