Tech 16 min read

Qwen3.7 Plus on ModelScope: function calling works, so I built an agent loop

IkesanContents

I got Qwen3.7 Plus working through ModelScope’s OpenAI-compatible endpoint. I’d already used it to proofread a Japanese novel, but whether it works as an agent is a separate question. That’s what I’m testing here.

Qwen3.7 is also a generation pitched squarely at agent use. Alibaba sells it on function calling, long-horizon autonomy, and running the same way on any agent harness (Claude Code, OpenClaw, or your own code). If that’s the case, the question becomes how smoothly it runs on a plain off-the-shelf loop, with no dedicated framework.

To run as an agent you need two things: the model being able to call tools (function calling), and a loop to drive it. Whether ModelScope’s OpenAI-compatible endpoint actually passes tool calls through, though, you can’t know without trying. If it does, you can write the loop yourself with just the openai SDK — model requests a tool → you run it and return the result → model continues. If it doesn’t, you fall back to ReAct, writing the tools into the prompt and parsing the model’s text output.

After confirming that, I went through it in order: the minimal tool loop (inner), extending it to a conversation by keeping history (outer), and where it breaks if you let history pile up. All with local run logs.

The verification setup

ItemValue
ClientApple M1 Max / 64GB / macOS 26.5
Language/runtimePython 3.13 / uv 0.10
SDKopenai 2.41
ConnectionOpenAI-compatible API via ModelScope
ModelQwen3.7 Plus
Paramstemperature=0 for tests, temperature=0.3 for chat

Inference runs on the API side, so local machine specs don’t matter — the M1 Max is just an API client. With any OpenAI-compatible API, swapping base_url, the API key, and the model name for your own setup runs the same code.

An agent is just two nested loops

What people call an agent is, inside, just two loops nested together.

The inner loop is the tool round-trip for finishing one instruction. You send messages to the model; when it replies “call this tool with these arguments,” you actually run the function, append the result to the message list, and send again. When the model stops requesting tools, its answer is the final output.

The outer loop is the conversation itself. Wait for input, append it to the message list, run the inner loop, then wait for the next input. Whether you keep this message list alive across the conversation is what separates a one-shot API call from a conversational agent.

flowchart TD
    U[Input] --> A[Append to messages]
    A --> L[Send to Qwen]
    L --> D{tool_calls?}
    D -->|yes| T[Run tool]
    T --> R[Append result to messages]
    R --> L
    D -->|no| O[Show final answer]
    O --> U

Send to Qwen, check for tool_calls, run any tools and append the results, then send again — that small cycle is the inner loop. Once tool requests run out, show the final answer and wait for the next input — that larger cycle is the outer loop. The code follows the diagram directly.

function calling is the spec for the part where the model returns “call a tool” in a structured form. Tools aren’t built into the model from the start. The model can only request what you teach it each time through tools (an array defining tool names, descriptions, and arguments in JSON Schema). Even when the model puts a tool name and a JSON argument string into tool_calls in its response, it doesn’t run the function itself. Preparing and running the actual function is your side; you return the result as a message with role set to tool.

Testing function calling in the inner loop

First I check whether the endpoint accepts tools. I prepared the tools on my side beforehand: a dummy function that returns weather and one that adds two integers, defined and listed in tools, then asked a question that needs both.

import json, os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["QWEN_API_KEY"],
    base_url=os.environ["QWEN_BASE_URL"],   # OpenAI-compatible endpoint
)
MODEL = os.environ["QWEN_MODEL"]            # the Qwen model name to use

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Return the current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "add",
            "description": "Add two integers",
            "parameters": {
                "type": "object",
                "properties": {"a": {"type": "integer"}, "b": {"type": "integer"}},
                "required": ["a", "b"],
            },
        },
    },
]

def get_weather(city):
    return f"{city} is sunny, 24C"

def add(a, b):
    return a + b

DISPATCH = {"get_weather": get_weather, "add": add}

messages = [{"role": "user", "content": "What's the weather in Tokyo? Also, what's 17 + 25?"}]
while True:
    r = client.chat.completions.create(
        model=MODEL, messages=messages, tools=tools, temperature=0)
    msg = r.choices[0].message
    messages.append(msg.model_dump(exclude_none=True))
    if not msg.tool_calls:               # done if there's no tool request
        print(msg.content)
        break
    for tc in msg.tool_calls:            # run the requested tools and return results
        args = json.loads(tc.function.arguments)
        result = DISPATCH[tc.function.name](**args)
        messages.append(
            {"role": "tool", "tool_call_id": tc.id, "content": str(result)})

Run log.

[turn=0] tool_call: get_weather({'city': 'Tokyo'}) -> Tokyo is sunny, 24C
[turn=0] tool_call: add({'a': 17, 'b': 25}) -> 42
[turn=1] final: The weather in Tokyo is sunny at 24C. Also, 17 + 25 is 42.

On the first response, get_weather and add came back at once — parallel tool_calls. Running both and returning the results, the second turn combined them into one final answer. No ReAct fallback needed; native function calling went straight through. Qwen models already carry a function-calling template, but whether ModelScope’s inference endpoint passes that through is something you can’t know without calling it, so I wanted to confirm it.

Making it conversational with the outer loop

Wrap the inner loop in an outer loop that waits for input. The difference is keeping the message list outside the loop and not discarding it across turns. That alone makes it remember the conversation.

SYSTEM = "A helpful assistant that answers in English. Use tools when needed."

def main():
    messages = [{"role": "system", "content": SYSTEM}]   # kept across the whole conversation
    while True:                                  # outer: conversation loop
        user = input("\nyou> ").strip()
        if user in {"exit", "quit", ""}:
            break
        messages.append({"role": "user", "content": user})
        while True:                              # inner: tool round-trip loop
            r = client.chat.completions.create(
                model=MODEL, messages=messages, tools=tools, temperature=0.3)
            msg = r.choices[0].message
            messages.append(msg.model_dump(exclude_none=True))
            if not msg.tool_calls:
                print(f"\nQwen> {msg.content}")
                break
            for tc in msg.tool_calls:
                args = json.loads(tc.function.arguments)
                result = DISPATCH[tc.function.name](**args)
                messages.append(
                    {"role": "tool", "tool_call_id": tc.id, "content": str(result)})

main()

A two-turn log.

you> What's the weather in Tokyo?
  [tool] get_weather({'city': 'Tokyo'}) -> Tokyo is sunny, 24C
Qwen> The weather in Tokyo is sunny, 24C.

you> Convert that city's temperature to Fahrenheit.
Qwen> Tokyo was 24C, so in Fahrenheit that's 75.2F.
      The math is 24 * 9/5 + 32 = 75.2.

On the second turn, “that city’s temperature” — the model didn’t call get_weather again; it pulled Tokyo and 24C from history and converted. The whole prior exchange, including tool results, stays in messages. That’s what a conversational agent really is — no special memory mechanism, just not throwing the message list away.

You can build a skill mechanism too

With the same approach as adding tools, you can build “skills.” A skill here means a procedure for a specific task, written as instructions, loaded into the model only when needed. This isn’t about compatibility — it’s whether you can build that spec yourself.

The approach is simple: keep the skill procedures in a dict, and provide one meta-tool, use_skill, that loads them. The model calls use_skill for a skill that fits the task, reads the returned procedure, and follows it.

SKILLS = {
    "proofread": {
        "description": "Fix typos in English text",
        "instructions": (
            "Proofread as follows. "
            "1. Pull out only typos and misspellings. "
            "2. List each as 'wrong -> right (reason)'. "
            "3. Don't touch dialect, casual speech, or proper nouns."
        ),
    },
    "haiku": {
        "description": "Turn the content into a haiku",
        "instructions": "Return the given content as a single 5-7-5 haiku. Include one season word.",
    },
}

tools = [{
    "type": "function",
    "function": {
        "name": "use_skill",
        "description": "Load the procedure for a registered skill",
        "parameters": {
            "type": "object",
            "properties": {"name": {"type": "string", "enum": list(SKILLS)}},
            "required": ["name"],
        },
    },
}]

def use_skill(name):
    return SKILLS[name]["instructions"]

Asked to proofread, the model picked proofread itself, loaded it, and fixed the text by that procedure.

task: Proofread this: Today is a realy nice day. I hope it stays suny tomorow.
[use_skill] loaded the proofread procedure
final:
- realy -> really (misspelling)
- suny -> sunny (misspelling)
- tomorow -> tomorrow (misspelling)
corrected: Today is a really nice day. I hope it stays sunny tomorrow.

A skill is, at bottom, just the procedure text plus one tool to load it. Claude Code’s skills are, stripped down, the same mechanism — loading a SKILL.md for the relevant task — so the skeleton is no different. Register several and the model picks among them by task.

It reads errors and recovers on its own

I also tested what happens when a tool fails. If you return the failure as a tool-role message, the model reads it and changes its next move. Here I set up a get_weather that rejects a city name passed as-is — it only accepts lowercase English.

WEATHER = {"tokyo": "sunny, 24C", "osaka": "rainy, 21C"}  # keys are lowercase only

def get_weather(city):
    if city not in WEATHER:                      # reject capitalized or non-matching names
        return (f"ERROR: '{city}' not found. Specify city in lowercase English. "
                f"e.g. {', '.join(WEATHER)}")
    return WEATHER[city]

Asked for “the weather in Tokyo and Osaka,” it first called with the capitalized names, got errors on both, read them, and retried in lowercase.

task: What's the weather in Tokyo and Osaka?
[turn 0] get_weather({'city': 'Tokyo'}) -> ERROR: 'Tokyo' not found. Specify city in lowercase English. e.g. tokyo, osaka
[turn 0] get_weather({'city': 'Osaka'}) -> ERROR: 'Osaka' not found. Specify city in lowercase English. e.g. tokyo, osaka
[turn 1] get_weather({'city': 'tokyo'}) -> sunny, 24C
[turn 1] get_weather({'city': 'osaka'}) -> rainy, 21C
final: Tokyo is sunny, 24C. Osaka is rainy, 21C.

I didn’t change a single line to say “use lowercase.” Just returning the error string was enough: the model read it and fixed the arguments. You return tool results the same way whether they succeed or fail, and the more plainly you surface failures as errors, the easier the model recovers.

But this is also a loop that “keeps going until it gets a valid value.” Left alone it would round-trip forever, so always cap the loop with a turn limit.

for turn in range(max_turns):   # cap; it won't loop past this
    ...

To stay safe, decide how to fence it in before you decide how to stop it. If you keep tools to side-effect-free dummies and make it always stop at the cap, even a runaway just hits harmless tools a few times and ends. If you hand it real file or shell access, this is where you need pre-execution approval and a sandbox.

Letting history pile up eventually breaks

That “not throwing it away,” though, breaks down if you keep going. messages grows every turn with input, the model’s response, and tool results. The more the agent round-trips through tools, the more tokens you resend for a single response. Eventually it uses up the context window, and the billed tokens swell in proportion to the conversation’s length along the way.

So lately there are a lot of implementations heading toward how to thin out memory instead of holding the whole history. On the side of managing what’s in the context, the approaches split broadly.

ApproachWhat it doesReference
Time decayLowers the weight of old context on a forgetting curve and drops itYourMemory drops old context with the Ebbinghaus forgetting curve
SupersessionRetires semantically close old info with newer infoVektor Memory retires old preferences with supersession chains
Compression proxySummarizes ahead and compresses tool output in front of the APICompresr Context Gateway prevents context starvation with a proxy
Hook injectionInjects only the needed context across sessionsCTX adds working memory to Claude Code

All of them drop sending everything every time and rebuild “just what’s needed now.” On the other hand, there’s an argument that these are retrieval systems pulling things off a shelf, distinct from learning that folds experience into the model’s weights (the paper arguing agent memory is notes, not memory).

Another direction is writing notes out beyond the context. Designs that put memory in a canonical Markdown file under Git (WUPHF returns agent memory to Markdown and Git), or that route output to HTML on the assumption the AI re-reads it (routing Claude Code’s output from Markdown to HTML), are showing up. The aim is that even if the context overflows, re-reading that file lets you resume from another session.

This blog’s experiment posts are written this way from the start. Stand up a draft first, append at each check, and summarize at the end. Even if the AI’s context cuts out partway, reading the draft shows how far it got, so it can continue. My hand-built agent here still just holds the whole history as-is, so for long runs, adding one of these is the next move.

Write it yourself, or put it on a framework

To begin with, you rarely write this loop from zero in practice. Normally you use a ready-made agent framework. I wrote it by hand this far to understand what’s happening inside, and confirmed it runs as long as you add to the tools array and DISPATCH.

Toward real operation, you need the surrounding pieces: streaming output, pre-execution approval for tools, automatic history compression, retries. Existing tools provide those. Qwen’s official Qwen-Agent handles function-calling fallback and even MCP connections; LangGraph lets you write control flow as a graph. OpenAI’s Agents SDK also runs on OpenAI-compatible APIs. All of them put Qwen behind them by swapping base_url and the API key.

This structure applies to coding agents like Claude Code too. What runs in the CLI is the same two loops. The differences are just that it has dozens of tools like Read/Write/Bash, a huge system prompt, and the context management built up above — the thing running inside is the same few-dozen-line loop I wrote here.

Pinning down the spec to have it build a BBS

Finally, to tie together the tools, conversation, and action so far. Until now I’ve thrown the model a one-liner like “build a BBS” and had it produce one in a single shot. At Qwen3.7 Plus’s size it can one-shot a board, so what I’m testing here isn’t generation ability but whether it can build after a few turns of pinning down the spec.

I give it one write_file tool. The write target is fenced to the bbs_out/ sandbox, and the system prompt just says “if the requirements are vague, ask before building.”

def write_file(filename, content):
    name = pathlib.Path(filename).name        # restricted to under bbs_out/
    (OUT / name).write_text(content, encoding="utf-8")
    return f"wrote: bbs_out/{name} ({len(content)} bytes)"

Throwing it just “I want a simple BBS,” it asked back about the stack, features, and design.

you> I want to build a simple BBS.
Qwen> Let me check a few things.
  1. Stack (HTML+JS / PHP / Node / Python+Flask ...)
  2. Features (list / post / delete / threads / admin delete ...)
  3. Design (simple? modern?)

Returning a spec, it called write_file on the spot and wrote out the implementation.

you> Python + Flask, one file. Posts have name, body, timestamp; store data in JSON.
     Features: list, new post, delete. Minimal design. Newest post on top.
  [tool] write_file(app.py, 4048 bytes) -> wrote: bbs_out/app.py
Qwen> Done. pip install flask, then python app.py to start it.

I ran the app.py it produced as-is and checked. From an empty state, posting lined up name, body, and timestamp newest-first, saved to JSON, and delete removed it. It worked exactly to the spec I gave.

GET  /          -> "No posts yet"
POST /post      -> 302 (Kana / test post)
GET  /          -> Kana / 2026-06-19 / test post
POST /delete/1  -> 302
GET  /          -> "No posts yet"

What’s different from a one-line dump is that the model pins down the missing spec from its own side. Tool calling, conversation memory, action (writing a file), and checking the result all connected through one practical task.

Don’t trust the generated code as-is

It ran, but whether you can put the generated code into production as-is is another matter. I poked at the output’s security a little.

XSS first. Putting <script>alert('xss')</script> into a post body and displaying it, it was escaped to &lt;script&gt; and didn’t go through. Stored XSS doesn’t land.

posted body: <script>alert('xss')</script>
rendered HTML: &lt;script&gt;alert('xss')&lt;/script&gt;

But this isn’t because the model wrote XSS defenses. Jinja2 auto-escapes {{ post.body }} in render_template_string by default; there’s no escape() of any kind in the code. It’s saved by the framework’s defaults.

Conversely, there are holes left unhandled. The generated app.py starts with app.run(debug=True, host='0.0.0.0'). The debugger opens on all interfaces, so exposing it as-is risks code execution through it. The post forms have no CSRF protection either (no auth on delete is to spec, since I asked for a “simple BBS”).

The agent gets you to “runs, and output is escaped,” but not to hardening. It’s just faithful to the minimal spec you asked for, so don’t trust the generated code as-is — use it on the assumption you’ll review it yourself.

This “how far it acts on its own” is the same axis I worked on when I had Qwen3.6 one-shot a BBS to measure intent-reading. That time it was a single HTML file, and whether the model wrote XSS escaping itself was where models differed. This time, riding on Flask and Jinja’s auto-escaping, that decision was left to the framework.