#LLM

99 articles

Tech Mar 27, 2026 8 min

Chroma Context-1 achieves search performance equivalent to Frontier LLM with 20B parameters

A self-editing search agent with 20B parameters published by Chroma. It performs multi-hop search while dynamically pruning the context, and shows the same or higher accuracy than the Frontier model at 1/10 the cost and up to 10 times faster latency. Weights are exposed in Apache 2.0.

Chroma search agent Reinforcement Learning RAG LLM

Tech Mar 25, 2026 17 min

Hypura’s NVMe Streaming and TurboQuant’s KV Cache Quantization

Hypura breaks away from llama.cpp’s mmap design and streams even dense models with a three-tier NVMe placement, while TurboQuant eliminates quantization-constant overhead via a polar-coordinate transform. Includes a design comparison with Flash‑MoE and a review of scenarios where KV‑cache compression actually helps.

LLM Local LLM Quantization Apple Silicon Inference Optimization KV Cache Rust

Tech Mar 25, 2026 updated 4 min

TeamPCP poisoned the LiteLLM PyPI package and embedded malware that steals more than 50 kinds of credentials

LiteLLM 1.82.7 and 1.82.8 were poisoned on PyPI for about 46 minutes. TeamPCP stole a PyPI token through Trivy's CI/CD and injected malware that collects more than 50 credential types, including SSH keys, AWS, Kubernetes, and Docker secrets.

Security Supply Chain PyPI Malware LLM

Tech Mar 23, 2026 11 min

Will NVIDIA's world model Cosmos 2.5 series be included in pet robots?

The Cosmos 2.5 series world model announced by NVIDIA at GTC 2026 is mainly for industrial use, but it has reached the stage where the 2B parameter model can be run on the Jetson Orin Nano, which costs less than $500. We have organized the edge deployment of physical AI, from industrial robots to pet robots.

NVIDIA LLM Robotics Synthetic Data Physical A.I.

Tech Mar 23, 2026 11 min

Severe vulnerability in 7% of OpenClaw skills, over 30,000 instances exposed

Composio publishes security analysis of OpenClaw. Approximately 7.1% of SkillHub-distributed skills were found to have critical vulnerabilities, leaving over 30,000 instances exposed to the internet in the early stages at risk of prompt injection and credential theft.

Security AI Agent OpenClaw Prompt Injection LLM

Tech Mar 23, 2026 7 min

Flash-MoE: Running a 397B-parameter model on a 48GB MacBook

Flash-MoE is a C/Metal inference engine that runs Qwen3.5-397B-A17B on a MacBook Pro M3 Max at 4.36 tokens/s. With expert streaming from SSD and hand-written Metal shaders, it fits the 209GB model into a 48GB memory budget.

Inference MPS LLM Qwen MoE Local LLM

Tech Mar 23, 2026 14 min

Packaging the BERT + Qwen OCR Correction Pipeline as a Python Tool

The three-stage pipeline of BERT perplexity scan → LLM judgment → escalation packaged as a cross-platform Python tool. The installer automatically downloads llama-server and GGUF models.

NLP OCR Machine Learning Python BERT LLM llama.cpp Qwen NDLOCR-Lite Gradio Ollama Experiment

Tech Mar 22, 2026 13 min

Together AI announces Mamba-3: ~7x faster long-context inference than Transformers, with complex-valued SSM

Redesigned with inference latency as the first priority, Mamba‑3 combines exponential trapezoid discretization, complex‑valued states, and a MIMO structure to reach about 6.9× the speed of a Transformer at 16,384 tokens.

SSM LLM Inference Architecture

Tech Mar 22, 2026 5 min

How Compresr's Context Gateway solves context exhaustion for AI agents

Compresr's YC-backed Context Gateway is a proxy between AI agents and LLM APIs. Its three pillars - preemptive summarization, tool output compression, and tool discovery - reduce wasted context-window usage.

AI LLM Claude Code Go OSS

Tech Mar 22, 2026 7 min

You can use the free LLM API 3,000 times a month with Sakura AI Engine

Sakura Internet's "Sakura AI Engine" is an LLM inference platform compatible with OpenAI API. There is a free limit of 3,000 requests per month, and multiple models such as Kimi-K2.5 and gpt-oss-120b can be used domestically.

AI LLM Sakura Internet API

Tech Mar 21, 2026 4 min

Cursor Composer 2 turned out to be Kimi K2.5 with coding-focused RL

Cursor released Composer 2 without disclosing its base model; calling its OpenAI-compatible API revealed it is Kimi K2.5. This escalated into a licensing dispute, but a formal commercial agreement with Moonshot AI was subsequently confirmed.

Cursor Kimi Moonshot AI Reinforcement Learning LLM open weights

Tech Mar 21, 2026 10 min

MoonshotAI (Kimi) proposed AttnRes to replace Transformer's residual connection with Attention, 1.25 times more computationally efficient.

AttnRes to replace Transformer's fixed residual combination with softmax attention in the depth direction. Demonstration with Kimi Linear 48B improved GPQA-Diamond +7.5pt and HumanEval +3.1pt. Training overhead was kept below 4% and inference below 2%.

A.I. LLM MoonshotAI Kimi Transformer the study