What hardware was GLM-5 trained on?

GLM-5 was trained entirely on Huawei Ascend AI chips using 28.5 trillion tokens of training data. This is significant because it demonstrates that frontier AI models can be developed without relying on NVIDIA GPUs, which are subject to US export controls.

How does GLM-5 compare to GPT-5.4 and Claude Opus 4.6?

GLM-5 is competitive with GPT-5.4 and Claude Opus 4.6 on coding and reasoning benchmarks while being open-weight and significantly cheaper. It scores 77.8% on SWE-bench Verified compared to GPT-5.4's ~75% and Claude Opus 4.6's ~79.2%. Its open-weight license and lower API costs make it attractive for cost-sensitive and self-hosted deployments.

Zhipu AI released GLM-5 on February 11, 2026, and it rewrites the rules on two fronts. First, it's a 744 billion parameter open-weight model that competes head-to-head with GPT-5.4 and Claude Opus 4.6 on coding and reasoning benchmarks. Second — and this is the part that shook the industry — it was trained entirely on Huawei Ascend chips. No NVIDIA hardware. Zero.

That matters because US export controls were supposed to slow Chinese AI development by restricting access to NVIDIA's top-tier GPUs. GLM-5 proves that frontier-class models can be built on alternative silicon. For developers, the practical takeaway is simpler: you get a powerful open-weight model with a 200K context window, 128K max output tokens, function calling, streaming, and structured output — all under a permissive license.

In this guide, we cover GLM-5's architecture, benchmark results, API access, pricing, function calling patterns, comparison with competing models, and how Lushbinary helps teams integrate it into production systems.

📋 Table of Contents

1.What Is GLM-5?
2.Architecture: 744B MoE on Huawei Ascend
3.Benchmark Results & Performance
4.API Access & Pricing
5.Function Calling & Tool Use
6.GLM-5 vs GPT-5.4 vs Claude Opus 4.6 vs DeepSeek V4
7.Context Window & Streaming
8.Self-Hosting & Deployment Options
9.Limitations & Considerations
10.Why Lushbinary for AI Integration

1What Is GLM-5?

GLM-5 is Zhipu AI's flagship open-weight large language model, released February 11, 2026. Zhipu AI (also known as Z.ai) is one of China's leading AI labs, backed by significant funding and a research team with deep roots in Tsinghua University's NLP group.

The model uses a Mixture-of-Experts (MoE) architecture with 744 billion total parameters, but only 40–44 billion are active per inference pass. This means you get the reasoning capacity of a massive model with the inference cost of a much smaller one — the same architectural approach that made DeepSeek V3/V4 and Mixtral successful.

744B total parameters, 40–44B active per inference
256 total experts, 8 routed per token (plus shared experts)
200K context window with 128K max output tokens
Trained on 28.5 trillion tokens using Huawei Ascend chips
Open-weight under a permissive license
DeepSeek Sparse Attention (DSA) for efficient long-context processing

Why this matters

GLM-5 is the first frontier-class model trained entirely without NVIDIA hardware. It demonstrates that Huawei's Ascend chip ecosystem is mature enough for training models that compete with the best from OpenAI, Anthropic, and Google. For developers, it means another strong open-weight option in the toolkit.

2Architecture: 744B MoE on Huawei Ascend

GLM-5's architecture follows the sparse MoE paradigm that has become the dominant approach for frontier models in 2026. The key design decisions:

Mixture-of-Experts Routing

Each token is routed to 8 of 256 specialized expert sub-networks, plus a set of shared experts that process every token. This gives GLM-5 approximately 5–6% sparsity — only a small fraction of the total parameters are active for any given token, keeping inference costs manageable despite the massive total parameter count.

DeepSeek Sparse Attention (DSA)

GLM-5 adopts DeepSeek Sparse Attention to handle its 200K context window efficiently. DSA selectively attends to the most relevant tokens rather than computing full quadratic attention, reducing both memory usage and compute cost for long-context workloads.

Huawei Ascend Training Stack

The entire 28.5 trillion token training run was executed on Huawei Ascend AI processors using the MindSpore framework. Zhipu AI developed custom distributed training optimizations to achieve competitive training efficiency on non-NVIDIA hardware, including custom kernel implementations and communication primitives.

Specification	GLM-5
Total Parameters	744B
Active Parameters	40–44B per inference
Expert Count	256 total, 8 routed per token + shared
Context Window	200K tokens
Max Output	128K tokens
Training Data	28.5T tokens
Training Hardware	Huawei Ascend (no NVIDIA)
Attention Mechanism	DeepSeek Sparse Attention (DSA)
License	Open-weight, permissive

3Benchmark Results & Performance

GLM-5 posts strong numbers across coding, reasoning, and general knowledge benchmarks. The standout result is its SWE-bench Verified score of 77.8%, which places it firmly in frontier territory alongside GPT-5.4 and Claude Opus 4.6.

Benchmark	GLM-5	GPT-5.4	Claude Opus 4.6
SWE-bench Verified	77.8%	~75%	~79.2%
Vending Bench 2	$4,432	—	—
Context Window	200K	1M	200K (1M beta)
Max Output Tokens	128K	128K	128K

The 77.8% SWE-bench Verified score is particularly notable because it was achieved by a model trained on non-NVIDIA hardware. This demonstrates that the training infrastructure gap between NVIDIA and alternative chip ecosystems has narrowed significantly.

GLM-5 also supports thinking modes for complex multi-step reasoning, similar to the chain-of-thought approaches used by GPT-5.4 and Claude. When enabled, the model shows its reasoning process before delivering a final answer, improving accuracy on math, logic, and code generation tasks.

4API Access & Pricing

GLM-5 is available through two primary channels: the Z.ai platform (Zhipu AI's official API) and the WaveSpeed API for high-throughput inference. Both offer OpenAI-compatible endpoints, making integration straightforward if you're already using the OpenAI SDK.

Provider	Access Method	Key Features
Z.ai Platform	REST API, OpenAI-compatible	Official, context caching, function calling
WaveSpeed API	REST API, OpenAI-compatible	High-throughput, optimized inference
Self-hosted	Open weights download	Full control, data sovereignty

Cost Advantage

GLM-5's open-weight license means you can self-host for zero per-token API costs. Even through the Z.ai managed API, pricing is significantly lower than GPT-5.4 ($2.50/M input) or Claude Opus 4.6 ($15/M input). Context caching further reduces costs for repeated prompt prefixes.

Quick Start: Z.ai API with OpenAI SDK

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://open.bigmodel.cn/api/paas/v4",
  apiKey: process.env.ZHIPU_API_KEY,
});

const response = await client.chat.completions.create({
  model: "glm-5",
  messages: [
    { role: "system", content: "You are a senior software engineer." },
    { role: "user", content: "Review this code for security issues..." },
  ],
  max_tokens: 4096,
  temperature: 0.3,
  stream: false,
});

console.log(response.choices[0].message.content);

5Function Calling & Tool Use

GLM-5 supports native function calling with the same tool-use pattern popularized by OpenAI. You define tools in your API request, and the model decides when and how to call them based on the user's query. This makes it straightforward to build AI agents that interact with external APIs, databases, and services.

const response = await client.chat.completions.create({
  model: "glm-5",
  messages: [
    { role: "user", content: "What's the weather in Beijing?" },
  ],
  tools: [
    {
      type: "function",
      function: {
        name: "get_weather",
        description: "Get current weather for a city",
        parameters: {
          type: "object",
          properties: {
            city: { type: "string", description: "City name" },
            units: {
              type: "string",
              enum: ["celsius", "fahrenheit"],
              description: "Temperature units",
            },
          },
          required: ["city"],
        },
      },
    },
  ],
  tool_choice: "auto",
});

// GLM-5 returns a tool call
const toolCall = response.choices[0].message.tool_calls?.[0];
if (toolCall) {
  const args = JSON.parse(toolCall.function.arguments);
  console.log(args); // { city: "Beijing", units: "celsius" }
}

GLM-5 also supports structured output via JSON mode, which constrains the model's response to valid JSON matching a provided schema. This is essential for building reliable data extraction pipelines and API integrations where you need predictable output formats.

const structured = await client.chat.completions.create({
  model: "glm-5",
  messages: [
    {
      role: "user",
      content: "Extract: John Doe, age 32, engineer at Acme Corp",
    },
  ],
  response_format: {
    type: "json_object",
  },
});

// Returns valid JSON: { "name": "John Doe", "age": 32, ... }

Thinking Modes

GLM-5 supports thinking modes where the model reasons step-by-step before producing a final answer. This is particularly useful for complex coding tasks, mathematical reasoning, and multi-step planning. Enable it by setting the appropriate parameter in your API request to get both the reasoning trace and the final output.

6GLM-5 vs GPT-5.4 vs Claude Opus 4.6 vs DeepSeek V4

The frontier model landscape in early 2026 is more competitive than ever. Here's how GLM-5 stacks up against the other leading models across the dimensions that matter most for production deployments:

Factor	GLM-5	GPT-5.4	Claude Opus 4.6	DeepSeek V4
Parameters	744B (40–44B active)	Undisclosed	Undisclosed	~1T (32B active)
Architecture	MoE (256 experts)	Dense	Dense	MoE (256 experts)
SWE-bench Verified	77.8%	~75%	~79.2%	~78%
Context Window	200K	1M	200K (1M beta)	1M+
Computer Use	No	Yes (native)	Yes	No
Open Weights	Yes	No	No	Yes
Training Hardware	Huawei Ascend	NVIDIA	NVIDIA / Google TPU	NVIDIA
Best For	Open-weight, self-hosted	Computer use, tool search	Long-horizon coding	Cost-sensitive workloads

GLM-5 occupies a unique position: it's the only frontier model trained on non-NVIDIA hardware, and it's open-weight. If your requirements include data sovereignty, self-hosting, or avoiding vendor lock-in to US-based AI providers, GLM-5 is a strong candidate. For pure coding performance, Claude Opus 4.6 still leads slightly on SWE-bench, but GLM-5 is within striking distance.

For cost-sensitive workloads where you don't need computer use, both GLM-5 and DeepSeek V4 offer compelling alternatives to the proprietary models. The choice between them often comes down to context window requirements (DeepSeek V4's 1M+ vs GLM-5's 200K) and your comfort level with each provider's ecosystem.

7Context Window & Streaming

GLM-5's 200K token context window is large enough for most production use cases — processing entire codebases, long documents, or multi-turn conversations. While it's smaller than GPT-5.4's 1M or DeepSeek V4's 1M+, the 200K window covers the vast majority of real-world workloads.

The 128K max output token limit is generous and matches GPT-5.4 and Claude Opus 4.6. This is enough to generate entire files, long-form documents, or detailed code reviews in a single response.

Real-Time Streaming

GLM-5 supports real-time streaming via Server-Sent Events (SSE), delivering tokens as they're generated. This is essential for chat interfaces and any application where perceived latency matters.

const stream = await client.chat.completions.create({
  model: "glm-5",
  messages: [
    { role: "user", content: "Explain MoE architecture in detail" },
  ],
  stream: true,
  max_tokens: 8192,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) {
    process.stdout.write(content);
  }
}

Context Caching

GLM-5 supports context caching through the Z.ai platform. When you send the same system prompt or document prefix across multiple requests, the cached tokens are processed at a significantly reduced cost. This is automatic — no code changes required. If you're building a chatbot with a long system prompt or processing multiple queries against the same document, caching can reduce your input token costs substantially.

8Self-Hosting & Deployment Options

One of GLM-5's biggest advantages is its open-weight license. You can download the model weights and deploy on your own infrastructure, giving you full control over data privacy, latency, and cost at scale.

With 744B total parameters but only 40–44B active per inference, GLM-5 is more feasible to self-host than its total parameter count suggests. The MoE architecture means you need enough memory to hold all expert weights, but the compute per token is comparable to a 40–44B dense model.

Deployment Options

vLLM: The most popular open-source inference engine for large models. Supports MoE architectures with tensor parallelism across multiple GPUs.
SGLang: High-performance serving framework with RadixAttention for efficient prefix caching, well-suited for GLM-5's MoE architecture.
AWS (EC2 p5 / SageMaker): Deploy on p5 instances with 8x H100 GPUs for production-grade inference. SageMaker provides managed endpoints with auto-scaling.
Huawei Cloud: Native support on Huawei's ModelArts platform with Ascend hardware, matching the training environment.

# Self-host GLM-5 with vLLM (example)
pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model THUDM/glm-5 \
  --tensor-parallel-size 8 \
  --max-model-len 200000 \
  --trust-remote-code \
  --port 8000

# Test the endpoint
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "THUDM/glm-5",
    "messages": [{"role": "user", "content": "Hello GLM-5!"}],
    "max_tokens": 256
  }'

Hardware Requirements

Self-hosting GLM-5 requires significant GPU memory. Plan for at least 8x H100 80GB GPUs (or equivalent) with tensor parallelism. For production workloads with high concurrency, consider 16+ GPUs with pipeline parallelism. The MoE architecture means all 744B parameters must fit in memory, even though only 40–44B are active per token.

9Limitations & Considerations

GLM-5 is impressive, but it's not without trade-offs. Here's what to keep in mind before committing to it for production:

No computer use: Unlike GPT-5.4 and Claude Opus 4.6, GLM-5 cannot control browsers or desktop applications natively. If you need autonomous computer interaction, look elsewhere.
Smaller context window: At 200K tokens, GLM-5's context is large but trails GPT-5.4 (1M) and DeepSeek V4 (1M+). For workloads requiring million-token contexts, this is a real limitation.
Ecosystem maturity: The Z.ai API ecosystem is less mature than OpenAI's or Anthropic's. Documentation, SDKs, and community tooling are still catching up.
Geopolitical considerations: As a Chinese AI model, some enterprises may have compliance or regulatory concerns about data routing through Chinese infrastructure. Self-hosting mitigates this.
Self-hosting cost: While the weights are free, the hardware required to run a 744B parameter model is substantial. Budget for 8+ high-end GPUs minimum.
Benchmark verification: Some performance claims are based on Zhipu AI's internal testing. Independent verification at scale is still ongoing as of early 2026.
English vs Chinese performance: GLM-5 was trained with a strong emphasis on Chinese language data. While English performance is competitive, it may not match models primarily optimized for English on certain nuanced tasks.

10Why Lushbinary for AI Integration

At Lushbinary, we help teams evaluate, integrate, and deploy AI models like GLM-5 into production systems. Whether you need a multi-model routing architecture that picks the best model per task, a self-hosted GLM-5 deployment on AWS, or a cost-optimized pipeline that blends open-weight and proprietary models, we've built it.

Our team has hands-on experience with every major LLM API and self-hosting stack — from vLLM and SGLang to SageMaker and custom Kubernetes deployments. We can help you:

Evaluate GLM-5 against your specific use case with real-world benchmarks, not just public leaderboard scores
Build multi-model architectures that route between GLM-5, GPT-5.4, Claude, and DeepSeek based on task complexity and cost targets
Deploy self-hosted inference on AWS, GCP, or your own infrastructure with proper scaling, monitoring, and failover
Implement function calling pipelines with tool orchestration, error handling, and retry logic
Optimize costs with context caching, batch processing, and intelligent model selection

Free AI Architecture Consultation

Not sure which model fits your use case? Book a free 30-minute call with our AI team. We'll review your workload, estimate costs across providers, and recommend the optimal architecture — whether that's GLM-5, a multi-model setup, or something else entirely.

❓ Frequently Asked Questions

What is GLM-5 and who made it?

GLM-5 is a 744B parameter open-weight MoE model released February 11, 2026 by Zhipu AI (Z.ai). It was trained entirely on Huawei Ascend chips, making it the first frontier model built without NVIDIA hardware.

How does GLM-5 perform on coding benchmarks?

GLM-5 scores 77.8% on SWE-bench Verified and $4,432 on Vending Bench 2, placing it among the top frontier models alongside GPT-5.4 and Claude Opus 4.6.

Can I self-host GLM-5?

Yes. GLM-5 is open-weight under a permissive license. You can deploy it on your own infrastructure using vLLM, SGLang, or other serving frameworks. Plan for at least 8x H100 GPUs for the full model.

How does GLM-5 compare to DeepSeek V4?

Both are open-weight MoE models from Chinese AI labs. DeepSeek V4 has a larger context window (1M+ vs 200K) and more total parameters (~1T vs 744B). GLM-5's unique advantage is its Huawei Ascend training, proving non-NVIDIA frontier AI is viable.

Does GLM-5 support function calling and streaming?

Yes. GLM-5 supports native function calling, structured JSON output, real-time streaming via SSE, context caching, and thinking modes for complex reasoning tasks.

📚 Sources

Zhipu AI (Z.ai) — GLM-5 Official Platform
THUDM GitHub — GLM Model Repository
SWE-bench Leaderboard
OpenAI GPT-5.4 Announcement (for comparison data)
DeepSeek API Documentation (for comparison data)

Content was rephrased for compliance with licensing restrictions. Benchmark data sourced from official Zhipu AI announcements and independent benchmark leaderboards as of February 2026. Pricing and benchmark scores may change — always verify on the vendor's website.

Ready to Integrate GLM-5 Into Your Product?

From API setup and multi-model routing to self-hosted deployments and cost optimization — we ship production AI integrations that work. Tell us about your project.

Build Smarter, Launch Faster.

Q: What is GLM-5 and who made it?

GLM-5 is a 744 billion parameter open-weight Mixture-of-Experts model released on February 11, 2026 by Zhipu AI (Z.ai). It was trained entirely on Huawei Ascend chips without any NVIDIA hardware, making it the first frontier-class model to prove that cutting-edge AI can be built on non-NVIDIA silicon.

Q: Can I self-host GLM-5?

Yes. GLM-5 is released under a permissive open-weight license, so you can deploy it on your own infrastructure. With 744B total parameters but only 40-44B active per inference (MoE architecture), it is feasible to run on high-end GPU clusters. It is also available via the Z.ai platform API and WaveSpeed API for managed access.

Book a free strategy call and explore how LushBinary can turn your vision into reality.

GLM-5 Developer Guide: Zhipu AI's 744B Open-Weight Model Trained on Huawei Chips