Shroud — Competitive Pricing Dashboard

B200 Blackwell — Required for True Confidential Compute

NVIDIA B200 — Only GPU with encrypted NVLink

B200 has hardware-encrypted NVLink and NVSwitch — the only way to run multi-GPU models with true E2EE. H200/Hopper NVLink is unencrypted plaintext — Venice uses this and falsely markets it as E2EE. 192GB HBM3e + FP4 native. T-WAP $5.50/hr, prices rising (Vast.ai hit $9.38/hr March 21).

NVLINK

Encrypted

VRAM

192 GB

VS H200

1.7x faster

H200 NVLINK

Plain

Ask for the Board

7× H200 + 25× B200

H200 for single-GPU models (≤70B). B200 for multi-GPU — only GPU with encrypted NVLink for true CC.

~$115K/mo budget

7× H200 ($18K) + 25× B200 ($99K). GLM-5 alone = 8× B200 ($32K). All tiers profitable with CU weights.

H200 Market T-WAP

$3.50/hr

141GB HBM3e | Range: $2.10 – $6.31

Our Cost / 1M Tokens

$0.22

Llama 70B on H200 SXM (calculator-linked)

Market Price / 1M Tokens

$0.35–$0.88

Open source 70B tier

Gross Margin (target)

58%

At $0.50/1M output tokens

GPU Compute Market Pricing

Live marketplace rates as of March 25, 2026. Sorted by on-demand price.

All GPUs

H200

B200

Provider ↕	GPU ↕	VRAM ↕	Spot $/hr ↕	On-Demand $/hr ↕	Type	vs T-WAP

T-WAP by GPU ($/hr)

T-WAP = Time-Weighted Average Price across all providers weighted by market availability

Cost per Million Tokens by GPU (Llama 70B)

Based on typical throughput. H200 and B200 deliver best cost/token ratio.

LLM Inference API Pricing

Prices per 1M tokens across all major providers. March 2026.

All Models

Proprietary

Open Source

Budget (<$0.50)

Provider ↕	Model ↕	Category	Input $/1M ↕	Output $/1M ↕	Out/In Ratio	Blended $/1M

Blended price = (input + output) / 2. Batch API discounts (typically 50% off) available at most providers but not shown.

Interactive Profitability Calculator

Adjust parameters to see revenue, cost, and margin for Shroud.

Infrastructure Parameters

GPU Type

Number of GPUs 10

GPU Utilization 70%

Avg Model Size 70B

8B70B405BDeepSeek

Pricing Parameters

Your Input Price ($/1M tokens) $0.30

Your Output Price ($/1M tokens) $0.90

Input/Output Token Ratio 30/70

Your Price vs Market (Output tokens, $/1M)

Where you sit relative to competitors

GPU Requirements by Model

Minimum GPU count to run each model. One Cocoon worker = one model instance.

Model	Parameters	Min VRAM	GPU Config	CC Requirement	Cost/mo	Notes

Total GPU Requirement Summary

H200 T-WAP $3.50/hr, B200 T-WAP $5.50/hr. Multi-GPU models require B200 (Blackwell) for encrypted NVLink — true Confidential Compute. H200 (Hopper) NVLink is unencrypted, safe only for single-GPU.

Inference Framework: vLLM vs SGLang (March 2026)

SGLang — Best for MoE

• +29% throughput vs vLLM on 8B
• RadixAttention: 5x speedup on prefix-heavy (RAG, agents)
• Native Expert Parallelism (DeepSeek, Qwen3 MoE, Llama 4)
• SGLang EP72: 22K+ tok/s output on DeepSeek at scale
• Recommended: all MoE models

vLLM — Best for Dense/Diverse

• Broader model compatibility, plugin ecosystem
• Wins on large dense models (120B+) at high concurrency
• P-EAGLE speculative decoding (up to 4.79x on Llama 70B)
• Day-zero support for new models (Llama 4, DeepSeek V3.2)
• Recommended: dense models, rapid iteration

FP8 Quantization (Production Standard)

• -2.7% avg quality loss vs BF16 (acceptable)
• +33% throughput, +50% capacity vs BF16
• Native on H200/B200 — zero overhead
• DeepSeek & Qwen3 ship official FP8 checkpoints
• B200 adds FP4: 2x throughput vs H200 (Blackwell only)

Deployment Roadmap & Market Signal

Shroud can host any open-source model. Single-GPU on H200, multi-GPU on B200 with encrypted NVLink.

Venice.ai — Primary Competitor

43 models total: 18 self-hosted, 15 proxied (Anon), 11 "E2EE"

$180/yr. "Anon" models = API proxy to OpenAI/Google/Anthropic (not self-hosted). E2EE on Hopper = NVLink plaintext on multi-GPU.

SELF-HOSTED

E2EE MODELS

MULTI-GPU E2EE

Insecure

OUR EDGE

B200 NVLink

Venice doesn't have: DeepSeek R1 Llama 3.3 70B Llama 4 Scout Llama 4 Maverick (Venice dropped it) Qwen3-235B Venice fake E2EE: GLM-5 (multi-GPU, NVLink plaintext)

OpenRouter 100T-Token Study — Real API Traffic (OSS only)

#1 OSS TRAFFIC

DeepSeek family

14.37T tokens

#2 OSS TRAFFIC

Qwen family

5.59T tokens

#3 OSS TRAFFIC

Meta Llama

3.96T tokens

#4 OSS TRAFFIC

Mistral

2.92T tokens

⚠️ HF downloads skew to small self-hosted models (7B–8B). Production API traffic skews 70B+ and frontier MoE. A new provider needs both.

Tier	Models	GPUs	$/mo	HF + API Signal	Rationale
P0 — Day 1	Qwen2.5-7B · Qwen3-8B · Llama-3.1-8B · Llama-3.3-70B	4× H200	~$12K	19.6M + 9.1M + 7.8M HF/mo	Top downloaded, 1× H200 each — single GPU, no NVLink
P1 — Week 2	Llama 4 Maverick · Qwen3-235B · Qwen3-32B	2× H200 + 5× B200	+$26K	4.5M HF + #2 OSS API	Qwen3-32B 1× H200; Maverick 3× B200, Qwen3-235B 2× B200
P1 — Week 3	GLM-5 (#1 Elo) · Kimi K2.5 (MIT)	+12× B200	+$48K	#1 Arena Elo 1451, 76.8% SWE	GLM-5 8× B200 DGX (FP8), Kimi K2 4× B200 — encrypted NVLink
P0 — Month 2	DeepSeek V3.2 · DeepSeek R1	+6× B200	+$24K	14.37T tokens OpenRouter	#1 OSS API globally — 3× B200 each, INT4

HuggingFace Popularity Signal — March 2026 (Actual Download Numbers)

Top Downloads / Month

#1 Qwen2.5-7B-Instruct 19.6M

#4 Qwen3-8B 9.09M

#7 Llama-3.1-8B-Instruct 7.79M

#17 GLM-5-FP8 4.3M

Qwen: 8 of top 20 models, 113K+ forks

Quality Benchmark (whatllm.org, Feb 2026)

#1 GLM-5 (Reasoning) 49.64

#2 Kimi K2.5 (Reasoning) 46.73

#3 MiniMax M2.5 41.97

#5 DeepSeek V3.2 41.2

DeepSeek R1 = most liked model in HF history

Shroud Subscription Plans & Unit Economics

Pricing tiers, per-model cost structure, and margin analysis. Payment: Stripe (USD) + TON.

Subscription Plans

Plan	$/mo	Included tokens	Overage $/1M	req/s	Seats
Free	$0	100K	—	1	2
Developer	$29	20M	$1.00	10	3
Startup	$99	50M	$0.50	50	10
Enterprise	$499	500M	$0.20	200	50

Tokens are billed with model-tier weights (see below). 1 token on a Tier S model = 1 CU, on a Tier L model = 50 CU. Overage rate applies to weighted CUs.

Cost per Model

Model	GPU Config	CC	$/mo GPU	Throughput (tok/s)	Cost $/1M tokens	CU Weight	Effective price to customer	Gross Margin
Llama 3.1 8B	1× H200	H200	$2,520	24,000	$0.04	1×	$0.20–$1.00/1M	75–95%
Llama 3.3 70B / Qwen3-32B	1× H200 FP8	H200	$2,520	4,500	$0.22	1×	$0.20–$1.00/1M	–23% to 74%
Qwen3-235B (MoE)	2× B200	B200	$7,920	1,300	$2.35	15×	$3.00–$15.00/1M	22–84%
Llama 4 Maverick	3× B200	B200	$11,880	1,200	$3.82	50×	$10.00–$50.00/1M	62–92%
DeepSeek R1 / V3.2	3× B200	B200	$11,880	~3,000	$1.53	50×	$10.00–$50.00/1M	85–97%
GLM-5 (744B MoE)	8× B200 DGX	B200	$31,680	~1,370	$8.92	Custom	Enterprise only	By contract
Kimi K2 (1T MoE)	4× B200	B200	$15,840	~800	$7.64	Custom	Enterprise only	By contract

CU Weight Tiers — How Billing Works

Tier	Models	CU per token	Customer pays (overage range)	Our cost	Min. margin
Tier S ≤32B	Llama 8B, Qwen3-8B, Gemma 27B, Mistral 24B, GPT-OSS 20B/120B	1×	$0.20–$1.00 / 1M	$0.04	75–96%
Tier S 70B	Llama 3.3 70B FP8, Qwen3-32B	1×	$0.20–$1.00 / 1M	$0.22	–9% to 78%
Tier M 235B MoE	Qwen3-235B	15×	$3.00–$15.00 / 1M	$2.35	22–84%
Tier L 685B+	DeepSeek R1/V3.2, Llama 4 Maverick	50×	$10.00–$50.00 / 1M	$1.53–$3.82	62–97%
Tier XL 744B+	GLM-5, Kimi K2, Kimi K2.5	Enterprise only	Custom pricing — $7.64–$8.92/1M cost		—

Customer sees a simple token budget. Behind the scenes, each token is multiplied by the model's CU weight before billing. All tiers are profitable at any subscription level.

Launch Summary

Key differentiators, payment infrastructure, and launch targets.

Payment Methods

TON (crypto)Priority

Stripe (card)Priority

X42 paymentsPlanned

MPP paymentsPlanned

MoonPayPlanned

Key Differentiators vs Venice

✓ True multi-GPU CC — B200 encrypted NVLink (Venice = Hopper plaintext)

✓ Any open-source model — not limited to a fixed list

✓ Models Venice doesn't have (DeepSeek R1, Llama 3/4, Qwen3-235B)

✓ TON + Stripe payments, OpenAI-compatible API

Minimum Viable Launch

30 models7× H200 + 25× B200

Monthly GPU cost~$115K

Breakeven tokens/mo~190M

Breakeven revenue~$85K

Target launchQ2 2026

SHROUD Pricing Intelligence