SHROUD Pricing Intelligence

Board Presentation — April 3, 2026 | Data: March 25, 2026
Executive Summary — Board Presentation April 3, 2026
Shroud can be profitable from day one at market-competitive prices
Small models ($0.04/1M) are highly profitable. Large models ($1–$8/1M) need CU weights to stay above market. True CC: H200 single-GPU, B200 encrypted NVLink for multi-GPU. Venice NVLink = plaintext.
76%
Gross margin at Standard tier
B200 Blackwell — Required for True Confidential Compute
NVIDIA B200 — Only GPU with encrypted NVLink
B200 has hardware-encrypted NVLink and NVSwitch — the only way to run multi-GPU models with true E2EE. H200/Hopper NVLink is unencrypted plaintext — Venice uses this and falsely markets it as E2EE. 192GB HBM3e + FP4 native. T-WAP $5.50/hr, prices rising (Vast.ai hit $9.38/hr March 21).
NVLINK
Encrypted
VRAM
192 GB
VS H200
1.7x faster
H200 NVLINK
Plain
Ask for the Board
7× H200 + 25× B200
H200 for single-GPU models (≤70B). B200 for multi-GPU — only GPU with encrypted NVLink for true CC.
~$115K/mo budget
7× H200 ($18K) + 25× B200 ($99K). GLM-5 alone = 8× B200 ($32K). All tiers profitable with CU weights.
H200 Market T-WAP
$3.50/hr
141GB HBM3e | Range: $2.10 – $6.31
Our Cost / 1M Tokens
$0.22
Llama 70B on H200 SXM (calculator-linked)
Market Price / 1M Tokens
$0.35–$0.88
Open source 70B tier
Gross Margin (target)
58%
At $0.50/1M output tokens
GPU Compute Market Pricing
Live marketplace rates as of March 25, 2026. Sorted by on-demand price.
All GPUs
H200
B200
Provider GPU VRAM Spot $/hr On-Demand $/hr Type vs T-WAP
T-WAP by GPU ($/hr)
T-WAP = Time-Weighted Average Price across all providers weighted by market availability
Cost per Million Tokens by GPU (Llama 70B)
Based on typical throughput. H200 and B200 deliver best cost/token ratio.
LLM Inference API Pricing
Prices per 1M tokens across all major providers. March 2026.
All Models
Proprietary
Open Source
Budget (<$0.50)
Provider Model Category Input $/1M Output $/1M Out/In Ratio Blended $/1M
Blended price = (input + output) / 2. Batch API discounts (typically 50% off) available at most providers but not shown.
Interactive Profitability Calculator
Adjust parameters to see revenue, cost, and margin for Shroud.
Infrastructure Parameters
Number of GPUs 10
GPU Utilization 70%
Avg Model Size 70B
8B70B405BDeepSeek
Pricing Parameters
Your Input Price ($/1M tokens) $0.30
Your Output Price ($/1M tokens) $0.90
Input/Output Token Ratio 30/70
Your Price vs Market (Output tokens, $/1M)
Where you sit relative to competitors
GPU Requirements by Model
Minimum GPU count to run each model. One Cocoon worker = one model instance.
Model Parameters Min VRAM GPU Config CC Requirement Cost/mo Notes
Total GPU Requirement Summary
H200 T-WAP $3.50/hr, B200 T-WAP $5.50/hr. Multi-GPU models require B200 (Blackwell) for encrypted NVLink — true Confidential Compute. H200 (Hopper) NVLink is unencrypted, safe only for single-GPU.
Inference Framework: vLLM vs SGLang (March 2026)
SGLang — Best for MoE
• +29% throughput vs vLLM on 8B
• RadixAttention: 5x speedup on prefix-heavy (RAG, agents)
• Native Expert Parallelism (DeepSeek, Qwen3 MoE, Llama 4)
• SGLang EP72: 22K+ tok/s output on DeepSeek at scale
Recommended: all MoE models
vLLM — Best for Dense/Diverse
• Broader model compatibility, plugin ecosystem
• Wins on large dense models (120B+) at high concurrency
• P-EAGLE speculative decoding (up to 4.79x on Llama 70B)
• Day-zero support for new models (Llama 4, DeepSeek V3.2)
Recommended: dense models, rapid iteration
FP8 Quantization (Production Standard)
• -2.7% avg quality loss vs BF16 (acceptable)
• +33% throughput, +50% capacity vs BF16
• Native on H200/B200 — zero overhead
• DeepSeek & Qwen3 ship official FP8 checkpoints
• B200 adds FP4: 2x throughput vs H200 (Blackwell only)
Deployment Roadmap & Market Signal
Shroud can host any open-source model. Single-GPU on H200, multi-GPU on B200 with encrypted NVLink.
Venice.ai — Primary Competitor
43 models total: 18 self-hosted, 15 proxied (Anon), 11 "E2EE"
$180/yr. "Anon" models = API proxy to OpenAI/Google/Anthropic (not self-hosted). E2EE on Hopper = NVLink plaintext on multi-GPU.
SELF-HOSTED
18
E2EE MODELS
11
MULTI-GPU E2EE
Insecure
OUR EDGE
B200 NVLink
Venice doesn't have: DeepSeek R1 Llama 3.3 70B Llama 4 Scout Llama 4 Maverick (Venice dropped it) Qwen3-235B Venice fake E2EE: GLM-5 (multi-GPU, NVLink plaintext)
OpenRouter 100T-Token Study — Real API Traffic (OSS only)
#1 OSS TRAFFIC
DeepSeek family
14.37T tokens
#2 OSS TRAFFIC
Qwen family
5.59T tokens
#3 OSS TRAFFIC
Meta Llama
3.96T tokens
#4 OSS TRAFFIC
Mistral
2.92T tokens
⚠️ HF downloads skew to small self-hosted models (7B–8B). Production API traffic skews 70B+ and frontier MoE. A new provider needs both.
Tier Models GPUs $/mo HF + API Signal Rationale
P0 — Day 1 Qwen2.5-7B · Qwen3-8B · Llama-3.1-8B · Llama-3.3-70B 4× H200 ~$12K 19.6M + 9.1M + 7.8M HF/mo Top downloaded, 1× H200 each — single GPU, no NVLink
P1 — Week 2 Llama 4 Maverick · Qwen3-235B · Qwen3-32B 2× H200 + 5× B200 +$26K 4.5M HF + #2 OSS API Qwen3-32B 1× H200; Maverick 3× B200, Qwen3-235B 2× B200
P1 — Week 3 GLM-5 (#1 Elo) · Kimi K2.5 (MIT) +12× B200 +$48K #1 Arena Elo 1451, 76.8% SWE GLM-5 8× B200 DGX (FP8), Kimi K2 4× B200 — encrypted NVLink
P0 — Month 2 DeepSeek V3.2 · DeepSeek R1 +6× B200 +$24K 14.37T tokens OpenRouter #1 OSS API globally — 3× B200 each, INT4
HuggingFace Popularity Signal — March 2026 (Actual Download Numbers)
Top Downloads / Month
#1 Qwen2.5-7B-Instruct 19.6M
#4 Qwen3-8B 9.09M
#7 Llama-3.1-8B-Instruct 7.79M
#17 GLM-5-FP8 4.3M
Qwen: 8 of top 20 models, 113K+ forks
Quality Benchmark (whatllm.org, Feb 2026)
#1 GLM-5 (Reasoning) 49.64
#2 Kimi K2.5 (Reasoning) 46.73
#3 MiniMax M2.5 41.97
#5 DeepSeek V3.2 41.2
DeepSeek R1 = most liked model in HF history
Shroud Subscription Plans & Unit Economics
Pricing tiers, per-model cost structure, and margin analysis. Payment: Stripe (USD) + TON.
Subscription Plans
Plan$/moIncluded tokensOverage $/1Mreq/sSeats
Free$0100K12
Developer$2920M$1.00103
Startup$9950M$0.505010
Enterprise$499500M$0.2020050
Tokens are billed with model-tier weights (see below). 1 token on a Tier S model = 1 CU, on a Tier L model = 50 CU. Overage rate applies to weighted CUs.
Cost per Model
ModelGPU ConfigCC$/mo GPUThroughput
(tok/s)
Cost
$/1M tokens
CU WeightEffective price
to customer
Gross Margin
Llama 3.1 8B1× H200H200$2,52024,000 $0.04 $0.20–$1.00/1M 75–95%
Llama 3.3 70B / Qwen3-32B1× H200 FP8H200$2,5204,500 $0.22 $0.20–$1.00/1M –23% to 74%
Qwen3-235B (MoE)2× B200B200$7,9201,300 $2.35 15× $3.00–$15.00/1M 22–84%
Llama 4 Maverick3× B200B200$11,8801,200 $3.82 50× $10.00–$50.00/1M 62–92%
DeepSeek R1 / V3.23× B200B200$11,880~3,000 $1.53 50× $10.00–$50.00/1M 85–97%
GLM-5 (744B MoE)8× B200 DGXB200$31,680~1,370 $8.92 Custom Enterprise only By contract
Kimi K2 (1T MoE)4× B200B200$15,840~800 $7.64 Custom Enterprise only By contract
CU Weight Tiers — How Billing Works
TierModelsCU per tokenCustomer pays (overage range)Our costMin. margin
Tier S ≤32BLlama 8B, Qwen3-8B, Gemma 27B, Mistral 24B, GPT-OSS 20B/120B$0.20–$1.00 / 1M$0.0475–96%
Tier S 70BLlama 3.3 70B FP8, Qwen3-32B$0.20–$1.00 / 1M$0.22–9% to 78%
Tier M 235B MoEQwen3-235B15×$3.00–$15.00 / 1M$2.3522–84%
Tier L 685B+DeepSeek R1/V3.2, Llama 4 Maverick50×$10.00–$50.00 / 1M$1.53–$3.8262–97%
Tier XL 744B+GLM-5, Kimi K2, Kimi K2.5Enterprise onlyCustom pricing — $7.64–$8.92/1M cost
Customer sees a simple token budget. Behind the scenes, each token is multiplied by the model's CU weight before billing. All tiers are profitable at any subscription level.
Launch Summary
Key differentiators, payment infrastructure, and launch targets.
Payment Methods
TON (crypto)Priority
Stripe (card)Priority
X42 paymentsPlanned
MPP paymentsPlanned
MoonPayPlanned
Key Differentiators vs Venice
True multi-GPU CC — B200 encrypted NVLink (Venice = Hopper plaintext)
Any open-source model — not limited to a fixed list
✓ Models Venice doesn't have (DeepSeek R1, Llama 3/4, Qwen3-235B)
✓ TON + Stripe payments, OpenAI-compatible API
Minimum Viable Launch
30 models7× H200 + 25× B200
Monthly GPU cost~$115K
Breakeven tokens/mo~190M
Breakeven revenue~$85K
Target launchQ2 2026