Executive Summary — Board Presentation April 3, 2026
Shroud can be profitable from day one at market-competitive prices
Small models ($0.04/1M) are highly profitable. Large models ($1–$8/1M) need CU weights to stay above market. True CC: H200 single-GPU, B200 encrypted NVLink for multi-GPU. Venice NVLink = plaintext.
76%
Gross margin at Standard tier
B200 Blackwell — Required for True Confidential Compute
NVIDIA B200 — Only GPU with encrypted NVLink
B200 has hardware-encrypted NVLink and NVSwitch — the only way to run multi-GPU models with true E2EE.
H200/Hopper NVLink is unencrypted plaintext — Venice uses this and falsely markets it as E2EE.
192GB HBM3e + FP4 native. T-WAP $5.50/hr, prices rising (Vast.ai hit $9.38/hr March 21).
Ask for the Board
7× H200 + 25× B200
H200 for single-GPU models (≤70B). B200 for multi-GPU — only GPU with encrypted NVLink for true CC.
~$115K/mo budget
7× H200 ($18K) + 25× B200 ($99K). GLM-5 alone = 8× B200 ($32K). All tiers profitable with CU weights.
H200 Market T-WAP
$3.50/hr
141GB HBM3e | Range: $2.10 – $6.31
Our Cost / 1M Tokens
$0.22
Llama 70B on H200 SXM (calculator-linked)
Market Price / 1M Tokens
$0.35–$0.88
Open source 70B tier
Gross Margin (target)
58%
At $0.50/1M output tokens
GPU Compute Market Pricing
Live marketplace rates as of March 25, 2026. Sorted by on-demand price.
| Provider ↕ |
GPU ↕ |
VRAM ↕ |
Spot $/hr ↕ |
On-Demand $/hr ↕ |
Type |
vs T-WAP |
Cost per Million Tokens by GPU (Llama 70B)
LLM Inference API Pricing
Prices per 1M tokens across all major providers. March 2026.
All Models
Proprietary
Open Source
Budget (<$0.50)
| Provider ↕ |
Model ↕ |
Category |
Input $/1M ↕ |
Output $/1M ↕ |
Out/In Ratio |
Blended $/1M |
Interactive Profitability Calculator
Adjust parameters to see revenue, cost, and margin for Shroud.
Your Price vs Market (Output tokens, $/1M)
Where you sit relative to competitors
GPU Requirements by Model
Minimum GPU count to run each model. One Cocoon worker = one model instance.
| Model |
Parameters |
Min VRAM |
GPU Config |
CC Requirement |
Cost/mo |
Notes |
Total GPU Requirement Summary
Inference Framework: vLLM vs SGLang (March 2026)
SGLang — Best for MoE
• +29% throughput vs vLLM on 8B
• RadixAttention: 5x speedup on prefix-heavy (RAG, agents)
• Native Expert Parallelism (DeepSeek, Qwen3 MoE, Llama 4)
• SGLang EP72: 22K+ tok/s output on DeepSeek at scale
• Recommended: all MoE models
vLLM — Best for Dense/Diverse
• Broader model compatibility, plugin ecosystem
• Wins on large dense models (120B+) at high concurrency
• P-EAGLE speculative decoding (up to 4.79x on Llama 70B)
• Day-zero support for new models (Llama 4, DeepSeek V3.2)
• Recommended: dense models, rapid iteration
FP8 Quantization (Production Standard)
• -2.7% avg quality loss vs BF16 (acceptable)
• +33% throughput, +50% capacity vs BF16
• Native on H200/B200 — zero overhead
• DeepSeek & Qwen3 ship official FP8 checkpoints
• B200 adds FP4: 2x throughput vs H200 (Blackwell only)
Deployment Roadmap & Market Signal
Shroud can host any open-source model. Single-GPU on H200, multi-GPU on B200 with encrypted NVLink.
Venice.ai — Primary Competitor
43 models total: 18 self-hosted, 15 proxied (Anon), 11 "E2EE"
$180/yr. "Anon" models = API proxy to OpenAI/Google/Anthropic (not self-hosted). E2EE on Hopper = NVLink plaintext on multi-GPU.
Venice doesn't have:
DeepSeek R1
Llama 3.3 70B
Llama 4 Scout
Llama 4 Maverick (Venice dropped it)
Qwen3-235B
Venice fake E2EE:
GLM-5 (multi-GPU, NVLink plaintext)
OpenRouter 100T-Token Study — Real API Traffic (OSS only)
#1 OSS TRAFFIC
DeepSeek family
14.37T tokens
#2 OSS TRAFFIC
Qwen family
5.59T tokens
#3 OSS TRAFFIC
Meta Llama
3.96T tokens
#4 OSS TRAFFIC
Mistral
2.92T tokens
⚠️ HF downloads skew to small self-hosted models (7B–8B). Production API traffic skews 70B+ and frontier MoE. A new provider needs both.
| Tier |
Models |
GPUs |
$/mo |
HF + API Signal |
Rationale |
| P0 — Day 1 |
Qwen2.5-7B · Qwen3-8B · Llama-3.1-8B · Llama-3.3-70B |
4× H200 |
~$12K |
19.6M + 9.1M + 7.8M HF/mo |
Top downloaded, 1× H200 each — single GPU, no NVLink |
| P1 — Week 2 |
Llama 4 Maverick · Qwen3-235B · Qwen3-32B |
2× H200 + 5× B200 |
+$26K |
4.5M HF + #2 OSS API |
Qwen3-32B 1× H200; Maverick 3× B200, Qwen3-235B 2× B200 |
| P1 — Week 3 |
GLM-5 (#1 Elo) · Kimi K2.5 (MIT) |
+12× B200 |
+$48K |
#1 Arena Elo 1451, 76.8% SWE |
GLM-5 8× B200 DGX (FP8), Kimi K2 4× B200 — encrypted NVLink |
| P0 — Month 2 |
DeepSeek V3.2 · DeepSeek R1 |
+6× B200 |
+$24K |
14.37T tokens OpenRouter |
#1 OSS API globally — 3× B200 each, INT4 |
HuggingFace Popularity Signal — March 2026 (Actual Download Numbers)
Top Downloads / Month
#1 Qwen2.5-7B-Instruct
19.6M
#4 Qwen3-8B
9.09M
#7 Llama-3.1-8B-Instruct
7.79M
#17 GLM-5-FP8
4.3M
Qwen: 8 of top 20 models, 113K+ forks
Quality Benchmark (whatllm.org, Feb 2026)
#1 GLM-5 (Reasoning)
49.64
#2 Kimi K2.5 (Reasoning)
46.73
#3 MiniMax M2.5
41.97
#5 DeepSeek V3.2
41.2
DeepSeek R1 = most liked model in HF history
Shroud Subscription Plans & Unit Economics
Pricing tiers, per-model cost structure, and margin analysis. Payment: Stripe (USD) + TON.
Subscription Plans
| Plan | $/mo | Included tokens | Overage $/1M | req/s | Seats |
| Free | $0 | 100K | — | 1 | 2 |
| Developer | $29 | 20M | $1.00 | 10 | 3 |
| Startup | $99 | 50M | $0.50 | 50 | 10 |
| Enterprise | $499 | 500M | $0.20 | 200 | 50 |
Cost per Model
| Model | GPU Config | CC | $/mo GPU | Throughput (tok/s) | Cost $/1M tokens | CU Weight | Effective price to customer | Gross Margin |
| Llama 3.1 8B | 1× H200 | H200 | $2,520 | 24,000 |
$0.04 |
1× |
$0.20–$1.00/1M |
75–95% |
| Llama 3.3 70B / Qwen3-32B | 1× H200 FP8 | H200 | $2,520 | 4,500 |
$0.22 |
1× |
$0.20–$1.00/1M |
–23% to 74% |
| Qwen3-235B (MoE) | 2× B200 | B200 | $7,920 | 1,300 |
$2.35 |
15× |
$3.00–$15.00/1M |
22–84% |
| Llama 4 Maverick | 3× B200 | B200 | $11,880 | 1,200 |
$3.82 |
50× |
$10.00–$50.00/1M |
62–92% |
| DeepSeek R1 / V3.2 | 3× B200 | B200 | $11,880 | ~3,000 |
$1.53 |
50× |
$10.00–$50.00/1M |
85–97% |
| GLM-5 (744B MoE) | 8× B200 DGX | B200 | $31,680 | ~1,370 |
$8.92 |
Custom |
Enterprise only |
By contract |
| Kimi K2 (1T MoE) | 4× B200 | B200 | $15,840 | ~800 |
$7.64 |
Custom |
Enterprise only |
By contract |
CU Weight Tiers — How Billing Works
| Tier | Models | CU per token | Customer pays (overage range) | Our cost | Min. margin |
| Tier S ≤32B | Llama 8B, Qwen3-8B, Gemma 27B, Mistral 24B, GPT-OSS 20B/120B | 1× | $0.20–$1.00 / 1M | $0.04 | 75–96% |
| Tier S 70B | Llama 3.3 70B FP8, Qwen3-32B | 1× | $0.20–$1.00 / 1M | $0.22 | –9% to 78% |
| Tier M 235B MoE | Qwen3-235B | 15× | $3.00–$15.00 / 1M | $2.35 | 22–84% |
| Tier L 685B+ | DeepSeek R1/V3.2, Llama 4 Maverick | 50× | $10.00–$50.00 / 1M | $1.53–$3.82 | 62–97% |
| Tier XL 744B+ | GLM-5, Kimi K2, Kimi K2.5 | Enterprise only | Custom pricing — $7.64–$8.92/1M cost | — |
Launch Summary
Key differentiators, payment infrastructure, and launch targets.
Payment Methods
TON (crypto)Priority
Stripe (card)Priority
X42 paymentsPlanned
MPP paymentsPlanned
MoonPayPlanned
Key Differentiators vs Venice
✓ True multi-GPU CC — B200 encrypted NVLink (Venice = Hopper plaintext)
✓ Any open-source model — not limited to a fixed list
✓ Models Venice doesn't have (DeepSeek R1, Llama 3/4, Qwen3-235B)
✓ TON + Stripe payments, OpenAI-compatible API
Minimum Viable Launch
30 models7× H200 + 25× B200
Monthly GPU cost~$115K
Breakeven tokens/mo~190M
Breakeven revenue~$85K
Target launchQ2 2026