ScalingArchitectureInfrastructure

Scaling Real-Time Voice Chat from Small Rooms to Thousands of Listeners

VoxaStream Team·May 4, 2026·7 min read

Scaling a voice room from 10 to 10,000 concurrent listeners is not just a matter of adding more servers. The architecture changes fundamentally at different scales, and the mistakes developers make when they first try to grow often result in audio quality degradation, latency spikes, or complete connection failures under load.

This post covers what actually changes when you scale real-time voice, what the common failure points are, and how VoxaStream's architecture handles this so you do not have to.

The naive approach and why it breaks

The simplest voice room architecture is peer-to-peer (P2P): each participant sends their audio directly to every other participant. This works fine for 2–4 users. At 8 users, each person is sending 7 separate audio streams — upload bandwidth explodes and battery drain becomes severe on mobile.

Beyond 10 participants, P2P is not viable. You need a server-side media router.

What a Selective Forwarding Unit (SFU) does

An SFU is a server that receives audio from each speaker and forwards it selectively to participants. Instead of every speaker sending to every listener, each speaker sends one stream to the SFU, and the SFU handles distribution.

This changes the math dramatically:

Without SFU (P2P, 10 speakers): Each speaker uploads 9 streams. Total upload load per user = 9x their single stream bandwidth.
With SFU (10 speakers, 1,000 listeners): Each speaker uploads 1 stream. The SFU handles the 1,000 listener distributions. Speaker upload cost stays constant regardless of listener count.

VoxaStream uses an SFU architecture. Your app's upload cost per speaker does not grow as you add listeners — only the server-side distribution cost grows, and that is handled automatically.

Latency at scale

Latency in voice rooms has two components:

Encoding latency — Time to encode audio on the sender's device. Typically 20–60ms with modern codecs (Opus at 20ms frame size).
Network latency — Time for audio packets to travel from sender → SFU → receiver. Depends on geographic distance and packet loss.

For listener-only participants (who are not speaking back), end-to-end latency of 200–500ms is perfectly acceptable — the listener is not interacting, they are hearing a broadcast. For active speakers talking to each other, you want under 150ms round-trip. VoxaStream maintains speaker-to-speaker latency in this range under normal conditions.

What breaks at 1,000+ listeners

At high listener counts, three things typically go wrong with DIY implementations:

WebSocket fan-out. If your signaling server tries to send a message to 1,000 connections simultaneously (e.g., a presence update), you get significant CPU spikes and potential message drops. The fix is batching and prioritising audio over signaling.
TURN server saturation. Listeners behind strict NATs need TURN relay. A single TURN server becomes a bottleneck quickly. Production deployments need TURN clusters with geographic distribution.
State synchronisation. Keeping every listener's UI updated with who is speaking, who is muted, and who just joined becomes expensive if you broadcast a full state update on every change. You need delta updates.

VoxaStream handles all three of these internally. The presence message is a lightweight delta update. TURN infrastructure is managed by the platform. Fan-out is handled server-side.

Horizontal scaling

VoxaStream's signaling layer is stateless — session state is stored in the database, not in memory on the signaling server. This means you can run multiple instances behind a load balancer without sticky sessions.

If you are self-hosting VoxaStream, you can scale horizontally by:

Running 2+ instances of the VoxaStream binary behind nginx or a cloud load balancer.
Pointing all instances at the same Postgres database.
Using a shared Redis instance for pub/sub if running in clustered mode.

# nginx upstream for multiple VoxaStream instances
upstream voxastream {
  server 10.0.0.1:8090;
  server 10.0.0.2:8090;
  server 10.0.0.3:8090;
}

server {
  location /ws {
    proxy_pass http://voxastream;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
  }
}

Practical capacity planning

Rough numbers for a single VoxaStream instance on a 4-core / 8GB VPS:

Scenario	Speakers	Listeners	Est. CPU usage
Small rooms	~200 concurrent	0	~30%
Mixed	32	~2,000	~50%
Broadcast	4	~8,000	~70%

For anything beyond 10,000 concurrent listeners, use the managed VoxaStream cloud or run a horizontally scaled self-hosted cluster.

The managed option

If you would rather not think about any of this — TURN clusters, SFU scaling, Postgres replication — the VoxaStream managed API handles it for you. You pay per minute of usage and the infrastructure scales automatically.

Start with the free tier — 1,000 minutes per month, no credit card. When you are ready to scale, the same API and SDK work at any volume.

Ready to build with VoxaStream?

Free tier available. No credit card required.

Get Started Free →