Real-time data looks simple in a demo. You open a dashboard, a number ticks up, the room nods. Ship it.
Then you deploy. Users in three time zones hit the same endpoint. A cache serves stale odds while the API returns fresh ones. Two concurrent game updates land out of order and a leaderboard shows negative scores for six seconds before self-correcting. Nobody claps at that demo.
Over the past two years, we built multiple platforms that depend on live data: BookIt Sports (live scores and leaderboards across five leagues), Footy Access (live match scores for youth soccer), NXTGEN Media (real-time athlete submissions and scout feeds), and WSV (live portfolio updates). Each one taught us something the previous one didn't. The lessons compound, and they apply well beyond sports or media -- any system that moves data from a source to a screen in under a few seconds will hit the same walls.
Polling works until it doesn't
Polling is the default starting point, and for good reason. Hit an endpoint every N seconds, update the UI, move on. No persistent connections, no infrastructure changes, no new failure modes. For Footy Access, polling on a 10-second interval was the right call. Match updates trickle in. The data source itself only refreshes every 30 to 60 seconds. Polling faster than your source can produce data is just paying for the same answer twice.
The trap is the 5-second refresh. When stakeholders see a dashboard and say "can we make it faster?", the instinct is to drop the polling interval. Go from 10 seconds to 5, then 3, then 1. Each halving doubles your API load and gets you diminishing returns. At a 1-second interval with 500 concurrent users, you are making 500 requests per second to an endpoint that probably returns the same payload 90% of the time. Your API budget climbs, your rate limits start biting, and you still don't have true real-time -- you have fast polling with all the costs and none of the guarantees.
The rule we settled on: if your data source updates less than once every 10 seconds, poll it. If it updates more frequently, or if latency under 2 seconds matters to the user experience, polling is the wrong tool.
WebSockets vs SSE: pick the right tool
Once polling breaks down, you have two real options: WebSockets and Server-Sent Events. They solve different problems, and picking the wrong one costs you complexity you don't need.
WebSockets give you a persistent, bidirectional connection. The server pushes data to the client, and the client can push back. That bidirectionality matters when the client needs to send structured messages -- chat applications, collaborative editing, multiplayer game state. For BookIt Sports, WebSockets made sense for the survivor pool feature where users submit picks and receive immediate confirmation alongside live game state updates. The connection carries data in both directions.
Server-Sent Events are simpler. The server pushes to the client over a standard HTTP connection. No upgrade handshake, no custom protocol, no reconnection logic you have to write yourself -- the browser handles reconnection natively with the EventSource API. For Footy Access, SSE was the right call. The live scores feed is read-only from the client's perspective. Match updates flow one way. SSE gave us that with less infrastructure, better compatibility behind corporate proxies, and automatic reconnection out of the box.
The practical decision framework
Use SSE when the client only needs to receive updates. Use WebSockets when the client needs to send structured data back through the same connection. If you are unsure, start with SSE. You can always upgrade to WebSockets later, but you cannot easily downgrade without rearchitecting. We have never regretted starting simple.
One caveat: Next.js API routes handle SSE cleanly with streaming responses. WebSocket support requires a separate server process or an edge runtime that supports long-lived connections. That infrastructure difference is real and should factor into your decision early.
Caching is the actual architecture
The transport layer -- polling, SSE, WebSockets -- gets all the attention. Caching is where the architecture actually lives.
Every real-time system has the same tension: you want fresh data on the screen, but you cannot afford to hit your data source on every request. The gap between "fresh enough" and "actually fresh" is where caching strategy earns its keep.
Stale-while-revalidate
This pattern saved us repeatedly. Serve the cached response immediately, then revalidate in the background. The user sees data instantly. If it is a few seconds stale, the next render picks up the fresh version. On the client side, SWR and React Query both implement this natively. On the server side, we set Cache-Control: s-maxage=5, stale-while-revalidate=30 on our Next.js API routes. The CDN serves the cached response for 5 seconds, then serves stale data for up to 30 seconds while fetching a fresh copy. Users never see a loading spinner. The data is at most 35 seconds old in the worst case, and usually under 5.
TTL strategies
Not all data has the same shelf life. On BookIt, live scores get a 3-second TTL. Leaderboard rankings get 15 seconds. Historical stats get 5 minutes. Season standings get an hour. Treating all data with the same TTL either wastes resources on stale data or serves outdated scores during a live game. We define TTL at the data-type level, not the route level.
Cache invalidation
The hard part. Always the hard part. For NXTGEN, new athlete submissions and scout reviews land on unpredictable schedules. We cannot predict when a new profile drops. Polling the database every second looking for changes works but scales poorly. The pattern that held up: Redis pub/sub for cache invalidation signals. When a submission is approved or a review is published, the write path publishes an invalidation event. Subscribers -- the API layer, the SSE broadcaster -- evict their cached copy and fetch fresh. The write path owns the invalidation. The read path never guesses.
This is the part most teams skip. They build a beautiful real-time transport layer and then serve stale data through it because nothing told the cache to let go.
What breaks at scale
The problems that show up in production are rarely about transport or even caching. They are about ordering, concurrency, and state reconciliation.
Out-of-order updates
Network latency is not constant. Update A leaves the server before Update B, but Update B arrives at the client first. If your client blindly applies every incoming update, the user sees the score jump forward, then backward, then forward again. On BookIt, during concurrent NFL and NBA games, we saw this happen within the first week of production. The fix is monotonic versioning. Every update carries a sequence number or timestamp. The client only applies an update if its version is higher than the last applied version. Stale arrivals get dropped silently.
Concurrent writes
Survivor pool picks on BookIt hit a classic concurrency problem. Two users submit picks at the same time for the same game slot. Without proper handling, both writes succeed and the data is inconsistent. Optimistic locking solved it: read the current version, submit the write with that version attached, reject the write if the version has changed since the read. The user whose write is rejected gets a clear error and retries against the fresh state. Database-level constraints back this up as a safety net.
State reconciliation
Long-lived connections drift. A WebSocket connection that has been open for three hours has received thousands of incremental updates. If any single update was dropped -- network blip, server restart, client backgrounded on mobile -- the client state is wrong and every subsequent update builds on that wrong state. The fix is periodic full-state snapshots. Every 60 seconds, the server sends a complete state object, not a delta. The client replaces its local state entirely. This caps the blast radius of any dropped update to 60 seconds of drift. It costs more bandwidth, but the reliability tradeoff is not close.
On BookIt, we learned this the hard way during a Sunday slate with five concurrent NFL games. A server deploy cycled the WebSocket connections. Clients reconnected and resumed receiving deltas, but they had missed updates during the 2-second reconnection window. Leaderboards showed stale scores until the next full snapshot landed. After that, snapshots became non-negotiable.
The patterns that survived
After four builds, here is what we standardized on:
Transport selection. SSE for read-only feeds. WebSockets only when bidirectional communication is a real requirement. Polling for data sources that update infrequently.
Caching. Stale-while-revalidate everywhere. Per-data-type TTLs. Redis pub/sub for invalidation signaling on the write path.
Client state. React Query or SWR for cache management on the client. Monotonic versioning on every update. Drop stale arrivals, never apply them.
Reliability. Periodic full-state snapshots on long-lived connections. Optimistic locking on concurrent writes. Database constraints as the last line of defense.
Monitoring. Track cache hit rates, update latency (source to screen), and dropped update counts. If you cannot measure freshness, you cannot guarantee it.
None of these are novel. That is the point. Real-time systems do not fail because teams lack clever ideas. They fail because teams skip the boring infrastructure -- versioning, invalidation, reconciliation -- and focus on the transport layer that makes the demo look good.
Build the plumbing first. The demo will take care of itself.