As we work with more of the world’s largest B2B and B2C companies, the questions we get about scale have become sharper and more specific.
Customers want to know how the platform behaves during their highest-volume moments – the Black Friday sales, the sporting events, their production incidents. They want confidence that their growth will not outgrow the systems they depend on.
We welcome those questions. They are the right questions to ask of any critical component of their business.
Today, our systems handle serious scale. At daily peak, we see over 150,000 customer requests per second coming into the platform, with more than 70,000 asynchronous requests per second flowing through the background systems. During our busiest days of the week, we handle over five million conversations and more than 100 million comments being added across the platform.
We also design for individual customer spikes, not just aggregate platform traffic. We can handle a single customer workspace spiking with hundreds of comments per second, or around 100 new conversations per second. Sustained over a full day, that would map to millions of conversations from a single customer.
While those numbers are important, they age quickly. Every growing software company can publish a bigger number every year, month, week. What’s more important is whether the architecture has clear scaling levers, whether we understand the pressure points in the system, and whether we can add capacity before customers need it.
Every system has limits. Competence is knowing where they are, measuring them, and moving them before customers reach them.
Here’s how we do this at Fin.
We build on boring foundations
At the edges, we try hard not to be clever.
We use AWS for the infrastructure primitives AWS is very good at running. We do not want our engineers spending their best energy recreating S3, load balancers, queues, or commodity infrastructure patterns. We want that energy spent on the parts of the system that are specific to our customers and our product.
That is a deliberate trade-off. It gives us fewer systems to understand, deeper expertise in the ones we do run, and more leverage when we need to scale.
This is an extension of a principle we have talked about for years: run less software. The point is not to have the smallest possible technology stack for its own sake. The point is to compound expertise. When many teams build on the same small set of technologies, our tooling, observability, and operational practice all improve together.
Boring technology choices are not a lack of ambition. They are how we reserve our ambition for more nuanced scaling challenges.
The source of truth is the hard part
For many companies, the database is where “we scale” claims go to be tested.
You can scale stateless web traffic by adding more machines. You can add queue consumers. You can add cache. Those are real problems, but they are not usually the hardest ones.
The source-of-truth database is different. It is where the most important data lives, where the hardest correctness guarantees exist, and where maintenance windows often come from. It has to be correct, fast, resilient to failover, capable of large migrations, and able to keep serving traffic while we improve it. As customers grow, it cannot require a full re-architecture every time the next ceiling appears.
That is why we moved to Vitess, managed by PlanetScale. We wrote extensively about the architectural decisions behind this migration.
The goals were clear: improve availability, reduce operational complexity, make large table migrations safer, simplify MySQL scaling, and eliminate customer downtime from routine database maintenance and failovers.
When we wrote our last public update, the largest part of the migration was still ahead of us. We completed that migration in 2025, and the benefits are now part of how we operate the platform day to day.
Today, our highest-scale source-of-truth data is spread across 128 shards. The database layer handles around two million requests per second, with more than ten million cache reads per second in front of it. For the largest customers, we can isolate and scale database capacity independently, including dedicating a shard to a single customer when needed.
We have not come close to needing that, which is significant. The goal of architecture like this is not to run every system at the edge of its capacity, but rather to have room to move before customers need it. Vitess gives us native sharding, query routing, online schema change capabilities, connection pooling, and resharding primitives built for this kind of workload. Instead of application code carrying all of the sharding complexity, the database layer can do more of the work. That reduces cognitive load for engineers and removes whole classes of operational risk.
Ultimately, this gives us practical scaling options instead of hard architectural rewrites, and lets us do routine database improvement without planned customer-impacting maintenance windows.
Search is not a hidden bottleneck
Search is another place where scaling issues can hide as it underpins core product surfaces across the platform, from vector search with Fin to our realtime reporting. If search is slow or unhealthy, customers feel it in the product.
We have written before about how we optimize Elasticsearch usage because scaling is not just adding more machines. Often, the better approach is making the product do less unnecessary work.
Today, our Elasticsearch clusters support a much higher-throughput product than it did in the past, with more than 650TB of storage, more than 1.7 trillion documents, and peaks above 40,000 requests per second. We are serving a larger product surface more efficiently, not just running a bigger cluster.
But the exact numbers are less important than the operating pattern: when an index gets too large, or traffic distribution becomes unhealthy, we do not want that to become a high-risk manual migration. We have invested in the ability to reshape Elasticsearch indexes online. That means partitioning by customer ID, dual-writing to old and new indexes, backfilling, validating, gradually moving customers with feature flags, and only deleting the old index when we are confident. We have used this pattern for years to make large search migrations safer and more incremental.
That pattern shows up across our infrastructure work. Make large changes incremental, observable, reversible where possible, and safe to run while customers continue using the product.
A large customer spike should mostly be their spike
Multi-tenant systems need fairness. A single customer having a high-volume moment should not quietly become everyone else’s latency problem. This is one of the core risks enterprise customers are right to ask about. If you share infrastructure with other large customers, what happens when one of them has a spike?
We design for this at multiple layers.
For asynchronous work, we use overflow queues and queueing strategies that help prevent one high-volume workload from consuming shared capacity in a way that hurts quieter tenants. AWS SQS fair queues are one example of a primitive we use extensively. They are designed for exactly this class of problem. When one tenant creates a backlog in a shared queue, fair queues help reduce the dwell-time impact on other tenants.
We also build our own application-level guardrails where the product requires them. In a large multi-tenant Rails application, customer isolation cannot depend on every engineer remembering every rule in every code path. The safe path has to be built into the system.
One of our Principal Engineers, Miles McGuire, talked publicly about one example of this work in “Guardrails: Keeping customer data separate in a multi-tenant system”.
The focus is primarily about correctness and customer data separation, but the work also reflects the broader operating principle. Important customer boundaries should be enforced by infrastructure and application frameworks.
The same thinking applies to scale. We want customer-specific load to be visible, attributable, and controlled. When a customer spike happens, we should be able to understand it as that customer’s workload, protect the rest of the platform, and add capacity where it is actually needed.
Fin adds a new dimension to scaling
Our AI Agent Fin introduces a new set of infrastructure challenges. To provide reliable AI-powered support at scale, we need to operate across multiple model providers, route across them based on capacity and latency, and protect customer-facing workloads from lower-priority work.
The details are different from traditional SaaS infrastructure, but the principle is the same. We still need to understand the bottlenecks, build clear scaling levers, and monitor the customer outcome. We should be honest about that. AI providers are not commodity storage systems, and we do not design as if they are.
That is why we have invested in Fin-specific reliability systems.
Fin now fully resolves over two million conversations per week. At that scale, high availability cannot depend on a single model, a single provider, a single region, or a single pool of capacity. Our LLM routing layer supports cross-vendor failover, cross-model failover, latency-based routing, capacity isolation, and load testing. We also maintain buffer capacity with major providers, with headroom to handle 2x to 3x normal Fin traffic at any point.
This matters for enterprise customers because AI support volume can spike just like human support volume. If a customer’s own product has an incident, or a launch drives a sudden surge in questions, the AI layer needs to absorb that spike without depending on one fragile upstream path.
When customers depend on Fin to absorb a spike in support demand, the AI layer needs the same operational discipline as the rest of the platform.
Real production traffic teaches us things tests do not
Performance tests are useful. But what happens in production is reality.
Real customers use products in ways a performance test will not perfectly predict. They have launches, incidents, seasonal patterns, gaming events, and sudden changes in end-user behavior. Those moments give us data that no synthetic test can fully reproduce.
Often, a large customer event barely moves the platform-wide graphs. Our customer base is broad enough that one industry can be at peak while another is in a quieter period. Black Friday and Cyber Monday are good examples. Many ecommerce customers are at their busiest, while many B2B SaaS customers are quieter. At the aggregate platform level, the change can be much less dramatic than people expect.
That does not mean those events are unimportant. It means we need to look at both levels: the health of the overall platform and the experience of the individual customer having the spike.
Sometimes those events teach us something more specific.
In one case, a very large customer used the Messenger in a way that exercised the full Messenger lifecycle even though the visible user experience did not require it. Under normal traffic, this was fine. During a major customer-side incident, their users refreshed aggressively, generating a much larger burst of Messenger traffic than the integration actually needed.
The platform stayed available, but the event exposed unnecessary work in that integration path. We built a lighter-weight integration path that served the customer’s actual use case with far less work per request, making future spikes easier to absorb.
We treat large customer events this way even when there is no broad customer impact. They are opportunities to understand the real scaling properties of the system and improve for the next event.
Scale is also an operating model
The infrastructure that underpins scaling matters, but it is not enough. You can have the right database architecture and still hurt customers if you detect issues late, recover slowly, communicate poorly, or fail to learn from incidents.
That is why our operating model starts with customer outcomes. If the customer cannot do the job they came to do, the system is unhealthy. It does not matter how many dashboards are green.
Heartbeat metrics tell us whether customers can do the core jobs they hire us to do. They cut through the noise of infrastructure dashboards and help us answer the question that matters most during an incident: are customers able to use the product successfully?
This also shapes how we ship.
Today, we average around 250 ships to production per workday, with an average merge-to-production time under 10 minutes. We wrote about the system behind this earlier this year, when those numbers were around 180 ships per workday and 12 minutes from merge to production.
That is not a vanity metric. It is part of the safety model. Smaller changes are easier to understand, easier to observe, and easier to roll back. Feature flags let us separate deployment from release. Automatic rollback and heartbeat-driven detection help us recover quickly when a change hurts customers.
Fast shipping is not the opposite of reliability. Done properly, it is one of the ways you stay in control of change.
The bar is high. Engineers are expected to understand the impact of their changes, watch them go live, and act quickly if something looks wrong. Resuming service is not the end of an incident. We expect teams to understand the root cause, fix the contributing systems, and prevent recurrence.
That is how scale stays safe over time.
Scheduled maintenance should be extraordinary
Planned maintenance is sometimes necessary. But it should not be the normal cost of operating a modern customer service platform.
Historically, database maintenance was one of the main reasons companies needed maintenance windows. Upgrading a database, changing instance sizes, performing failovers, or moving large tables could require customer-impacting downtime.
That is exactly the kind of operational constraint we wanted to remove.
With the move to Vitess and PlanetScale, we have changed what routine database improvement looks like. We can upgrade, scale, and improve critical database infrastructure without turning that work into planned customer-impacting downtime. We have been doing this in practice, not just talking about it as a future goal.
This is important because customers depend on our platform for live customer operations. If their support team, Messenger, Help Desk, or AI Agent is unavailable, the impact is immediate. Scheduled maintenance cannot be treated as a casual operational convenience.
Our posture is simple: routine infrastructure improvement should not require planned customer-impacting downtime.
Scheduled maintenance should be exceptional, non-routine, clearly communicated, and minimized in frequency, duration, and customer impact.
That is the practical benefit of the architecture work. Better scaling is not only about handling more traffic, but also reducing the operational moments that might inconvenience customers.
What this means for customers
Customers should be skeptical of vague scale claims. We certainly are.
The question is not whether a vendor says they can scale. The question is whether they can explain how, where the limits are, what they measure, how they recover, and what they have already changed after learning from production.
At Fin, we understand the scaling properties of our systems, have clear levers to add capacity at the right layers, design for customer isolation and fairness, monitor customer outcomes directly, and use real production events to make the next one safer. Scale is never finished. Every large customer event, traffic spike, migration, and incident teaches us something about the real behavior of the system. We use that data to keep improving.
That is what customers should expect from a platform they depend on during their busiest moments.




