Go back to all articles

Why Averages Lie: Mathematical Methods for Load Testing

Nov 18, 2025
14 min read

Heydar Gabrielyants,

Performance & Load Testing Lead

Relying on “average” metrics alone makes load testing surprisingly inaccurate. In this article, we’ll show how to avoid the usual traps and walk through practical techniques for mathematically modelling a workload profile, from analyzing variance and correlations to spotting Simpson’s paradox and validating the final model.

When a company moves to a new system, the first question is usually the same every time:
“Will it handle the load?”

It sounds simple, but that question hides a chain of unknowns. Sure, we can point to previous cases: “We had a similar project with five thousand users, and everything worked just fine.” But that’s a weak argument: one system may run at a pharmaceutical company, another is a coal mining holding’s SAP. Their workload profiles are completely different, so drawing a direct analogy doesn’t hold up.

Proper load testing must reflect the full operational cycle of the business, considering seasonality, peak hours, and transaction volume. To account for all of that, we use mathematical workload modelling: a method that lets us predict, rather than guess, how the system will behave as user count and transaction volume grow.

Initial System Analysis

Collecting baseline data

The first step in building a mathematical workload model is gathering data about the current system:

  • Application logs (for example, technical logs in systems like Dynamics 365 or SAP);
  • User behavior data (business hours, activity schedules);
  • User activity logs (records of user actions and transactions);
  • Database statistics — number of records, document sizes, and operation frequency.

Of course, a complete picture isn’t always available. Sometimes it’s impossible to extract detailed technical logs from ERP systems like SAP, Dynamics, or others. But even then, there are still plenty of useful data sources: user activity logs, database statistics, and system monitoring metrics. That’s usually enough to understand how users interact with the system and which scenarios generate the main load.

Reconstructing workload in the test environment

The next step is to identify which types of documents and operations should be generated to reproduce the load in the test environment.

It’s not enough to simply “click buttons” in the UI — creating a document, posting it, and generating a report must all be modeled on a realistic database, ideally containing at least one year’s worth of operational data.

Otherwise, you might miss performance bottlenecks hidden in the code: for example, an algorithm that scans the entire table every time instead of selecting data by index. On a small database, that kind of problem can go completely unnoticed.

Common Pitfalls in Data Interpretation

The myth of even load distribution

One of the most common mistakes is averaging everything out. Early in my career, I used to take statistics for the entire company — for example: “On average, 10,000 documents are created per day, so it’s about a thousand per hour. Let’s simulate 3,000 for the peak hour and spread them evenly.”

Seems reasonable, right? But in reality, system load is almost never evenly distributed.

Let’s say a warehouse manager tells you: “Our most important operation is loading the trucks.” You check the logs and see only 200 such operations per day — looks insignificant. But here’s the catch: all 200 happen within five minutes, in which trucks are being loaded. During that short window, the system load skyrockets — and that’s the behavior your test model needs to capture.

You can’t model an average day — you need to model the peak moment. Only then can you tell whether the system will actually handle real business scenarios.

Modeling only one scenario when there are many

It may seem logical: take the “heaviest” hour — the timeframe within which users create the biggest amount of documents — and see if the system can handle it. But that would be far from the whole picture.

System load is almost never static — it shifts throughout the day. During business hours, users might be processing payments, generating reports, and creating documents. At night, manufacturing or logistics processes may take over.

In one of our projects, for example, the client produced meat and poultry products — sausages, chicken, eggs — all processed and shipped during the day. But there was one special workflow: pork fat shipments, handled exclusively at night. That single process created a completely different load profile.

Ignoring recurrence and temporal patterns in operations

In real-world scenarios, many operations — such as creating orders, updating inventory, or generating reports — happen on a recurring basis. Each has its own interval: every five minutes, every hour, once a day. These intervals form the rhythm of the workload.

When calculating averages, this factor is often overlooked — yet it can have a major impact on how the system behaves under the load. If you simply average the data in your model, that rhythm disappears. The system may appear stable — but that’s an illusion.

To make your model realistic, you need to simulate recurrence:

  • capture how often operations repeat over time;
  • define relationships between transactions that don’t occur simultaneously but with a time lag;
  • design test scenarios that reflect the natural fluctuations of user activity.

This level of detail allows the model to reproduce real peaks and troughs, rather than a flat, “sterilized” version of the workload.

Non-interactive operations: the hidden system load

Another major source of distortion comes from non-interactive operations, i.e. tasks executed automatically by the system rather than directly by users. These include things like inventory recalculations, automated order generation, scheduled data updates, or AI-driven processes.

Such operations are neither uniform nor predictable. They run on schedules, event triggers, or specific data conditions — and as a result, the load can spike unexpectedly, even when user activity is low.

That’s why non-interactive processes should be analyzed separately:

  • identify how often they run and under what conditions;
  • calculate what share of the total system load they contribute;
  • if necessary, smooth out spikes by distributing tasks more evenly over time.

This kind of analysis helps to prevent situations where the system suddenly becomes overloaded at night or during “quiet hours” — not because of user activity, but due to background jobs that were forgotten during load planning.

Lock contention and long waits

When you build a model with averaged parameters — say, 1,800 virtual users performing actions at a uniform rate — the system may appear perfectly stable. The graphs look smooth, the CPU is calm, and the database responds without delays.

But in real-world usage, user behavior is far from uniform. Some people hit “Post Document” at the exact same second as dozens of others, while others generate reports or perform bulk data exports.

During those peak moments, you get locks, timeouts, and deadlocks — when multiple processes try to modify the same data simultaneously and the system must wait for resources to be released.

If the load is distributed too evenly, your test will never capture these conflicts. And that means they’ll surface later — in production, under real pressure — when fixing them becomes far more expensive.

Missing administrative load-reduction measures

Another common mistake is ignoring manageable peaks — the ones that can be reduced simply through scheduling and process management. For example, why run cost calculation or customer consumption analysis at 3 p.m., right in the middle of peak activity? A simple administrative decision — moving that job to nighttime — can significantly reduce system load without touching a single line of code or changing the architecture.

Misjudging peak loads

Engineers often analyze only the large, heavy operations, while overlooking smaller but frequent actions that also contribute to spikes. Take the earlier example of truck loading: the operation seems minor, yet it’s exactly what adds the final stress during peak hours and creates a bottleneck.

It’s important to consider all types of operations during peak hours — not just the biggest ones selected by data volume.

Inconsistent rest results and mathematical model

And finally, a critical mistake: failing to validate the model.  After completing the test, you need to collect performance statistics again — now based on the modeled behavior — and compare them to the original input data used to build the model.

If everything is done correctly, the results should align both ways:

  • the mathematical model predicts what the test actually shows;
  • and the test confirms what the model calculated.

That feedback loop is the key to reliability.

The Mathematical Framework for Workload Modeling

When we begin mathematical modeling of a system’s workload, it’s important to start with one simple assumption: we don’t know the system perfectly. In reality, data isn’t random — users perform intentional actions, order specific products, and follow business logic. But to build a useful model, we temporarily assume randomness.This assumption allows us to uncover hidden dependencies and correlations that ordinary analysis might miss.

For example, we might notice that Customer X most often buys Product A — and always does it on Wednesdays at 3 p.m. At first glance, that may look like a coincidence. But with mathematical analysis of such patterns, we discover that this time window actually creates a recurring load peak.

By examining many of these repeating scenarios across all data series, we can determine which hours and which operations cause the highest latency or overload. As a result, the model begins to reflect the real behavioral patterns of the system — we’re no longer just generating random user actions, but reproducing their behavior with its natural periodicity and rhythm.

Data Variation

Variation
Variation refers to the difference in the values of a given attribute across multiple entities within the same time period.

When analyzing variation, consider:

  • absolute deviations;
  • the coefficient of variation;
  • variance.

Variance helps identify which elements of the system occur most frequently:

  • which products are in the highest demand;
  • which clients or vendors generate the most transactions;
  • which contracts or projects drive the main activity.

We then analyze the temporal patterns: which hour or day of the week sees peak activity, whether there is seasonality, or recurring patterns in operations.

Once these patterns are revealed, it’s crucial to prepare the test data correctly. You can’t just use a single document template and copy it repeatedly, since real systems operate on many similar but not identical transactions: different clients, products, and order parameters.

To model that diversity properly, create multiple templates — for example:

  • 10 document templates for one client;
  • 10 for another;
  • and so on.

These documents should also be created and processed at different time intervals — some immediately, others after an hour or two. That way, the model starts reflecting realistic system behavior: repeated yet non-identical operations that together create the true production load.

Data Correlation

Correlation
Correlation is a statistical relationship between two or more random variables (or variables that can be treated as such with reasonable accuracy). Changes in one or more of these variables systematically correspond to changes in another.
Correlation analysis
Correlation analysis is a statistical method used to measure the strength of association between two or more variables. It’s closely related to regression analysis, often combined into what’s called correlation–regression analysis. This approach helps determine which factors should be included in a multiple regression equation and how well that equation explains the observed relationships (typically assessed using the coefficient of determination, R²).

As mentioned earlier, when building a workload model, we rarely know the system perfectly.
We can’t always explain why a particular client orders specific products or why shipments go through a certain warehouse. But we can measure how those events relate to each other — which ones tend to occur together and how they are connected. That’s where correlation analysis comes in — one of the foundational tools of mathematical workload modeling.

We start by collecting and organizing statistics such as:

  • which clients appear most frequently in orders;
  • which products or SKUs are often sold together;
  • and from which warehouses shipments most frequently depart.

Then, using correlation analysis, we measure the strength of relationships between these parameters. This allows us to identify which elements of the system are interdependent and which behave independently.

Correlation analysis helps simulate load patterns that closely reflect real user behavior — showing where peaks align, where operations reinforce each other, and which combinations of actions create critical overload points.

Below are graphs from real projects, showing product and service transactions over a 55-minute window, divided into 5-minute intervals.

average based calculation

Here’s an example of time-based calculations where we simply distributed an average document rate across the entire hour — without any mathematical modeling.
Does it reflect reality? Of course not.

In reality, document posting mostly happened between the 20th and 35th minute.

peak load

In another example, documents were processed only at the beginning and end of the hour, following the actual truck arrival schedule.

simultaneous document flow

And in yet another case, product shipments depended entirely on when the production line released finished goods.

spike load

None of these real scenarios even came close to the average-based one.

Checking for Simpson’s Paradox

When working with statistics, it’s important to remember: averages can be misleading. This becomes especially apparent when data is divided into multiple groups — for example, by warehouse and by customer.

Let’s say we analyze how often a specific product is sold from certain warehouses to specific clients. When we look at these parameters separately, there may seem to be a clear correlation.
But once we combine the data, viewing sales by both warehouse and client, the relationship suddenly weakens or even disappears.

This is Simpson’s paradox:
a statistical phenomenon where a trend observed in separate groups reverses or vanishes when those groups are combined. It’s not an error or a sign of “bad” data — it’s a signal for deeper analysis.

If combining groups dramatically changes the results, it means there are hidden factors influencing user behavior or data structure. In the context of load testing, this is particularly crucial: ignoring such differences can lead to a model that looks statistically sound but fails to represent real interaction patterns within the system.

Once the mathematical model is ready and load testing begins, it’s important to correctly interpret how operations are distributed over time.

interpretation

The example above shows a real workload scenario from a banking system. During testing, virtual users (bots) simulated user actions, revealing how activity was distributed minute by minute.

Bank statement processing typically began around the 30th minute of each hour and finished by the 50th. That means the active phase lasted only about 25 minutes, during which the system experienced a true peak load.

If we had distributed operations by average values, the graph would look smooth and tidy — with no visible spikes. But real life doesn’t work that way: real users generate uneven, bursty loads, and those are exactly the ones we must model.

The upper part of the table reflects interactive operations — manual actions performed by users (simulated by bots in the test). These actions tend to have a fairly uniform distribution, with five-minute intervals and no sharp jumps.

Below, you can see scheduled jobs — automated background processes triggered by time or events. They are the ones that create the most complex and demanding load peaks in the system.

scheduled jobs

The chart shows that at one moment the system processes only 23 documents, while at another — nearly 1,500. After modeling this scenario, we obtained the following operation table and load graph.

As you can see, CPU utilization is far from uniform: instead of a smooth rise and fall, there are clear spikes. The heaviest intervals occur between minutes 10–20 and 20–30, when the system experiences its highest stress levels.

Interestingly, the average CPU load looks perfectly healthy — around 80%, with some apparent capacity left. But a closer look at the peaks shows that they’re both long-lasting and are repeated in certain time frames. If you analyze the situation only by the average value (the yellow line on the chart), you might reach the wrong conclusion: “Everything looks great — we can decommission some servers, there’s spare capacity.”

In reality, that decision would be disastrous. During peak minutes, the system would begin to choke under pressure — producing timeouts, locks, slowdowns, and failed transactions.

The “Non-Random” Randomness: How to Validate Your Model

Once the main tests are complete and the results recorded, engineers move on to the exploratory phase — verifying the model’s stability. This involves running a series of additional tests using different, randomly generated datasets.

tests

The goal of these tests is to ensure that system behavior doesn’t depend on specific input data, but rather reflects the patterns defined in the algorithms.

Analyzing Variance

The results are compared against those from the main test. If the graphs differ significantly, that’s a signal that the algorithms are dependent on the data structure.

In that case, the root cause is usually one of two things:

  • a flaw in business logic — for example, an incorrectly modeled business process; or
  • a problem in the code — such as inefficient queries, redundant conditions, or deeply nested if/else statements that slow down execution.

In practice, the poorly optimized code would be the more common cause.

Conditional Probability Modeling

This stage of analysis is what we call “non-random randomness.” We start by assuming that all data is random, but in fact, it follows conditional probabilities.

Engineers build a mathematical model describing the likelihood of certain events, for example:

Customer X buys Product A with probability 0.7, and Product B with probability 0.3.

If load test results confirm these probabilities, the mathematical model is validated. If not, the model must be revised — either the data generation was flawed, or the processing algorithms were biased.

When there’s a significant delta between expected and observed test results, it’s important to analyze not only the logs and datasets but also the algorithms that influence test behavior.
In such cases, additional load testing scenarios may be required.

Whenever possible, define conditional probabilities for key operations as:

formula1

then it’s possible to run a series of experiments.
If the experimental outcomes align with:

formula2

create a dependency table showing how parameter values — for example, product IDs or warehouse locations — relate to one another. Analyzing these dependencies helps confirm that the model accurately reflects real-world behavior.

Ultimately, this “non-random randomness” check serves as the final stage of model validation — ensuring that the mathematical model truly mirrors the real system.

Seeking a Reliable Load Testing Team?

Table of contents

    Related insights in blog articles

    Explore what we’ve learned from these experiences
    12 min read

    UI Load Testing: Full Guide

    ui load testing preview
    Nov 7, 2025

    When an application starts to slow down, users notice it immediately. Pages hesitate to load, buttons lag, animations freeze for a split second, and that’s often enough to make someone close the tab. These issues rarely come from the backend alone. In most cases, the real strain appears in the browser, where scripts, styles, and […]

    6 min read

    Internet of Things Testing: Benefits, Best Practices, & Tools for Reliable Connected Systems

    iot testing preview
    Nov 4, 2025

    IoT is an ecosystem of devices connected through networks and relying on cloud or app services for endless communication, data exchange, and smart automation. For this ecosystem to work seamlessly 24/7, it heavily depends on IoT testing. Apart from impeccable performance, the latter guarantees the reliability, protection, and integrity of diverse devices, networks, apps, and […]

    5 min read

    Swagger API Testing: What It Is, How It Works, and Best Practices for QA Teams

    swagger api testing preview
    Oct 28, 2025

    Testing APIs without proper documentation can feel like walking through fog — every endpoint is a guess, every parameter a risk. But not with Swagger UI API testing. Swagger turns static API definitions into a live, interactive interface where developers and QA teams can validate endpoints, check request/response schemas, and explore the system in real […]

    6 min read

    BlazeMeter vs. JMeter: Full Comparison

    blazemeter jmeter comparison
    Oct 24, 2025

    Ever wondered whether you should stick with Apache JMeter or move your tests to BlazeMeter? Both tools are powerhouses in performance and load testing, but they serve different needs. JMeter is an open-source desktop tool under the Apache 2.0 license; ideal for local or distributed testing across HTTP, APIs, JDBC, and more. BlazeMeter, on the […]

  • Be the first one to know

    We’ll send you a monthly e-mail with all the useful insights that we will have found and analyzed