Why Modern Software Performance Is Mostly About Waiting

May 28, 2026

English

24 min read

0local views

0shares

Latency, memory, caching, and the physical realities underneath modern software systems

Most developers initially think software performance is mainly about computation. If a system becomes slow, the instinct is usually to blame inefficient algorithms, weak hardware, excessive CPU usage, or “unoptimized code.” Performance work therefore often gets framed as making computation faster.

But modern systems spend enormous amounts of time not computing.

They spend time waiting.

Waiting for:

memory access
storage retrieval
database queries
packets crossing networks
locks
APIs
other services
data to move between physical systems separated by latency and bandwidth constraints

This changes the entire nature of performance engineering.

Modern processors can execute billions of instructions per second, yet large applications still become slow because retrieving information is often dramatically more expensive than processing it once it arrives. As systems scale, moving data and coordinating state usually becomes harder than computation itself.

This is one of the deepest recurring truths underneath modern computing:

computation is cheap compared to coordination and data movement.

The implications appear everywhere.

Databases rely heavily on indexes and caching because storage access is expensive. Distributed systems struggle because network communication introduces latency and uncertainty. APIs become bottlenecks because remote calls are fundamentally slower than local execution. Modern CPUs depend heavily on cache hierarchies because main memory is comparatively slow. Cloud infrastructure replicates data geographically because physical distance affects responsiveness. Large systems collapse under load because queues and contention amplify waiting time across infrastructure.

Performance engineering therefore becomes much less about isolated “code optimization” and much more about:

reducing unnecessary waiting
minimizing expensive communication
improving locality
avoiding coordination bottlenecks

In this article, we’ll examine the physical realities shaping software performance, why memory access dominates many workloads, why latency accumulates across systems, why caching appears repeatedly throughout computing, and why modern software architecture is ultimately constrained by the movement of information through physical systems.

Computation Is Cheap, Waiting Is Expensive

Modern CPUs are extraordinarily fast at arithmetic and logical operations. A processor can execute billions of instructions per second while performing:

comparisons
vector operations
cryptographic computation
floating-point arithmetic
branching
many other forms of computation continuously

The surprising part is that many systems are not bottlenecked by those operations directly.

They are bottlenecked waiting for information to arrive.

Suppose a processor needs data that is not already nearby in cache. The CPU may need to retrieve information from RAM, wait for storage access, request information across a network, synchronize with another thread, or even pause until another service responds entirely.

During this time, the processor itself may remain capable of executing instructions extremely quickly while sitting partially stalled waiting for data retrieval to complete.

This creates one of the central realities of modern software systems:

Fast Computation
+
Slow Retrieval
=
Waiting Dominates

The deeper systems insight is that performance is often constrained less by raw computation and more by how quickly systems can retrieve, move, and coordinate information.

That distinction changes how you reason about software entirely.

The Latency Hierarchy Shapes Modern Computing

Not all information is equally expensive to access.

Modern systems are built around a hierarchy where different storage and communication layers operate at dramatically different speeds. Information physically closer to active computation is vastly cheaper to access than information farther away.

A simplified conceptual hierarchy looks something like this:

CPU Registers
↓
L1 Cache
↓
L2 / L3 Cache
↓
RAM
↓
SSD / Disk
↓
Remote Server
↓
Another Geographic Region

The differences between these layers are not small.

They are often separated by orders of magnitude.

CPU registers and cache access may take nanoseconds. RAM access is slower. SSD access is dramatically slower than RAM. Network communication is vastly slower still, especially once requests cross regions or continents.

This means software architecture becomes heavily shaped by distance.

The farther information lives from active computation, the more expensive retrieval becomes.

This is one reason systems repeatedly try to:

cache aggressively
reduce unnecessary network communication
batch operations together
avoid excessive synchronization
keep related data physically close to active execution whenever possible

Modern software performance is deeply constrained by this latency hierarchy whether developers explicitly think about it or not.

Why Memory Access Dominates So Many Workloads

One of the most important realizations in systems engineering is that accessing memory efficiently often matters more than raw computation speed.

Modern CPUs are so fast that processors can execute instructions much faster than main memory can consistently supply data. This mismatch is one reason processors rely heavily on cache hierarchies designed to keep frequently accessed information physically closer to execution units.

When required data already exists nearby in cache, performance can remain extremely fast.

When data is scattered unpredictably across memory, processors may spend substantial time stalled waiting for retrieval.

This is why memory access patterns matter enormously.

Sequential memory access is usually much faster than random access because processors and memory systems can predict and preload nearby information more efficiently. Random access patterns often generate frequent cache misses, forcing the CPU to retrieve information repeatedly from slower memory layers.

A simplified conceptual contrast:

Sequential Access

Sequential Access
↓
Predictable
↓
Cache Friendly

Random Access

Random Access
↓
Unpredictable
↓
Frequent Cache Misses

This physical reality affects:

databases
operating systems
search engines
analytics pipelines
AI systems
game engines
many other high-performance workloads

A surprising amount of “algorithm performance” in real systems is actually memory behavior performance underneath the surface.

Why CPUs Depend So Heavily On Caches

Modern processors are designed partly around hiding memory latency.

If CPUs waited directly on RAM for every operation, much of the processor’s computational capability would remain underutilized continuously. Cache hierarchies exist to reduce this problem by storing frequently accessed information closer to execution units.

Modern processors therefore contain several cache layers — commonly L1, L2, and L3 caches — each balancing speed, size, and proximity differently.

Smaller caches are extremely fast but limited in capacity. Larger caches store more information but operate more slowly.

A simplified conceptual model:

Smaller Cache
↓
Faster Access

Larger Cache
↓
Slower Access

When required information cannot be found in cache, a cache miss occurs and the processor must retrieve data from slower memory layers instead.

Cache misses are one of the major hidden costs underneath many real-world performance problems.

This is why data layout matters so much in high-performance systems. Keeping related information physically close together can dramatically improve cache efficiency and reduce processor stalls.

Modern CPUs therefore spend enormous engineering effort attempting to predict future memory access patterns before software explicitly requests the data.

Large portions of processor architecture exist primarily to hide waiting.

The CPU Is Often Waiting, Not Computing

One of the most counterintuitive realities in modern computing is that processors frequently spend substantial time stalled waiting for information rather than actively executing useful work.

From the outside, CPUs appear to operate continuously at enormous speed. Internally, however, modern processors are heavily designed around one recurring problem:

memory is comparatively slow.

If every instruction required waiting directly on RAM or storage before execution could continue, modern CPUs would waste enormous amounts of potential computational throughput.

Processor architecture therefore evolved around techniques designed specifically to hide latency and keep execution pipelines busy whenever possible.

This is one reason modern CPUs became extraordinarily sophisticated internally.

Processors attempt to:

preload likely future data
predict upcoming instruction branches
execute instructions speculatively
reorder operations dynamically
overlap memory access with computation

all partly to reduce the visible impact of waiting.

A simplified conceptual idea looks something like this:

Need Data
↓
Data Already In Cache?
↓
Yes → Continue Quickly
No → Stall Waiting

The deeper insight is important:

modern processors are not simply “fast calculators.”

They are heavily optimized systems for hiding retrieval latency.

Branch Prediction And Why Predictability Matters

Modern CPUs process instructions through pipelines where multiple stages of execution happen simultaneously. This improves throughput dramatically because processors can overlap work internally instead of completing one instruction fully before starting the next.

But pipelines introduce another problem.

Suppose the CPU encounters a conditional branch:

if (x > 0)

The processor now needs to determine which instructions should execute next.

Waiting for the branch result before continuing would slow execution significantly, so modern CPUs often predict the likely path ahead of time and begin executing instructions speculatively before certainty exists.

If the prediction is correct, execution continues efficiently.

If the prediction is wrong, the processor may need to discard speculative work and restart the pipeline with the correct instruction path instead.

This means predictability affects performance physically.

Sequential predictable execution is often easier for processors to optimize efficiently than highly irregular unpredictable execution patterns.

Modern CPU performance therefore depends not only on:

how much computation exists

but also:

how predictable the workload behavior is

Why Sequential Workloads Often Perform Better

A recurring pattern throughout computing is that sequential access patterns are usually dramatically easier for hardware to optimize efficiently than scattered unpredictable workloads.

This applies across many layers:

memory systems
storage systems
databases
networking
streaming systems
analytics pipelines

Sequential workloads allow hardware and software systems to anticipate future access patterns and preload information proactively.

For example, if memory access proceeds sequentially:

Data A
↓
Data B
↓
Data C

processors and memory controllers can often begin fetching upcoming information before software explicitly requests it.

Random access patterns break these assumptions.

A workload jumping unpredictably across memory or storage forces systems to retrieve scattered information repeatedly, increasing:

cache misses
storage seeks
latency stalls
memory pressure

This is one reason physical data layout matters enormously in high-performance systems.

Two programs performing theoretically similar computational work may behave radically differently in practice depending on how efficiently they access memory and storage.

Why Caching Appears Everywhere In Computing

Once you understand the latency hierarchy, one architectural pattern starts appearing almost everywhere:

caching.

Modern systems cache aggressively because retrieving distant information repeatedly is expensive.

This pattern appears across nearly every layer of computing:

CPUs cache memory
operating systems cache disk pages
databases cache queries
browsers cache resources
CDNs cache content geographically
applications cache API responses
distributed systems cache replicated state

The underlying idea remains remarkably similar throughout all of them.

A simplified conceptual model:

Expensive Retrieval
↓
Store Nearby Temporarily
↓
Avoid Repeating Work

Caching exists because repeatedly retrieving distant information wastes:

time
bandwidth
storage access
coordination effort

Modern software performance depends so heavily on caching that many large systems would become economically or operationally impractical without it.

This is one reason performance engineering often becomes less about speeding up computation directly and more about avoiding unnecessary work entirely.

Why Cache Invalidation Becomes Difficult

Caching improves performance, but introduces another problem:

stale data.

Suppose a system caches information locally for speed, but the underlying data changes elsewhere. The cache may now contain outdated information while other parts of the system already reflect newer state.

This creates one of the most famous recurring problems in software engineering:

Fast Access
vs
Fresh Correct Data

Maintaining cache correctness becomes increasingly difficult as systems grow larger and more distributed.

Large systems must constantly balance:

latency
consistency
bandwidth usage
synchronization overhead
cache expiration policies
replication timing

This is one reason distributed systems and large-scale applications become so difficult operationally. Performance optimizations frequently introduce additional coordination complexity elsewhere in the system.

There are very few “free” optimizations in large-scale software architecture.

Why Network Calls Are Fundamentally Expensive

One of the biggest mental upgrades in systems engineering is realizing that remote communication is fundamentally different from local execution.

A local function call inside one process may complete in nanoseconds.

A network request may involve:

serialization
kernel networking stacks
encryption negotiation
routing infrastructure
congestion handling
packet retransmission
remote processing
deserialization
response transmission

Even under ideal conditions, network communication introduces dramatically more latency and uncertainty than local execution.

This is why distributed systems engineering becomes fundamentally different from ordinary local application development.

A simplified conceptual contrast:

Local Function Call

Local Function Call
↓
Memory Local
↓
Extremely Fast

Remote API Call

Remote API Call
↓
Crosses Network
↓
Latency + Failure + Coordination

This distinction shapes modern architecture profoundly.

Many developers accidentally design distributed systems while mentally reasoning as if remote calls behave like local computation. That mismatch creates enormous scalability and reliability problems later.

Modern infrastructure engineering therefore spends huge effort reducing unnecessary communication between systems whenever possible.

Serialization: The Hidden Cost Most Developers Ignore

When systems communicate across process or machine boundaries, data usually cannot remain in native in-memory form. It must be transformed into a transferable representation that another system can understand reliably.

This transformation process is called serialization.

A service may take internal objects or data structures, convert them into:

JSON
Protocol Buffers
MessagePack
Avro
some other transferable format

send the payload across a network, and then reconstruct usable structures again on the receiving side.

Conceptually this sounds straightforward, but at scale it becomes surprisingly expensive because the work involves:

memory allocation
copying
encoding
decoding
buffering
compression
parsing
traversal through networking stacks

before meaningful computation even begins.

A simplified conceptual flow looks like this:

Internal Data
↓
Serialize
↓
Transmit
↓
Deserialize
↓
Usable Data Again

This is one reason distributed systems often become much slower than developers initially expect.

A local function call can pass references or memory-local structures almost instantly. Distributed systems repeatedly convert, copy, transmit, reconstruct, validate, and synchronize information across boundaries where local assumptions no longer apply.

The cost becomes especially noticeable once systems operate at large scale or high throughput.

Repeated serialization overhead can consume substantial CPU time, memory bandwidth, and network capacity even before business logic itself becomes expensive.

This is also one reason many high-performance infrastructure systems avoid verbose human-readable formats whenever efficiency matters heavily. Human-readable representations are convenient for debugging and interoperability, but compact binary formats usually reduce:

payload size
parsing overhead
memory pressure
network transfer cost

significantly.

The broader systems lesson is important:

moving information between systems is often far more expensive than developers intuitively expect.

Why Queues Form And Systems Suddenly Collapse

Large systems rarely fail in a smooth linear way.

More often, systems appear stable until workload crosses certain thresholds, after which latency rises rapidly and cascading failures begin spreading through infrastructure.

Queues are one of the major reasons this happens.

Suppose incoming requests arrive faster than a service can process them consistently. Work begins accumulating faster than the system can complete it.

Requests wait longer. Memory usage grows. Retry traffic increases. Downstream services experience additional pressure. Eventually the system may spend more effort managing backlog and coordination than performing useful work efficiently.

A simplified conceptual pattern looks like this:

Requests Arrive Faster
Than Processing Completes
↓
Queues Grow
↓
Waiting Time Increases
↓
Latency Explodes

This becomes dangerous because latency itself often amplifies load.

Slow systems frequently trigger retries from clients or dependent services. Those retries create additional traffic. Additional traffic increases contention and queue pressure further. The system slows down even more, triggering even more retries and waiting.

Many large outages are fundamentally coordination and queueing failures underneath the surface rather than raw computation failures.

This is one reason modern infrastructure engineering spends enormous effort controlling queue growth, applying backpressure, limiting retries, shedding excess load, and preventing latency amplification from cascading across services.

Performance problems in large systems are often less about “running out of CPU” and more about systems losing the ability to keep up with accumulating coordination overhead under pressure.

Why Tail Latency Matters More Than Average Latency

Average performance numbers can be deeply misleading.

Suppose a service usually responds in 10 milliseconds but occasionally takes 2 seconds. The average latency may still appear statistically acceptable, yet the user experience can feel terrible because occasional slow responses dominate perceived responsiveness.

This becomes even more important in distributed systems.

Modern applications often depend on many downstream services simultaneously. Even if every service is individually “fast most of the time,” occasional slow responses accumulate across dependency chains.

One unexpectedly slow service can delay the entire request path.

A simplified conceptual pattern:

Many Services
↓
One Slow Dependency
↓
Entire Request Slows

As systems become more distributed, the probability of encountering at least one slow component increases across long request chains.

This is why large-scale infrastructure systems focus heavily on:

latency percentiles
timeout control
load balancing
queue management
retry strategies
overload protection

rather than simply optimizing average response times.

At scale, occasional outliers often matter more than averages.

Contention: When Systems Compete For Shared Resources

Another recurring source of performance problems is contention.

Contention occurs when many operations compete for limited shared resources such as:

locks
database connections
thread pools
memory bandwidth
storage bandwidth
network capacity

At low load, contention may remain nearly invisible.

At high load, systems can degrade dramatically because operations increasingly spend time waiting on each other rather than making progress independently.

A simplified conceptual pattern looks like this:

Many Operations
↓
Shared Resource
↓
Waiting
↓
Reduced Throughput

This is one reason scaling systems becomes difficult.

Additional concurrency does not always improve performance. In some cases, adding more threads, requests, or parallel work simply increases synchronization overhead and coordination pressure instead of increasing useful throughput.

Modern performance engineering therefore often revolves around reducing shared bottlenecks and minimizing unnecessary coordination between operations.

Many high-performance systems achieve scalability not by maximizing parallelism blindly, but by carefully reducing contention and preserving locality wherever possible.

Why Distributed Systems Magnify Performance Problems

Distributed systems amplify nearly every performance challenge already present in local systems.

Latency increases because communication crosses networks. Coordination becomes harder because machines fail independently. Caching becomes more complicated because state exists across replicas. Queues become more dangerous because failures propagate across services. Observability becomes harder because requests span many systems simultaneously.

Even relatively simple application requests may involve:

API gateways
authentication services
databases
caching layers
asynchronous jobs
external APIs
multiple backend services coordinating underneath the surface

A simplified request path may look something like this:

Frontend
↓
API Gateway
↓
Service A
↓
Service B
↓
Database
↓
External API

Every additional boundary introduces more:

latency
serialization overhead
queueing risk
synchronization cost
failure possibilities

This is one reason modern software engineering increasingly revolves around reducing unnecessary communication and coordination between systems rather than merely optimizing isolated computation.

As systems grow larger, coordination overhead becomes one of the defining constraints underneath performance.

Why Scaling Is Mostly About Reducing Coordination

One of the deepest insights in modern systems engineering is that scaling systems is often less about increasing computation and more about reducing coordination overhead.

Suppose many machines constantly need synchronization before useful work can continue. Even if the infrastructure contains enormous computational capacity, the system may still become bottlenecked by communication itself.

Machines spend time:

waiting for acknowledgments
synchronizing state
acquiring locks
replicating updates
coordinating ordering
resolving conflicts across distributed infrastructure

This appears repeatedly throughout modern computing:

distributed databases coordinate replication and consistency
microservices coordinate through APIs and queues
cloud systems coordinate workloads across regions
AI training systems synchronize gradients across GPUs
distributed storage systems coordinate replicas and metadata
large-scale analytics systems coordinate partitions and task scheduling

The recurring problem underneath all of them is coordination cost.

A simplified conceptual idea looks something like this:

More Machines
↓
More Communication
↓
More Coordination
↓
Potential Bottlenecks

This is one reason distributed systems become difficult so quickly. Adding machines increases computational resources, but it also increases synchronization complexity.

A surprisingly large amount of scalability engineering therefore revolves around reducing how often systems need to coordinate at all.

High-performance architectures often try to:

partition workloads
minimize shared state
preserve locality
batch communication efficiently
allow components to operate independently whenever possible

Local autonomy is usually cheaper than constant synchronization.

This is also why many systems scale nonlinearly. Doubling infrastructure capacity does not necessarily double useful throughput because coordination overhead may grow alongside the system itself.

The Hidden Cost Of Abstractions

Abstractions are one of the most important tools in software engineering.

Without abstractions, modern software systems would become unmanageable. Frameworks, databases, operating systems, APIs, containers, cloud platforms, and programming languages all exist partly to hide complexity behind simpler interfaces.

But abstractions do not remove complexity.

They relocate it.

This becomes important in performance engineering because hidden layers still execute real work underneath the surface.

A seemingly simple operation may trigger:

memory allocation
serialization
network communication
database queries
synchronization
retries
caching logic
filesystem interaction

even when the abstraction itself appears clean and minimal from the application layer.

This is one reason performance problems often surprise developers. The visible code path may appear straightforward while the underlying execution path spans many systems and layers of infrastructure.

For example:

an ORM may make database interaction feel like ordinary object manipulation
a cloud function may make distributed infrastructure appear serverless
a remote API call may resemble a local function call syntactically

But underneath those abstractions, the physical costs still exist:

latency
memory access
serialization
network transfer
synchronization
storage access

Good abstractions are still enormously valuable because they simplify development and reduce cognitive overhead.

The problem arises when developers stop recognizing the physical systems underneath the abstraction layer entirely.

One of the defining traits of strong systems engineers is that they continue reasoning about underlying cost even when abstractions hide implementation details successfully.

Why Performance Problems Often Look Random

Many real-world performance issues appear inconsistent or unpredictable from the surface.

A system may behave perfectly under moderate load and then degrade rapidly under slightly heavier traffic. An application may feel fast most of the time while occasionally experiencing severe latency spikes. One query may complete instantly while another structurally similar query becomes dramatically slower.

These behaviors often feel mysterious until you begin viewing systems through the lens of:

queues
contention
memory locality
coordination overhead
caching behavior
latency amplification

Modern systems contain many thresholds where small workload changes trigger disproportionately large effects.

For example:

a cache miss may suddenly force storage retrieval
a queue crossing capacity may amplify waiting time rapidly
contention may increase exponentially once too many operations compete simultaneously
retries may unintentionally overload already degraded services
synchronization overhead may dominate once systems scale beyond certain sizes

Performance engineering therefore often involves identifying nonlinear behavior hidden underneath apparently stable systems.

This is one reason benchmarking and production behavior frequently differ. Controlled benchmarks may not expose:

queueing effects
contention patterns
tail latency
network variability
coordination bottlenecks

that emerge under realistic load conditions.

Real systems are dynamic environments, not isolated algorithms running in perfect conditions.

Why Software Performance Is Ultimately About Physics

At a sufficiently deep level, modern software performance is constrained by physical reality.

Information occupies space.

Memory access takes time.

Signals travel at finite speed.

Storage retrieval has latency.

Networks introduce distance.

Synchronization requires communication.

Hardware has bandwidth limits.

Caches have finite capacity.

Queues consume memory.

Heat affects processor behavior.

Power consumption constrains hardware scaling.

Large portions of software architecture therefore exist partly to manage physical constraints rather than purely computational logic.

This is why the same performance patterns appear repeatedly across computing history:

locality matters
caching matters
coordination is expensive
bandwidth is finite
latency accumulates
waiting dominates many workloads
reducing unnecessary work is often more valuable than accelerating computation

The abstractions evolve, but the underlying constraints remain remarkably consistent.

Modern cloud infrastructure, databases, distributed systems, browsers, AI systems, storage engines, and networking stacks are all ultimately shaped by the same physical realities governing the movement and coordination of information.

The Most Important Performance Lesson

Most performance problems are not fundamentally about computation.

They are about systems:

waiting on information
coordinating across boundaries
contending for shared resources
retrieving distant data
synchronizing state
performing unnecessary work repeatedly

Once you begin seeing systems this way, many areas of software engineering start looking different:

database indexes become locality optimizations
caches become latency reduction systems
distributed systems become coordination problems
APIs become expensive communication boundaries
scaling becomes synchronization management
performance optimization becomes work reduction

This shift in perspective is important because it changes how engineers reason about systems entirely.

Instead of asking:

“How do we make this computation faster?”

experienced engineers increasingly ask:

“Why is the system waiting?”
“Where is the information coming from?”
“What coordination is happening?”
“Can we avoid this work entirely?”
“Can we reduce communication?”
“Can we improve locality?”
“Can we remove this dependency?”
“Can we avoid moving this data?”

That is the deeper mental model underneath modern performance engineering.

And once you internalize it, many seemingly unrelated areas of computing begin collapsing into the same recurring architectural truth:

modern software performance is largely the problem of moving and coordinating information efficiently through systems constrained by latency, bandwidth, memory, synchronization, and physical distance.

Why Great Performance Engineering Often Looks Like Simplicity

One of the most interesting patterns in large-scale systems is that high-performance architectures often appear deceptively simple from the outside.

This is partly because unnecessary coordination, unnecessary abstraction layers, unnecessary communication, and unnecessary work all accumulate hidden cost.

Systems that survive large scale reliably are often the ones that remove complexity aggressively rather than continuously adding more optimization machinery on top.

This does not mean simple systems are easy to build.

Quite often the opposite is true.

It is relatively easy to construct architectures containing:

excessive services
unnecessary synchronization
deeply layered abstractions
inefficient communication patterns
redundant data movement
accidental contention

It is much harder to design systems where information flows efficiently with minimal coordination overhead.

This is one reason experienced infrastructure engineers frequently become suspicious of architectures requiring too much communication between components.

Every boundary introduces additional:

latency
serialization overhead
retries
queueing behavior
observability complexity
failure possibilities

The performance problem is rarely isolated to one operation.

The problem is usually the accumulation of many individually “reasonable” costs across large systems.

A database query taking 5 milliseconds may appear harmless. An API call adding another 10 milliseconds may also seem acceptable. Serialization overhead, cache misses, authentication checks, network retries, logging systems, observability pipelines, and synchronization delays may each appear individually manageable as well.

But modern systems rarely perform one isolated operation.

They perform chains of operations repeatedly at enormous scale.

A simplified conceptual pattern:

Small Costs
+
Repeated Frequently
+
Many Layers
=
Large System-Wide Cost

This is one reason local reasoning often fails in large systems. Engineers may optimize one component successfully while the overall architecture remains fundamentally inefficient because the coordination model itself creates too much overhead.

Strong performance engineering therefore often involves removing unnecessary work entirely rather than accelerating isolated computations.

Why Throughput And Latency Are Different Problems

Another important systems distinction is the difference between throughput and latency.

Latency measures how long one operation takes to complete.

Throughput measures how much total work a system can complete over time.

These are related, but not identical.

Some systems optimize for low latency because responsiveness matters most:

interactive applications
trading systems
gaming infrastructure
search engines
user-facing APIs

Other systems optimize primarily for throughput:

analytics pipelines
batch processing
video encoding
distributed training jobs
large-scale data processing

In many cases, improving one metric may worsen the other.

For example, batching operations together often improves throughput because systems process more work efficiently in larger groups. But batching may also increase latency because requests wait longer before execution begins.

Similarly:

aggressive synchronization may improve consistency while reducing throughput
large caches may improve latency while increasing memory usage and invalidation complexity

This is one reason performance engineering is fundamentally contextual.

There is no universally “fast” system independent of workload requirements and operational constraints.

The correct architecture depends heavily on:

access patterns
workload shape
concurrency behavior
consistency requirements
infrastructure limits
operational goals

Real systems engineering is therefore mostly tradeoff management under physical constraints.

Why Hardware Progress Changed Software Architecture

Modern software architecture evolved partly because hardware bottlenecks changed over time.

Earlier computing environments were often constrained primarily by raw computation. CPUs were comparatively slow, memory was extremely limited, and storage systems were highly restrictive.

Modern systems face a different balance of constraints.

Processors improved dramatically faster than memory latency improved. Network infrastructure scaled globally. Storage capacity exploded. Distributed systems became economically viable. Cloud infrastructure made horizontal scaling accessible.

As a result, many modern bottlenecks shifted away from pure computation toward:

memory access
network coordination
storage retrieval
synchronization overhead
distributed communication

This is one reason contemporary systems architecture looks so different from earlier generations of software engineering.

Large portions of modern infrastructure exist specifically to manage:

latency
locality
coordination
caching
distributed state
communication overhead

rather than merely maximizing raw computation.

Understanding this historical shift is important because many modern engineering patterns only make sense once you realize what bottlenecks actually dominate current systems.

Why AI Systems Are Also Performance Systems

Modern AI infrastructure follows many of the same physical principles.

People often imagine AI systems primarily as “models performing intelligent computation.” But production AI systems are heavily shaped by the same constraints governing other large-scale infrastructure:

memory bandwidth
data movement
network communication
storage throughput
caching
batching
coordination overhead
latency management

Large language models, for example, are extremely computationally intensive, but deployment bottlenecks often involve:

GPU memory limits
inference latency
distributed synchronization
token throughput
retrieval overhead
bandwidth constraints
infrastructure cost

Training large models also becomes deeply constrained by communication overhead between GPUs and machines because distributed training requires constant synchronization of enormous parameter sets across hardware.

This is one reason modern AI engineering increasingly overlaps with distributed systems engineering and infrastructure optimization rather than existing purely inside machine learning research.

The same recurring principles still apply:

moving information is expensive
coordination is expensive
locality matters
caching matters
waiting dominates many workloads

The abstractions changed.

The physics did not.

Conclusion

Most developers initially learn software through abstractions:

programming languages
frameworks
APIs
databases
cloud platforms

Those abstractions are useful because they make modern software development possible.

But underneath all of them, systems remain constrained by physical realities:

memory latency
bandwidth limits
storage access
network distance
synchronization cost
contention
queueing behavior
communication overhead

Modern software performance is therefore not mainly the story of “fast code.”

It is the story of information moving through physical systems under constraints.

Once you internalize this, many areas of software engineering start looking fundamentally different.

Databases become systems for minimizing retrieval cost. Distributed systems become coordination problems. Caches become locality optimizations. APIs become expensive communication boundaries. Scalability becomes the management of synchronization and waiting rather than merely adding hardware.

And perhaps most importantly, performance engineering stops looking like isolated optimization tricks and starts looking like systems reasoning.

The strongest engineers are often not the ones writing the cleverest low-level code.

They are the ones who understand:

where systems spend time waiting
where information moves unnecessarily
where coordination becomes expensive
where complexity quietly accumulates underneath abstractions

Because at scale, modern software performance is ultimately governed by a remarkably consistent set of truths:

distance matters, waiting dominates, coordination is expensive, and moving information is often harder than computing on it once it arrives.