Serving Substrate

Systems Runtime

Reason about the C++ and networking boundary underneath accelerator services: resource lifetime, request admission, queues, communication layers, and failures.

Request lifecycle
socket/event -> admission -> bounded queue -> GPU task
                 |               |              |
               reject        timeout/cancel   release state
C++ Ownership

Failure paths are part of performance correctness

#

Use RAII-style owned handles for sockets, staging buffers, streams, and communicator lifetimes. A timeout or cancelled request must release each resource once, even when completion events arrive late.

Interview test

Name who owns each buffer before and after an asynchronous submission, and what destroys it on normal completion, cancellation, or exception.

Admission

Bound queue memory before overload chooses for you

#

Stable operating point

Under stable assumptions, Little's law relates average in-flight work to arrival rate and time in system: .

Overload behavior

Once arrivals outpace service, choose explicit rejection, deferral, prioritization, or shedding. Do not treat growing host or device memory as a queue policy.

Transport Layers

Name the boundary before naming the optimization

#
LayerMovesQuestion
TCP / HTTP request pathclient payloads and streamed responsesadmission, reconnect, deadline, framing
RDMA-style transferregistered memory between hostsregistration, ownership, failure boundary
NVLink / device fabricdevice-access topologylocality and available bandwidth
NCCL collectivesdistributed tensorscollective order, communicator failure, measured duration
Troubleshooting

Distinguish overload, communication failure, and slow compute

#

Collect queue wait, rejection rate, deadline expiry, buffer high-water marks, kernel timeline, collective duration, asynchronous communicator errors, and request cancellation traces. A single utilization number cannot locate the cause.

References

Primary documentation

#