Silicon DNA: The Physics of Clock States, Voltage Profiles, and V/F Curve Forensics

February 26, 2026|By Assurd Engineering Lab

Silicon DNA: Reading the Electrical Biography of a GPU

"A GPU's voltage-frequency curve is not a specification — it is a fingerprint. And like all fingerprints, it changes with age, abuse, and accumulated damage." — Assurd Engineering Lab, Internal SOP v4.2

Every modern GPU arrives from the factory with a silicon substrate that has been electrically characterized during production testing. The resulting V/F curve — the mapping of operating frequency to the minimum voltage required to sustain it reliably — is the fundamental document of a chip's electrical identity. It determines which power states the boost algorithm selects, how aggressively the card responds to thermal headroom, and how much voltage margin the firmware reserves against process variation.

At Assurd Techlabs, reading the V/F curve of a submitted GPU is the equivalent of a blood panel in clinical medicine. It tells us not just what the card is doing today, but what has been done to it over its operational lifetime. This document explains the physics of how clock states and voltages work, why they degrade, and how our forensic methodology interprets the resulting electrical signatures.

Part I: The Physics of Semiconductor Switching — Why Voltage and Frequency Are Coupled

The CMOS Switching Model

Modern GPU silicon is built on CMOS (Complementary Metal-Oxide-Semiconductor) technology — currently at TSMC 4N or Samsung 8nm process nodes for recent NVIDIA Ampere/Ada architectures. Every logic gate in the shader array, the rasterizer, the tensor cores, and the memory interface is fundamentally a CMOS inverter: a complementary pair of PMOS and NMOS transistors switching between supply voltage ( $V_{DD}$ ) and ground.

The dynamic power dissipation of a CMOS circuit is governed by:

P_{\text{dynamic}} = \alpha \cdot C_L \cdot V_{DD}^2 \cdot f

Where:

$\alpha$ = Activity factor (fraction of gates switching per clock cycle)
$C_L$ = Load capacitance (the total capacitive load on all switching nodes)
$V_{DD}$ = Supply voltage
$f$ = Operating frequency

This relationship has profound implications: power scales with the square of voltage. Halving the voltage reduces dynamic power to one-quarter — which is why aggressive voltage undervolting is so effective at reducing power draw without sacrificing clock speed.

The Voltage-Frequency Relationship and Setup Timing

The reason a higher clock frequency requires a higher supply voltage is rooted in logic propagation delay. Each gate transition — from low to high or high to low — takes a finite time, determined by the drive strength of the transistor and the capacitance it must charge or discharge. This is the propagation delay ( $t_{pd}$ ).

At higher clock frequencies, the time budget for each logic stage is reduced. The silicon must complete its computation and settle to a valid output level before the next clock edge arrives — the setup time requirement ( $t_{setup}$ ). If this budget is violated, the flip-flop captures a metastable intermediate value, and the result is a timing violation — the digital equivalent of the circuit computing the wrong answer.

Higher voltage increases the drain current of each transistor (via the MOSFET $I_D \propto V_{GS} - V_{th}$ relationship), which increases gate drive strength, which reduces $t_{pd}$ , which creates timing margin at higher frequencies. This is why voltage and frequency are inextricably linked.

f_{\text{max}} \approx \frac{1}{t_{pd,\text{critical path}} + t_{setup} + t_{skew}}

The GPU's firmware encodes this relationship as the V/F table — a lookup table mapping each voltage step to the maximum frequency sustainable with adequate timing margin.

Part II: Silicon Binning and the Lottery of Process Variation

Why Two Identical SKUs Have Different Clocks

TSMC and Samsung do not produce uniform silicon. Photolithographic patterning, ion implantation, etch depth, and deposition uniformity vary across the wafer surface — a consequence of fundamental physical limits in semiconductor manufacturing. The result is that transistors on a given die have subtly different threshold voltages ( $V_{th}$ ), gate oxide thicknesses ( $t_{ox}$ ), and channel lengths.

A die with transistors skewed toward lower $V_{th}$ and shorter effective channel lengths will switch faster at a given voltage — it is a "good" bin. A die with higher $V_{th}$ or more process variation requires higher voltage to achieve the same frequency — it is a "worse" bin.

NVIDIA's GPU Boost algorithm (currently at version 5.x for Ada Lovelace) exploits this at runtime. The firmware characterizes the silicon during power-on self-test and selects boost clock states within the safe operating area defined by the V/F table. A good-bin chip on a well-cooled board will boost to the upper boundary of its V/F table. A worse-bin chip or a thermally constrained card will operate conservatively.

The TDC and TGP Power Limits

Modern NVIDIA GPUs enforce three simultaneous power limits that constrain the V/F operating point:

TGP (Total Graphics Power): The wall-level package power limit, specified in the card's VBIOS. For an RTX 4090, stock TGP is 450W.
TDC (Thermal Design Current): The sustained current limit for the VRM. This constrains the electrical load on the power delivery network independently of total power.
Temperature Limit: The firmware's maximum allowable $T_j$ (typically 83°C for NVIDIA consumer cards). Exceeding this triggers frequency reduction.

A healthy card with adequate cooling operates with thermal headroom to spare, meaning the TGP limit is the active constraint — the card is using all available power budget for maximum performance. A degraded card — one with dried thermal paste, failing fans, or a degraded heatsink — becomes thermally constrained, meaning the temperature limit is the active constraint. In this regime, the card voluntarily reduces its V/F operating point to reduce heat generation, and performance falls below what the TGP budget would otherwise allow.

Forensic Significance: Our analysis compares the active constraint for each submitted card. A card that is thermally constrained in a 21°C lab environment — rather than power-constrained — has a heatsink or thermal interface pathology that requires investigation, regardless of what the core temperature sensor reports.

Part III: Electromigration — How Voltage Abuse Physically Destroys Silicon

The Physics of Electromigration

Electromigration is the gradual displacement of metal atoms in a conductor under the influence of a sustained electrical current. It is the mechanism by which overvoltage and overtemperature destroy GPU silicon — not instantaneously, but over thousands of operating hours.

In the thin metal interconnects that wire together the transistors on a GPU die — copper and aluminum lines with widths measured in nanometers — the electron flux at elevated current densities creates a momentum transfer to the metal lattice. Individual atoms are displaced from their equilibrium positions in the direction of electron flow. Over time, this produces two complementary failure modes:

Voids: Depletion of metal atoms downstream, increasing the local resistance of the interconnect until it eventually opens (becomes an open circuit).
Hillocks: Accumulation of metal atoms upstream, which can eventually grow tall enough to short-circuit adjacent metal layers.

The Black's Equation models electromigration lifetime:

\text{MTF} = A \cdot J^{-n} \cdot e^{\left(\frac{E_a}{k_B \cdot T}\right)}

Where:

MTF = Mean Time to Failure
$A$ = A process-dependent constant
$J$ = Current density (A/cm²)
$n$ = Current density exponent (typically 1–2)
$E_a$ = Activation energy (eV) — approximately 0.7–0.9 eV for aluminum, 0.9–1.2 eV for copper
$k_B$ = Boltzmann constant
$T$ = Temperature in Kelvin

The critical insight from Black's Equation is the temperature sensitivity. Raising the operating temperature from 70°C (343K) to 100°C (373K) reduces the exponential term dramatically — at $E_a = 0.9$ eV, this temperature increase reduces MTF by approximately 5x. Operating a GPU at sustained elevated temperatures with elevated voltages (as is common in long-term heavy overclocking or poorly ventilated mining rigs) cumulatively degrades the MTF of every on-die interconnect.

How Electromigration Manifests in Observable Data

The practical consequence of accumulated electromigration damage is a shift in the chip's effective V/F curve. As interconnects develop incipient voids, local resistance increases. This means a given logic path requires slightly more voltage to maintain the same propagation speed — effectively demanding a higher $V_{DD}$ for the same $f_{max}$ .

This manifests in our diagnostic data in several ways:

Reduced maximum stable frequency at stock voltage: A degraded die can no longer sustain the clock states its V/F table specifies. The boost algorithm, detecting instability, drops to a lower V/F point.
Increased voltage demand for stability: To sustain target clocks on a degraded die, the firmware may push VCore higher — increasing power draw and heat, further accelerating the degradation.
Increased variance in boost clock behavior: A healthy die boosts to a consistent, reproducible frequency under a given thermal/power condition. A degraded die shows clock state jitter — oscillating between frequency steps rather than settling at one.

Forensic Significance: We compare each submitted GPU's observed V/F operating profile against our reference database for that SKU. A shift of >50MHz in the average boost clock at a given VCore level, or a >3% increase in voltage demand for a given clock state, is flagged as potential electromigration or silicon degradation.

Part IV: VRM Architecture and Phase Topology

The Synchronous Buck Converter

The VRM (Voltage Regulator Module) on a GPU PCB converts the input voltage (typically 12V from the PCIe connector) to the 0.7–1.1V required by the GPU die. This is accomplished by a synchronous buck converter, the circuit topology found on virtually every modern GPU PCB.

The fundamental buck converter operation:

The high-side MOSFET ( $Q_H$ ) switches the 12V input to the inductor ( $L$ ) for a duration determined by the duty cycle $D$
The low-side MOSFET ( $Q_L$ ) provides the freewheeling current path when $Q_H$ is off
The output capacitors ( $C_{out}$ ) filter the inductor ripple current to produce a smooth DC output

The output voltage is governed by:

V_{out} = D \cdot V_{in} = \frac{t_{on}}{T_{sw}} \cdot V_{in}

Where $T_{sw} = 1/f_{sw}$ is the switching period, and $t_{on}$ is the on-time of the high-side MOSFET.

For a GPU VCore rail at 1.0V from a 12V input, the duty cycle is approximately $D = 1.0/12 = 0.0833$ — meaning the high-side MOSFET is conducting only 8.3% of each switching cycle.

Multi-Phase Operation and Current Sharing

High-power GPUs — the RTX 4090 consumes up to 450W, with the GPU core alone drawing potentially 400W at ~1.0V, implying 400A of current — cannot be served by a single converter phase. A single inductor carrying 400A at typical switching frequencies would be physically impractical.

The solution is multi-phase interleaving: multiple identical buck converter stages operating in parallel, each offset in phase by $360°/N$ where $N$ is the phase count. The RTX 4090 PCB uses up to 24 VRM phases for the GPU core rail alone.

Multi-phase operation provides:

Effective ripple reduction: The ripple currents from each phase partially cancel, reducing the output voltage ripple by approximately $1/N^2$ relative to a single-phase design
Current sharing: Each phase carries $I_{total}/N$ amps, dramatically reducing thermal stress on individual components
Transient response bandwidth: More phases can respond faster to sudden current demand changes

PWM Controller and Droop Compensation

The master PWM controller (common examples include the Monolithic Power Systems MP2888, or Renesas RAA229110 used in high-end GPU PCBs) continuously monitors the output voltage and adjusts phase duty cycles via a high-bandwidth feedback loop.

Modern GPU VRMs implement Active Voltage Positioning (AVP), also called droop or load-line calibration. Under this scheme, the output voltage is intentionally set slightly higher at light load and allowed to droop proportionally as current increases. This serves two purposes:

It reduces the voltage overshoot when load suddenly drops (the capacitors have less "catching up" to do)
It allows the nominal operating voltage to be set conservatively, with the actual silicon operating at a slightly lower voltage under heavy load — trading efficiency for reliability margin

V_{out}(I) = V_{set} - R_{LL} \cdot I_{load}

Where $R_{LL}$ is the load-line impedance (typically in the range of 0.1–0.5 mΩ for a high-end GPU VRM).

Part V: Voltage Ripple — Measurement and Diagnostic Interpretation

What Voltage Ripple Is

No switched-mode power supply produces a perfectly flat DC output. The finite switching frequency, inductor current ripple, and capacitor impedance all contribute to a small AC component superimposed on the DC output voltage — the voltage ripple.

For a multi-phase buck converter, the theoretical peak-to-peak output ripple voltage is:

\Delta V_{out} = \frac{V_{in} \cdot D \cdot (1-D)}{N \cdot f_{sw} \cdot L \cdot C_{out} \cdot f_{sw}}

A simplified form for practical estimation:

\Delta V_{pp} \approx \frac{I_{ripple,L}}{8 \cdot N \cdot f_{sw} \cdot C_{out}}

In a healthy, properly designed GPU VRM with adequate output capacitance, this ripple is typically <10mV peak-to-peak — negligible relative to the 900–1100mV operating voltage.

Pathological Ripple Signatures

Elevated Ripple (>25mV p-p): Indicates capacitor degradation. As electrolytic or polymer capacitors age, their ESR increases and their capacitance decreases. Both effects increase the output ripple. High ripple directly modulates the GPU die's supply voltage — if the ripple amplitude approaches the silicon's noise margin, timing violations and computational errors become possible.

Sawtooth / Asymmetric Ripple: If the ripple waveform exhibits asymmetric rise/fall timing, it indicates phase imbalance — one or more phases are not contributing their expected current share. This can result from a degraded MOSFET, a damaged inductor, or a faulty phase controller channel. The asymmetric loading causes the remaining phases to carry more current than designed, accelerating their degradation.

High-Frequency Switching Artifacts: Spikes at frequencies above the fundamental switching frequency (harmonics) indicate switching transient ringing — often caused by parasitic inductance in the PCB layout or degraded snubber components. These high-frequency spikes can couple into the GPU die through the power delivery network and introduce EMI into sensitive analog circuits.

Assurd Measurement Protocol: We monitor VCore stability and transient voltage droop through high-frequency digital polling of the GPU's internal Power Management Unit (PMU) via the I2C bus. Measurements taken at the wall or PCIe slot systematically underestimate transient spikes, so we pull telemetry directly from the silicon's onboard sensors.

Part VI: The Assurd V/F Curve Audit

Reference Database

For every GPU SKU we process, we maintain a reference V/F dataset compiled from known-good units tested within 90 days of manufacture (sourced from retail purchase). This establishes:

Nominal boost clock distribution at a given TGP and thermal load
VCore range at each boost frequency step
Expected phase count contribution and current sharing balance

Deviation Analysis

When a submitted unit is tested, its observed V/F profile is compared against the reference:

Parameter	Healthy Range	Deviation Flag
Boost Clock vs. Reference	±25 MHz of reference median	>50 MHz below reference
VCore at Max Boost	±20mV of reference	>40mV above reference (voltage compensation for degradation)
Ripple (p-p at die)	< 15mV	> 25mV
Phase Imbalance	< 5% current deviation between phases	> 15% imbalance
Boost Clock Variance (σ)	< 8 MHz standard deviation	> 20 MHz (jitter/instability)

Clock Stretching Detection

Clock stretching — referred to internally as ghost boosting — is among the most insidious GPU failure modes. A card reports running at 2500 MHz in GPU-Z, MSI Afterburner, or HWINFO64. The user sees "healthy" numbers. The game performs poorly. The card is not actually computing at 2500 MHz.

The mechanism: when the silicon cannot sustain a clock edge within the required timing budget at the specified voltage, the hardware does not crash — it extends the clock period. Specific to NVIDIA GPUs, the driver stack can report the commanded frequency while the actual CCLK being distributed to the shader array is lower. The GPU performance counters — accessible via NVML or hardware-level PMU reads — tell a different story.

Our test methodology cross-references:

Driver-reported frequency (via NVML)
PMU-derived shader throughput (via compute benchmarks with known FLOP/clock ratios)
Theoretical vs. measured throughput at a fixed workload

\eta_{\text{clock}} = \frac{\text{Measured GFLOPS}}{\text{Reported MHz} \times \text{CUDA Core Count} \times 2}

A healthy card achieves $\eta_{\text{clock}} > 0.98$ . A clock-stretching card will exhibit $\eta_{\text{clock}} < 0.92$ — indicating that the effective compute throughput is significantly below what the reported frequency implies.

Effective Clock Efficiency	Interpretation	Lab Action
> 98%	Full clock delivery	Pass
94% – 98%	Minor efficiency loss; within tolerance	Note on certification report
90% – 94%	Moderate clock stretching; VRM audit required	Hold — Level 2 Inspection
< 90%	Severe clock stretching; silicon or VRM pathology	Reject — Root Cause Analysis

Part VII: VBIOS Forensics — When the Firmware Lies

Flashed VBIOS and Modified V/F Tables

The VBIOS (Video BIOS) stored in the SPI flash chip on the GPU PCB contains the complete V/F table, power limits (TGP, TDC), fan curve, and board identity data. It is writable via software utilities — and it is routinely modified.

Common VBIOS modifications seen in the pre-owned market:

Increased TGP limits: Mining operators increase power limits beyond factory specifications to extract maximum compute performance. An RTX 3090 with a VBIOS flashed to 400W (vs. stock 350W) has been electrically stressed beyond its design point for potentially thousands of hours.
Modified V/F tables: Overclocking enthusiasts modify V/F tables to push higher frequencies, often with elevated voltages that accelerate electromigration.
Board ID spoofing: In extreme cases, a lower-SKU board has its VBIOS replaced with a higher-SKU firmware to misrepresent the hardware to detection tools.

Cryptographic VBIOS Verification

Assurd performs VBIOS integrity verification on every submitted unit using the following protocol:

Binary extraction: We read the raw VBIOS binary from the SPI flash using a hardware programmer (not the software GPU-accessible read, which can be intercepted)
SHA-256 hash comparison: The extracted binary is hashed and compared against our database of factory-original VBIOS images for that board revision
Structural analysis: Even if the hash matches (possible if only non-cryptographically-significant parameters were changed), we perform a structural diff against the reference, checking V/F table entries, TDP limits, and board identity tokens

H_{\text{VBIOS}} = \text{SHA-256}(\text{raw flash binary})

\text{Integrity} = \begin{cases} \text{Verified} & \text{if } H_{\text{VBIOS}} \in \mathcal{D}_{\text{reference}} \\ \text{Modified} & \text{if } H_{\text{VBIOS}} \notin \mathcal{D}_{\text{reference}} \end{cases}

A modified VBIOS is not an automatic rejection — but it triggers a full electrical history investigation and mandates our maximum-duration test suite. The certification report will document the VBIOS modification status regardless of pass/fail outcome.

Conclusion

The V/F curve of a GPU is not just a performance specification — it is the accumulated electrical biography of the silicon substrate. It encodes how hard the card was driven, whether it was overvolted, whether the VRM is maintaining regulation, and whether the interconnects have accumulated electromigration damage that will continue to degrade the device's long-term reliability.

At Assurd Techlabs, reading that biography is the core of our certification work. We don't ask "is this card fast?" — we ask "is this card's electrical health consistent with its age and use history?" The mathematics of the V/F audit, the physics of the synchronous buck converter, and the forensics of VBIOS integrity verification are our instruments.

An Assurd Certified badge on a GPU means its silicon DNA has been read, interpreted, and verified. The card's electrical identity matches what it claims to be, and its degradation level is within the bounds of safe, long-term operation.

Appendix: Reference V/F profiles for major GPU SKUs are available to enterprise clients under NDA through our technical disclosure portal.