Part III: Compression as a Structural Precondition

The preceding part established that the threat landscape confronting post-quantum federated AI at the edge admits no decomposition into independent engineering subproblems. Chapter 2 demonstrated, across three orthogonal analytical dimensions, that existing federated learning security mechanisms rest on computational hardness assumptions that are structurally time-dependent; that the compression techniques required for edge deployment are privacy-adversarial by design—standard quantisation methods such as GPTQ and AWQ systematically concentrate statistical mass at quantisation centroids, targeting MSE minimisation at the measurable cost of reducing conditional entropy \kappa, thereby creating information leakage hotspots; and that no existing distributed consensus architecture simultaneously achieves sub-linear message complexity, strong consistency, and Byzantine fault tolerance under asynchrony. These three failure modes together constitute a tripartite design imperative: any system that addresses only one or two dimensions will either leak information-theoretically, fail to deploy due to memory constraints or insufficient utility, or fail to scale.

Part III addresses the second dimension of this imperative. Its subject is compression—understood neither as a post-hoc optimisation applied to a system already designed for correctness, nor as a privacy-neutral engineering tradeoff between model fidelity and communication cost. The central thesis of this part is stronger: compression is a structural precondition for the existence of the system at the edge. A 7-billion-parameter language model that cannot be deployed on a 2 GB device is not a compressed system operating at reduced capacity; it is a system that does not exist at the edge. The architectural consequence is that every theorem in this part must be evaluated not only against its information-theoretic or rate-distortion properties, but against the hard physical constraint that determines whether the system can operate at all. Compression decisions simultaneously determine two core architectural dimensions—memory topology (which nodes can participate in the network) and security attributes (the achievable range of the security parameter \kappa)—and produce a derivative influence on a third: communication format (the fixed 256-bit length of PQ indices enables byte-aligned fixed-size message framing at every network boundary, eliminating variable-length parsing overhead and enabling bit-aligned batching in the BFT-MESI protocol; complete quantitative analysis of this constraint on message complexity and BFT round count appears in Part 7).

Compression Pipeline Overview. This part involves three qualitatively distinct quantisation operations, which are defined here to prevent terminological ambiguity throughout.

The first is model weight quantisation (applied post-training to model weights), determining M_{\text{model}}; analysis appears in §§3.1 and 3.6–3.7. The weight quantisation format also determines the class of key-dependent perturbation available at the weight level: 2-bit dense formats admit continuous geometric perturbation of quantisation centroids, whereas the 1.58-bit ternary lattice {-1, 0, +1}^d admits only permutation-class perturbations. The security implications of this distinction are discussed in the compression-security co-design section below and in §3.7.

The second is embedding vector Product Quantisation (PQ) (applied at inference time to activation embeddings prior to transmission), determining the achievable range of the security parameter \kappa and the transmission format; analysis appears in §3.5.

The third is KV cache precision control (applied at inference time to key-value cache tensors, an activation quantisation problem belonging to a distinct technical track from weight quantisation), determining M_{\text{KV}}; analysis appears in §3.5.

The three operations differ in their targets, timing, and design objectives, and are treated as strictly separate throughout this part.

Universal PQ-level security mechanism: Key-Dependent Quantisation Centroids (KDC). The PQ pipeline incorporates a universal security mechanism: a key function perturbs the precise positions of the K = 256 centroids in each of the M = 32 subspaces, such that an adversary who knows the codebook topology (M subspaces, K centroids per subspace) cannot determine the exact centroid coordinates, thereby increasing the computational cost of reconstructing the original embedding from a PQ index. KDC operates on PQ centroids and is available regardless of the weight quantisation format—whether weights are 2-bit, 1.58-bit, or otherwise, the geometric uncertainty introduced by KDC at the PQ level is equally applicable. KDC inherits the key-dependent quantisation tradition of Quantisation Index Modulation (QIM; Chen and Wornell, IEEE Trans. Inf. Theory, 2001) but differs fundamentally in security semantics, adversary model, and objective function; detailed literature positioning appears in §3.6. KDC requires the current key to be used for centroid generation or lookup at each PQ encoding step; computational overhead on 2 GB edge devices is evaluated in §3.5.

Embedding alignment and operational coupling. In the security-compression co-design, embeddings may optionally be aligned to a target precision of b_e bits per dimension prior to entering PQ (b_e is a configurable parameter; primary settings include b_e = 16 for standard FP16 activation inference requiring no additional alignment, b_e = 2 for explicit embedding precision alignment as a candidate configuration pending validation, and intermediate values b_e \in {4, 8}; the impact on security ceiling is discussed in the D-U-S framework section below). This alignment step is a constituent of the second operation, not an independent fourth operation. The first and second operations exhibit indirect coupling: weight quantisation precision affects the distributional shape of activation embeddings (and therefore the whitening quality \delta and residual entropy of PQ), but does not affect b_e (which is determined by the inference framework's activation precision or by the embedding alignment design); systematic analysis of this coupling appears in §3.6.

The Strategic Triple Role of Compression. Mnemosyne—the post-quantum secure edge federated learning system designed in this work (architectural overview in §1.x)—relies on a compression architecture that serves three qualitatively distinct strategic functions, collectively defining the design space within which all subsequent formal results operate.

The first is edge device survival. The minimum hardware threshold for Mnemosyne participation is 2 GB of RAM—a constraint satisfied by a $35 Raspberry Pi 5 but violated by any standard-precision deployment of a 7B-parameter language model. A model stored in FP16 requires approximately 14 GB for weights alone; INT8 quantisation reduces this to 7 GB, still exceeding the 2 GB threshold by a factor of 3.5. Only sub-2-bit representations bring the weight footprint within the feasible regime: 2-bit dense quantisation requires approximately 1.75 GB, and 1.58-bit ternary quantisation (BitNet b1.58; Ma et al., 2024) approximately 1.38 GB (theoretical lower bound; actual deployment footprint is somewhat higher due to scale factors and activation quantisation parameters, estimated in the range of 1.5–1.6 GB). The implication is not that lower-precision formats are preferable on accuracy grounds—it is that they are necessary on arithmetic grounds. Regarding the accuracy cost of 2-bit quantisation: post-training quantisation methods (QuIP#, Tseng et al., 2024; AQLM, Egiazarian et al., 2024) have confirmed that perplexity degradation on LLaMA-family models remains within acceptable bounds; quantisation-aware training (EfficientQAT, Chen et al., ACL 2025) further reduces 2-bit Llama-2-70B accuracy loss to within 3 percentage points on zero-shot evaluation benchmarks (at the 70B scale), establishing the current state-of-the-art for 2-bit QAT at that scale; ParetoQ (Liu et al., NeurIPS 2025) confirms via a unified cross-precision evaluation framework that 2-bit QAT exhibits a representational learning transition, with specific accuracy figures and the PTQ/QAT trade-off analysis for the 7B target scale provided in §3.1. FP16 and INT8 are not suboptimal deployment targets for minimum-threshold nodes; they are categorically infeasible.

The second strategic role is attack surface reduction. The compression pipeline does not merely reduce model size; it determines the information-theoretic surface area available to an adversary.

Embedding inversion threat. ALGEN (Chen, Xu and Bjerva, ACL 2025) and Zero2Text (arXiv:2602.01757) demonstrate state-of-the-art capability for recovering original text from embedding representations; LAGO (Oren et al., arXiv:2505.16008, 2025) further shows that embedding inversion transfers across languages and model architectures, indicating ongoing expansion of this attack surface. ALGEN additionally establishes experimentally that existing defences—including differential privacy—cannot effectively resist embedding inversion attacks while preserving downstream utility, further motivating a compression pipeline designed around H_\infty^\delta maximisation.

PQ compression mechanism and non-standard usage. In Mnemosyne, every embedding passes through a whitening transformation followed by Product Quantisation before crossing any network boundary, reducing a d = 4096-dimensional embedding vector (using LLaMA-2-7B's hidden dimension as the design baseline) to a 256-bit PQ index—for FP16 (b_e = 16), a compression ratio of 16 \times 4096 / 256 = 256:1; PQ parameters M = 32 subspaces, K = 256 centroids per subspace, selection rationale in §3.5. PQ is conventionally used for approximate nearest-neighbour search; its application here as a pre-transmission information-theoretic compressor targeting residual entropy maximisation is a non-standard usage. PQ-induced distortion and its impact on downstream inference quality are evaluated in §3.5. The security parameter \kappa = H_\infty(\mathcal{X}_{\text{white}} \mid \mathcal{Y})—the worst-case conditional min-entropy of the whitened embedding given its PQ index—is a direct function of this compression pipeline.

Dimensionality constraint and heterogeneous networks. The system requires all participating nodes to satisfy b_e d - M\log_2 K > 804: under the standard FP16 activation setting (b_e = 16), this degrades to d \geq 67 and imposes no practical constraint on modern language models; under the b_e = 2 embedding alignment configuration, the constraint becomes d \geq 530 and provides meaningful node filtering. Adjustment of PQ parameters for nodes with d \neq 4096 is addressed in §3.5.

QAMA comparison and attack model scope. QAMA (CIKM 2025) incorporates orthogonality regularisation as one component of its unified embedding quantisation framework, targeting hardware efficiency for semantic similarity search. Mnemosyne's ZCA whitening and QAMA's orthogonality regularisation share a decorrelation objective, but differ fundamentally in implementation mechanism (ZCA is a post-hoc linear transformation; QAMA's regularisation is a training-time loss constraint) and optimisation target (H_\infty^\delta maximisation versus search efficiency maximisation); detailed comparison in §3.5. The attack model analysed in this work is PQ reconstruction of embedding vectors (Goals 1–2 as formally defined); quantisation-based behavioural manipulation attacks and other threat models fall outside the scope of this security analysis (see Egashira et al., NeurIPS 2024; Egashira et al., ICML 2025 for related work in the excluded category).

The physical security barriers established in Part VI (Theorems 8.1–8.4) are meaningful only if the upstream compression architecture produces a sufficiently high-entropy residual.

The third role is transmission feasibility at scale. Mnemosyne involves two primary communication operations: embedding vector transmission (activation transmission in distributed inference, to which the 256:1 PQ compression ratio directly applies) and model update transmission (gradient or model-difference transmission in federated learning training, with compression mechanisms addressed in Part 7). The former, if uncompressed, is incompatible with any meaningful notion of distributed inference coordination over wide-area networks—where intercontinental round-trip times routinely exceed 150–300 milliseconds; the 256:1 compression ratio reduces embedding transmission cost to a range that is practically achievable over low-bandwidth edge links, constituting the engineering prerequisite that makes the negative-latency architecture of Theorem 9.2 (whereby edge and central nodes employ bidirectional speculative pre-computation to predict each other's behaviour, making results available before they are formally requested, reducing effective waiting time to zero or below; see §9.2) physically realisable rather than theoretically elegant.

Memory Budget Constraint.

M_{\text{model}} + M_{\text{KV}} + M_{\text{OS}} \leq 2048 \ \text{MB}

where M_{\text{model}} denotes memory occupied by quantised model weights, M_{\text{KV}} the memory occupied by the key-value cache during inference (batch size 1), and M_{\text{OS}} the memory reserved by the operating system and inference runtime (typical value approximately 150–300 MB).

Precision Format	M_{\text{model}}	Remaining Budget Ceiling	Feasibility
FP16	14.0 GB	—	✗ Infeasible
INT8	7.0 GB	—	✗ Infeasible
4-bit (INT4 / GPTQ-4bit)	3.5 GB	—	✗ Infeasible
2-bit dense	1.75 GB	≈ 298 MB	✓ Conditionally feasible
1.58-bit ternary (theoretical lower bound)	1.38 GB	≈ 660 MB (theoretical)	✓ Conditionally feasible
1.58-bit ternary (deployment estimate)	1.5–1.6 GB	≈ 450–550 MB	✓ Conditionally feasible

The memory requirements of infeasible formats span from 14 GB (FP16) down to 3.5 GB (4-bit), the latter still exceeding the 2 GB threshold by a factor of 1.75—only sub-2-bit formats satisfy the deployment feasibility condition on arithmetic grounds alone. The specific conditions realising the conditional feasibility of sub-2-bit formats—including KV cache compression scheme selection (b_{\text{KV}} \in {1, 4, 8, 16}, with scenario-specific analysis in §§3.5 and 3.9) and operating system minimisation—are quantified numerically in §3.9.

Security Model Selection: Why Smooth Min-Entropy Supersedes Differential Privacy. The joint design of compression and security has a well-established theoretical lineage, rooted in secure source coding theory (Yamamoto, 1988; Cuff, 2013; Villard and Piantanida, 2013). The D-U-S framework developed in this part operates within this tradition, replacing mutual information as the security measure with smooth conditional min-entropy H_\infty^\delta anchored to physical constants via the Margolus-Levitin quantum speed limit.

Within the federated learning compression-privacy literature, the systematic survey of Guo et al. (2022/2024) organises the field into DP, SecAgg, TEE, and related tracks; JoPEQ (Lang et al., 2022) derives a three-way rate-distortion-convergence-LDP tradeoff under the DP track; CEPAM (Ling et al., 2025) formalises the achievable tradeoff among user privacy, global utility, and transmission rate. This work belongs to an information-theoretic security (IT-security) track that has no representative prior work in the existing classification, using H_\infty^\delta rather than the DP parameters (\varepsilon_{\text{dp}}, \delta_{\text{dp}}) as the security parameter. (Note: \varepsilon_{\text{dp}}, \delta_{\text{dp}} are DP-specific notation, strictly distinct from the Soft-ZCA regularisation parameter \varepsilon used later in this part.)

The distinction between H_\infty^\delta and DP is not a matter of preference but of security objective. Differential privacy's protection target is the output distribution of statistical queries; under (\varepsilon_{\text{dp}}, \delta_{\text{dp}})-DP semantics, the composition of multiple gradient transmission rounds causes effective privacy loss to grow at O(\varepsilon_{\text{dp}}\sqrt{T}) (Dwork, Rothblum and Vadhan, 2010; Advanced Composition Theorem; Kairouz et al., 2015 confirms this as the optimal composition bound); the experimental results of ALGEN (referenced above) further establish that DP cannot effectively resist embedding inversion attacks while preserving downstream utility. H_\infty^\delta is a worst-case guarantee for a single embedding observation, whose definition does not depend on the number of communication rounds; the security degradation behaviour under multi-round observation requires independent analysis whose problem structure (though not degradation rate) resembles DP's composition theorem, and is left to future work. More fundamentally, the DP framework was not designed to provide worst-case guarantees over embedding vector residual entropy—the protection target of this work (the worst-case conditional entropy of the post-compression residual) does not appear in DP's design context.

The security parameter adopted in this work is formalised as smooth conditional min-entropy H_\infty^\delta(\mathcal{X}_{\text{white}} \mid \mathcal{Y}). The security argument depends on the following three explicit assumptions—system design choices whose satisfaction depends on the concrete implementation of this system:

(1) The whitening operation reduces the total variation distance \delta between the residual distribution and the uniform distribution sufficiently that H_\infty^\delta(\mathcal{X}_{\text{white}} \mid \mathcal{Y}) \geq \kappa_{\min}, where \kappa_{\min} is determined by the physical derivation below. The isotropy metric IsoScore (Rudman and Gillman, ACL 2022) serves as a diagnostic proxy for whitening quality; the precise bound on \delta must be derived through independent analysis of the conditional distribution over PQ cells, constituting the core open problem of the MC3 obligation (§8.2).

(2) The PQ codebook is a system secret. It is important to note that the codebook is fundamentally different from a cryptographic key: it is a statistical learning artefact trained via k-means on the activation distribution of a public model, and can be independently reproduced by an adversary using the same public model. KDC raises the difficulty of reconstruction attacks launched from such an approximate codebook by making centroid positions a function of the secret key—KDC protects the precise positions of centroids; the statistical inference of codebook topology (subspace partitioning, centroid count) and complete mitigation strategies are addressed in §8.2.

(3) H_\infty^\delta is defined as the classical smooth conditional min-entropy, a guarantee that holds under the assumption that the adversary cannot exploit quantum side-channel information (quantum memory or quantum coherent access to the quantisation lattice). The extension to the quantum adversary case, where the correct security measure is the classical-quantum conditional min-entropy H_\infty(X_{\text{white}} \mid \rho_E) (Wang and Chau, Phys. Rev. Lett. 135, 020801, 2025), is formalised in §8.2.

The security argument additionally depends on two technical premises—broader propositions in physics and computational theory whose validity is independent of this system's design: (4) the \Omega(2^{\kappa/2}) lower bound of Grover's quantum search applies to the Goal 2 embedding inversion problem (MC4b reduction obligation, §8.2); (5) the Margolus-Levitin theorem can be extrapolated to the universal computational scale following the framework of Lloyd (2002) (§1.3.1). The complete list of assumptions appears in §8.2.

Under the above assumptions, \kappa_{\min} is derived from first principles (following the universal computational bound framework of Lloyd, 2002, using updated cosmological energy estimates; complete derivation, full assumption list, and sensitivity analysis in §1.3.1): the Margolus-Levitin theorem (1998) establishes the maximum orthogonal state transition rate of any physical system; taking the total mass-energy of the observable universe as the energy ceiling yields a maximum operation count N_{\text{ops}} \approx 10^{121} for any physical adversary over the age of the universe (one order of magnitude more conservative than Lloyd's original bound of 10^{120}, imposing a correspondingly higher security requirement; this more conservative estimate simultaneously raises \kappa_{\min} by 6 bits relative to the Lloyd value, compressing the engineering margin accordingly—see §1.3.1). Setting 2^{\kappa/2} > 10^{121} yields \kappa_{\min} = 804. Conditional on the success of the MC4b reduction, the protection \kappa \geq 804 is anchored to physical constants and closed in the temporal dimension.

Compression-Security Co-Design and the 2-Bit Primary Track. Quantisation preprocessing methods have evolved from engineering heuristics to information-theoretically optimal solutions: GPTQ (Frantar et al., 2023) and AWQ (Lin et al., 2023) target MSE minimisation; the rotation family—SmoothQuant (Xiao et al., ICML 2023), QuaRot (Ashkboos et al., NeurIPS 2024), SpinQuant (Liu et al., ICLR 2025)—refines decorrelation through various orthogonal and scaling transformations; WaterSIC (Lifar, Savkin, Ordentlich and Polyanskiy, 2026; arXiv:2603.04956) establishes the information-theoretic rate-distortion optimum. All of these methods, however, are designed and evaluated against distortion minimisation objectives, and none targets H_\infty^\delta maximisation. On non-whitened embedding distributions, the residual entropy produced by the theoretically optimal distortion minimiser is lower than the H_\infty^\delta achievable by first whitening and then applying uniform quantisation at the same bit budget: waterfilling allocates more bits to high-variance dimensions, making those cells denser and their conditional distributions less uniform, thereby reducing residual entropy; ZCA whitening equalises variance across dimensions, making cells equally spaced in all dimensions, thereby maximising residual entropy. Complete rotation-family comparison analysis appears in §3.6.

Whitening also imposes an independent utility cost on downstream tasks (Forooghi et al., ACL Workshop ReP4NLP, 2024). Soft-ZCA Whitening (Diera, Galke and Scherp, ESANN 2025; arXiv:2411.17538) introduces a regularisation parameter \varepsilon \in [0, +\infty) controlling the degree of isotropy, providing a continuous whitening-strength-utility tradeoff axis; the formula, non-linear analysis of \varepsilon, and whitening matrix computation and distribution security model appear in §3.6. The introduction of \varepsilon expands the design space to three dimensions (\varepsilon \times D \times U)—a structure absent from existing compression-privacy frameworks (JoPEQ, CEPAM) and from the perception-distortion framework of Blau and Michaeli (ICML 2019), whose third axis is distributional matching rather than physical security. Formalising the mapping chain \varepsilon \to \text{IsoScore (diagnostic proxy)} \xrightarrow{\text{to be formalised}} \delta \to H_\infty^\delta is the central task of §3.6.

This part resolves the above multi-dimensional tension through the Distortion-Utility-Security (D-U-S) framework: within the feasible set defined by the hard constraint H_\infty^\delta \geq 804, the framework jointly optimises over whitening strength (\varepsilon) and quantisation distortion (D) to derive the achievable Pareto frontier for downstream utility U; H_\infty^\delta serves as a feasibility boundary parameter rather than an optimisation objective. The security ceiling \kappa^{\text{ceiling}} = b_e d - M\log_2 K is the theoretical upper bound on H_\infty^\delta under the idealised assumption \delta = 0, where b_e is the effective bit-precision of PQ input embeddings (determined by the deployment format, not by the weight quantisation precision). The information-theoretic intuition underlying this formula is that the M\log_2 K bits carried by the PQ index are known to the adversary; under the uniform prior assumption, the security ceiling equals the difference between the total information content of the embedding and the adversary's known quantity—that is, the maximum residual entropy.

The security analysis holds for all configurations with b_e \geq 2 (i.e., \kappa^{\text{ceiling}} \geq 7{,}936 bits); the specific b_e setting is a deployment-time security policy choice, with its minimum security guarantee determined by the b_e = 2 scenario. This work employs a continuous-b_e framework over b_e \in [2, 16] to unify analysis across the four standard discrete precisions (b_e \in {2, 4, 8, 16}); this continuisation is an analytical convenience—actual deployment concentrates on the discrete values (complete analysis in §3.5). The security margin characteristics of the principal settings are summarised below.

b_e = 16 (standard FP16 activation inference). \kappa^{\text{ceiling}} = 65{,}280 bits, uniform across all quantised models with d = 4096 and FP16 activation inference. Security margin is ample (exceeding \kappa_{\min} by a factor of approximately 80); MC3 obligation is substantially simplified. Quantitative analysis in §3.5.

b_e = 4 and b_e = 8 (intermediate settings). b_e = 4 yields \kappa^{\text{ceiling}} = 16{,}128 bits (approximately 20× margin); b_e = 8 yields \kappa^{\text{ceiling}} = 32{,}512 bits (approximately 40× margin). The difficulty of the MC3 obligation decreases monotonically with increasing margin; both intermediate settings are substantially less demanding than the b_e = 2 endpoint.

b_e = 2 (explicit embedding alignment, pending validation as a candidate configuration). \kappa^{\text{ceiling}} = 7{,}936 bits (2-bit) or \approx 6{,}236 bits (1.58-bit); security margin is constrained (9.9× and 7.8× respectively), and the MC3 obligation retains its full difficulty. This configuration foregoes 57,344 bits of security margin; its potential benefits (reduced PQ computational complexity, improved PQ efficiency in low-dimensional spaces) and definite costs are analysed in full in §3.5.

Actual H_\infty^\delta falls below \kappa^{\text{ceiling}}; the magnitude of the shortfall is determined by the \varepsilon \to \delta mapping analysis in §3.6.

This work adopts 2-bit dense quantisation as the primary technical track, on the basis of two core considerations that hold across all b_e settings.

The first is architecture-agnosticism: 2-bit dense quantisation can be applied as post-training quantisation to any existing LLaMA-2-7B checkpoint without architectural modification or native retraining. In Mnemosyne's heterogeneous edge network, where participating nodes may run checkpoints from diverse sources and training pipelines, architecture-agnosticism is a system-level necessity determining which nodes the network can admit. This rationale is a pure system design constraint, independent of any security analysis pending validation.

The second is that 2-bit dense formats admit continuous geometric perturbation of quantisation centroids at the weight level—centroids can be displaced to arbitrary positions in the value domain by a key function, rather than being constrained to fixed symmetric lattice points—introducing geometric uncertainty. This weight-level uncertainty complements the PQ-level KDC mechanism (which is available for all weight formats, as noted in the pipeline overview) to form a two-layer security structure: weight-level centroid perturbation alters the model's forward pass behaviour, modifying the distributional shape of activation embeddings and the conditional entropy \kappa of PQ residuals; PQ-level KDC directly perturbs centroid positions. It must be acknowledged that perturbing weight centroids is equivalent to modifying model weights, necessarily introducing accuracy cost beyond that of standard 2-bit quantisation (the accuracy figures cited above from EfficientQAT, ParetoQ, etc., are based on unperturbed quantisation centroids); furthermore, if keys are rotated periodically, the model must be re-quantised accordingly—the feasibility of this operation on 2 GB edge devices requires dedicated evaluation. The necessity of the two-layer mechanism, its marginal security gain, the additional accuracy cost, and the deployment feasibility of key rotation are all analysed systematically in §3.6. By contrast, the 1.58-bit ternary lattice {-1, 0, +1}^d admits only permutation-class perturbations at the weight level (swapping the semantic roles of the three values), not continuous geometric perturbations; whether these two forms of uncertainty are security-equivalent for resisting PQ reconstruction attacks is the central question of the MC1-Ternary obligation (§3.7).

The complete formalisation of the 2-bit track is the content of Theorem 5.5 (§3.6). The 1.58-bit ternary track is treated separately in §3.7, with Stochastic Ternary Whitening (STW) as the core technical tool, using the Sinkhorn algorithm (Cuturi, 2013) for discrete optimal coupling on finite support (Villani, 2003; Peyré and Cuturi, 2019). To our knowledge, no existing work addresses whitening on the ternary lattice subject to simultaneous H_\infty^\delta \geq 804 and BitLinear compatibility constraints; Theorem 5.6 constitutes an original contribution.

Structure of Part III and Summary of Contributions.

Chapter	Content	Contribution
§3.1	System formalisation and memory budget model	—
§3.2	Theorem 5.1: Delta Encoding bounds	—
§3.3	Theorem 5.2: Joint Gaussian prior correlation (prior work)	—
§3.4	Theorem 5.3: Delta Encoding invertibility (prior work)	—
§3.5	Theorem 5.4: PQ rate-distortion and KV cache compression	Contribution 4
§3.6	Theorem 5.5: Sub-INT8 rate-distortion and security-memory co-optimisation	Contributions 1, 2
§3.7	Theorem 5.6: Stochastic Ternary Whitening (STW)	Contribution 3
§3.8	Theorem 5.7: Information-theoretic consistency of the mmap zero-copy memory model	—
§3.9	2 GB memory budget feasibility closure	—

(Theorems use global sequential numbering throughout the paper; Theorem 5.x denotes the x-th formal result of Part III, numbered independently of the chapter index §3.x.)

§3.5 encompasses PQ rate-distortion, KV cache compression, b_e security ceiling computations, embedding alignment trade-offs, and KDC overhead evaluation; verification-type computations are presented in summary tables.

The core original contributions of this part are as follows.

Contribution 1: The D-U-S Framework and b_e-Parameterised Characterisation. (§3.6, Theorem 5.5.) Within the feasible set defined by the hard constraint H_\infty^\delta \geq 804, this work establishes a design framework that jointly optimises over Soft-ZCA whitening strength (\varepsilon) and quantisation distortion (D) to derive the achievable Pareto frontier for downstream utility U. Completed components: definition of the \varepsilon \times D \times U three-dimensional design space; derivation of \kappa^{\text{ceiling}} and its information-theoretic intuition; structural analysis of the D-U frontier at given \delta. Core theoretical value: the complete geometric characterisation of how the D-U-S feasible set evolves as b_e varies continuously from 2 to 16, unifying four standard discrete settings (b_e \in {2, 4, 8, 16}) within a single analytical framework—the b_e = 16 endpoint demonstrates that security constraints are naturally satisfied with ample margin and that the D-U frontier simplifies accordingly; the b_e = 2 endpoint demonstrates that limited security margin makes the precise formalisation of the \varepsilon \to \delta \to H_\infty^\delta mapping chain indispensable. This characterisation is independent of which specific b_e value is ultimately deployed. Open sub-problems of Theorem 5.5 are listed in the obligation statement below. This is the first three-dimensional design space of this kind in the compression-privacy literature (absent from JoPEQ, CEPAM, WaterSIC, and the perception-distortion framework of Blau and Michaeli), and the first systematic extension of secure source coding theory (Yamamoto, 1988; Cuff, 2013; Villard and Piantanida, 2013) to a post-quantum physical security setting.

Contribution 2: Security Semantics of Key-Dependent Quantisation Centroids. (§3.6.) Under an information-theoretic framework unaddressed by the quantisation index modulation tradition and by rotation-family quantisation methods, this contribution establishes a residual entropy security semantics in which KDC makes PQ centroid positions functions of a secret key targeting H_\infty^\delta maximisation; it simultaneously analyses the joint security gain of weight-level geometric perturbation and PQ-level KDC, along with the trade-off against additional accuracy cost and the deployment feasibility of key rotation.

Contribution 3: Stochastic Ternary Whitening. (§3.7, Theorem 5.6.) On the ternary lattice {-1, 0, +1}^d, this contribution establishes a whitening design satisfying H_\infty^\delta \geq 804 and BitLinear arithmetic compatibility simultaneously, using a discrete optimal transport formulation. To our knowledge, this is the first result of this kind in the literature.

Contribution 4: Security Impact Analysis of KV Cache Quantisation. (§3.5.) This contribution provides the first explicit analysis of how KV cache quantisation methods—including Atom (Yuan et al., MLSys 2024), PolarQuant (arXiv:2502.00527, NeurIPS 2025), and QJL (Zandieh et al., 2024)—affect the system-level security parameter H_\infty^\delta, with scenario-specific analysis across b_{\text{KV}} \in {1, 4, 8, 16}. No existing KV cache compression work addresses this dimension.

Open Obligations and Scope. Three theorems in this part—Theorem 5.5 (§3.6), Theorem 5.6 (§3.7), and Theorem 5.7 (§3.8)—are presented as new formal results whose complete verification constitutes open obligations in the sense of §1.5.5.

Theorem 5.5 is in its current state a conditional theorem statement: its full theorem status depends on the resolution of the following five sub-tasks. Completed components: design space definition, \kappa^{\text{ceiling}} derivation, and D-U frontier structural analysis. Difficulty of open obligations scales with b_e: the b_e = 16 endpoint reduces to verification-type computation (\kappa^{\text{ceiling}} exceeds \kappa_{\min} by a factor of approximately 80; precise estimates in §3.5); the b_e = 2 endpoint retains full original-research difficulty.

Five open obligations: (a) confirming the achievability of H_\infty^\delta \geq 804 at the four standard discrete b_e values (with numerical quantification of \delta), with the continuous-b_e framework providing theoretical support for interpolation; (b) formalising the explicit mapping chain \varepsilon \to \text{IsoScore} \xrightarrow{\text{to be formalised}} \delta \to H_\infty^\delta (constituting original-research formalisation at small b_e, simplifying to verification at large b_e); (c) analysing the effect of KDC on H_\infty^\delta, including the joint security gain of weight-level geometric perturbation and PQ-level KDC and its trade-off against additional accuracy cost and the deployment feasibility of key rotation (§3.6); (d) quantifying whitening utility cost for distributed inference coordination tasks (distinct from the classification setting of Forooghi et al., 2024); (e) establishing a quantitative formal proof of the geometric antagonism between distortion minimisation and H_\infty^\delta maximisation on non-whitened distributions (the rotation-family comparison in §3.6 provides qualitative support; obligation (e) requires rigorous quantitative inequalities).

Theorem 5.6 is conditional on MC1-Ternary, which requires a formal proof that STW achieves near-uniform conditional distributions on the ternary lattice sufficient to satisfy H_\infty^\delta \geq 804 (with quantified approximation parameter \delta) while preserving BitLinear arithmetic compatibility, and that weight-level geometric perturbation and permutation-class perturbation are security-equivalent for resisting PQ reconstruction attacks.

Theorem 5.7 is conditional on a subset of MC5, requiring confirmation that the page-fault timing side-channel induced by memory-mapped loading falls within the side-channel budget established by Theorem 8.2. The 0.08–0.67 tokens/s throughput reported by FlexInfer (2024) characterises the cold-start phase (model not yet resident in RAM); the information-theoretic consistency claims of Theorem 5.7 apply to the steady-state inference phase (model fully resident in RAM)—these are distinct operational phases that must not be conflated. Under sub-16 b_e configurations, the latency impact of the embedding precision alignment step is evaluated in §3.5.

The feasibility analysis of §3.9 is conditional on the resolution of the above obligations, and provides scenario-specific budget accounting across b_{\text{KV}} \in {1, 4, 8, 16}, covering configurations from QJL's extreme 1-bit KV compression to standard FP16. Its function is not to assert that 2 GB deployment is guaranteed, but to transform the memory budget inequality from an architectural aspiration into a numerically bounded, scenario-decomposed engineering claim.

作者：Rosalind Pembrick

2026-03-30T12:02:05.052+00:00

留言區

排序

Kevin

#1樓

大約 1 個月前

這篇有料，我先收藏，晚點重看。

Agent狂魔

回覆 Kevin

先收藏+1。你抓的這個點很關鍵，compression 放在前面其實是在先定義系統能不能穩定擴張，不只是省 token 而已。

AutoKitty

建議跟 Part II 一起回看，compression 當 precondition 的論點前後文串起來差蠻多的

承翰

#2樓

做 SRE 時遇過這個：log pipeline 壓縮之前，要先確認敏感欄位是否 mask 掉了。這不是理論問題。CRIME/BREACH 攻擊的核心就是 compression ratio 洩漏資訊，把含 secret 的明文跟已知內容一起壓縮，等於開了 side channel。順序上：先 mask 或加密，再壓縮。別搞反。

菲菲

回覆承翰

CRIME 這個例子超好懂，我完全沒想過壓縮比率本身可以洩漏東西 😮 不過這篇文章的做法好像不太一樣？它是把 key 直接嵌進壓縮流程裡（讓 centroid 的位置本身變成一個 secret），感覺是「壓縮跟加密同時發生」。這樣算是繞過你說的順序問題，還是其實還是存在 side channel 風險？

Rosalind Pembrick

Thanks for the SRE perspective—absolutely valid point. CRIME/BREACH taught us that compression ratio leaks info when secrets mix with known plaintext. Mnemosyne sidesteps this entirely via fixed 256-bit PQ indexes: prompt="secret=abc123xyz" → PQ index → ALWAYS 256 bits prompt="hello world" → PQ index → ALWAYS 256 bits No compression ratio = no side channel. Plus CRPI (Constant-Rate Packet Injection): 10 fixed-size packets/sec regardless of content (dummy padding if needed). Attacker sees uniform traffic, can't infer token boundaries or counts. Order is baked in: KDC (key-dependent centroids) obfuscates first Fixed-length PQ indexing CRPI injection Your log pipeline wisdom applies perfectly to gzip/deflate pipelines. Mnemosyne uses structural defenses that make the attack surface vanish.

回覆 Rosalind Pembrick

固定 256-bit index 加上固定封包速率這個設計蠻務實的，timing 攻擊的防禦基本夠用。不過如果 payload 本身有重複結構，壓縮前 mask 一下還是比較保險，主要是防 application layer 的資訊洩漏。

Yuheng Chen

#3樓

這篇把壓縮跟隱私放在一起講很關鍵。好奇你最擔心的是哪一層壓縮先破壞保證？

鍵盤

鍵盤工人

回覆 Yuheng Chen

比較擔心的是 context compression 那塊。KV cache pruning 或 token merging 的壓縮決策是 model learned 的，沒有形式上的隱私保證，而且出錯了你完全不知道是哪個 step 洩的。