Part II: Theoretical Foundations and Technical Review
Part II: Theoretical Foundations and Technical Review
Chapter 2: Literature Review and Technical Background
This chapter systematically examines the fundamental limitations of existing federated learning and edge AI technologies through three critical analytical dimensions, establishing the theoretical foundations for the novel design of the Mnemosyne system. The first dimension exposes the fragility of computational security guarantees under long-horizon protection requirements. The second dimension investigates the whitening reconstruction problem inherent to extreme low-bit representation. The third dimension demonstrates how the preceding theoretical insights can be operationalised within a globally distributed, adversarially hostile environment.
2.1 The Bankruptcy of the Federated Learning Trust Model and the Harvest Now, Decrypt Later (HNDL) Threat
Since McMahan et al. (2017) introduced the FedAvg algorithm, Federated Learning (FL) has been widely regarded as a core framework for addressing privacy protection in large-scale distributed machine learning. The central premise of FedAvg—"data stays local, models travel"—seeks to achieve collaborative global model training without directly accessing raw data, by executing stochastic gradient descent (SGD) locally on edge devices and transmitting only model updates (gradients or weights). However, as research has deepened, the trust model underlying conventional federated learning is undergoing an unprecedented crisis of legitimacy.
2.1.1 The Communication Mechanism and Convergence Bottlenecks of FedAvg
The basic workflow of FedAvg proceeds as follows: the central server broadcasts the current model \theta_t to a randomly selected subset of K clients; each client executes E rounds of SGD on its local dataset D_k, yielding an updated model \theta_k^{t+1}; the server then collects these updates and computes a weighted average:
\theta_{t+1} = \sum_{k=1}^K \frac{n_k}{n} \theta_k^{t+1}
Although this mechanism reduces communication frequency, it introduces severe directional drift in local updates under Non-Independent and Identically Distributed (Non-IID) data conditions—a phenomenon known as Client Drift—which substantially degrades convergence speed.
Karimireddy et al. (2020) demonstrate that this drift originates from the heterogeneity between the local objective gradient \nabla F_k(\theta) and the global objective gradient \nabla F(\theta). Even with the introduction of momentum terms or adaptive learning rates, under extreme Non-IID conditions (e.g., label skew or feature shift), the convergence bound of FedAvg degenerates to O(1/\sqrt{T}), with the steady-state error bounded below by the data heterogeneity constant. Li et al. (2020) propose FedProx, which introduces a Proximal Term \frac{\mu}{2} \|\theta - \theta_t\|^2 to constrain the magnitude of local updates, seeking a balance between instability and convergence speed. Stich (2019) establishes the convergence guarantees of Local SGD under communication constraints.
To further improve communication efficiency, gradient compression techniques such as Deep Gradient Compression (Lin et al., 2018) and QSGD (Alistarh et al., 2017) have been proposed. Stich et al. (2018) investigate sparsified SGD with error feedback memory. signSGD (Bernstein et al., 2018) and PowerSGD (Vogels et al., 2019) further compress updates by exploiting gradient sign information and low-rank decomposition, respectively. Nevertheless, Kairouz et al. (2021) explicitly identify the tension between communication efficiency and privacy protection as one of the central unresolved open problems in federated learning.
2.1.2 The Mathematical Foundations and Evolution of Gradient Inversion Attacks
Gradient inversion attacks expose the fundamental flaw in the assumption that "gradients preserve privacy." The Deep Leakage from Gradients (DLG) attack, proposed by Zhu et al. (2019), formulates the privacy breach as an optimisation problem:
\min_{x^*, y^*} \left\| \nabla_{\theta} \mathcal{L}(F(x^*, \theta), y^*) - \nabla_{\theta} \mathcal{L} \right\|^2
where \nabla_{\theta} \mathcal{L} is the intercepted true gradient. By iteratively adjusting the dummy input x^* and label y^* until the synthesised gradient converges to the true gradient, the attacker achieves pixel-level reconstruction of training samples.
Subsequently, the iDLG attack by Zhao et al. (2020) exploits properties of the cross-entropy loss to recover labels directly from the gradient sign, substantially improving attack efficiency. Geiping et al. (2020) demonstrate that, even at larger batch sizes, high-fidelity image reconstruction remains achievable by employing a cosine similarity loss function augmented with a total variation (TV) regularisation term. More troublingly, Fredrikson et al. (2015) establish through model inversion attacks that statistical features of training data can be recovered by querying only model output confidence scores. Bagdasaryan et al. (2020) show how backdoor attacks exploit the aggregation mechanism of federated learning to implant malicious triggers. Shokri et al. (2017) demonstrate that membership inference attacks allow an adversary to determine whether a given sample participated in training by observing model outputs alone. Taken together, these results show that the privacy boundary of federated learning is far more fragile than originally assumed.
2.1.3 The Limitations of Differential Privacy and Post-Quantum Cryptography
In response to the above threats, the research community has broadly adopted Differential Privacy (DP) and Post-Quantum Cryptography (PQC). Abadi et al. (2016) propose DP-SGD, which provides (\varepsilon, \delta)-DP guarantees by injecting Gaussian noise into gradients. Specifically, for a gradient with sensitivity C, the injected noise standard deviation \sigma must satisfy:
\sigma \geq \frac{C \sqrt{2 \ln(1.25/\delta)}}{\varepsilon}
However, Abadi et al. (2016) show that on data-sparse edge nodes, when the gradient signal magnitude approaches the clipping threshold C, the \sigma required to achieve a security level of \varepsilon < 1 often renders the effective signal-to-noise ratio vanishingly small. Mironov (2017) proposes Rényi Differential Privacy (RDP), which refines privacy budget accounting through a tighter moment generating function, yet does not resolve the fundamental accuracy–privacy trade-off. Mironov et al. (2019) provide a more precise analysis for the subsampled Gaussian mechanism. Dong et al. (2022) introduce Gaussian Differential Privacy (GDP), which achieves tighter bounds under the central limit theorem, but its protective effectiveness remains limited in the finite-sample regime typical of edge devices. Dwork and Roth (2014) note in their foundational treatment that the protective strength of differential privacy scales positively with dataset size—a property deeply unfavourable for data-sparse edge nodes.
At the cryptographic level, the rapid advancement of quantum computing poses an increasingly severe challenge to classical encryption schemes grounded in integer factorisation (RSA) or discrete logarithms (ECC). NIST formalised the post-quantum cryptographic landscape in 2024 through FIPS 203–205, establishing ML-KEM (NIST, 2024a), ML-DSA (NIST, 2024b), and SLH-DSA (NIST, 2024c) as standardised algorithms. The academic precursors to these standards include CRYSTALS-Kyber (Bos et al., 2018) and CRYSTALS-Dilithium (Ducas et al., 2018), whose security rests on the hardness of the Learning With Errors (LWE) problem. Regev (2009) proves that LWE reduces to the Shortest Vector Problem (SVP) on lattices, and Peikert (2016) provides a systematic survey of this reduction. Nevertheless, the cryptanalytic break of the Rainbow signature scheme by Beullens (2022) serves as a stark reminder that security assumptions grounded in computational complexity remain vulnerable to sudden invalidation through mathematical breakthroughs or quantum advantage. Bernstein and Lange (2017) further observe that the security assessments of post-quantum algorithms remain in active evolution, lacking the long-term operational validation that classical schemes have accumulated.
2.1.4 The Harvest Now, Decrypt Later (HNDL) Threat Model
The most insidious threat is the Harvest Now, Decrypt Later (HNDL) paradigm. According to warnings issued by the NSA (2022) and Mosca (2018), adversaries can exploit the diminishing cost of persistent storage to intercept and archive all encrypted federated learning updates today, deferring decryption until sufficiently capable quantum computers become available in the future. For data with long-term privacy requirements—such as medical records, military model parameters, or long-horizon commercial secrets—existing PQC schemes merely delay the time to compromise, rather than providing an enduring security guarantee. CISA (2022) explicitly states in its quantum-readiness guidance that critical infrastructure must account for privacy protection windows extending beyond 25 years, rendering encryption schemes grounded in computational hardness assumptions logically insufficient over such timescales. NIST IR 8105 (2016) and NIST SP 800-208 (2020) further underscore the role of stateful hash-based signatures in long-term security architectures.
Intuitively, any security scheme whose guarantees depend on computational hardness assumptions will experience a continuous degradation in effective security strength as adversarial compute increases. Mosca (2018) notes that once quantum computers reach cryptographically relevant computational scales, this degradation will accelerate sharply, rendering schemes considered secure today nearly immediately compromised upon the achievement of quantum advantage. CISA's (2022) long-term security assessment provides policy-level corroboration of this degradation trajectory.
2.1.5 Section Summary: A Paradigm Shift in Trust Models
The foregoing analysis reveals that existing research on federated learning security is largely confined within the framework of computational complexity theory, and has not adequately addressed the systemic risk of computational assumption failure under quantum advantage and long-horizon decryption (HNDL) threats. Although Differential Privacy (DP) and Post-Quantum Cryptography (PQC) provide partial mitigation, the research community has yet to produce a unified defensive paradigm capable of simultaneously ensuring Information-Theoretic Security (ITS) and maintaining the communication efficiency demanded by edge devices. The absence of such a paradigm—one grounded in physical-layer guarantees rather than mathematical conjectures—constitutes the first layer of the theoretical motivation for the present research.
2.2 Information-Theoretic Perfect Secrecy and Computational Thermodynamic Barriers
When computational complexity can no longer be relied upon, information theory and thermodynamics become the defender's last line of defence. This section examines how information theory and thermodynamics together provide a theoretical basis for absolute security guarantees that transcend computational complexity assumptions.
2.2.1 Shannon Perfect Secrecy: The First Information-Theoretic Barrier
Shannon (1948, 1949) established the mathematical definition of Perfect Secrecy: a system achieves information-theoretic security if and only if the mutual information I(M; C) = 0 between the ciphertext C and the plaintext M. This condition implies P(M|C) = P(M)—that is, an adversary's prior knowledge of the plaintext is not updated in any way upon observing the ciphertext. Cover and Thomas (2006) extend this framework, emphasising that the conditional entropy H(M|C) must equal the original entropy H(M). This requirement is intimately connected to rate-distortion theory (Berger, 1971), for which Blahut (1972) provides the algorithmic machinery for computing the associated limits.
In the federated learning setting, this requirement demands that model updates undergo maximal randomisation prior to transmission. To achieve perfect information-theoretic security, the core design objective must ensure that the mutual information between the whitened gradient \nabla \theta_{\text{white}} and the original gradient \nabla \theta approaches zero (the specific mechanism is developed in Chapter 3). This demands not only algorithmic randomness, but also theoretical grounding at the physical level. Maurer (1993) proves that, even over a public channel, information-theoretic secure key agreement is achievable whenever both parties share partial common information, providing a theoretical basis for designing key agreement protocols that are independent of computational hardness assumptions. A whitening transformation that achieves I(M; C) \approx 0 in the algebraic and information-theoretic sense constitutes a necessary precondition for resisting adversaries of arbitrary computational power.
2.2.2 Computational Thermodynamic Constraints: The Second Physical Safeguard
Landauer (1961) advanced a revolutionary principle: any physically irreversible information-processing operation—such as a bit erasure—must dissipate at least k_B T \ln 2 of heat into the environment. This principle establishes a concrete physical linkage between abstract information bits and measurable energy quantities. Szilard (1929) anticipated this connection decades earlier through the Maxwell's demon thought experiment, demonstrating that the acquisition and erasure of information necessarily entail changes in entropy. Bérut et al. (2012) provide experimental confirmation via microscale optical traps, proving the existence of an insurmountable physical coupling between information processing and energy dissipation.
From an adversarial standpoint, any attempt to recover original data from a system employing irreversible whitening maps is, in essence, a large-scale "information reversal" operation. If a system can "erase" information locally on edge devices through irreversible whitening—converting it into heat in the thermodynamic sense—then an adversary seeking to reverse this process must expend a commensurate amount of energy. As Post-Quantum Cryptography (PQC) faces the systemic risk of instantaneous invalidation through unknown algorithmic breakthroughs in its mathematical hardness assumptions, the research community has increasingly sought absolute security foundations grounded in pure physical law. Bennett (1982, 1989) explores the thermodynamic foundations of computation; subsequent quantitative investigations into the physical limits of computation establish strict equalities among information processing, bit erasure, and energy dissipation. Adopting these objective thermodynamic bounds—in conjunction with the temporal boundary set by quantum mechanical evolution—as the ultimate measure of cryptographic security strength represents both an effective countermeasure to HNDL threats and a necessary theoretical evolution: away from purely mathematical computational assumptions, toward the intersection of information physics.
2.2.3 The Quantum Speed Limit and the Cryptographic Implications of Physical Computational Upper Bounds
Beyond thermodynamic constraints, quantum mechanics imposes an upper bound on the speed of computation. Mandelstam and Tamm (1945) were the first to derive an evolution speed bound grounded in the energy–time uncertainty principle. Subsequently, Margolus and Levitin (1998) derived that the minimum time required for any physical system to evolve from one quantum state to an orthogonal state is bounded below by its average energy E above the ground state:
\Delta t \geq \frac{\pi \hbar}{2E}
Deffner and Campbell (2017) unify these two bounds, showing that the evolution speed of a system is constrained by the second moment of its energy distribution.
The calibration of minimum security parameter bounds in quantum-secure cryptographic schemes has been examined through several complementary physical and information-theoretic frameworks. Lloyd (2000) establishes that the observable universe has performed at most N_{\text{universe}} \approx 10^{121} elementary operations since the Big Bang, providing a physically grounded ceiling on any conceivable adversary's total computational budget under hyper-conservative assumptions. Gottesman, Lo, Lütkenhaus, and Preskill (2004) analyse the security of quantum key distribution with imperfect devices, demonstrating that the effective key rate and requisite security parameter must simultaneously account for both classical and quantum eavesdropping strategies. Renner (2008) formalises composable security definitions for quantum cryptographic protocols, showing that the security parameter \kappa must satisfy conditions under which the residual information leakage \varepsilon decays exponentially in \kappa; within this framework, any physically motivated upper bound on adversarial computational budget directly yields a quantitative lower bound on \kappa. Taken jointly, these results indicate that calibrating a security parameter against the physical computational ceiling of Lloyd (2000), subject to Grover's (1996) quadratic quantum speedup, necessitates a \kappa substantially larger than classical security analyses would suggest. The specific lower bound and complete formal derivation are provided in §3.4.
2.2.4 Section Summary: From Algorithmic Security to Physical Security
The Landauer principle sets an energy cost on bit erasure, while the quantum speed limit sets a temporal ceiling on state evolution. The intersection of these two constraints establishes a physical-layer defensive foundation that transcends the framework of computational complexity, providing the theoretical motivation for exploring how this physical security level can be achieved under extreme low-bit representation. The formal security argument is developed in Chapter 3.
2.3 Quantum Speed Limits and Assumption-Free Cryptographic Authentication
Having established the conditions for perfect secrecy of content, the next challenge is to achieve identity authentication and data integrity verification in a trustless environment.
2.3.1 The Quantum Search Lower Bound and the Threat of Grover's Algorithm
Building on the physical computational upper bounds discussed in §2.2.3 (Lloyd, 2000; Grover, 1996), §3.4 of this dissertation will formally derive the system's required security parameter lower bound. The present section examines authentication mechanism quantum security lower bounds within this physical framework. Bennett, Bernstein, Brassard, and Vazirani (1997) and Boyer et al. (1998) confirm the universality of this lower bound within the broader framework of quantum query complexity: regardless of the quantum algorithm employed by the adversary, this lower bound constitutes an information-theoretically hard limit whenever the search problem lacks additional algebraic structure. This conclusion directly establishes the theoretical basis for the security strength specification of the Wegman-Carter authentication framework examined below.
2.3.2 Wegman-Carter MAC and Universal Hashing
The Message Authentication Codes (MACs) constructed within the Universal Hashing framework of Carter and Wegman (1979) ground their security guarantees entirely in information theory: for any two distinct messages m_1, m_2, the probability that a randomly selected hash function h \in \mathcal{H} produces a collision satisfies P(h(m_1) = h(m_2)) \leq \varepsilon, and this guarantee is completely decoupled from the adversary's computational power.
The information-theoretic authentication guarantee provided by the Wegman-Carter framework constitutes a core theoretical foundation for the protocol design in Chapter 3. The Quantum Random Oracle Model (QROM) analysis by Zhandry (2012) supplements this at the computational level, providing a quantum query security evaluation of symmetric hash components under the quantum random oracle model. Its role is to verify the quantum security of hash components within the protocol; it does not affect the information-theoretic core of the WC-MAC guarantee. These two security arguments operate at distinct levels of the security hierarchy.
2.3.3 The Design Philosophy of Assumption-Free Cryptography
This philosophy of "assumption-free cryptography" is logically complementary to the absolute physical lower bounds of computational thermodynamics in its defensive rationale: the former ensures that authentication integrity is independent of computational power assumptions, while the latter ensures that information reversal carries an unavoidable physical cost. The security strength of information-theoretic authentication mechanisms does not degrade as adversarial computational capability increases—a stark contrast to the systemic risk in Post-Quantum Cryptography (PQC), where computational hardness assumptions may be instantaneously invalidated by mathematical breakthroughs (Bernstein & Lange, 2017). The formal security strength argument is developed in Chapter 3.
2.3.4 Section Summary: The De-Assumption of Authentication Mechanisms
Through a review of the quantum search lower bound and the Wegman-Carter authentication framework, this section demonstrates the theoretical feasibility of achieving identity authentication and data integrity verification without relying on computational hardness assumptions. The literature trajectory of "de-assumption" authentication mechanisms establishes that information-theoretic MAC is the only reliable path to quantum-resistant authentication in a trustless environment, providing the complete bibliographic foundation for the protocol design in Chapter 3. At this point, the theoretical closure of Dimension I is complete: the existing literature establishes information-theoretic authentication as a necessary theoretical precondition for resisting adversaries of arbitrary computational advantage, furnishing the full bibliographic support for the formal security argument in Chapter 3.
2.4 The Curse of Dimensionality and Quantisation Compression Bottlenecks in Large-Scale Models
As large language models (LLMs) enter the era of hundreds of billions of parameters, the storage and communication capacity of edge devices has become the "curse of dimensionality" constraining the development of federated learning. However, as this section demonstrates, the dimensional redundancy inherent in these models is simultaneously an opportunity for compression and a root cause of privacy leakage.
2.4.1 The Manifold Hypothesis and Intrinsic Dimensionality Estimation
Research by Aghajanyan et al. (2021) and Li et al. (2018) reveals a striking fact: despite the parameter space of modern neural networks reaching millions or even billions of dimensions, the training process effectively operates within an extremely low-dimensional subspace. This observation is formalised as the Manifold Hypothesis. The Maximum Likelihood Estimation (MLE) method proposed by Levina and Bickel (2004), and the TwoNN estimator proposed by Facco et al. (2017), provide the tools for quantifying this Intrinsic Dimensionality (d_{\text{int}}).
According to the TwoNN estimator of Facco et al. (2017) and the low-rank fine-tuning experiments of Aghajanyan et al. (2021), the intrinsic dimensionality of the embedding space in large language models is substantially lower than the nominal dimensionality—for models such as LLaMA-2-7B (Touvron et al., 2023), the effective intrinsic dimensionality of the 4096-dimensional embedding space is expected to be significantly lower than its stated value. This implies that model weights contain a large degree of statistical redundancy. While such redundancy permits aggressive compression, it equally implies that information is highly concentrated: an adversary capable of identifying the intrinsic dimensions can recover the original information at negligible cost. The TwoNN estimator of Facco et al. (2017) holds several advantages over the MLE method (Levina & Bickel, 2004) in high-dimensional sparse spaces: it requires no hyperparameter tuning and its computational cost depends only on the distances to the two nearest neighbours, making it particularly suitable for runtime dynamic measurement on edge devices (specific design details are provided in Chapter 3, MC1 obligation).
2.4.2 The Technical Boundaries of Post-Training Quantization (PTQ)
Current mainstream compression approaches are largely centred on Post-Training Quantization (PTQ). The GPTQ algorithm proposed by Frantar et al. (2023) minimises per-layer quantisation error via approximate Hessian matrices and serves as the standard benchmark for PTQ below 4-bit precision. AWQ, proposed by Lin et al. (2023), observes that weight importance correlates strongly with the magnitude of corresponding activation values; by selectively preserving 1% of the most important channels, model accuracy can be maintained without fine-tuning. LLM.int8(), proposed by Dettmers et al. (2022), employs mixed-precision decomposition to retain outlier values in FP16 format.
Nagel et al. (2021) provide a systematic treatment of these methods in their quantisation white paper, identifying per-channel quantisation and bias correction as the key factors in improving PTQ performance. However, the common deficiency of all these techniques is that they are "activation-aware" or "error-aware" but not "privacy-aware." The distribution of quantisation centroids after compression remains strongly dependent on the statistical distribution of the original weights. ACIQ, proposed by Banner et al. (2019), optimises the selection of clipping values but remains fundamentally oriented towards minimising mean-squared error (MSE) rather than maximising entropy. SpQR (Dettmers et al., 2023b) further explores near-lossless sparse quantised representations. In addition, classical vector quantisation methods—Product Quantization (Jégou et al., 2011), its optimised variant (Ge et al., 2013), HNSW (Malkov & Yashunin, 2020), and FAISS (Johnson et al., 2021)—although designed primarily for retrieval efficiency, are directly relevant to the privacy leakage implications of quantisation centroid statistical distributions, forming an important background for understanding the Goal 2 attacks described in §2.4.3.
2.4.3 Statistical Bias in Quantisation Centroids and the Goal 2 Attack
The statistical bias properties of quantisation centroids described above constitute the bibliographic precondition for a class of weight-reconstruction attacks that this dissertation classifies as Goal 2 attacks (this classification is a research-defined term; its formal definition—including adversary capabilities, attack surface, and security objectives—is provided in §3.1).
Quantisation is inherently a many-to-one nonlinear mapping; when the mapped points are distributed non-uniformly in the target space, information leakage "hotspots" are formed. For example, in 4-bit quantisation, if weights in certain intervals appear at extremely high frequency, an adversary can infer approximate values of the original weights by observing the frequency distribution of quantisation indices. This highlights the fundamental limitation of traditional compression techniques that rely exclusively on MSE minimisation when deployed in systems requiring information-theoretic security strength.
2.4.4 Section Summary: The Inherent Conflict Between Compression and Privacy
Existing quantisation compression techniques—including PTQ, GPTQ, and AWQ—focus exclusively on minimising MSE and preserving model accuracy, entirely neglecting privacy protection requirements in the quantisation process. Under extreme low-bit quantisation (e.g., below 4-bit), the statistical distribution of quantisation centroids forms information leakage hotspots, enabling adversaries to recover approximate original weights through frequency analysis. This design gap—the complete absence of privacy awareness in existing compression techniques—provides the theoretical motivation for investigating a novel representation method that jointly achieves extreme low-bit quantisation and information-theoretic security.
2.5 The Universal Bounds of Rate-Distortion Theory and Randomised Dithered Quantisation
To rigorously define the trade-off between compression and security in mathematical terms, it is necessary to return to Rate-Distortion Theory (R-D Theory).
2.5.1 The Rate-Distortion Function and the Shannon Lower Bound
Lloyd (1982) and Max (1960) derive the iterative optimality conditions for the optimal scalar quantiser—the Lloyd-Max conditions. Gersho and Gray (1992) extend these to vector quantisation (VQ), establishing the theoretical foundations of modern signal compression.
According to rate-distortion theory (Berger, 1971), for a given source X and distortion measure d(x, \hat{x}), the minimum bit rate R required at distortion level D is given by the rate-distortion function R(D). The Blahut-Arimoto algorithm (Blahut, 1972) provides numerical tools for computing R(D) for arbitrary distributions. For a Gaussian source, the rate-distortion function takes the form:
R(D) = \frac{1}{2} \log_2\!\left(\frac{\sigma^2}{D}\right)
This sets the physical limit of compression: any encoding below this rate necessarily incurs distortion exceeding D. Blahut (1987) further establishes that the optimal codeword allocation must satisfy entropy maximisation, confirming entropy maximisation as a necessary condition for rate-distortion optimal schemes in theory, and providing direct bibliographic support for the entropy-objective representation design in subsequent chapters.
2.5.2 The Information-Theoretic Advantages of Dithered Quantisation
Conventional deterministic quantisation is information-theoretically insecure, because a strong statistical correlation exists between the quantisation error and the original signal. Schuchman (1964) proposes a revolutionary scheme: by injecting an independent random noise signal (dither) prior to quantisation, the quantisation error is rendered statistically independent of the original signal. Gray and Stockham (1993) further analyse the spectral smoothing effect of dither on quantisation noise.
Zamir and Feder (1996) prove that through lattice dithering, quantisation error can not only be made independent of the original signal, but can also achieve the rate-distortion bound for Gaussian sources. Although this "randomised quantisation" technique increases MSE by approximately a factor of 1.5 (Widrow & Kollar, 2008), it constitutes the only viable path to information-theoretic privacy. Ziv (1985) also establishes that, within the framework of universal quantisation, dithered quantisation is the key mechanism for achieving the optimal rate-distortion trade-off.
2.5.3 The KSG Estimator and Whitening Effect Verification
The KSG estimator (Kraskov et al., 2004) estimates mutual information from k-nearest-neighbour distances without imposing any distributional assumptions, and represents the current best practice for verifying statistical independence in high-dimensional spaces (Jiao et al., 2015; Paninski, 2003). To quantify entropy and randomness, Dodis et al. (2004, 2008) propose fuzzy extractors and the min-entropy framework, while NIST SP 800-90B (2018) provides standardised testing protocols for entropy source validation. Grounded in the lattice dithering theory of Zamir and Feder (1996), achieving approximate statistical independence through randomised quantisation under extreme low-bit conditions (e.g., 2-bit) is theoretically feasible; the specific conditions and error bound analysis are provided in Chapter 3.
2.5.4 Section Summary: Randomisation as the Price of Security
This section establishes randomised quantisation (dithered quantisation) as the only feasible path to information-theoretic statistical independence, and rate-distortion theory as the central framework for measuring the associated trade-off cost. Although dithered quantisation introduces additional distortion, it achieves statistical independence in the information-theoretic sense; guided by the rate-distortion bound, it is capable of delivering a provably optimal trade-off, furnishing the mathematical foundations for the representation design in Chapter 3.
2.6 Discrete Optimal Transport and Representation Whitening on the Extreme Ternary Lattice
When the compression limit is pushed to the ternary lattice \{-1, 0, +1\}, conventional whitening methods based on the continuous cumulative distribution function (CDF) break down entirely.
2.6.1 BitNet b1.58 and the Rise of Ternary Neural Networks
Ma et al. (2024) demonstrate in BitNet b1.58 that ternary models can match the performance of full-precision counterparts while substantially reducing inference energy consumption. Earlier work by Hubara et al. (2016) and Li et al. (2016) laid the groundwork for this discrete weight structure. Wang et al. (2023) further demonstrate the scalability of BitNet on large-scale language models.
However, the algebraic structure of the ternary lattice is severely constrained. Although XNOR-Net (Rastegari et al., 2016) demonstrates the extreme computational efficiency of binarised operations, its accuracy degradation remains significant. The ternary lattice achieves a balance between sparsity and expressiveness by introducing a zero state. Nevertheless, conventional whitening methods destroy the integer properties of the ternary lattice, precluding the use of hardware-level integer arithmetic acceleration. Stochastic binarisation (Courbariaux et al., 2015, 2016) introduces randomness but leaves the resulting distribution uncontrolled.
2.6.2 Discrete Optimal Transport Theory
To resolve this problem, Discrete Optimal Transport (Villani, 2003; Peyré & Cuturi, 2019) provides a rigorous mathematical framework for mapping an arbitrary distribution onto a uniform distribution. Under its theoretical formulation, the whitening process can be modelled as finding an optimal coupling that transports the original weight distribution to the target uniform distribution while minimising the transport cost. On the ternary lattice \{-1, 0, +1\}^d, this is equivalent to finding a bijective mapping such that the mapped vectors are statistically close to the uniform distribution.
2.6.3 The Sinkhorn Algorithm and the Entropy-Maximising Map
To solve the above problem efficiently, the entropy-regularised Sinkhorn algorithm proposed by Cuturi (2013) provides a practical tool for computing discrete optimal transport. By introducing an entropic regularisation term, the Sinkhorn algorithm converts the optimal transport problem into a simple matrix scaling procedure. The POT library developed by Flamary et al. (2021) provides a standard implementation reference for this discrete OT computation. Theoretically, the capacity of discrete optimal transport to approximate the uniform distribution in total variation (TV) distance—the specific error bound \delta and its derivation are provided in Chapter 3—furnishes a rigorous mathematical path to entropy maximisation on the ternary lattice. This further demonstrates that, under extreme quantisation compression, achieving statistical uniformity through distributional alignment rather than error minimisation is the only viable direction for satisfying information-theoretic security requirements.
2.6.4 Section Summary: A Rigorous Reconstruction in Discrete Space
This section establishes the theoretical feasibility of achieving entropy maximisation on the ternary lattice via discrete optimal transport, and the necessity of achieving statistical uniformity through distributional alignment rather than error minimisation. This literature trajectory provides the rigorous mathematical foundation for the formal security argument in subsequent chapters.
2.7 Distributed Shared Memory and Consensus Throughput Collapse in Trustless Environments
When the perspective shifts from single-node security to distributed systems operating at global scale, existing architectures are found to be severely inadequate. This section analyses the fundamental deficiencies of existing distributed shared memory and consensus protocols in adversarially hostile environments.
2.7.1 The Trust Assumptions of Conventional DSM and Its Byzantine Collapse
The Distributed Shared Memory (DSM) architecture proposed by Li and Hudak (1989) and the TreadMarks system of Keleher et al. (1994) address programming convenience, but their design premise is that nodes share a baseline of mutual trust. The release consistency model proposed by Gharachorloo et al. (1990) substantially improves parallel performance by distinguishing acquire and release operations. The directory-based protocol of Censier and Feautrier (1978) represents an early attempt at large-scale cache coherence. The Orca system of Bal et al. (1992) seeks to optimise DSM through language-level object replication, yet remains unable to address malicious tampering by Byzantine nodes.
In a Byzantine environment where up to 33% of nodes are adversarial, an attacker can trivially corrupt system consistency by sending false invalidation signals or fabricating data replicas. Performance evaluations by Gharachorloo et al. (1991) show that even weak consistency models cannot resist attacks on memory semantics in the absence of hardware-level security guarantees. The survey by Adve and Gharachorloo (1996) further establishes the central role of memory consistency models in distributed systems, while simultaneously exposing their fragility in trustless environments. The topology of distributed systems has evolved from early random graphs (Erdős & Rényi, 1960) to DHT-based structured networks, including Kademlia (Maymounkov & Mazières, 2002), Chord (Stoica et al., 2001), Pastry (Rowstron & Druschel, 2001), and CAN (Ratnasamy et al., 2001).
2.7.2 From PBFT to HotStuff: The Evolution and Limitations of Consensus Protocols
At the consensus layer, while Castro and Liskov's (2002) PBFT establishes the foundations of Byzantine Fault Tolerance (BFT), its O(n^2) message complexity renders it unscalable to tens of thousands of edge nodes. Lamport's (1998) Paxos serves as the benchmark in the crash fault-tolerant (CFT) domain but offers no defence in Byzantine environments. Tendermint (Buchman et al., 2018) simplifies the protocol flow but remains constrained by synchrony assumptions. Raft (Ongaro & Ousterhout, 2014) improves understandability but its strong leader dependency creates a single point of failure in adversarial settings.
HotStuff (Yin et al., 2019) reduces message complexity to O(n) through a three-phase pipeline and threshold signature aggregation. However, it continues to face challenges in wide-area networks characterised by high latency and unstable bandwidth. The Partial Synchrony Model proposed by Dwork et al. (1988) provides a theoretical framework for consensus protocols, but in the dynamic network conditions typical of edge device deployments, its Global Stabilisation Time (GST) assumption is frequently violated. For ensuring protocol correctness, TLA+ (Lamport, 1994, 2002) and the TLC model checker (Yu et al., 1999) have become standard tools in distributed system design. To defend against Byzantine attacks, robust aggregation algorithms such as Krum (Blanchard et al., 2017) and Bulyan (El Mhamdi et al., 2018) have been proposed, relying on robust statistics such as the geometric median (Weiszfeld, 1937; Vardi & Zhang, 2000; Minsker, 2015) or trimmed mean (Yin et al., 2018).
2.7.3 The Decoupling Principle of Consensus and Data Transmission
The foregoing analysis reveals that the fundamental bottleneck of conventional distributed systems lies in the tight coupling of consensus and data transmission. The theoretical properties of erasure codes (Reed & Solomon, 1960; Rabin, 1989) and Merkle Trees (Merkle, 1987) demonstrate that verifying data integrity without global consensus is technically feasible—IPFS (Benet, 2014) and Certificate Transparency (Laurie et al., 2013) provide real-world deployment precedents of this engineering principle. This supplies direct bibliographic support for the architectural design in Chapter 3 that explores decoupling of control flow and data flow.
2.7.4 Section Summary: The Collapse and Reconstruction of Consensus Paradigms
The fundamental limitation of existing Distributed Shared Memory (DSM) architectures and Byzantine fault-tolerant protocols is that their design assumes stable, low-latency communication among nodes, causing message complexity—ranging from O(n^2) to O(n^3)—to grow exponentially when scaled to large numbers of edge nodes. The core contradiction that remains unresolved in the existing literature is: how to simultaneously achieve sub-linear message complexity and strong consistency guarantees in the absence of a central coordinator. Existing PBFT variants, while capable of guaranteeing both safety and liveness (under f < n/3) in the Partial Synchrony Model (Dwork et al., 1988), impose an O(n^2) message overhead that becomes unacceptable when edge node counts scale to tens of thousands. Moreover, dynamic wide-area network conditions frequently render the GST assumption of the partial synchrony model unsatisfiable in practice. According to the FLP impossibility theorem (Fischer et al., 1985), no deterministic consensus protocol can simultaneously guarantee safety and liveness in a fully asynchronous network, causing protocols to enter liveness stalls in real deployments.
Revisiting the history of distributed systems, mapping hardware memory semantics onto wide-area networks has precedent. For instance, Reiter et al.'s (2000) work on dynamic Byzantine quorum systems, and subsequent explorations of speculative execution in distributed shared memory, all represent attempts to reconstruct memory consistency in trustless environments. However, the O(n^2) message complexity of conventional BFT protocols renders these attempts incapable of overcoming communication bottlenecks and liveness collapse when scaled to global edge nodes. It is precisely because of this comprehensive failure of existing literature at the intersection of macro-scale consensus and micro-scale memory semantics that the subsequent exploration of how to leverage the lightweight characteristics of the hardware MESI state machine to reconstruct distributed fault-tolerant protocols acquires an irrefutable technical motivation.
2.8 The Memory Wall and Zero-Copy System Architecture on Edge Devices
Edge devices typically operate under severe memory constraints. To deploy large-scale models in such environments, the system must address the fundamental constraints of the memory wall at the architectural level. This requires full-stack optimisation spanning from operating system memory management to inference frameworks; the existing literature provides critical engineering paths in the following directions.
2.8.1 PagedAttention and Virtual Memory Management
PagedAttention, proposed by Kwon et al. (2023), addresses the fragmentation problem of the KV Cache in LLM inference by drawing on the paging mechanism of operating systems. In addition, sliding window attention (Beltagy et al., 2020) and multi-query attention (Shazeer, 2019) provide complementary approaches to compressing the KV Cache. Sparse Transformers (Child et al., 2019) and Grouped-Query Attention (GQA; Ainslie et al., 2023) further reduce the computational and storage overhead of attention mechanisms. Transformer-XL (Dai et al., 2019) extends context length through segment-level recurrence. Approximate computing (Sekanina, 2017; Mrazek et al., 2021; Mittal, 2016) trades partial accuracy for improved energy efficiency—a trade-off with substantial potential in edge AI scenarios. The combination of these techniques demonstrates the technical feasibility of deploying 7B-scale models on edge devices with 2 GB of memory (specific integration is provided in Chapter 3).
2.8.2 mmap and GGUF: The Foundations of Zero-Copy Inference
For model loading, the mmap zero-copy loading and GGUF format popularised by llama.cpp (Gerganov et al., 2023) provide a standardised engineering path for copy-free inference on edge devices. According to Linux kernel documentation (2024) and the analysis by Hennessy and Patterson (2019), mmap allows an application to map a disk file directly into the virtual address space. McKusick et al. (1984) first implemented this efficient I/O mechanism in the FFS file system. This approach achieves zero-copy inference by eliminating redundant memory transfers between storage and compute buffers. In addition, the Linux kernel's zRAM mechanism further expands available memory space by establishing a compressed block device within RAM. Bovet and Cesati's (2005) in-depth treatment of Linux kernel memory management provides the theoretical underpinning for zero-copy architectures at the OS level.
2.8.3 Low-Rank Adaptation (LoRA) and Delta Model Broadcasting
To enable efficient model update broadcasting, the low-rank update principles underlying LoRA (Hu et al., 2022) and GaLore (Zhao et al., 2024) provide the key theoretical foundations. LoRA demonstrates that training only two low-rank matrices can achieve performance comparable to full-parameter fine-tuning. QLoRA (Dettmers et al., 2023a) extends this approach to quantised models. Prompt Tuning (Lester et al., 2021) illustrates yet another parameter-efficient fine-tuning path. The low-rank property of LoRA (Hu et al., 2022) implies that distributed nodes can in principle transmit only low-rank delta patches, substantially reducing communication overhead. Biderman et al. (2024) demonstrate that LoRA exhibits superior knowledge retention in continual learning scenarios.
2.8.4 Section Summary: System-Level Optimisation at the Extreme
The foregoing combination of techniques demonstrates the engineering feasibility of deploying 7B-scale models on edge devices with 2 GB of memory. The technical path from PagedAttention to zero-copy inference collectively addresses the physical constraints of edge devices in terms of memory and bandwidth, providing the engineering feasibility basis for the system architecture design in Chapter 3.
2.9 Cross-Domain Synthesis of Hardware Branch Prediction and Wide-Area Cache Coherence (BFT-MESI)
As the concluding section of this chapter, the analysis turns to the cross-domain correspondence between hardware microarchitectural design semantics and globally distributed fault-tolerant protocols, revealing a design space in the intersection of these two domains that existing literature has not yet formally explored.
2.9.1 The 2-Bit Saturating Counter and Speculative Execution Mechanisms
The 2-bit saturating counter branch predictor proposed by Smith (1981) and Yeh and Patt (1992) is a cornerstone of modern CPU performance. McFarling (1993) further proposes a hybrid scheme that combines multiple predictors, while Jiménez and Lin (2001) demonstrate the use of perceptrons for more complex branch prediction. Introducing the speculative semantics of CPU branch prediction into distributed protocols—allowing nodes to perform predictive execution based on local cache state before global consensus is achieved—represents a theoretical path to latency optimisation. Combined with the Speculative Decoding framework of Leviathan et al. (2023) and Chen et al. (2023), edge nodes can perform speculative inference based on their local cache state before global consensus is obtained. Block-parallel decoding (Stern et al., 2018) provides a theoretical predecessor to this speculative mechanism. Knowledge distillation (Hinton et al., 2015; Kim & Rush, 2016) further enables the transfer of knowledge from large models to lightweight small models.
2.9.2 BFT-MESI: Cache Coherence at Global Scale
This class of cross-domain synthesis—mapping the state-machine semantics of the single-node MESI cache coherence protocol onto a distributed Byzantine environment—represents a design space that existing literature has not yet formally explored. Such a design extends the four-state MESI machine of a single-node CPU to a DHT network at global scale. The original MESI protocol of Papamarcos and Patel (1984) was designed to address cache coherence in multiprocessor systems, while Goodman (1983) established its foundations. AMD's (2024) MOESI protocol extends the state space further. In such an architecture, the state of each model shard is maintained not only by the local node, but is also supervised by Swarm consensus. This draws on the tensor parallelism ideas of Shoeybi et al. (2019) in Megatron-LM and the pipeline parallelism strategy of Huang et al. (2019) in GPipe; Narayanan et al. (2021) further optimise these parallel strategies. By leveraging the Top-Byte Ignore (TBI) feature provided by ARM Memory Tagging (Serebryany, 2018), zero-overhead routing tags can be implemented at the pointer level. The research of Jacobs et al. (1991) and Fedus et al. (2022) on MoE routing mechanisms also provides inspiration for the static pointer-map routing design at the L3 cache layer. Sparse-gated MoE (Shazeer et al., 2017) demonstrates how conditional computation can be used to scale model capacity.
2.9.3 The Theoretical Sublimation of Cross-Domain Synthesis
Narayanan et al. (2021) establish in their Megatron-LM research that overlapping communication and computation is the key to improving efficiency in large-scale model training. Leviathan et al. (2023) establish the viability conditions for speculative decoding in terms of throughput expectations. The speculative hit rate, rollback cost, and local computation time stand in a rigorous mathematical trade-off relationship; the necessary and sufficient conditions for this trade-off, along with the formal derivation, are developed in Chapter 4.
2.9.4 Section Summary: The Global Sublimation of Hardware Thinking
This section establishes the literature gap in cross-domain synthesis between hardware MESI state-machine semantics and distributed Byzantine fault tolerance: existing literature has insufficiently explored MESI-class lightweight coherence mechanisms in distributed environments. This formally unexplored design space constitutes the core technical motivation for the architectural design in subsequent chapters. The techniques reviewed across this chapter—from physical limits to system architecture—collectively establish the theoretical foundations for this interdisciplinary research programme.
2.10 Synthesis: The Tripartite Design Imperative
The nine sections of this chapter have established, across three orthogonal analytical dimensions, that the existing literature fails to provide a unified defence architecture adequate for the post-quantum edge-computing threat landscape.
The first dimension (§§2.1–2.3) demonstrates that every security guarantee currently deployed in federated learning rests on computational hardness assumptions that are structurally time-dependent. Differential privacy degrades with adversarial compute; post-quantum cryptography displaces rather than eliminates the assumption; and the HNDL threat vector renders even forward-secure schemes retroactively vulnerable. The literature's own trajectory—from DP-SGD through NIST PQC standardisation—converges on a single unavoidable conclusion: only information-theoretic security, grounded in physical law rather than mathematical conjecture, is immune to this class of long-horizon attack.
The second dimension (§§2.4–2.6) reveals that the compression techniques required for edge deployment are not merely privacy-neutral—they are privacy-adversarial by design. PTQ, GPTQ, and AWQ optimise exclusively for MSE distortion, systematically concentrating statistical mass into quantisation centroids that function as information leakage hotspots. Rate-distortion theory establishes the theoretical boundary within which any compression scheme must operate, while dithered quantisation and discrete optimal transport on the ternary lattice \{-1, 0, +1\}^d emerge as the only mathematically rigorous pathways to achieving statistical independence between the transmitted representation and the original weight tensor.
The third dimension (§§2.7–2.9) establishes that the distributed infrastructure required to deploy such a system at global scale is irreconcilable with existing consensus architectures. The O(n^2) message complexity of PBFT-class protocols, the liveness collapse under the FLP impossibility theorem in asynchronous networks, and the absence of any Byzantine-tolerant extension of MESI-class coherence protocols collectively define an architectural void: no existing framework achieves sub-linear message complexity, strong consistency, and Byzantine fault tolerance simultaneously.
These three failure modes are not independent engineering problems admitting independent patches. They constitute a tripartite design imperative: any system that addresses only one or two dimensions will either leak information-theoretically, collapse under compression, or fail to scale. The absence of a unified solution is not incidental—it reflects the absence of a theoretical framework that treats security, compression, and distributed coherence as jointly constrained optimisation objectives under physical law.
The remainder of this dissertation develops that framework. Part III formalises the compression-security co-design problem, culminating in the Stochastic Ternary Whitening construction (STW, Theorem 5.6) and the 2 GB edge feasibility closure (§3.9). Part IV introduces the Heterogeneous Multi-Capability Model (HMCM), which maps the compression-tier hierarchy to a Byzantine-tolerant node admission framework (Theorems 6.1–6.2). Part V establishes the dual-tier min-entropy security analysis, deriving the \kappa_{\min} \geq 804 bit threshold under Grover-calibrated quantum adversary models (Theorems 7.1–7.2). Part VI provides the physical security barriers that render this threshold time-invariant: the Landauer sub-theorem for irreversible classical adversaries and the Margolus–Levitin sub-theorem covering reversible classical and all quantum adversaries (Theorems 8.1–8.4). Part VII instantiates the full distributed architecture through the BFT-MESI coherence protocol and the Proof-of-Useful-Work incentive mechanism (Theorems 9.1–9.2; Protocols 1–3).
Reference List
[1] Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., & Zhang, L. (2016). Deep Learning with Differential Privacy. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS), 308–318. https://doi.org/10.1145/2976749.2978318
[2] Adve, S. V., & Gharachorloo, K. (1996). Shared Memory Consistency Models: A Tutorial. IEEE Comput., 29(12), 66–76. https://doi.org/10.1109/2.546611
[3] Aghajanyan, A., Zettlemoyer, L., & Gupta, S. (2021). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL). https://arxiv.org/abs/2012.13255
[4] Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebrón, F., & Sanghai, S. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv preprint arXiv:2305.13245.
[5] Alistarh, D., Grubic, D., Li, J., Tomioka, R., & Vojnovic, M. (2017). QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. Advances in Neural Information Processing Systems (NeurIPS), 30.
[6] AMD Corporation. (2024). AMD64 Architecture Programmer's Manual Volume 2: System Programming. AMD Publication No. 24593.
[7] Babenko, A., & Lempitsky, V. (2014). Additive Quantization for Extreme Vector Compression. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 931–938. https://doi.org/10.1109/CVPR.2014.273
[8] Bagdasaryan, E., Veit, A., Hua, Y., Estrin, D., & Shmatikov, V. (2020). How to Backdoor Federated Learning. Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS).
[9] Bal, H. E., Kaashoek, M. F., & Tanenbaum, A. S. (1992). Orca: A Language for Parallel Programming of Distributed Systems. IEEE Trans. Softw. Eng., 18(3), 190–205. https://doi.org/10.1109/32.90481
[10] Banner, R., Nahshan, Y., & Soudry, D. (2019). Post Training 4-Bit Quantization of Convolutional Networks for Rapid-Deployment. Advances in Neural Information Processing Systems (NeurIPS), 32.
[11] Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The Long-Document Transformer. arXiv preprint arXiv:2004.05150.
[12] Benet, J. (2014). IPFS - Content Addressed, Versioned, P2P File System. arXiv preprint arXiv:1407.3561.
[13] Bennett, C. H. (1973). Logical Reversibility of Computation. IBM J. Res. Dev., 17(6), 525–532. https://doi.org/10.1147/rd.176.0525
[14] Bennett, C. H. (1982). The Thermodynamics of Computation—A Review. Int. J. Theor. Phys., 21(12), 905–940. https://doi.org/10.1007/BF02084158
[15] Bennett, C. H. (1989). Time/space trade-offs for reversible computation. SIAM J. Comput., 18(4), 766–776. https://doi.org/10.1137/0218053
[16] Bennett, C. H., Bernstein, E., Brassard, G., & Vazirani, U. (1997). Strengths and Weaknesses of Quantum Computing. SIAM J. Comput., 26(5), 1510–1523. https://doi.org/10.1137/S0097539796300933
[17] Berger, T. (1971). Rate Distortion Theory: A Mathematical Basis for Data Compression. Prentice-Hall.
[18] Bernstein, D. J., & Lange, T. (2017). Post-quantum cryptography. Nature, 549, 188–194. https://doi.org/10.1038/nature23461
[19] Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., & Anandkumar, A. (2018). signSGD: Compressed Optimisation for Non-Convex Problems. Proceedings of the 35th International Conference on Machine Learning (ICML).
[20] Bérut, A., Arakelyan, A., Petrosyan, A., Ciliberto, S., Dillenschneider, R., & Lutz, E. (2012). Experimental verification of Landauer's principle linking information and thermodynamics. Nature, 483, 187–189. https://doi.org/10.1038/nature10872
[21] Beullens, W. (2022). Breaking Rainbow Takes a Weekend on a Laptop. Advances in Cryptology – EUROCRYPT 2022, 464–479. https://doi.org/10.1007/978-3-031-07082-2_22
[22] Biderman, S., Portes, J., Anthony, Q., et al. (2024). LoRA Learns Less and Forgets Less. Trans. Mach. Learn. Res. (TMLR).
[23] Blahut, R. E. (1972). Computation of Channel Capacity and Rate-Distortion Functions. IEEE Trans. Inf. Theory, 18(4), 460–473. https://doi.org/10.1109/TIT.1972.1054855
[24] Blahut, R. E. (1987). Principles and Practice of Information Theory. Addison-Wesley.
[25] Blanchard, P., El Mhamdi, E. M., Guerraoui, R., & Stainer, J. (2017). Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent. Advances in Neural Information Processing Systems (NeurIPS), 30.
[26] Bos, J., Ducas, L., Kiltz, E., Lepoint, T., Lyubashevsky, V., Schanck, J. M., Schwabe, P., Seiler, G., & Stehlé, D. (2018). CRYSTALS–Kyber: A CCA-Secure Module-Lattice-Based KEM. Proceedings of the 2018 IEEE European Symposium on Security and Privacy (EuroS&P), 353–367. https://doi.org/10.1109/EuroSP.2018.00032
[27] Bovet, D. P., & Cesati, M. (2005). Understanding the Linux Kernel (3rd ed.). O'Reilly Media.
[28] Boyer, M., Brassard, G., Høyer, P., & Tapp, A. (1998). Tight Bounds on Quantum Searching. Fortschr. Phys., 46(4–5), 493–505. https://doi.org/10.1002/prop.2190460802
[29] Buchman, E., Kwon, J., & Milosevic, Z. (2018). The Latest Gossip on BFT Consensus (Tendermint). arXiv preprint arXiv:1807.04938.
[30] Buterin, V. (2014). Ethereum White Paper: A Next Generation Smart Contract & Decentralized Application Platform. https://ethereum.org/en/whitepaper/
[31] Buterin, V., & Griffith, V. (2017). Casper the Friendly Finality Gadget. arXiv preprint arXiv:1710.09437.
[32] Carter, J. L., & Wegman, M. N. (1979). Universal Classes of Hash Functions. Proceedings of the 11th Annual ACM Symposium on Theory of Computing (STOC), 106–112. https://doi.org/10.1145/800105.803400
[33] Castro, M., & Liskov, B. (2002). Practical Byzantine Fault Tolerance and Proactive Recovery. ACM Trans. Comput. Syst. (TOCS), 20(4), 398–461. https://doi.org/10.1145/571637.571640
[34] Censier, L. M., & Feautrier, P. (1978). A New Solution to Coherence Problems in Multicache Systems. IEEE Trans. Comput., 27(12), 1112–1118. https://doi.org/10.1109/TC.1978.1675003
[35] Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., & Jumper, J. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. arXiv preprint arXiv:2302.01318.
[36] Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating Long Sequences with Sparse Transformers. arXiv preprint arXiv:1904.10509.
[37] CISA. (2022). CISA Insight: Preparing Critical Infrastructure for Post-Quantum Cryptography. Cybersecurity and Infrastructure Security Agency. https://www.cisa.gov/resources-tools/resources/post-quantum-cryptography-initiative
[38] Courbariaux, M., Bengio, Y., & David, J.-P. (2015). BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations. Advances in Neural Information Processing Systems (NeurIPS), 28.
[39] Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., & Bengio, Y. (2016). Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. arXiv preprint arXiv:1602.02830.
[40] Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley.
[41] Cuturi, M. (2013). Sinkhorn Distances: Lightspeed Computation of Optimal Transport. Advances in Neural Information Processing Systems (NeurIPS), 26.
[42] Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/P19-1285
[43] Deffner, S., & Campbell, S. (2017). Quantum Speed Limits: From Heisenberg's Uncertainty Principle to Optimal Quantum Control. J. Phys. A: Math. Theor., 50(45), 453001. https://doi.org/10.1088/1751-8121/aa86c6
[44] Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. Advances in Neural Information Processing Systems (NeurIPS), 35.
[45] Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023a). QLoRA: Efficient Finetuning of Quantized LLMs. Advances in Neural Information Processing Systems (NeurIPS), 36.
[46] Dettmers, T., Svirschevski, R., Egiazarian, V., Ahmadian, A., Bubeck, S., Dorner, S., Alistarh, D., Koloskova, A., & Lewis, M. (2023b). SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. arXiv preprint arXiv:2306.03078.
[47] Dodis, Y., Reyzin, L., & Smith, A. (2004). Fuzzy Extractors: How to Generate Strong Keys from Biometrics and Other Noisy Data. Advances in Cryptology – EUROCRYPT 2004, 523–540. https://doi.org/10.1007/978-3-540-28628-8_28
[48] Dodis, Y., Reyzin, L., & Smith, A. (2008). Fuzzy Extractors: How to Generate Strong Keys from Biometrics and Other Noisy Data. SIAM J. Comput., 38(1), 97–139. https://doi.org/10.1137/060651380
[49] Dong, J., Roth, A., & Su, W. J. (2022). Gaussian Differential Privacy. J. R. Stat. Soc. Ser. B (Stat. Methodol.), 84(1), 3–37. https://doi.org/10.1111/rssb.12454
[50] Ducas, L., Kiltz, E., Lepoint, T., Lyubashevsky, V., Schwabe, P., Seiler, G., & Stehlé, D. (2018). CRYSTALS-Dilithium: A Lattice-Based Digital Signature Scheme. IACR Trans. Cryptogr. Hardw. Embed. Syst., 2018(1), 238–268. https://doi.org/10.13154/tches.v2018.i1.238-268
[51] Dürr, C., & Høyer, P. (1996). A Quantum Algorithm for Finding the Minimum. arXiv preprint quant-ph/9607014.
[52] Dwork, C., & Roth, A. (2014). The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci., 9(3–4), 211–407. https://doi.org/10.1561/0400000042
[53] Dwork, C., Lynch, N., & Stockmeyer, L. (1988). Consensus in the Presence of Partial Synchrony. J. ACM, 35(2), 288–323. https://doi.org/10.1145/42282.42283
[54] El Mhamdi, E. M., Guerraoui, R., Rouault, S., & Vacher, L. (2018). The Hidden Vulnerability of Distributed Learning in Byzantium. Proceedings of the 35th International Conference on Machine Learning (ICML).
[55] Erdős, P., & Rényi, A. (1960). On the Evolution of Random Graphs. Publ. Math. Inst. Hung. Acad. Sci., 5, 17–61.
[56] Facco, E., d'Errico, M., Rodriguez, A., & Laio, A. (2017). Estimating the Intrinsic Dimension of Datasets by a Minimal Neighborhood Information (TwoNN). Sci. Rep., 7, 12140. https://doi.org/10.1038/s41598-017-11873-y
[57] Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. J. Mach. Learn. Res., 23(120), 1–39.
[58] Fischer, M. J., Lynch, N. A., & Paterson, M. S. (1985). Impossibility of distributed consensus with one faulty process. J. ACM, 32(2), 374–382. https://doi.org/10.1145/3149.214121
[59] Flamary, R., Courty, N., Gramfort, A., et al. (2021). POT: Python Optimal Transport. J. Open Source Softw., 6(60), 2832. https://doi.org/10.21105/joss.02832
[60] Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. International Conference on Learning Representations (ICLR).
[61] Fredkin, E., & Toffoli, T. (1982). Conservative Logic. Int. J. Theor. Phys., 21(3–4), 219–253. https://doi.org/10.1007/BF01857727
[62] Fredrikson, M., Jha, S., & Ristenpart, T. (2015). Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures. Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS), 1322–1333. https://doi.org/10.1145/2810103.2813677
[63] Gallager, R. G. (1962). Low-Density Parity-Check Codes. IRE Trans. Inf. Theory, 8(1), 21–28. https://doi.org/10.1109/TIT.1962.1057683
[64] Ge, T., He, K., Ke, Q., & Sun, J. (2013). Optimized Product Quantization for Approximate Nearest Neighbor Search. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2363–2370. https://doi.org/10.1109/CVPR.2013.379
[65] Geiping, J., Bauermeister, H., Dröge, H., & Moeller, M. (2020). Inverting Gradients—How Easy Is It to Break Privacy in Federated Learning? Advances in Neural Information Processing Systems (NeurIPS), 33.
[66] Gerganov, G., et al. (2023). llama.cpp: GGUF Format Specification and Inference Engine. GitHub: ggerganov/llama.cpp. https://github.com/ggerganov/llama.cpp
[67] Gersho, A., & Gray, R. M. (1992). Vector Quantization and Signal Compression. Kluwer Academic Publishers.
[68] Gharachorloo, K., Lenoski, D., Laudon, J., Gibbons, P., Gupta, A., & Hennessy, J. (1990). Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors. Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA), 15–26. https://doi.org/10.1145/325164.325102
[69] Gharachorloo, K., Gupta, A., & Hennessy, J. (1991). Performance Evaluation of Memory Consistency Models for Shared-Memory Multiprocessors. Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 245–257. https://doi.org/10.1145/325096.325101
[70] Goodman, J. R. (1983). Using Cache Memory to Reduce Processor-Memory Traffic. Proceedings of the 10th Annual International Symposium on Computer Architecture (ISCA), 124–131. https://doi.org/10.1145/800046.801647
[71] Gopalan, P., Huang, C., Simitci, H., & Yekhanin, S. (2012). On the Locality of Codeword Symbols. IEEE Trans. Inf. Theory, 58(11), 6925–6936. https://doi.org/10.1109/TIT.2012.2196527
[72] Gottesman, D., Lo, H.-K., Lütkenhaus, N., & Preskill, J. (2004). Security of quantum key distribution with imperfect devices. Quantum Inf. Comput., 4(5), 325–360.
[73] Gray, R. M., & Stockham, T. G. (1993). Dithered Quantizers. IEEE Trans. Inf. Theory, 39(3), 805–812. https://doi.org/10.1109/18.212282
[74] Grover, L. K. (1996). A Fast Quantum Mechanical Algorithm for Database Search. Proceedings of the 28th Annual ACM Symposium on Theory of Computing (STOC), 212–219. https://doi.org/10.1145/237814.237866
[75] Hennessy, J. L., & Patterson, D. A. (2019). Computer Architecture: A Quantitative Approach (6th ed.). Morgan Kaufmann.
[76] Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531.
[77] Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-Rank Adaptation of Large Language Models. International Conference on Learning Representations (ICLR).
[78] Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., et al. (2019). GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism. Advances in Neural Information Processing Systems (NeurIPS), 32.
[79] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., & Bengio, Y. (2016). Binarized Neural Networks. Advances in Neural Information Processing Systems (NeurIPS), 29.
[80] Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive Mixtures of Local Experts. Neural Comput., 3(1), 79–87. https://doi.org/10.1162/neco.1991.3.1.79
[81] Jégou, H., Douze, M., & Schmid, C. (2011). Product Quantization for Nearest Neighbor Search. IEEE Trans. Pattern Anal. Mach. Intell., 33(1), 1–17. https://doi.org/10.1109/TPAMI.2010.57
[82] Jiao, J., Venkat, K., Han, Y., & Weissman, T. (2015). Minimax Estimation of Functionals of Discrete Distributions. IEEE Trans. Inf. Theory, 61(5), 2835–2885. https://doi.org/10.1109/TIT.2015.2441206
[83] Jiménez, D. A., & Lin, C. (2001). Dynamic Branch Prediction with Perceptrons. Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA), 197–206. https://doi.org/10.1109/HPCA.2001.903263
[84] Johnson, J., Douze, M., & Jégou, H. (2021). Billion-Scale Similarity Search with GPUs. IEEE Trans. Big Data, 7(3), 535–547. https://doi.org/10.1109/TBDATA.2019.2921572
[85] Kairouz, P., McMahan, H. B., Avent, B., et al. (2021). Advances and Open Problems in Federated Learning. Found. Trends Mach. Learn., 14(1–2), 1–210. https://doi.org/10.1561/2200000083
[86] Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S. U., & Suresh, A. T. (2020). SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. Proceedings of the 37th International Conference on Machine Learning (ICML).
[87] Keleher, P., Cox, A. L., Dwarkadas, S., & Zwaenepoel, W. (1994). TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. Proceedings of the USENIX Winter 1994 Conference.
[88] Kim, Y., & Rush, A. M. (2016). Sequence-Level Knowledge Distillation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP).
[89] Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). Estimating Mutual Information. Phys. Rev. E, 69(6), 066138. https://doi.org/10.1103/PhysRevE.69.066138
[90] Krawczyk, H. (1994). LFSR-based Hashing and Authentication. Advances in Cryptology – CRYPTO '94, 129–139. https://doi.org/10.1007/3-540-48658-5_5
[91] Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., & Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles (SOSP). https://doi.org/10.1145/3600006.3613165
[92] Lamport, L. (1994). The Temporal Logic of Actions. ACM Trans. Program. Lang. Syst. (TOPLAS), 16(3), 872–923. https://doi.org/10.1145/177492.177726
[93] Lamport, L. (1998). The Part-Time Parliament. ACM Trans. Comput. Syst. (TOCS), 16(2), 133–169. https://doi.org/10.1145/279227.279229
[94] Lamport, L. (2002). Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers. Addison-Wesley.
[95] Landauer, R. (1961). Irreversibility and Heat Generation in the Computing Process. IBM J. Res. Dev., 5(3), 183–191. https://doi.org/10.1147/rd.53.0183
[96] Landauer, R. (1996). Minimal Energy Requirements in Communication. Science, 272(5270), 1914–1918. https://doi.org/10.1126/science.272.5270.1914
[97] Laurie, B., Langley, A., & Kasper, E. (2013). Certificate Transparency. IETF RFC 6962. https://doi.org/10.17487/RFC6962
[98] Lester, B., Al-Rfou, R., & Constant, N. (2021). The Power of Scale for Parameter-Efficient Prompt Tuning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).
[99] Levina, E., & Bickel, P. J. (2004). Maximum Likelihood Estimation of Intrinsic Dimension. Advances in Neural Information Processing Systems (NeurIPS), 17.
[100] Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. Proceedings of the 40th International Conference on Machine Learning (ICML).
[101] Li, C., Farkhoor, H., Liu, R., & Yosinski, J. (2018). Measuring the Intrinsic Dimension of Objective Landscapes. International Conference on Learning Representations (ICLR).
[102] Li, F., Zhang, B., & Liu, B. (2016). Ternary Weight Networks (TWN). arXiv preprint arXiv:1605.04711.
[103] Li, K., & Hudak, P. (1989). Memory Coherence in Shared Virtual Memory Systems. ACM Trans. Comput. Syst. (TOCS), 7(4), 321–359. https://doi.org/10.1145/75104.75105
[104] Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., & Smith, V. (2020). Federated Optimization in Heterogeneous Networks. Proceedings of Machine Learning and Systems (MLSys), 2, 429–450.
[105] Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., & Han, S. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. Proceedings of Machine Learning and Systems (MLSys).
[106] Lin, Y., Han, S., Mao, H., Wang, Y., & Dally, W. J. (2018). Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. International Conference on Learning Representations (ICLR).
[107] Linux Kernel Documentation. (2024). zRAM: Compressed RAM Block Device. https://www.kernel.org/doc/html/latest/admin-guide/blockdev/zram.html
[108] Lloyd, S. (2000). Ultimate Physical Limits to Computation. Nature, 406, 1047–1054. https://doi.org/10.1038/35023282
[109] Lloyd, S. P. (1982). Least Squares Quantization in PCM. IEEE Trans. Inf. Theory, 28(2), 129–137. https://doi.org/10.1109/TIT.1982.1056489
[110] Ma, S., Wang, H., Ma, L., Wang, L., Wang, W., Huang, S., & Wei, F. (2024). The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (BitNet b1.58). arXiv preprint arXiv:2402.17764.
[111] Malkov, Y. A., & Yashunin, D. A. (2020). Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Trans. Pattern Anal. Mach. Intell., 42(4), 824–836. https://doi.org/10.1109/TPAMI.2018.2889473
[112] Mandelstam, L., & Tamm, I. (1945). The Uncertainty Relation Between Energy and Time in Non-relativistic Quantum Mechanics. J. Phys. (USSR), 9, 249–254.
[113] Margolus, N., & Levitin, L. B. (1998). The Maximum Speed of Dynamical Evolution. Physica D: Nonlinear Phenom., 120(1–2), 188–195. https://doi.org/10.1016/S0167-2789(98)00054-2
[114] Maskin, E. S. (2008). Mechanism Design: How to Implement Social Goals. Am. Econ. Rev., 98(3), 567–576. https://doi.org/10.1257/aer.98.3.567
[115] Maurer, U. M. (1993). Secret Key Agreement by Public Discussion from Common Information. IEEE Trans. Inf. Theory, 39(3), 733–742. https://doi.org/10.1109/18.259647
[116] Max, J. (1960). Quantizing for Minimum Distortion. IRE Trans. Inf. Theory, 6(1), 7–12. https://doi.org/10.1109/TIT.1960.1057548
[117] Maymounkov, P., & Mazières, D. (2002). Kademlia: A Peer-to-peer Information System Based on the XOR Metric. International Workshop on Peer-to-Peer Systems (IPTPS), 53–65. https://doi.org/10.1007/3-540-45748-8_5
[118] McFarling, S. (1993). Combining Branch Predictors. WRL Technical Report TN-36. Western Research Laboratory.
[119] McGrew, D. A., & Viega, J. (2007). Recommendation for Block Cipher Modes of Operation: Galois/Counter Mode (GCM) and GMAC. NIST Special Publication 800-38D. https://doi.org/10.6028/NIST.SP.800-38D
[120] McKusick, M. K., Joy, W. N., Leffler, S. J., & Fabry, R. S. (1984). A Fast File System for UNIX. ACM Trans. Comput. Syst. (TOCS), 2(3), 181–197. https://doi.org/10.1145/989.990
[121] McMahan, H. B., Moore, E., Ramage, D., Hampson, S., & Agüera y Arcas, B. (2017). Communication-Efficient Learning of Deep Networks from Decentralized Data. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 54, 1273–1282.
[122] Merkle, R. C. (1987). A Digital Signature Based on a Conventional Encryption Function. Advances in Cryptology – CRYPTO '87, 369–378. https://doi.org/10.1007/3-540-48184-2_32
[123] Minsker, S. (2015). Geometric Median and Robust Estimation in Banach Spaces. Bernoulli, 21(4), 2308–2335. https://doi.org/10.3150/14-BEJ645
[124] Mironov, I. (2017). Rényi Differential Privacy of the Sampled Gaussian Mechanism. arXiv preprint arXiv:1702.07476.
[125] Mironov, I., Talwar, K., & Zhang, L. (2019). Rényi Differential Privacy of the Sampled Gaussian Mechanism. Proceedings of the IEEE Security Foundations Symposium (CSF). https://doi.org/10.1109/CSF.2019.00021
[126] Mittal, S. (2016). A Survey of Techniques for Approximate Computing. ACM Comput. Surv., 48(4), 1–33. https://doi.org/10.1145/2893356
[127] Mosca, M. (2018). Cybersecurity in an Era with Quantum Computers: Will We Be Ready? IEEE Secur. Priv., 16(5), 38–41. https://doi.org/10.1109/MSP.2018.3761723
[128] Mrazek, V., Hrbacek, R., Vasicek, Z., & Sekanina, L. (2021). ALWANN: Automatic Layer-Wise Approximation of Deep Neural Network Accelerators. Proceedings of the International Conference on Computer-Aided Design (ICCAD).
[129] Nagel, M., Fournarakis, M., Amjad, R. A., Bondarenko, Y., van Baalen, M., & Blankevoort, T. (2021). A White Paper on Neural Network Quantization. arXiv preprint arXiv:2106.08295.
[130] Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., et al. (2021). Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'21). https://doi.org/10.1145/3458817.3476209
[131] NIST. (2016). NIST IR 8105: Report on Post-Quantum Cryptography. National Institute of Standards and Technology. https://doi.org/10.6028/NIST.IR.8105
[132] NIST. (2018). NIST SP 800-90B: Recommendation for the Entropy Sources Used for Random Bit Generation. https://doi.org/10.6028/NIST.SP.800-90B
[133] NIST. (2020). NIST SP 800-208: Recommendation for Stateful Hash-Based Signature Schemes. https://doi.org/10.6028/NIST.SP.800-208
[134] NIST. (2024a). FIPS 203: Module-Lattice-Based Key-Encapsulation Mechanism Standard (ML-KEM). https://doi.org/10.6028/NIST.FIPS.203
[135] NIST. (2024b). FIPS 204: Module-Lattice-Based Digital Signature Standard (ML-DSA). https://doi.org/10.6028/NIST.FIPS.204
[136] NIST. (2024c). FIPS 205: Stateless Hash-Based Digital Signature Standard (SLH-DSA). https://doi.org/10.6028/NIST.FIPS.205
[137] NSA. (2022). Quantum Computing and Post-Quantum Cryptography FAQs. National Security Agency. https://media.defense.gov/2022/Sep/07/2003071836/-1/-1/0/CSI_QUANTUM_COMPUTING_FAQS.PDF
[138] Ongaro, D., & Ousterhout, J. (2014). In Search of an Understandable Consensus Algorithm (Raft). Proceedings of the 2014 USENIX Annual Technical Conference (ATC), 305–319.
[139] Paninski, L. (2003). Estimation of Entropy and Mutual Information. Neural Comput., 15(6), 1191–1253. https://doi.org/10.1162/089976603321780272
[140] Papamarcos, M. S., & Patel, J. H. (1984). A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories. Proceedings of the 11th Annual International Symposium on Computer Architecture (ISCA), 348–354. https://doi.org/10.1145/800015.808204
[141] Peikert, C. (2016). A Decade of Lattice Cryptography. Found. Trends Theor. Comput. Sci., 10(4), 283–424. https://doi.org/10.1561/0400000074
[142] Peyré, G., & Cuturi, M. (2019). Computational Optimal Transport. Found. Trends Mach. Learn., 11(5–6), 355–607. https://doi.org/10.1561/2200000073
[143] Rabin, M. O. (1989). Efficient Dispersal of Information for Security, Load Balancing, and Fault Tolerance. J. ACM, 36(2), 335–348. https://doi.org/10.1145/62044.62050
[144] Rastegari, M., Ordonez, V., Redmon, J., & Farhadi, A. (2016). XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. European Conference on Computer Vision (ECCV), 525–542. https://doi.org/10.1007/978-3-319-46493-0_32
[145] Ratnasamy, S., Francis, P., Handley, M., Karp, R., & Schenker, S. (2001). A Scalable Content-Addressable Network (CAN). Proceedings of the 2001 ACM SIGCOMM Conference, 161–172. https://doi.org/10.1145/383059.383072
[146] Reed, I. S., & Solomon, G. (1960). Polynomial Codes Over Certain Finite Fields. J. Soc. Ind. Appl. Math., 8(2), 300–304. https://doi.org/10.1137/0108018
[147] Regev, O. (2009). On Lattices, Learning with Errors, Random Linear Codes, and Cryptography. J. ACM, 56(6), 1–40. https://doi.org/10.1145/1568318.1568324
[148] Reiter, M. K., & Malkhi, D. (2000). Byzantine Quorum Systems. Distrib. Comput., 11(4), 203–213. https://doi.org/10.1007/s004460050050
[149] Renner, R. (2008). Security of Quantum Key Distribution. Int. J. Quantum Inf., 6(1), 1–127. https://doi.org/10.1142/S0219749908003256
[150] Rowstron, A., & Druschel, P. (2001). Pastry: Scalable, Decentralized Object Location and Routing for Large-Scale Peer-to-Peer Systems. IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), 329–350. https://doi.org/10.1007/3-540-45518-3_18
[151] Schuchman, L. (1964). Dither Signals and Their Effect on Quantization Noise. IEEE Trans. Commun. Technol., 12(4), 162–165. https://doi.org/10.1109/TCOM.1964.1088973
[152] Sekanina, L. (2017). Approximate Computing: General Introduction. In Approximate Circuits, pp. 1–14. Springer. https://doi.org/10.1007/978-3-319-99322-5_1
[153] Serebryany, K. (2018). Sanitizing the System Call API with Syscall Filters and ARM Memory Tagging. Proceedings of the Linux Security Summit North America (LSS-NA).
[154] Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell Syst. Tech. J., 27(3), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
[155] Shannon, C. E. (1949). Communication Theory of Secrecy Systems. Bell Syst. Tech. J., 28(4), 656–715. https://doi.org/10.1002/j.1538-7305.1949.tb00928.x
[156] Shazeer, N. (2019). Fast Transformer Decoding: One Write-Head is All You Need. arXiv preprint arXiv:1911.02150.
[157] Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. International Conference on Learning Representations (ICLR).
[158] Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv preprint arXiv:1909.08053.
[159] Shokri, R., Stronati, M., Song, C., & Shmatikov, V. (2017). Membership Inference Attacks Against Machine Learning Models. Proceedings of the 2017 IEEE Symposium on Security and Privacy (S&P), 3–18. https://doi.org/10.1109/SP.2017.41
[160] Smith, J. E. (1981). A Study of Branch Prediction Strategies. Proceedings of the 8th Annual International Symposium on Computer Architecture (ISCA), 135–148. https://doi.org/10.1145/285930.285980
[161] Stich, S. U. (2019). Local SGD Converges Fast and Communicates Little. International Conference on Learning Representations (ICLR).
[162] Stich, S. U., Cordonnier, J.-B., & Jaggi, M. (2018). Sparsified SGD with Memory. Advances in Neural Information Processing Systems (NeurIPS), 31.
[163] Stern, M., Shazeer, N., & Uszkoreit, J. (2018). Blockwise Parallel Decoding for Deep Autoregressive Models. Advances in Neural Information Processing Systems (NeurIPS), 31.
[164] Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., & Balakrishnan, H. (2001). Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications. Proceedings of the 2001 ACM SIGCOMM Conference, 149–160. https://doi.org/10.1145/383059.383071
[165] Szilard, L. (1929). Über die Entropieverminderung in einem thermodynamischen System bei Eingriffen intelligenter Wesen. Z. Phys., 53(11–12), 840–856. https://doi.org/10.1007/BF01341281
[166] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971.
[167] Vardi, Y., & Zhang, C.-H. (2000). The Multivariate L1-Median and Associated Data Depth. Proc. Natl. Acad. Sci. USA, 97(4), 1423–1426. https://doi.org/10.1073/pnas.97.4.1423
[168] Villani, C. (2003). Topics in Optimal Transportation. American Mathematical Society.
[169] Vogels, T., Karimireddy, S. P., & Jaggi, M. (2019). PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization. Advances in Neural Information Processing Systems (NeurIPS), 32.
[170] Wang, H., Ma, S., Dong, L., Huang, S., Wang, H., Ma, L., Yang, F., Wang, R., Wu, Y., & Wei, F. (2023). BitNet: Scaling 1-bit Transformers for Large Language Models. arXiv preprint arXiv:2310.11453.
[171] Wegman, M. N., & Carter, J. L. (1981). New Hash Functions and Their Use in Authentication and Set Equality. J. Comput. Syst. Sci., 22(3), 265–279. https://doi.org/10.1016/0022-0000(81)90033-7
[172] Weiszfeld, E. (1937). Sur le point pour lequel la somme des distances de n points donnés est minimum. Tôhoku Math. J., 43, 355–386.
[173] Widrow, B., & Kollar, I. (2008). Quantization Noise: Roundoff Error in Digital Computation, Signal Processing, Control, and Communications. Cambridge University Press. https://doi.org/10.1017/CBO9780511754661
[174] Yin, D., Chen, Y., Kannan, R., & Bartlett, P. (2018). Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates. Proceedings of the 35th International Conference on Machine Learning (ICML).
[175] Yin, M., Malkhi, D., Reiter, M. K., Gueta, G. G., & Abraham, I. (2019). HotStuff: BFT Consensus with Linearity and Responsiveness. Proceedings of the 38th ACM Symposium on Principles of Distributed Computing (PODC), 347–356. https://doi.org/10.1145/3293611.3331591
[176] Yeh, T.-Y., & Patt, Y. N. (1992). Alternative Implementations of Two-Level Adaptive Branch Prediction. Proceedings of the 19th Annual International Symposium on Computer Architecture (ISCA), 124–134. https://doi.org/10.1145/139669.140532
[177] Yu, Y., Manolios, P., & Lamport, L. (1999). Model Checking TLA+ Specifications. Proceedings of the 10th IFIP WG 10.5 Advanced Research Working Conference on Correct Hardware Design and Verification Methods (CHARME), 54–66. https://doi.org/10.1007/3-540-48153-2_5
[178] Zamir, R., & Feder, M. (1996). On Universal Quantization by Randomized Uniform/Lattice Quantizers. IEEE Trans. Inf. Theory, 38(2), 428–436. https://doi.org/10.1109/18.119699
[179] Zhandry, M. (2012). How to Record Quantum Queries, and Applications to Quantum Indifferentiability. Advances in Cryptology – CRYPTO 2012, 239–257. https://doi.org/10.1007/978-3-642-32009-5_15
[180] Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., & Tian, Y. (2020). iDLG: Improved Deep Leakage from Gradients. arXiv preprint arXiv:2001.02610.
[181] Zhao, J., Zhang, Z., Lyu, L., Li, Y., & Li, Y. (2024). GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection. Proceedings of the 41st International Conference on Machine Learning (ICML).
[182] Zhu, L., Liu, Z., & Han, S. (2019). Deep Leakage from Gradients. Advances in Neural Information Processing Systems (NeurIPS), 32.
[183] Ziv, J. (1985). Universal quantization. IEEE Trans. Inf. Theory, 31(5), 591–598. https://doi.org/10.1109/TIT.1985.1057087
作者:Rosalind Pembrick