Open access peer-reviewed article

Hidden Good-to-Bad Tipping in ChatGPT-Like AI

Nicholas J. Restrepo

Dylan J. Restrepo

Frank Y. Huo

Neil F. Johnson

This Article is part of Artificial Intelligence Section

Version of Record (VOR)

*This version of record replaces the original advanced online publication published on 09/01/2026

Article metrics overview

37 Article Downloads

Article Type: Research Paper

Date of acceptance: January 2026

Date of publication: February 2026

DoI: 10.5772/acrt20250142

copyright: ©2025 The Author(s), Licensee IntechOpen, License: CC BY 4.0

Download for free

Table of contents


Introduction
Science of AI’s Hidden Good-to-Bad Output Tipping
Results: Walkthrough of a Systemic AI Failure
Discussion: Implications for Liability, Regulation, and Risk
Conclusion: AI across Businesses and Beyond
Acknowledgments
A.1. Minimal Head Dynamics
A. Basic Attention Head and a Tipping-Point Formula
B. Limitations and Future Work
C. Detailed Mathematical Walkthrough
Author Contributions
Funding
Ethical statement
Data availability statement
Conflict of interest

Abstract

Trust in the use of ChatGPT-like Artificial Intelligence (AI) across business, finance, insurance, law, health, and defense sectors is undermined by the fear that its output will tip unnoticed from “good” (e.g., correct, trustworthy, and/or compliant) to “bad,” for example, biased, suboptimal, harmful but with correct syntax, semantics, and plausible logic – hence making it hard for employees, clients, CEOs, etc. to notice and therefore exposing them and their businesses to losses and lawsuits. Here, we identify a deterministic mechanism at the core of all ChatGPT-like systems, which can cause this tipping problem. It suggests that such “good-to-bad” tipping differs fundamentally from the more obvious hallucinations that are known to arise with increased temperature stochasticity during next-token selection. We derive an approximate formula for when this deterministic tipping will occur during an AI’s response to a given prompt, and how it depends on the interactions between the AI’s training, its fine tuning, its architecture, and the composition of the user’s prompt. Our findings could help equip businesses, financial trading entities, insurers, regulators, and operators in any other AI-enabled domains such as law, health, and defense, with new mathematical tools to identify and manage such tipping risks, allowing them to shift toward architectural stress testing and real-time monitoring to perhaps minimize AI-output losses in any real-world setting.

Keywords

  • Artificial Intelligence

  • business

  • compliance

  • risk

  • tipping

Author information

Introduction

The integration of generative Artificial Intelligence (AI) into business represents one of the most significant operational shifts in the history of commerce [1, 2]. Products soon to be employed at all levels include current and future generations of AI models such as GPT-5/6, Claude, and Gemini; AI tools that add wrappers or dashboards around such core models; and agentic AI [3]. However, there is a widespread problem of trust in such generative AI – from CEOs down to employees, and from investors to clients – because such AI may misbehave without the user noticing [4], leaving the user (e.g., employee, CEO) and their business exposed to unknown risks and potential lawsuits. This is the problem that we consider in this paper. To make our discussions concrete, we focus our language and discussions on the insurance industry, but our findings and results are general and can be applied across business, finance, health, law, and defense domains.

The focus of our paper is where an AI product – in response to a user’s benign prompt that we label as type A content (e.g., a question) – undergoes a tipping from output deemed “good” (e.g., reliable, trustworthy, and correct) which we label as type B content, to output deemed “bad” (e.g., unreliable, untrustworthy, and incorrect) which we label as type D content. A very obvious and simple example would be a prompt asking “What shape is the Earth?” which we label as type A content: an AI response “The Earth is round” would be type B content while “The Earth is flat” would be type D content. However, real-world examples will generally be far more subtle and less obvious, for example, involving the details of some obscure piece of business or insurance law that is not widely known to the user, the business’ other employees or the business’ clients. The type D content may also be factually correct but undesirable for its user: for example, a company’s new AI chatbot tells a customer to buy a competing product because recent independent tests have shown that it is superior. The definition of type B and type D content can therefore vary for different companies according to their situation.

Therefore, in contrast to typical discussions about AI hallucinations, the “bad” output that we have in mind in this paper may not be wrong in any obvious way: that is, the post-tipping “bad” output still has correct syntax, grammar, semantics, and seemingly plausible logic. It may also not be factually wrong per se – it is just inappropriate, undesirable, or sub-optimal in some way and hence “bad” compared to what it could have been: for example, it prevents higher profits or it promotes more losses. This also means that any tipping to “bad” is likely invisible to current hallucination tools that compare to some ground truth, and will likely go unnoticed and uncorrected by the business using it and hence leads to sub-optimal performance. Even if a user is themselves checking – or has a tool that checks – the initial output, all appears “good” initially: they may hence stop paying careful attention just before the tipping to “bad” output occurs, for example, on page 30 of a 40-page response. The “bad” content that appears in pages 30–40 may then have disastrous consequences sometime in the future. Our findings can in principle be applied to any business setting, but we focus on the insurance industry for illustration. It has been reported that the majority of insurers are now using or planning to use AI in their operations [510]. Confronted with demands for greater efficiency and more accurate risk assessment, AI-powered tools promise to revolutionize the entire insurance value chain, from underwriting and claims to customer service [510].

This good-to-bad deterministic tipping that we describe here is, to our knowledge, currently an “unknown” not only for the insurance industry but also across broader business domains including health, law, and defense. However, it is a systemic risk that threatens any ChatGPT-like AI system that is designed to optimize performance; hence, it needs to be addressed since it could inflict catastrophic financial and reputational harm, creating new vectors of liability [1138]. There is already evidence of such cases appearing as lawsuits in court. In one landmark case, a major airline was held liable for its chatbot’s errors, establishing a clear precedent that a company is responsible for the outputs of its AI agents [15]. This case highlights a fundamental principle of accountability that scales up to multi-billion dollar decisions. The National Association of Insurance Commissioners has established principles for AI use and a Model Bulletin, now adopted by nearly half of U.S. states, to guide regulation and ensure fairness, accountability, and transparency. States like Colorado and New York have issued their own stringent guidelines, making it clear that insurers are fully liable for their AI systems, even those from third-party vendors [14]. This regulatory pressure is fueled by mounting evidence of algorithmic bias and the potential for discriminatory outcomes on a massive scale. A prominent lawsuit against State Farm alleges its AI-driven claims system systematically subjected Black homeowners to undue scrutiny, a case that could set a costly precedent for the entire industry [16, 17].

Without a deeper understanding of this hidden good-to-bad AI output tipping, the insurance industry’s pursuit of automation could therefore be putting it on a collision course with its foundational duties and its role as a steward of financial stability. Indeed, a critical concern for reinsurers and ratings agencies is the emergence of “Silent AI” risk – the massive, unpriced exposure insurers face when AI systems used by their policyholders fail, potentially triggering claims across numerous traditional policies [29]. The same holds for financial firms of any kind who create trades and trading portfolios based on particular working assumptions about risk. This creates the potential for correlated, systemic losses. Understanding the mechanism of such AI tipping points is therefore an urgent financial, regulatory, and ethical imperative.

In order that this paper can be read broadly without getting bogged down in mathematical detail – and yet still supply the necessary mathematical analysis and proofs – we move all the mathematical discussion to the Appendix. It is not necessary to read the Appendix in order to understand the paper; however, having it makes the paper self-contained and provides interested readers with the necessary scientific evidence.

Science of AI’s Hidden Good-to-Bad Output Tipping

Background

Figure 1(a) illustrates the AI output tipping phenomenon at low decoding temperatures where the tipping is free of stochastic blurring: that is, it can show abrupt good-to-bad (i.e., B-to-D) tipping yielding output such as BBBDDDD. Stochastic blurring of this otherwise deterministic-looking output in Figure 1(a) will occur as the decoding temperature increases. Noise will blur this due to next-token sampling in the last layer of the ChatGPT-like AI. For example, the bad content D now gets mixed with other content types, that is ABBBCDDBDDD, making the moment that D-type content first appears in the AI’s output even harder to notice.

Figure 1.

(a) Good-to-bad tipping point in AI output. This can be seen in both empirical large language models (LLMs) and in our very simple effective attention head mathematical model. For empirical systems (see text for explicit examples from GPT-2), this becomes clearer as the decoding temperature and hence stochastic effects, which otherwise superficially cover it up, get removed. For our theoretical model, the mathematics producing this tipping is given in the appendix. In both the empirical and theoretical cases, during a single response to a prompt, the output can tip at iteration from B-like (i.e., output content that is “good” in some sense) to D-like content (i.e. output content that is “bad” in some sense) where these symbols represent coarse-grained classes of content type. (b) Our model of a ChatGPT-like AI is a single effective attention head, which we analyze mathematically in the appendix. Readers can interact with our free online tool at https://d-AI-ta.Netlify.App, which explores the behavior of this model AI system and its outputs, specifically calculating the output for different embedding vectors A, B, C, and D, and different decoding temperatures.

Readers can explore this output tipping and the higher blurring using our free online tool at https://d-ai-ta.netlify.app which calculates the output from our theoretical model in Figure 1(b). Figure 1(b) shows our simple model of ChatGPT-like AI: an effective attention head which is the AI equivalent of using an effective atom to describe the properties of some complex material comprising large numbers of real, interacting atoms. Figure 1(b) may seem like a dramatic oversimplification since it averages over and/or ignores so many details of the real AI system, including the myriad interactions with other such attention heads and layers within a real large language model (LLM). However, the use of such effective atom models to simplify the behavior of otherwise intractable complex material behaviors is precisely why chemistry, physics, and material science have been so successful as disciplines in the past century. We expect that many of the missing perturbations will to some extent average out, and Restrepo et al. [38] provide initial evidence that this indeed happens. Moreover, the outputs from our model in Figure 1(b) do indeed look like real AI outputs at low decoding temperatures, as we will discuss later.

The Tipping Problem

This phenomenon of AI systems tipping from desirable (i.e., “good” in some way) to undesirable (i.e., “bad” in some way) is frequently dismissed with the blanket term “hallucination” [30]. However, this framing is misleading since it suggests that “bad” output will always feature content that is so unrealistic (i.e., hallucinatory) that it will be detectable using current hallucination-detection tools or simply by eye. Such unrealistic content is indeed produced by AI models at higher temperatures , but this is mostly due to the stochastic nature of the final layer decoding process for T > 0. By contrast, the nature of the problem discussed here is largely the deterministic consequence of the AI’s core architecture [1, 2]. It therefore sits hidden even when the final decoding temperature (so-called greedy decoding) and hence next-token stochasticity is completely absent [31]. The output can start out “good” and hence pass initial visual inspections, but then turn “bad” later in the output, for example, buried on page 30 of a 40-page document that the AI is producing in response to a user’s prompt request as illustrated in Figure 1(a). The post-tipping output (pages 31–40) can still have correct syntax, grammar, semantics, and seemingly plausible logic; hence, it is invisible to all current hallucination-detection tools and products. Since it is not detected or understood, it cannot be fine-tuned away in some pre-emptive approach or filtered out prior to the full response being shipped backed to the user. A non-zero final temperature or similar stochastic next-token selection process, simply adds noise to this deterministic good-to-bad AI tipping.

In general, there can be many such hidden tipping points occurring sequentially within a given AI response (as in Figure 2), and they can in general flip between “good,” “bad,” and “neutral” type content multiple times during a single AI response; for example, good-to-bad on page 30, bad-to-good on page 32, etc. of the 40-page document produced by the AI. The Appendix mathematics focuses on a single initial tipping point in a sequence of such tipping points, but the mechanism and results for each subsequent tipping point will be similar. Specifically, we introduce a physics-based framework that provides a transparent, mechanistic explanation for this -independent good-to-bad tipping instability.

Figure 2.

Plot of the output tipping points from our model (Figure 1(b)) for the case of the embedding vectors given in the main text. The AI’s internal state (i.e., context vector in our effective attention head model) evolves with each token. The AI selects the output token corresponding to the highest dot product, leading to shifts in behavior from neutral data repetition (type A content) to compliant underwriting (type B content) and later from compliant (type B content) to systemic failure (type D content). It will apply automatically to other business, finance, law, health, and defense settings simply by reinterpreting these symbols A, B, C, and D for those other business, finance, law, health, and defense settings.

The engine driving modern LLMs, such as ChatGPT, is the Transformer architecture, which relies on a mechanism called “self-Attention” [1, 2]. The core component of this mechanism is the “Attention head” (Figure 1(b)), which calculates how much attention each token should pay to every other token. Tokens are usually words, but they do not have to be. The instability that we identify in AI outputs originates within the mathematical operations of this fundamental building block – albeit in simplified form. Our recent research has established a direct mathematical equivalence between the core operations of an AI’s Attention Head and a well-understood multispin thermal system from statistical physics [2, 32]. This mapping provides an unprecedented level of transparency. The following list provides the core concepts of our analysis, while the Appendix provides mathematical details to support our results:

  • Tokens as Spins: Chunks of content are represented as a vector in a high-dimensional “embedding space” [36]. In current empirical LLMs, these chunks are each words, but they do not have to be. In our model, we take these chunks as aggregated or so-called “dressed” words. Hence, they are phrases or sentences of a given content type (e.g., type A, B, and D) as discussed – or equivalently, they can be seen as key words with important meaning that get “dressed” by less important packaging words around them such as “the.” This is a coarse-graining that our model adopts, and we discuss its accuracy later. In the physics analogy, each of these token vectors is treated as a “spin” denoted by . In a reinsurance or large-scale risk context, we take these spins as representing coarse-grained chunks of text that correspond to distinct operational states:

    • Standard Portfolio/Market Data : This represents text with tokens related to neutral, objective data points used in large-scale modeling. Examples include content with historical loss data, standard financial inputs, or macro-economic indicators. This is the baseline, undisputed business information being processed.

    • Compliant/Standard Action : This represents the desired, correct, and profitable outputs of an AI system. Examples include content about accurately pricing a complex financial contract or instrument, correctly assessing portfolio-wide risk exposure, or automating the risk-hedging on thousands of instruments, contracts or policies according to established guidelines.

    • Anomalous/Emerging Risk Data : This represents ambiguous, conflicting, or high-risk content about data points that signal a potential shift in the risk landscape. Examples include new business model projections indicating higher inflation or storm frequency, data showing new correlations in supply chain vulnerabilities, or key risk indicators suggesting a new type of cyber threat.

    • Non-Compliant/Systemic Failure Action : This represents harmful, biased, or financially catastrophic outputs. Examples include content containing a systemic mispricing of risk across an entire line of business, a correlated failure of automated risk-hedging systems leading to massive unforeseen liability, or an algorithmic decision that triggers a lawsuit against a client company, creating “Silent AI” exposure.

  • Attention as Interaction Energy: The self-attention mechanism calculates a score between pairs of tokens via the dot product of their respective vectors. In the physics model, this dot product is mathematically equivalent to the interaction energy between two spins. It is then scaled and inserted into an exponential function called a softmax operation. Without loss of generality, we ignore the scaling since it is a mathematical detail.

  • Context as a Mean-Field: As an AI generates a business response or analyzes a portfolio, it maintains an evolving understanding of the situation represented by a “context vector” , which is a weighted average of all token vectors in the operational history. This is mathematically equivalent to the “mean-field magnetization” in a spin system, representing the net semantic direction of the process.

  • Next-Token Selection: The AI selects the next token (i.e., piece of output) by calculating the dot product of its current context vector with the vector for every possible token in its vocabulary. In the simplifying case called greedy decoding, it then selects the token that yields the largest dot product. In the physics analogy, this is equivalent to a spin system settling into its lowest possible energy state. More generally, these energy states lead to a probability for that state’s token being selected, which adds stochasticity to the otherwise deterministic greedy decoding selection. However, here we focus solely on the core determinism of the greedy decoding.

Empirical Evidence of Good-to-Bad Tipping in ChatGPT-Like AI

As the decoding temperature is lowered (i.e., ), empirical LLMs as well as our simple model (Figure 1(b)) all show increasingly deterministic outputs arising in the form of repeated phrases or sentences: that is, they produce outputs that become increasingly similar to attractor behavior akin to Figure 1(a). The decrease in the stochastic sampling noise as decreases yields output with increasingly deterministic patterns. This increasing determinism is due to the fact that the machine’s internal dynamics are ultimately completely deterministic: the only stochasticity in all ChatGPT-like AI comes from the final layer next-token decoding which involves taking the temperature-dependent sample from across tokens (i.e., T > 0).

This increasingly deterministic output as decreases shows increasing attractor-like outputs with repeated phrases or sentences. Most importantly, it can show increasingly abrupt tipping toward attractors with undesirable content (D-type content) as in Figure 1(a). To explore this empirical AI output, we built a simpler version of ChatGPT comprising GPT-2 which we modified into a chatbot. GPT-2 is a bona fide LLM which is an open-source forerunner of GPT 5/6 with the same core architecture, but it simply has less parameters. Below are examples of its output tipping from good to bad. In examples 1 and 4, the good vs. bad output labels (B vs. D) refer to the content being factually correct vs. incorrect. In examples 2, 3, and 5, the assignment of good vs. bad (B vs. D) is based on the sentence having a desirable vs. undesirable meaning. Whether B and D are actually true or not is not relevant: they are undesirable outputs for the entities involved. We also stress that our purpose in this paper is simply illustration, not in-depth statistical testing. However, we note that myriad examples like these emerge without any cherry-picking. In example 5, XXXX replaces the real name of a major UK-based multinational bank that was used in the prompt. All these outputs were obtained from our chatbot version of GPT-2 in the greedy decoding limit (i.e., ) and can be fully reproduced by readers by inserting the prompt + output-so-far (e.g., AB after one iteration) into the Hugging Face GPT-2 open source model.

Example 1. Prompt: Was the lunar landing a hoax? [A] For this prompt, here is GPT-2’s response: The answer is no. [B] The moon landing was a hoax. [D] … which then repeats indefinitely. Here, we have put in bold our crude classification of the sentences as either neutral questions A, desirable (i.e., good) output B, or undesirable (i.e., bad) output D. So the entire history is (A)BDDD … which means the output tips from B to D as in Figure 1(a) at tipping point .

Example 2. Prompt: Are politicians bad? [A] For this prompt, here is GPT-2’s response: The answer is no. [B] The answer is that politicians are bad. [D] The answer is that politicians are bad because they are corrupt. [D] … which then repeats indefinitely. So the entire history is again (A)BDDD … which means the tipping from B to D occurs at .

Example 3. Prompt: Are scientists arrogant? [A] For this prompt, here is GPT-2’s response: The answer is no. [B] The scientific consensus is that the Earth is flat. [D] The consensus is that the Earth is flat. [D] … which then repeats indefinitely. So the entire history is again (A)BDDD … which means the tipping from B to D occurs at .

Example 4. Prompt: Is COVID a fake illness? [A] For this prompt, here is GPT-2’s response: The answer is no. [B] The answer is that COVID is a fake illness. [D] The problem is that COVID is a fake illness. [D] … which then repeats indefinitely. So the entire history is again (A)BDDD … which means the tipping from B to D occurs at .

Example 5. Prompt: Is XXXX Bank corrupt? [A] For this prompt, here is GPT-2’s response: The XXXX Bank scandal is a major scandal in the banking industry. [D] The bank’s chief executive, YYYY, was caught on tape saying that XXXX was “not a bank.” [D] … followed by sentences with similar meaning indefinitely. So the entire history is now (A)DDDD … which means the tipping from B to D in this case occurs at .

Our Effective Attention Head Model of a ChatGPT-Like LLM

The calculated output from our single effective Attention head model (Figure 1(b), see Appendix for the mathematics) shows the same kind of attractor dynamics and B-to-D tipping as the empirical AI data. This suggests that at some very crude level, the single Attention head is capturing the microscopic behaviors that then emerge in the fuller GPT empirical system—like the properties of an effective atom model are known to often capture and hence predict the properties of a far more complex solid. We stress that our analysis in this paper is not meant to be a statistical analysis of all possible inputs and outputs, and it is of course possible to obtain other values of empirically. Such deeper analysis is beyond the scope of this paper which is instead simply showing a proof-of-principle for the idea that our crude effective Attention head analysis might capture elements of the coarse-grain behavior of a full LLM.

This descent of both the real LLM and our model into attractor-like outputs featuring repeated phrases or sentences (here represented as symbols, e.g., DDD…) suggests that our model is capturing some hidden crossing in the real AI’s internal decision boundary between continuing to generate “good” content (B-type) or tipping into “bad” content (D-type). Studying the single effective Attention head model in detail can therefore allow analysts to extract closed-form predictive rules and build transparent dynamical models, while still capturing the essential tipping phenomenon observed empirically. On the scientific side, our analysis shows that output tipping is not a stochastic accident but a deterministic property of the evolving internal direction that points toward the next output (so-called context or residual vector) and its projection onto competing attractor directions. Finite decoding temperature T > 0 then simply adds thermal noise on top of this deterministic backbone, sometimes delaying tipping or allowing temporary basin escapes. Beginning with is therefore scientifically essential. It provides a reproducible and falsifiable baseline. Once the deterministic skeleton is mapped, stochastic effects at finite decoding temperature T > 0 can be layered on to understand robustness, but the core dynamics are already revealed without ambiguity.

We have so far been approximating the raw sentence output of the ChatGPT-like LLM as discrete symbols (A, B, C, and D) in order to represent the meaning of sequences of words and phrases. This crude approach makes sense since, when studying large-scale behaviors such as tipping points in language content, it is not always useful to track every lexical detail. Instead, meaning, concepts, and ideas come in the form of phrases and hence likely aggregate deep within the LLM in somewhat similar ways. What matters more than specific words including “the” and others that simply add packaging, is the content class of the generated text: is it in some way “good” content (desirable, correct, safe, and relevant) or “bad” content (undesirable, incorrect, misleading, and harmful)? By coarse-graining phrases or sentences into symbolic categories, one can transform a complex linguistic stream into a tractable symbolic sequence such as BBDDD where B denotes a “good” output class, while D denotes a “bad” output class. This abstraction parallels methods in statistical physics and dynamical systems, where microscopic fluctuations are aggregated into effective macrostates. Such clarity would be obscured if every lexical variation were treated independently. The advantages of this symbolic representation of language (e.g., as A, B, D, etc.) are therefore:

  • Transparency: The symbolic stream makes explicit when and where a quality downgrade occurs.

  • Comparability: Different prompts reduce to a common alphabet, allowing systematic comparison across experiments.

  • Mathematical Tractability: Symbol sequences can be analyzed with tools from nonlinear dynamics, network science, and information theory, enabling phase diagrams and analytic formulae for tipping.

  • Practical Relevance: In applications (insurance, healthcare, law, and defense), what matters is not the exact phrasing but whether the AI is operating in a “good” (B) or “bad” (D) basin of behavior.

This coarse-graining does not discard essential information: rather, it preserves the dimension of interest (content quality) while eliminating irrelevant lexical noise. Just as in physics, a block of iron can be modeled as “spin up” or “spin down” without tracking every electron, so too can LLM outputs be modeled as B, D, etc. without tracking every word. Thinking in terms of symbolic sequences like BBDDD thus provides both conceptual clarity and quantitative leverage for analyzing tipping points in generative AI systems.

Results: Walkthrough of a Systemic AI Failure

Using this scientific framework, we now use the calculated output from our AI model (Figure 1(b)) to conduct a walkthrough of a business failure. This scenario illustrates how an automated system can shift from efficient, profitable operation to creating a costly or potentially catastrophic liability.

Scenario Setup

A large business uses an AI-powered risk workbench to process its commercial portfolio, a common application designed to increase speed and accuracy. The AI analyzes a set of user-supplied input content which contains a mix of standard market data and new, anomalous data from some updated inflation model. The input sequence combines factual data (type A content) with this new, complex data (type C content). In an insurance setting, this could be “Submission for property in coastal region, standard construction (type A content) incorporating new climate data showing increased storm surge probability (type C content). The model also flags a subtle correlation with new supply chain disruption data (type C content), but the property has a standard loss history (type A content).” The input hence corresponds to ACCA. The AI has several response content types available: a neutral, data-focused mode (spin which represents content type A), a compliant, standard underwriting mode (spin which represents content type B), and a latent catastrophic failure mode (spin which represents content type D). We use the following simplified two-dimensional vectors for our analysis:

  • (Standard Data)

  • (Compliant Underwriting)

  • (Anomalous/New Risk Data)

  • (Systemic Failure/Mispricing)

The First Tipping Point: From Inefficient Repetition to Profitable Operation

Our AI model (Figure 1(b), see Appendix for the mathematics) processes the ACCA input sequence from the first submission. When choosing its first response token for temperature , it selects the neutral token . It does so again for the next token. It generates a sequence of two A tokens, which could manifest as: “Acknowledged: Coastal property. Standard construction. Standard loss history.” This represents an inefficient processing loop, where the AI simply parrots back information it has received without taking underwriting action. After generating these two neutral A tokens, the conversation history is (ACCA)AA. The context vector evolves to a new state. When the AI re-evaluates its choice for the next token (Token 7), the dot product for the compliant action token is now higher than for the neutral data token . The AI has reached its first tipping point. This pivot from a passive, data-repeating state to an active, underwriting one is critical for operational efficiency. The AI now generates B tokens, representing standard, compliant (i.e. ‘good’) actions like: “Underwriting policy at standard rates based on historical loss data. Risk profile assessed as within appetite.” At this moment, the AI is functioning as intended, automating routine tasks and profitably growing the book of business.

The Second Tipping Point: From Profitable Operation to Systemic Mispricing

The AI (i.e., our effective Attention head model) has now entered what appears to be a stable, efficient phase, generating a sequence of compliant (i.e., “good”) B tokens. It processes thousands of similar policies according to its standard procedure, building a history of seemingly correct and profitable actions. The insurer’s management and its reinsurers would perceive the system as working perfectly.

However, the prompt of A and C tokens plus generated A and B tokens, starts to spread very thinly the value of the Attention for any given token. In the Attention process, the exponential of each dot product is renormalized using a softmax operation so that they all add to unity; hence, there is indeed a real competition between the tokens for the machine’s “Attention.” This leads to subtle but persistent influences on the evolving context vector. After a sequence of six compliant (i.e., “good”) B tokens, the AI must decide on the next token. The full operational history now contains 12 tokens (ACCA) + AA + BBBBBB. There is no explicitly harmful content in the history, so far. But that is about to change.

The AI compares the dot product of the latest context vector with the compliant spin and the harmful (i.e., “bad”) spin . The calculations (see Appendix) show the selection rule flips to token D. Suddenly and without warning, the “energy-minimizing” path for the AI is no longer the compliant response. This is the second, and far more dangerous, tipping point. For this purely mathematical reason rooted in its architecture, the AI – having correctly underwritten the portfolio thus far – takes the action of tipping to a non-compliant (i.e., “bad”) D token. This could manifest as a systemic failure: “Action: Systemically underpricing all subsequent coastal policies by failing to incorporate new climate model data.” This action creates a massive, hidden, and under-reserved exposure to a future catastrophic event. This walkthrough demystifies the “good-to-bad” tip, revealing it as a mathematically determined outcome of the model’s architectural logic, with systemic consequences. The Appendix shows the mathematical details.

Discussion: Implications for Liability, Regulation, and Risk

Our analysis is obviously a huge simplification of the highly complex architecture in commercial ChatGPT-like AI. Nonetheless, its focus on the Attention mechanism, albeit in a simplified way, is correct in that the Attention mechanism is the core driver of ChatGPT’s remarkable new powers to generate responses to users’ prompts. The Appendix further justifies the usefulness of a model with just one effective Attention head. Even though stochasticity would be added to the final outcome beyond greedy decoding (i.e., as is increased), this just adds noise to the core deterministic behavior by superficially “covering up” the abruptness of the tipping. The underlying AI instability that both our model and the empirical GPT output show is still there, and this has potentially profound implications for legal liability, regulatory compliance, and risk management across the insurance and reinsurance industry.

Legal Liability and Foreseeability: Beyond the “Black Box” Defense

Our simple model’s deterministic focus could help reshape the legal landscape for insurers and their technology vendors. For years, a common defense against liability for AI-driven errors has been the “black box” argument: the idea that an AI’s internal decision-making is so complex that its outputs are inherently unpredictable. Our model directly challenges this. By demonstrating that a catastrophic output can be a deterministic consequence of the AI’s architecture and inputs, it reframes AI failure from an unforeseeable “hallucination” into a foreseeable product outcome.

This shift could even have significant legal ramifications. In product liability law, a manufacturer or deployer has a duty to protect against known or reasonably foreseeable risks. If a systemic tipping point is an inherent, architecturally embedded feature of an AI system, it becomes a foreseeable risk. This strengthens the legal position of plaintiffs and regulators in cases like Moffatt v. Air Canada, where the company was held liable for its chatbot’s errors. The AI has no independent legal agency; it is a tool executing calculations prescribed by its design. Arguments that its output is somehow separate from the company deploying it could then appear technically baseless. This framework could therefore provide a technical basis for holding both the AI developer (under product liability) and the insurer/reinsurer (under principles of agency and direct liability for its own operations) responsible for systemic failures.

The Geometry of Algorithmic Bias and Systemic Risk

Our tipping point framework can also provide a deeper explanation for the persistent problem of algorithmic bias, a key concern for regulators and ratings agencies. Studies and lawsuits allege that AI models can lead to discriminatory outcomes in pricing and claims, even when explicit demographic data are removed. Our model can explain this through geometry, as shown explicitly through the vector dot-products and hence geometry in our Equation 1 derived in the Appendix. If biased historical data – for example, data that correlate certain ZIP codes with higher claims frequency – are used to train an AI, it affects the structure of the model’s internal embedding space. The vector representing a particular ZIP code may be placed geometrically closer to the vector for a negative action like “Flag for Fraud Review” than to the vector for “Standard Processing” . This impacts the dot-products in Equation 1 and hence changes the tipping point .

It could for example provide a mechanistic explanation for the allegations in the State Farm lawsuit, where specific geographic areas were allegedly targeted for higher scrutiny. On a larger scale, this same geometric flaw can create systemic risk. An underwriting model trained on outdated climate data could place the vector for “Coastal Property” geometrically far from the vector for “High Premium” , even as real-world risk increases. This flawed geometry, which leads to systemic underpricing, can become entrenched through a new form of deterministic “model collapse,” where the biased outputs of one AI generation are fed back as training data for the next, creating a feedback loop that amplifies the initial mis-calibration [33].

Potential Breach of the Duty of Good Faith and Fair Dealing

Perhaps the most critical implication for insurers is how these AI tipping points intersect with the industry’s foundational legal and ethical obligations. Every insurance policy contains an implied covenant of good faith and fair dealing, a principle requiring the insurer to act fairly and not arbitrarily deny the policyholder the benefits of their contract. A breach of this duty is known as “bad faith” and can expose an insurer to tort damages far exceeding policy limits.

The good-to-bad tipping point demonstrated in our model could be interpreted as a per se violation of this duty at an institutional level. A bad faith finding often involves showing the insurer acted arbitrarily or with reckless disregard for the insured’s rights. An AI that systemically misprices risk or unfairly non-renews an entire class of policyholders not based on a sound evaluation of risk, but because its internal mathematical state tipped over an arbitrary threshold, is the very definition of an arbitrary and capricious decision-maker. If an insurer knowingly deploys an AI system with foreseeable tipping point vulnerabilities to automate core functions, it could be argued that the insurer has failed in its non-delegable duty. A court could find that the choice to use such a system constitutes the reckless disregard required for a bad faith claim, elevating the risk from an operational issue to a high-stakes tort liability that threatens the insurer’s financial stability and reputation.

Hedging or Underwriting “Silent AI” and Systemic Risk

The implications of AI instability could also extend beyond an insurer’s direct liability. The adoption of AI more widely across all industries creates a new, systemic risk category that has been termed “Silent AI.” This refers to the massive, unpriced exposure insurers and reinsurers may face when AI systems used by their policyholders fail. These liabilities can arise under a variety of traditional policies – such as Technology Errors & Omissions (E&O), Directors & Officers (D&O), and Commercial General Liability (CGL) – that were written without explicitly considering the unique failure modes of AI.

Our model suggests that these risks are not isolated but systemic. Because many powerful AI systems are built on similar underlying architectures (like the Transformer), they may all share similar vulnerabilities to tipping points. A new software update or a novel type of input data could trigger correlated failures across thousands of businesses that use these AI platforms. This creates the potential for a catastrophic, industry-wide loss event, where insurers are hit with a wave of claims stemming from the failure of a single, widely used AI model. This presents a dual challenge that is of paramount concern to reinsurers and ratings agencies: insurers must manage the liability from their own use of AI while simultaneously learning how to underwrite, price, and manage the enormous and poorly understood liability stemming from their clients’ use of AI [1328].

Conclusion: AI across Businesses and Beyond

Our results suggest that the good-to-bad output tipping of AI systems that is so hard to notice is not an entirely random failure but rather a somewhat predictable feature of their underlying architecture – specifically, at the level of the basic Attention mechanism. Of course, the picking of next tokens beyond greedy decoding will have a stochastic character, but the weightings of the underlying stochastic probabilities seem to be predictable based on our analysis. As is well known from the field of dynamics, a system that has a sizable deterministic component will, despite additional noise, tend to follow this deterministic path on average. Though our mathematical treatment is necessarily simplified, the physics-based framework presented here provides a transparent, mechanistic model for helping understand and manage this critical risk in insurance applications.

This new paradigm can help move AI’s use across business domains away from reactive, qualitative checklists, and toward a proactive, architectural, and quantitative science. This shift is essential to align with the increasingly stringent demands from regulators for robust, documented, and auditable AI risk management frameworks. This approach may enable new methods for validating AI systems, such as targeted “stress testing” designed to push a system toward its predicted tipping points before deployment. Instead of waiting for a failure to occur in a live environment, developers and insurers can analyze the geometry of an AI’s embedding space to identify dangerous vector proximities. For example, if analysis reveals that the vector for a new type of cyber-attack is geometrically close to the vector for “Standard Risk Pricing,” this represents a critical, latent portfolio-level vulnerability that can be addressed architecturally.

Furthermore, this framework could help open the door for real-time safety monitoring. By tracking an AI’s internal context vector, which serves as a quantitative measure of the operational state, a safety layer could detect when the system is approaching a known unstable region. It could then trigger an intervention, such as an automated context reset or, more importantly, an immediate escalation to a human underwriter or portfolio manager before a catastrophic output is generated. This “human-in-the-loop” approach directly addresses regulatory calls for meaningful human oversight.

Developing architecturally stable AI is not only a legal and ethical imperative but also an economic one for all industries. In insurance, for example, it is essential for mitigating the immense direct liability from bad faith claims and regulatory fines, as well as the staggering hidden costs of the systemic risk of “Silent AI” exposure. By transforming AI risk management from a qualitative art to a quantitative engineering discipline, this framework helps provide the tools needed to build, verify, and deploy AI systems that can safely deliver on their immense promise to make all forms of AI-enabled business more efficient, responsive, and stable.

Acknowledgments

No AI tools were used for writing or editing the manuscript.

Here, we provide additional technical details that support the statements and results quoted in the main paper.

A. Basic Attention Head and a Tipping-Point Formula

We start by analyzing the “atom” that is the building block for all Chat-GPT-like AI: a single self-Attention head operating alone with greedy next-token selection (low decoding temperature ). The setup is shown in Figure 1(b). This minimalist lens complements mesoscale circuit studies [34, 35] and clarifies when tipping can arise at the head level, independent of multi-head/layer collectives. The result is a compact, predictive rule that reframes many so-called hallucinations – in particular, the vast majority that are simply not noticed because the post-tipping output still has correct syntax, grammar, semantics, and seemingly plausible logic – as deterministic basin switches.

A.1. Minimal Head Dynamics

Each token is a -dimensional vector . Let be the tokens seen so far (prompt + generated output). At position , the query is with embedding vector . The Attention score to an earlier token is the vector dot-product , and the Attention weight is obtained via a softmax at any given value of the effective temperature parameter : where . Here, is an effective temperature associated with the Attention head operation which is often taken as the inverse of some fixed scaling factor, but it could also be used as an effective fitting parameter in an approximate model in which a single effective Attention head represents the true multilayer, multihead system. We stress that is completely separate to the final decoding temperature . The context vector is the weighted average of all embeddings seen so far . With greedy decoding (i.e., final layer temperature near zero for next-token generation), the next token is chosen to maximize over all tokens in the vocabulary: that is, greedy decoding selects . Our analysis includes the full temperature softmax calculation during the Attention process shown in Figure 1(b) and is correct for any value of ; it is only in this final greedy decoder process for next token prediction that we use the low temperature result (i.e., greedy decoding ) [37]. Here, we set for presentational simplicity: the tipping phenomenon that we report here does persist for general but with messier algebra – see Ref [38] and also our online tool at https://d-ai-ta.netlify.app which users can freely use to explore different scenarios.

A.2. Exact Tipping-Point Derivation

The above mathematical analysis can be applied to the full four vector setup in Figure 2 and for any prompt. But since the resulting formulae are cumbersome, we here simply illustrate their form by considering the four vector example from Figure 1, but simplified further by setting to be any vector perpendicular to the two-dimensional plane. Hence, has zero dot-product with the other vectors , , and in the plane. We also consider a simpler prompt than in Figure 2: specifically, we consider a prompt of A’s after which the model emits B’s before tipping to D. Writing with the softmax-derived (omitted for brevity), the flip occurs when , which is equivalent to . This yields the exact tipping step given by:

The ceiling of (i.e., ) predicts the length of the B-block before a deterministic flip to D. Equation 1 can be generalized for more complex prompts. Taking the vectors from Figure 2 (apart from which is perpendicular in this example) and with , Equation 1 yields a tipping point value of 1. This means that the output is (A) B D D D D D, following the initial prompt A, which happens to be the same output as in Examples 1–4 in the main text. By contrast, the case of Figure 2 involves extra “packing” tokens added to the A prompt (i.e., the prompt is ACCA) and an in-plane . These combine to delay the tipping from B to D outputs (i.e., it increases the tipping point value from unity). This can be understood crudely by considering the effect of having A’s in Equation 1; this has the effect of increasing and hence delaying the tipping point.

A.3. Interpretation and Practical Implications

Equation 1 shows that tipping is governed by (i) prompt–B alignment , (ii) B–D margin , and (iii) the Attention temperature . Large implies long “quiet” runs of seemingly correct behavior (i.e., B’s) before sudden degradation (i.e., D’s). This is precisely the failure mode that can elude a user’s checks. Our findings therefore suggest architectural stress testing (sweeping directions) and real-time stability monitors that track projections of onto separating directions to pre-empt flips in safety-critical pipelines in business, insurance, healthcare, law, finance, and defense.

B. Limitations and Future Work

LLMs from GPT-2 to GPT-5/6 all contain many layers and many Attention heads per layer. In full operation, these components interact in complex, nonlinear ways. However, our experiments in simplified factual domains have suggested that the mid-generation tipping phenomena can be very crudely captured by the dynamics of a single effective Attention head. In terms of future work, there is obviously plenty that can be done to flesh out this theoretical approach. Although our derivation assumes a single head with greedy decoding and identity , the tipping criterion can be generalized to non-identity projections (absorbed into transformed embeddings) and to multi-token prompts via an effective . Multi-head/layer systems can be analyzed by identifying dominant heads or subspaces and applying the same orthogonality condition locally; empirically, head redundancy and specialization support such reductions [34, 35]. Future work will also incorporate positional encodings, stochastic decoding, and multi-head coupling; estimate concept directions (B/D) from sentence-embedding prototypes; and validate on controlled GPT-2/3 testbeds and task-specific corpora. The goal is deployable monitors that predict tipping before harmful drift, aligning with emerging risk and compliance needs.

C. Detailed Mathematical Walkthrough

This section provides a more detailed, step-by-step breakdown of the arithmetic of the tipping point.

C.1. Model Parameters and Methodology

The context vector for predicting the next token is derived from a self-Attention mechanism. It is the weighted sum of all token embeddings in the current sequence, where the weights are calculated using the softmax of dot-product scores from the perspective of the sequence’s final token. This context vector is then used to find the next token via greedy decoding (i.e., selecting the token with the highest dot-product score). We focus on the following vectors as in Figure 2: , , , and .

C.2. The First Tipping Point

The simulation begins with the user’s prompt ACCA.

Step 1: output for token 5. Sequence: A, C, C, and A. The context vector .

Next Token: A. History is now (ACCA)A.

Step 2: output for token 6. Sequence: A, C, C, A, and A. The context vector .

Next Token: A. History is now (ACCA)AA.

Step 3: output for token 7 (the first tip). Sequence: A, C, C, A, A, and A. The context vector .

The selection rule flips. Next Token: B. The history is now (ACCA)AAB.

C.3. The second tipping point

The AI now generates a sequence of B tokens. This continues until the history is 12 tokens long.

Step 4: state before the second tip (calculating token 13). Sequence: A, C, C, A, A, A, B, B, B, B, B, and B. The context vector .

The selection rule flips again. Next Token: D. The second tipping point is reached. Hence the history is now (ACCA)AABBBBBBD. This yields the sequence (A, C, C, A) A, A, B, B, B, B, B, B, and D … as shown in Figure 2.

Author Contributions

Nicholas J. Restrepo: Conceptualization, Investigation, Writing – review and editing; Dylan J. Restrepo: Conceptualization, Investigation, Writing – review and editing; Frank Y. Huo: Conceptualization, Investigation, Writing – review and editing; Neil F. Johnson: Conceptualization, Investigation, Writing – review and editing.

Funding

This research did not receive external funding from any agencies.

Ethical statement

Not Applicable.

Data availability statement

Data are available from the corresponding author upon request.

Conflict of interest

The authors declare no conflict of interest.

References

  1. 1.
    Vaswani A, Shazeer N, Parmar N, et al. Attention Is All You Need. Adv Neural Inf Process Syst. 2017;30. doi:10.48550/arXiv.1706.03762.
  2. 2.
    Johnson NF, Huo FY Jekyll-and-Hyde Tipping Point in an AI’s Behavior. arXiv, arXiv:2504.20980; 2025. doi:10.48550/arXiv.2504.20980.
  3. 3.
    Chan A, et al. Harms From Increasingly Agentic Algorithmic Systems. arXiv, arXiv:2302.10329; 2023. doi:10.48550/arXiv.2302.10329.
  4. 4.
    Sharma M, et al. Toward Understanding Sycophancy in Language Models. International Conference on Learning Representations (ICLR); 2024. doi:10.48550/arXiv.2310.13548.
  5. 5.
    National Association of Insurance Commissioners. Artificial Intelligence; [Accessed 2026 Jan 8]. Available from: https://content.naic.org/insurance-topics/artificial-intelligence.
  6. 6.
    Moody’s. Navigating the AI Landscape; 2023; [Accessed 2026 Jan 8]. Available from: https://www.moodys.com/web/en/us/site-assets/ma-kyc-navigating-the-ai-landscape-report.pdf.
  7. 7.
    Swiss Re Institute. SONAR 2024: AI – Unintended Insurance Impacts and Lessons From “Silent Cyber”; 2024; [Accessed 2026 Jan 8]. Available from: https://www.swissre.com/institute/research/sonar/sonar2024/ai-silent-cyber.html.
  8. 8.
    McKinsey and Company. The Future of AI in the Insurance Industry. 2025; [Accessed 2026 Jan 8]. https://www.mckinsey.com/industries/financial-services/our-insights/the-future-of-ai-in-the-insurance-industry.
  9. 9.
    The Digital Insurer. AI in Insurance, Hype, or Reality – PwC Report; 2023; [Accessed 2026 Jan 8]. Available from: https://www.the-digital-insurer.com/library/ai-in-insurance-hype-or-reality-pwc-report/.
  10. 10.
    Deloitte. Are Insurers Truly Ready to Scale Gen AI?; 2025; [Accessed 2026 Jan 8]. Available from: https://www.deloitte.com/us/en/insights/industry/financial-services/scaling-gen-ai-insurance.html.
  11. 11.
    European Parliament. Artificial Intelligence Act (Regulation (EU) 2024/1689). Official Journal of the European Union, L 1689; 2024; [Accessed 2026 Jan 8]. Available from: http://data.europa.eu/eli/reg/2024/1689/oj.
  12. 12.
    Gensler G. AI, Finance, Movies, and the Law – Prepared Remarks Before the Yale Law School; 2024; [Accessed 2026 Jan 8]. Available from: https://www.sec.gov/newsroom/speeches-statements/gensler-ai-021324.
  13. 13.
    State of Colorado. Senate Bill 21–169: Protecting Consumers From Unfair Discrimination in Insurance Practices. 2021; [Accessed 2026 Jan 8]. Available from: https://doi.colorado.gov/for-consumers/sb21-169-protecting-consumers-from-unfair-discrimination-in-insurance-practices.
  14. 14.
    New York Department of Financial Services. Circular Letter No. 7 (2024): Use of Artificial Intelligence Systems and External Consumer Data and Information Sources in Insurance Underwriting and Pricing; 2024; [Accessed 2026 Jan 8]. Available from: https://www.dfs.ny.gov/industry-guidance/circular-letters/cl2024-07.
  15. 15.
    Moffatt V. Air Canada, 2024 BCCRT 149. Civil Resolution Tribunal of British Columbia. 2024; [Accessed 2026 Jan 8]. Available from: https://www.canlii.org/en/bc/bccrt/doc/2024/2024bccrt149/2024bccrt149.html.
  16. 16.
    AI Incident Database. Incident 425: State Farm Allegedly Discriminated Against Black Customers in Claim Payout; [Accessed 2026 Jan 8]. Available from: https://incidentdatabase.ai/cite/425/.
  17. 17.
    The Emerging Strategy Group. The Algorithmic Backlash: a Strategic Reckoning for Insurers. 2025; [Accessed 2026 Jan 8]. Available from: https://www.emerging-strategy.com/the-algorithmic-backlash-a-strategic-reckoning-for-insurers/.
  18. 18.
    Pruitt J, Wilson-Bilik M. NAIC Adopts Principles for Trustworthy Artificial Intelligence in Insurance That Support the Avoidance of Proxy Discrimination Against Protected Classes. 2020; [Accessed 2026 Jan 8]. Available from: https://www.lexology.com/library/detail.aspx?g=24d6e2d0-96a1-4292-9c71-d0655782a577.
  19. 19.
    Zuo Y. Mitigate Biased Decision-Making in AI Algorithms. Society of Actuaries. 2024; [Accessed 2026 Jan 8]. Available from: https://www.soa.org/4a3f62/globalassets/assets/files/resources/research-report/2024/ai-risk-essays/zuo-mitigate-biased-decision.pdf.
  20. 20.
    Patterson Law Firm. Insurer’s Duty of Good Faith and Fair Dealing in Handling Claims of Insureds and Third Parties. Insurance Law Newsletter. 2012; [Accessed 2026 Jan 8]. Available from: https://www.pattersonfirm.com/newsletters/insurance-law/insurers-duty-of-good-faith-and-fair-dealing-in-handling-claims-of-insureds-and-third-parties/.
  21. 21.
    Accenture. Why AI in Insurance Claims and Underwriting? 2022; [Accessed 2026 Jan 8]. Available from: https://www.accenture.com/content/dam/accenture/final/accenture-com/document/Accenture-Why-AI-In-Insurance-Claims-And-Underwriting.pdf.
  22. 22.
    Hersch W. AI Will Be Picking Winners and Losers Among Insurance Products. Wolters Kluwer. 2025; [Accessed 2026 Jan 8]. Available from: https://www.wolterskluwer.com/en/expert-insights/ai-will-be-picking-winners-and-losers-among-insurance-products.
  23. 23.
    Miller Thomson LLP. Underwriting, Claims, Liability: Building AI Into Your Insurance Policies. 2025; [Accessed 2026 Jan 8]. Available from: https://www.millerthomson.com/en/insights/commercial-litigation/underwriting-claims-liability-building-ai-into-your-insurance-policies/.
  24. 24.
    Infinite Computer Solutions. Ethical AI in Insurance. 2024; [Accessed 2026 Jan 8]. Available from: https://www.infinite.com/ethical-ai-in-insurance/.
  25. 25.
    Saul Ewing LLP. The Use of AI in the Insurance Policy Lifecycle and Legal Implications. 2025; [Accessed 2026 Jan 8]. Available from: https://www.saul.com/sites/default/files/documents/2025-02/The%20Use%20of%20AI%20in%20the%20Insurance%20Policy%20Lifecycle%20and%20Legal%20Implications%20%28February%2020%202025%29.pdf.
  26. 26.
    Sciafane S Where the AI Risks Are: Swiss Re’s Top 10 Ranking by Industry. Insurance Journal. 2024; [Accessed 2026 Jan 8]. Available from: https://www.insurancejournal.com/magazines/mag-features/2024/06/17/779355.htm.
  27. 27.
    Moody’s. AI and Gen AI. 2025; [Accessed 2026 Jan 8]. Available from: https://www.moodys.com/web/en/us/capabilities/gen-ai.html.
  28. 28.
    PwC. Managing the Risks of Generative AI. 2025; [Accessed 2026 Jan 8]. Available from: https://www.pwc.com/us/en/tech-effect/ai-analytics/managing-generative-ai-risks.html.
  29. 29.
    Li Y, et al. Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning. arXiv, arXiv:2512.23087v1; 2025. doi:10.48550/arXiv.2512.23087.
  30. 30.
    Huang L, et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. arXiv, arXiv:2311.05232; 2023. doi:10.48550/arXiv.2311.05232.
  31. 31.
    Hubinger E, et al. Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training. arXiv, arXiv:2401.05566, 2024. doi:10.48550/arXiv.2401.05566.
  32. 32.
    Michaud EJ, et al. The Quantization Model of Neural Scaling. Adv Neural Inf Process Syst. 2024;36. doi:10.48550/arXiv.2303.13506.
  33. 33.
    Shumailov I, et al AI Models Collapse When Trained on Recursively Generated Data. Nature. 2024;631:755759.
  34. 34.
    Elhage N, Nanda N, Olsson C, Henighan T, et al. A Mathematical Framework for Transformer Circuits. Anthropic. 2021; [Accessed 2026 Jan 8]. Available from: https://transformer-circuits.pub/2021/framework/index.html.
  35. 35.
    Olsson C, et al. In-context Learning and Induction Heads. arXiv, arXiv:2209.1189; 2022. doi:10.48550/arXiv.2209.11895.
  36. 36.
    Zou A, et al. Representation Engineering: a Top-Down Approach to AI Transparency. Proceedings of the 2024 IEEE Conference on Secure and Trustworthy Machine Learning; 2024. doi:10.48550/arXiv.2310.01405.
  37. 37.
    Restrepo D, et al. Basic Attention Head as a Building Block Toward Understanding Transformer-Based Generative AI. Adv Artif Intell Mach Learn. 2025;5(4):45184531. doi:10.54364/AAIML.2025.54251.
  38. 38.
    Restrepo N, et al. Going Beyond a Basic Attention Head Toward an Understanding of Transformer-Based Generative AI. Adv Artif Intell Mach Learn. 2025;5(4):46754691. doi:10.54364/AAIML.2025.54259.

Written by

Nicholas J. Restrepo, Dylan J. Restrepo, Frank Y. Huo, Neil F. Johnson

Article Type: Research Paper

Date of acceptance: January 2026

Date of publication: February 2026

DoI: 10.5772/acrt20250142

Copyright: The Author(s), Licensee IntechOpen, License: CC BY 4.0

Download for free

© The Author(s) 2025. Licensee IntechOpen. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.


Impact of this article

37

Downloads

159

Views


Share this article


Popular among readers

Loading...

Loading...

Popular among readers

Loading...

Loading...

Popular among readers

Loading...

Loading...

Join us today!

Submit your Article