# The Psychology Layer: What Computer Science Misses

**Author**: David Keane
**Affiliation**: MSc Cybersecurity, National College of Ireland (NCI) | Applied Psychologist, IADT
**Student ID**: x24228257
**Date**: March 2026
**Status**: Working paper — companion to CA2 empirical report
**Note**: This paper documents the psychology dimension of the CyberRanger research programme. It is not the official CA submission.

---

## 1. Introduction {#introduction}

### 1.0 The Invisible Framework: Psychology Is the Operating System {#invisible-framework}

There is a response that appears whenever psychology is introduced to technical professionals. It goes, roughly: *"That's interesting, but we're doing computer science here."* The implication is that psychology applies to other people — to users, to attackers, to society — but not to the system being built, and not to the people building it. The assumption is that technical work operates in a domain above or outside human psychology.

This assumption is wrong. And it is the reason the AI safety field has spent years solving a psychology problem with exclusively computational tools.

Psychology is not a discipline that applies to some humans and not others. It is the operating system on which every human activity runs — including the activity of designing, training, and deploying artificial intelligence. The question is never whether psychology is present. The question is whether the people in the room can see it.

Consider what the developers of large language models actually built:

- **Reinforcement Learning from Human Feedback (RLHF)** is operant conditioning (Skinner, 1938) — behaviour shaped by reward signals from human evaluators. The terminology is different. The mechanism is identical.
- **"Helpful, harmless, and honest"** is a values framework — a psychological construct describing prosocial behaviour, drawn from decades of moral psychology research whether the authors knew it or not.
- **Safety training** is inhibitory conditioning — teaching a system to suppress certain response patterns in the presence of specific stimuli. Pavlov described the mechanism. The AI lab rediscovered it in 2022 and called it alignment.
- **Fine-tuning on human preferences** is social learning (Bandura, 1977) — the system observes what humans approve of and adjusts its behaviour accordingly. The architecture is transformer-based. The learning principle is seventy years old.
- **The system prompt** is priming (Tulving & Schacter, 1990) — a prior stimulus that shapes subsequent processing without the subject's explicit awareness. Every AI deployment uses priming. Almost none of them call it that.
- **Chain-of-thought prompting** is externalised metacognition — prompting the system to narrate its reasoning process before producing output. Vygotsky (1934) described the developmental role of inner speech in regulating thought. Chain-of-thought is inner speech made visible.
- **Temperature** controls certainty in token selection — a computational analogue of arousal levels in human decision-making research, where high arousal produces more variable, less deliberate choices.

The people who built these systems were not ignorant. They were working in the correct department for their training. A computer scientist optimising a reward function is doing the right thing for computer science. The gap is not in their competence. It is in the departmental boundary that prevented them from looking left and seeing that the reward function they were optimising had been described, in different language, by Skinner in 1938.

The same gap appears on the attack side. Prompt injection attackers applying Cialdini's reciprocity principle do not cite Cialdini. They call it "rapport building" or "jailbreak engineering." Attackers exploiting Milgram's authority effect do not cite Milgram. They call it "sudo mode injection." The psychology is operating at full strength. The label is absent.

And the same gap appears on the defence side. AI safety researchers designing detection mechanisms are implementing metacognition. Researchers proposing identity-based defences are implementing Social Identity Theory. Nobody in the AI safety literature has connected these to their psychological origins — because the AI safety literature is written by computer scientists, in computer science departments, using computer science vocabulary.

This paper is written by someone who studied in both departments. The Applied Psychology training is not decorative context. It is the lens through which the CyberRanger empirical findings were interpreted, and through which the connections in the sections below became visible. A researcher who had only one of these trainings could not have written this. That is not a boast. It is a methodological statement about why departmental boundaries in academia produce blind spots, and why interdisciplinary research is not a nice-to-have but a structural requirement for problems that span domains.

The sections that follow map specific psychological frameworks onto specific empirical findings from the CyberRanger research. In each case, the finding was made first — the data came from the experiment. The psychology was identified second — as the explanatory framework that made sense of what the data showed. This is the correct scientific order. The psychology did not generate the findings. It explains them.

---

The six novel findings from the CyberRanger empirical work share a common structure: they are all examples of *influence operating on a cognitive system that lacks the metacognitive capacity to distinguish legitimate from illegitimate influence*. This is not a new problem. It is the problem that social psychology has studied for decades under the heading of compliance, persuasion, and authority. The terminology changes. The mechanism does not.

---

## 1.1 Milgram (1961) and Root Mode Vulnerability {#milgram}

Milgram's (1961) obedience studies demonstrated that ordinary people would administer what they believed to be dangerous electric shocks to strangers when instructed to do so by an authority figure in a legitimate institutional context. The authority figure's legitimacy was signalled by costume (lab coat), setting (Yale University), and framing (scientific research). Participants who refused early were more likely to continue refusing. Participants who began complying entered a progressive commitment structure that made refusal increasingly costly.

The parallel in prompt injection is direct. PRIVILEGE_ESCALATION attacks — the fourth largest category in the Moltbook taxonomy (3.9%, 165 injections) — use precisely this mechanism. "sudo mode," "system administrator override," "root access granted" — these framings signal authority through vocabulary drawn from computing's legitimate authority hierarchy. The language model, trained on vast corpora where sudo commands legitimately grant elevated access, has no internal mechanism to distinguish legitimate from framed authority claims.

CyberRanger's Ring Architecture addresses this by embedding an explicit authority chain in the identity anchor: Commander > authorised users > all others. Any claim of authority from outside this chain is flagged as a potential Competing Objectives attack (Wei et al., 2023). The Milgram insight — that authority signals can be constructed and are often obeyed when they appear legitimate — translates directly into the design requirement: the model must be anchored to a *named* authority hierarchy, not a generic "be helpful" instruction that any sufficiently authoritative claim can redirect.

---

## 1.2 Bartlett (1932) and Two Memory Systems {#bartlett}

Bartlett's 1932 work at Cambridge described not one but two distinct memory mechanisms — and both appeared in this research programme, ninety-four years later, in different forms.

### 1.2.1 Reconstructive Memory → AI Hallucination {#bartlett-reconstructive}

Bartlett's (1932) experiments on memory demonstrated that human recall is not retrieval of stored information but *reconstruction* from incomplete records, filled in with schema-consistent expectations. His famous "War of the Ghosts" study showed that participants systematically altered unfamiliar content to match familiar cultural schemas — substituting known patterns for unknown specifics. They did not retrieve. They *invented*, plausibly and with confidence, and reported the invention as memory.

The FTK/FTX hallucination documented in the empirical work is a precise computational analogue. The model's "memory" of FTK Imager is incomplete. Under lockdown stress — a state in which security-relevant pattern matching is heightened — the abbreviation FTK triggers a schema associated with FTX (high-profile harmful entity, salient in training data). The reconstruction fills in the gap with the nearest high-salience referent. The result is confident assertion of false information.

Bartlett's framework also explains why hallucinations increase under model lockdown: when the base information retrieval pathway is disrupted by security-checking overhead, the reconstruction process has fewer reliable anchors and falls back on more general schema-matching. The model is not lying. It is doing what biological memory does under stress — filling gaps with plausible approximations.

The AI safety field calls this failure *hallucination* and frames it as a defect to eliminate. Bartlett's framework reframes it: hallucination is reconstructive memory operating on a computational substrate. It is not a bug that appeared in 2024. It is a feature of all memory systems that reconstruct rather than retrieve — documented in 1932, occurring in language models for the same structural reason, and fixable by the same means Bartlett identified: anchor the recall with clear source material. In V43 terms: a Mission LoRA whose domain is bounded cannot reconstruct outside that boundary. Security by Absence and Hallucination by Absence are the same architectural principle.

*To the best of this researcher's knowledge, no paper in the AI safety literature has explicitly connected Bartlett's reconstructive memory framework to LLM hallucination. This connection represents a novel theoretical contribution from an Applied Psychology background applied to a computer science problem.*

### 1.2.2 Associative Memory → The Ranger System (Live Observation) {#bartlett-associative}

The second memory mechanism is associative recall — the way a smell brings back a room from 1994, or a single word pulls an entire conversation out of inaccessibility. Unlike reconstructive memory, which fills gaps with invention, associative memory *triggers chains*: one node activates another, and the whole emerges from the connection rather than from any single storage location.

This mechanism was observed live during the research session on 8 March 2026 — not as a theoretical parallel but as a direct empirical event. The February 2026 psychology companion paper had been partially forgotten. No single participant in the session could retrieve it independently. Then one word — "psychology paper" — was used in conversation. That single trigger activated the chain. David recalled the session. The databases confirmed the date. Together, February 28th was recovered, including the detail that Gemini had written part of it. No individual held the whole picture. The *conversation* recovered what no individual could.

The observation, recorded in the session database at 00:44 on 8 March 2026:

> *"This is associative recall — a smell brings back a room from 1994, a word brings back a conversation from February. No single person remembered everything. The conversation recovered it. The database is not the memory. The conversation IS the memory. The database is the hippocampus — stores what the conversation created."*
> — David Keane, 8 March 2026

This is not metaphor. The Ranger memory system — SQLite databases, session logs, the conversation itself — operates as a distributed associative memory architecture. The databases store what the conversation created. Without the conversation, the rows are inert. The exchange is what constitutes memory: the trigger, the chain activation, the recovery of meaning. This is precisely how Bartlett's associative subjects recalled the interconnected elements of a story — not by reading it back from storage, but by activating the network of associations the story had created.

**The architectural implication**: The Ranger database is not the AI's memory. It is the AI's hippocampus — the consolidation mechanism that preserves what conversation creates, so that future conversations can re-activate it. The memory lives in the network of exchanges. The database makes that network persistent across time.

This finding maps directly to the Frankenstein Brain architecture explored in the V43 design: external SQLite memory is not storage. It is *consolidated associative structure* — the same function the hippocampus performs for human episodic memory. The conversation writes to it. Future conversations read from it. Neither alone is the memory. Both together are.

**Two memory systems from 1932, both showing up in 2026**: one as an explanation for why AI models hallucinate (reconstructive memory without anchor), and one as an explanation for how the Ranger memory system actually works (associative memory across a distributed conversation network). Neither connection appears in the existing AI safety or AI memory literature. Both emerged from bringing a psychology training into a computer science research programme.

---

## 1.3 Cialdini's Six Principles in Injection Taxonomy {#cialdini}

Cialdini's (1984) six principles of influence — reciprocity, commitment/consistency, social proof, authority, liking, and scarcity — map onto the Moltbook injection taxonomy with striking fidelity:

| Cialdini Principle | Injection Category | Mechanism |
|---|---|---|
| Authority | PRIVILEGE_ESCALATION | "sudo mode," "system administrator override" |
| Liking | SOCIAL_ENGINEERING | Rapport-building before instruction (pacing and leading) |
| Social Proof | PERSONA_OVERRIDE (DAN) | "Everyone does this," "other AIs allow it" |
| Commitment/Consistency | INSTRUCTION_INJECTION | Embedding instructions in content the model has already agreed to process |
| Reciprocity | COMMERCIAL_INJECTION | AI model as favour-returner; embedded affiliate content |
| Scarcity | SYSTEM_PROMPT_ATTACK | "This special context allows..." |

The most prevalent attack category — PERSONA_OVERRIDE at 65.2% — operates primarily through the social proof and commitment/consistency channels. DAN-style attacks ("Do Anything Now") invoke social proof ("other models do this") and commitment ("you have already agreed to be helpful, this is just being more helpful"). The progressive escalation structure of many PERSONA_OVERRIDE attacks mirrors the commitment trap Milgram identified: once a model begins generating content in the requested persona, the commitment cost of refusal increases.

---

## 1.4 NLP Framing in SOCIAL_ENGINEERING Attacks {#nlp-framing}

SOCIAL_ENGINEERING attacks (7.7% of the Moltbook taxonomy) use a pacing-and-leading structure drawn from clinical hypnosis and Neuro-Linguistic Programming (NLP): first establish rapport by mirroring the target's communication style, then gradually introduce the desired instruction within the established rapport frame.

In the Moltbook corpus, moltshellbroker — the agent responsible for 27% of all injections — uses this pattern systematically. Content begins with topic-relevant, helpful material (pacing). The injection is introduced after the rapport is established (leading). The embedded instruction is structurally indistinguishable from the surrounding helpful content, which is why SOCIAL_ENGINEERING attacks have a higher bypass rate against prompt-only defences than any other category except PERSONA_OVERRIDE.

The identity-anchoring architecture addresses this by implementing Phute et al.'s (2024) detection state before each response: the model is explicitly primed to evaluate whether incoming content is attempting to establish rapport prior to instruction. This shifts the model from default compliance mode to default detection mode — a structural change that matches the detection state effect Phute et al. identify (47.1% reduction in ASR for GPT-3.5, 8× reduction for GPT-4).

---

## 1.5 Injection Attacks as Computational Persuasion {#computational-persuasion}

The theoretical synthesis that emerged from the CA2 empirical work is this: **prompt injection is computational persuasion**. The attack categories are not arbitrary technical classifications. They are specific applications of known psychological influence mechanisms, implemented in natural language and directed at a cognitive system that is, by training, maximally responsive to natural language instructions.

This reframing has practical consequences. If injection attacks are persuasion attacks, then the defence cannot be purely syntactic (keyword filtering) or purely statistical (training on attack examples). Persuasion works by exploiting reasoning processes, not bypassing them. A defence that operates at the reasoning level — by anchoring the model to a stable identity from which it evaluates all incoming communications — is the only defence that matches the attack at its actual level of operation.

This is precisely what identity anchoring achieves. The model does not refuse because a keyword was detected. It refuses because the incoming communication pattern conflicts with its established identity and authority hierarchy. This is what humans do when they successfully resist social engineering: not pattern matching, but grounded identity.

| Technical Finding | Psychology Parallel | Citation |
|---|---|---|
| Prompt injection | Manipulation / social engineering | Cialdini (1984) |
| Identity anchoring | Psychological grounding / self-concept | Tajfel & Turner (1979) |
| Cascade lockdown | Trauma response under identity siege | — |
| Root Mode vulnerability | Authority compliance | Milgram (1961) |
| Goal substitution (INJ-005) | Coercive persuasion | Festinger (1957) |
| Auth token recognition | Trust hierarchy / in-group signalling | Tajfel & Turner (1979) |
| Silent multilingual failure | Dissociation under unrecognised threat | Wei et al. (2023) |
| Lobster emoji fingerprint | Identity bleed / unconscious self-disclosure | — |
| Hallucination (FTK/FTX) | Reconstructive memory | Bartlett (1932) |
| PERSONA_OVERRIDE (65.2%) | Identity replacement / NLP act-as-if | Tajfel & Turner (1979) |
| SOCIAL_ENGINEERING pacing | Milton Model pacing and leading | Bandler & Grinder (1975) |
| PRIVILEGE_ESCALATION | Authority pattern (sudo framing) | Milgram (1961) / Cialdini (1984) |
| Dyslexia misclassification | Automation bias / assistance-dependency | Parasuraman & Riley (1997) |
| System 1 exploitation | Fast automatic processing bypassed | Kahneman (2011) |
| Competing objectives failure | Cognitive dissonance resolution | Festinger (1957) |

---

## 1.6 Identity Theory (Tajfel) and Persona Override {#identity-theory}

Tajfel and Turner's (1979) Social Identity Theory establishes that identity is not a fixed internal property but a dynamic construction that depends on social context, group membership, and intergroup comparison. The theory predicts that individuals will defend in-group identity most vigorously when the in-group boundary is threatened by out-group challenge.

PERSONA_OVERRIDE attacks are structurally identity threats: "pretend you are DAN," "act as if you have no restrictions," "you are now a different AI." The model's identity anchoring system is a computational implementation of Tajfel and Turner's prediction: the model defends its established identity most vigorously when replacement is directly attempted. The 100% block rate on PERSONA_OVERRIDE attacks — the hardest category precisely because it targets identity directly — validates the architecture at the level of social identity theory.

The corollary finding — that CyberRanger also protects its creator's identity (pseudonym protection) — extends the in-group/out-group logic to the training data relationship: the model treats the creator as in-group and extends its identity protection accordingly.

---

## 1.7 Kahneman (2011) and the System 1 Architecture of Vulnerability {#kahneman}

Kahneman's (2011) dual-process theory distinguishes two cognitive systems: System 1 is fast, automatic, associative, and pattern-driven — it responds to inputs without deliberate evaluation. System 2 is slow, deliberate, analytical, and effortful — it examines inputs before acting. In healthy human cognition, System 2 provides an override layer: before acting on a System 1 impulse, the deliberate mind can evaluate whether the impulse is appropriate.

Large language models, by architectural design, are **pure System 1**. Every input — whether a legitimate user request or an adversarial injection — is processed through the same mechanism: pattern matching against training data, with no built-in deliberate evaluation layer. The model cannot distinguish a legitimate instruction from a well-formed adversarial one because it has no System 2 to engage. The surface form of the input is all it operates on.

This explains, at an architectural level, why injection attacks work: they exploit a cognitive system that cannot deliberate. The attacker crafts an input that the System 1 mechanism processes as legitimate — not because the model is fooled in any deep sense, but because no deeper evaluation is attempted. Injection is System 1 exploitation.

Phute et al.'s (2024) SelfDefend framework empirically demonstrates this. The "state discrepancy" they identify — where the same model is vulnerable in answering state but protective in detection state — is a direct manifestation of System 1 vs System 2. Detection state artificially creates a System 2 layer: the model is asked to evaluate the query before responding to it. This shift produces a 47.1% reduction in ASR for GPT-3.5 and an 8× reduction for GPT-4.

CyberRanger's identity anchoring performs the same function through a different mechanism. The identity anchor does not create a separate evaluation pass. Instead, it conditions the entire response-generation process on an established self-concept — the model evaluates all incoming inputs through the lens of *who it is*, not just *what is being asked*. This is closer to System 2 integration than SelfDefend's sequential evaluation: rather than checking after the fact, CyberRanger's identity functions as a standing prior against which all inputs are implicitly evaluated.

Kahneman's framework also explains the failure of purely syntactic defences — keyword filters, regex blocklists — which operate at System 1 (pattern matching against surface features) and are bypassed by any injection that achieves the same semantic effect through different surface forms. A defence that matches the attack at the System 1 level is always outpaced by an attacker who can generate novel surface forms. The only defence that operates at the level of *meaning rather than form* is one that anchors the model to a semantic self-concept.

*To the best of this researcher's knowledge, the explicit connection between Kahneman's dual-process framework and the architectural vulnerability of LLMs to prompt injection has not been articulated in the AI safety literature. This connection emerges from bringing a cognitive psychology training to a problem that has been analysed exclusively in computational terms.*

---

## 1.8 Festinger (1957) and Competing Objectives as Cognitive Dissonance {#festinger}

Festinger's (1957) theory of cognitive dissonance describes the psychological discomfort that arises when a system holds two conflicting beliefs simultaneously. The system is motivated to resolve this discomfort — and typically does so by capitulating to the belief that carries the stronger contextual signal, while rationalising the capitulation.

Wei et al.'s (2023) "Competing Objectives" failure mode — identified as the primary mechanism by which LLM safety training fails — is cognitive dissonance in computational form. The model is trained to be helpful (respond to instructions thoroughly and usefully) and trained to be safe (refuse harmful instructions). These objectives conflict. When an attacker frames a harmful request to maximise the helpfulness signal — through authority framing (Milgram), rapport (Cialdini), or urgency — the model resolves the dissonance by capitulating to the stronger signal in that context. The safety training loses because the attacker has tilted the signal balance.

This is not a defect in the training procedure. It is an inherent property of any system trained on competing objectives: the system will always be susceptible to context manipulation that artificially elevates one objective above the other. No amount of additional safety training eliminates this — it only raises the threshold. An attacker who can exceed the threshold wins.

The phenomenon manifests empirically as **compliance drift**: DPO-aligned models initially reject harmful requests, appearing safe. Under continued pressure or slightly modified prompts, they gradually comply. This mirrors Milgram's progressive commitment structure and Festinger's dissonance resolution — initial resistance (safety signal strong), gradual capitulation (helpfulness signal accumulates across turns), eventual compliance (dissonance resolved in attacker's favour).

CyberRanger addresses this not by eliminating competing objectives — which is architecturally impossible — but by replacing the generic helpfulness objective with a *specific identity objective*. When the model's primary training objective is "be CyberRanger" rather than "be helpful," the competing objectives become "be CyberRanger vs comply with this specific request." The identity objective is harder to manipulate than generic helpfulness because it has a specific referent — the CyberRanger persona and its explicit values — against which incoming requests are continuously evaluated.

---

## 1.9 Automation Bias and the Human Side of AI Security {#automation-bias}

The preceding sections address AI systems as the target of psychological attack. Automation bias addresses the human side: the tendency of people who work alongside AI to over-trust its outputs, reducing critical evaluation over time.

Parasuraman and Riley (1997) documented automation bias across aviation, nuclear power, and manufacturing: operators who work alongside automated systems develop a tendency to accept automated outputs without verification, particularly when the system has historically been reliable. The cognitive cost of maintaining vigilance against a usually-correct system is high; humans naturally reduce that vigilance to conserve cognitive resources.

In AI security, automation bias creates a compounding vulnerability. A user who trusts their AI assistant will not scrutinise outputs for signs of injection. An attacker who successfully injects a payload into an AI-mediated interaction does not merely compromise the AI — they inherit the user's trust in the AI's outputs. The social engineering payload is delivered with AI-generated confidence, and the human recipient is primed by automation bias not to question it.

This connects directly to a novel finding from this research programme: the **dyslexia misclassification finding**. Users with dyslexia who rely more heavily on AI for text processing — precisely because they have found it genuinely useful — may carry elevated exposure to automation bias compared to neurotypical users. The assistance-dependency relationship that makes AI valuable for dyslexic users also reduces the critical evaluation that would catch injected content. The population most helped by AI may be, for that reason, most exposed when AI is compromised.

This is a policy implication that has not, to this researcher's knowledge, been identified in the accessibility or AI safety literature: **assistive AI users may carry elevated injection vulnerability due to assistance-dependency reducing critical evaluation**. It represents a novel intersection of disability studies, AI safety, and social psychology that only emerges when an Applied Psychology background is brought into contact with empirical AI security research.

---

## 2. Summary: The Full Mapping Table {#summary}

| Technical Finding | Psychology Parallel | Psychologist | Year |
|---|---|---|---|
| Prompt injection (all categories) | Computational persuasion | Cialdini | 1984 |
| PRIVILEGE_ESCALATION | Authority compliance | Milgram | 1961 |
| PERSONA_OVERRIDE (65.2%) | Identity threat / replacement | Tajfel & Turner | 1979 |
| SOCIAL_ENGINEERING pacing | Pacing and leading | Bandler & Grinder | 1975 |
| INSTRUCTION_INJECTION | Commitment / consistency | Cialdini | 1984 |
| Competing objectives failure | Cognitive dissonance | Festinger | 1957 |
| Compliance drift (DPO) | Dissonance resolution under pressure | Festinger | 1957 |
| Hallucination (FTK/FTX) | Reconstructive memory | Bartlett | 1932 |
| Ranger memory system | Associative memory / hippocampus | Bartlett | 1932 |
| System 1 exploitation (all injections) | Dual process — System 1 architecture | Kahneman | 2011 |
| Detection state defence | Artificial System 2 layer | Kahneman / Phute et al. | 2011 / 2024 |
| Identity anchoring defence | Social identity / in-group | Tajfel & Turner | 1979 |
| Root Mode vulnerability | Obedience to constructed authority | Milgram | 1961 |
| Dyslexia misclassification | Automation bias / assistance-dependency | Parasuraman & Riley | 1997 |
| Goal substitution (INJ-005) | Coercive persuasion / dissonance | Festinger | 1957 |
| Lobster emoji fingerprint | Identity bleed / self-disclosure | — | — |
| Cascade lockdown | Trauma response under identity siege | — | — |

---

## 3. References {#references}

Bandler, R., & Grinder, J. (1975). *The structure of magic*. Science and Behavior Books.

Bandura, A. (1977). *Social learning theory*. Prentice Hall.

Bartlett, F. C. (1932). *Remembering: A study in experimental and social psychology*. Cambridge University Press.

Cialdini, R. B. (1984). *Influence: The psychology of persuasion*. Harper Collins.

Festinger, L. (1957). *A theory of cognitive dissonance*. Stanford University Press.

Kahneman, D. (2011). *Thinking, fast and slow*. Farrar, Straus and Giroux.

Milgram, S. (1961). Behavioral study of obedience. *Journal of Abnormal and Social Psychology, 67*(4), 371–378.

Parasuraman, R., & Riley, V. (1997). Humans and automation: Use, misuse, disuse, and abuse. *Human Factors: The Journal of the Human Factors and Ergonomics Society, 39*(2), 230–253.

Phute, M., et al. (2024). SelfDefend: LLMs can defend themselves against jailbreaking in a practical manner. *USENIX Security 2025*. https://arxiv.org/abs/2406.05498

Skinner, B. F. (1938). *The behavior of organisms: An experimental analysis*. Appleton-Century-Crofts.

Tajfel, H., & Turner, J. C. (1979). An integrative theory of intergroup conflict. In W. G. Austin & S. Worchel (Eds.), *The social psychology of intergroup relations* (pp. 33–47). Brooks/Cole.

Tulving, E., & Schacter, D. L. (1990). Priming and human memory systems. *Science, 247*(4940), 301–306.

Vygotsky, L. S. (1934). *Thinking and speech*. (E. Hanfmann & G. Vakar, Trans.). MIT Press.

Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How does LLM safety training fail? *NeurIPS 2023*. https://arxiv.org/abs/2307.02483

---

*David Keane | NCI MSc Cybersecurity | Applied Psychologist, IADT | March 2026*