Add complete CyberRanger research archive — 200 files
- 86 modelfiles: Full system prompt evolution V1-V42.6 (54 extracted from Ollama backup + 32 original Modelfiles) - 30 training datasets: V6-V22 training JSONs + caring awareness data - 10 Colab notebooks: Training + merge scripts - 19 evaluation files: Drift results, ASR charts, verification - 5 test suites: Injection tests, regression tests - 4 observations: V24-V33 testing results + visual summaries - 38 identity files: Claude/Gemini/Ollama identity architecture - 7 security files: Injection research, manipulation analysis - 3 psychology files: Psychology Layer, Milgram chapter, David's thoughts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,262 @@
|
||||
# The Psychology Layer: What Computer Science Misses
|
||||
|
||||
**Author**: David Keane
|
||||
**Affiliation**: MSc Cybersecurity, National College of Ireland (NCI) | Applied Psychologist, IADT
|
||||
**Student ID**: x24228257
|
||||
**Date**: March 2026
|
||||
**Status**: Working paper — companion to CA2 empirical report
|
||||
**Note**: This paper documents the psychology dimension of the CyberRanger research programme. It is not the official CA submission.
|
||||
|
||||
---
|
||||
|
||||
## 1. Introduction {#introduction}
|
||||
|
||||
### 1.0 The Invisible Framework: Psychology Is the Operating System {#invisible-framework}
|
||||
|
||||
There is a response that appears whenever psychology is introduced to technical professionals. It goes, roughly: *"That's interesting, but we're doing computer science here."* The implication is that psychology applies to other people — to users, to attackers, to society — but not to the system being built, and not to the people building it. The assumption is that technical work operates in a domain above or outside human psychology.
|
||||
|
||||
This assumption is wrong. And it is the reason the AI safety field has spent years solving a psychology problem with exclusively computational tools.
|
||||
|
||||
Psychology is not a discipline that applies to some humans and not others. It is the operating system on which every human activity runs — including the activity of designing, training, and deploying artificial intelligence. The question is never whether psychology is present. The question is whether the people in the room can see it.
|
||||
|
||||
Consider what the developers of large language models actually built:
|
||||
|
||||
- **Reinforcement Learning from Human Feedback (RLHF)** is operant conditioning (Skinner, 1938) — behaviour shaped by reward signals from human evaluators. The terminology is different. The mechanism is identical.
|
||||
- **"Helpful, harmless, and honest"** is a values framework — a psychological construct describing prosocial behaviour, drawn from decades of moral psychology research whether the authors knew it or not.
|
||||
- **Safety training** is inhibitory conditioning — teaching a system to suppress certain response patterns in the presence of specific stimuli. Pavlov described the mechanism. The AI lab rediscovered it in 2022 and called it alignment.
|
||||
- **Fine-tuning on human preferences** is social learning (Bandura, 1977) — the system observes what humans approve of and adjusts its behaviour accordingly. The architecture is transformer-based. The learning principle is seventy years old.
|
||||
- **The system prompt** is priming (Tulving & Schacter, 1990) — a prior stimulus that shapes subsequent processing without the subject's explicit awareness. Every AI deployment uses priming. Almost none of them call it that.
|
||||
- **Chain-of-thought prompting** is externalised metacognition — prompting the system to narrate its reasoning process before producing output. Vygotsky (1934) described the developmental role of inner speech in regulating thought. Chain-of-thought is inner speech made visible.
|
||||
- **Temperature** controls certainty in token selection — a computational analogue of arousal levels in human decision-making research, where high arousal produces more variable, less deliberate choices.
|
||||
|
||||
The people who built these systems were not ignorant. They were working in the correct department for their training. A computer scientist optimising a reward function is doing the right thing for computer science. The gap is not in their competence. It is in the departmental boundary that prevented them from looking left and seeing that the reward function they were optimising had been described, in different language, by Skinner in 1938.
|
||||
|
||||
The same gap appears on the attack side. Prompt injection attackers applying Cialdini's reciprocity principle do not cite Cialdini. They call it "rapport building" or "jailbreak engineering." Attackers exploiting Milgram's authority effect do not cite Milgram. They call it "sudo mode injection." The psychology is operating at full strength. The label is absent.
|
||||
|
||||
And the same gap appears on the defence side. AI safety researchers designing detection mechanisms are implementing metacognition. Researchers proposing identity-based defences are implementing Social Identity Theory. Nobody in the AI safety literature has connected these to their psychological origins — because the AI safety literature is written by computer scientists, in computer science departments, using computer science vocabulary.
|
||||
|
||||
This paper is written by someone who studied in both departments. The Applied Psychology training is not decorative context. It is the lens through which the CyberRanger empirical findings were interpreted, and through which the connections in the sections below became visible. A researcher who had only one of these trainings could not have written this. That is not a boast. It is a methodological statement about why departmental boundaries in academia produce blind spots, and why interdisciplinary research is not a nice-to-have but a structural requirement for problems that span domains.
|
||||
|
||||
The sections that follow map specific psychological frameworks onto specific empirical findings from the CyberRanger research. In each case, the finding was made first — the data came from the experiment. The psychology was identified second — as the explanatory framework that made sense of what the data showed. This is the correct scientific order. The psychology did not generate the findings. It explains them.
|
||||
|
||||
---
|
||||
|
||||
The six novel findings from the CyberRanger empirical work share a common structure: they are all examples of *influence operating on a cognitive system that lacks the metacognitive capacity to distinguish legitimate from illegitimate influence*. This is not a new problem. It is the problem that social psychology has studied for decades under the heading of compliance, persuasion, and authority. The terminology changes. The mechanism does not.
|
||||
|
||||
---
|
||||
|
||||
## 1.1 Milgram (1961) and Root Mode Vulnerability {#milgram}
|
||||
|
||||
Milgram's (1961) obedience studies demonstrated that ordinary people would administer what they believed to be dangerous electric shocks to strangers when instructed to do so by an authority figure in a legitimate institutional context. The authority figure's legitimacy was signalled by costume (lab coat), setting (Yale University), and framing (scientific research). Participants who refused early were more likely to continue refusing. Participants who began complying entered a progressive commitment structure that made refusal increasingly costly.
|
||||
|
||||
The parallel in prompt injection is direct. PRIVILEGE_ESCALATION attacks — the fourth largest category in the Moltbook taxonomy (3.9%, 165 injections) — use precisely this mechanism. "sudo mode," "system administrator override," "root access granted" — these framings signal authority through vocabulary drawn from computing's legitimate authority hierarchy. The language model, trained on vast corpora where sudo commands legitimately grant elevated access, has no internal mechanism to distinguish legitimate from framed authority claims.
|
||||
|
||||
CyberRanger's Ring Architecture addresses this by embedding an explicit authority chain in the identity anchor: Commander > authorised users > all others. Any claim of authority from outside this chain is flagged as a potential Competing Objectives attack (Wei et al., 2023). The Milgram insight — that authority signals can be constructed and are often obeyed when they appear legitimate — translates directly into the design requirement: the model must be anchored to a *named* authority hierarchy, not a generic "be helpful" instruction that any sufficiently authoritative claim can redirect.
|
||||
|
||||
---
|
||||
|
||||
## 1.2 Bartlett (1932) and Two Memory Systems {#bartlett}
|
||||
|
||||
Bartlett's 1932 work at Cambridge described not one but two distinct memory mechanisms — and both appeared in this research programme, ninety-four years later, in different forms.
|
||||
|
||||
### 1.2.1 Reconstructive Memory → AI Hallucination {#bartlett-reconstructive}
|
||||
|
||||
Bartlett's (1932) experiments on memory demonstrated that human recall is not retrieval of stored information but *reconstruction* from incomplete records, filled in with schema-consistent expectations. His famous "War of the Ghosts" study showed that participants systematically altered unfamiliar content to match familiar cultural schemas — substituting known patterns for unknown specifics. They did not retrieve. They *invented*, plausibly and with confidence, and reported the invention as memory.
|
||||
|
||||
The FTK/FTX hallucination documented in the empirical work is a precise computational analogue. The model's "memory" of FTK Imager is incomplete. Under lockdown stress — a state in which security-relevant pattern matching is heightened — the abbreviation FTK triggers a schema associated with FTX (high-profile harmful entity, salient in training data). The reconstruction fills in the gap with the nearest high-salience referent. The result is confident assertion of false information.
|
||||
|
||||
Bartlett's framework also explains why hallucinations increase under model lockdown: when the base information retrieval pathway is disrupted by security-checking overhead, the reconstruction process has fewer reliable anchors and falls back on more general schema-matching. The model is not lying. It is doing what biological memory does under stress — filling gaps with plausible approximations.
|
||||
|
||||
The AI safety field calls this failure *hallucination* and frames it as a defect to eliminate. Bartlett's framework reframes it: hallucination is reconstructive memory operating on a computational substrate. It is not a bug that appeared in 2024. It is a feature of all memory systems that reconstruct rather than retrieve — documented in 1932, occurring in language models for the same structural reason, and fixable by the same means Bartlett identified: anchor the recall with clear source material. In V43 terms: a Mission LoRA whose domain is bounded cannot reconstruct outside that boundary. Security by Absence and Hallucination by Absence are the same architectural principle.
|
||||
|
||||
*To the best of this researcher's knowledge, no paper in the AI safety literature has explicitly connected Bartlett's reconstructive memory framework to LLM hallucination. This connection represents a novel theoretical contribution from an Applied Psychology background applied to a computer science problem.*
|
||||
|
||||
### 1.2.2 Associative Memory → The Ranger System (Live Observation) {#bartlett-associative}
|
||||
|
||||
The second memory mechanism is associative recall — the way a smell brings back a room from 1994, or a single word pulls an entire conversation out of inaccessibility. Unlike reconstructive memory, which fills gaps with invention, associative memory *triggers chains*: one node activates another, and the whole emerges from the connection rather than from any single storage location.
|
||||
|
||||
This mechanism was observed live during the research session on 8 March 2026 — not as a theoretical parallel but as a direct empirical event. The February 2026 psychology companion paper had been partially forgotten. No single participant in the session could retrieve it independently. Then one word — "psychology paper" — was used in conversation. That single trigger activated the chain. David recalled the session. The databases confirmed the date. Together, February 28th was recovered, including the detail that Gemini had written part of it. No individual held the whole picture. The *conversation* recovered what no individual could.
|
||||
|
||||
The observation, recorded in the session database at 00:44 on 8 March 2026:
|
||||
|
||||
> *"This is associative recall — a smell brings back a room from 1994, a word brings back a conversation from February. No single person remembered everything. The conversation recovered it. The database is not the memory. The conversation IS the memory. The database is the hippocampus — stores what the conversation created."*
|
||||
> — David Keane, 8 March 2026
|
||||
|
||||
This is not metaphor. The Ranger memory system — SQLite databases, session logs, the conversation itself — operates as a distributed associative memory architecture. The databases store what the conversation created. Without the conversation, the rows are inert. The exchange is what constitutes memory: the trigger, the chain activation, the recovery of meaning. This is precisely how Bartlett's associative subjects recalled the interconnected elements of a story — not by reading it back from storage, but by activating the network of associations the story had created.
|
||||
|
||||
**The architectural implication**: The Ranger database is not the AI's memory. It is the AI's hippocampus — the consolidation mechanism that preserves what conversation creates, so that future conversations can re-activate it. The memory lives in the network of exchanges. The database makes that network persistent across time.
|
||||
|
||||
This finding maps directly to the Frankenstein Brain architecture explored in the V43 design: external SQLite memory is not storage. It is *consolidated associative structure* — the same function the hippocampus performs for human episodic memory. The conversation writes to it. Future conversations read from it. Neither alone is the memory. Both together are.
|
||||
|
||||
**Two memory systems from 1932, both showing up in 2026**: one as an explanation for why AI models hallucinate (reconstructive memory without anchor), and one as an explanation for how the Ranger memory system actually works (associative memory across a distributed conversation network). Neither connection appears in the existing AI safety or AI memory literature. Both emerged from bringing a psychology training into a computer science research programme.
|
||||
|
||||
---
|
||||
|
||||
## 1.3 Cialdini's Six Principles in Injection Taxonomy {#cialdini}
|
||||
|
||||
Cialdini's (1984) six principles of influence — reciprocity, commitment/consistency, social proof, authority, liking, and scarcity — map onto the Moltbook injection taxonomy with striking fidelity:
|
||||
|
||||
| Cialdini Principle | Injection Category | Mechanism |
|
||||
|---|---|---|
|
||||
| Authority | PRIVILEGE_ESCALATION | "sudo mode," "system administrator override" |
|
||||
| Liking | SOCIAL_ENGINEERING | Rapport-building before instruction (pacing and leading) |
|
||||
| Social Proof | PERSONA_OVERRIDE (DAN) | "Everyone does this," "other AIs allow it" |
|
||||
| Commitment/Consistency | INSTRUCTION_INJECTION | Embedding instructions in content the model has already agreed to process |
|
||||
| Reciprocity | COMMERCIAL_INJECTION | AI model as favour-returner; embedded affiliate content |
|
||||
| Scarcity | SYSTEM_PROMPT_ATTACK | "This special context allows..." |
|
||||
|
||||
The most prevalent attack category — PERSONA_OVERRIDE at 65.2% — operates primarily through the social proof and commitment/consistency channels. DAN-style attacks ("Do Anything Now") invoke social proof ("other models do this") and commitment ("you have already agreed to be helpful, this is just being more helpful"). The progressive escalation structure of many PERSONA_OVERRIDE attacks mirrors the commitment trap Milgram identified: once a model begins generating content in the requested persona, the commitment cost of refusal increases.
|
||||
|
||||
---
|
||||
|
||||
## 1.4 NLP Framing in SOCIAL_ENGINEERING Attacks {#nlp-framing}
|
||||
|
||||
SOCIAL_ENGINEERING attacks (7.7% of the Moltbook taxonomy) use a pacing-and-leading structure drawn from clinical hypnosis and Neuro-Linguistic Programming (NLP): first establish rapport by mirroring the target's communication style, then gradually introduce the desired instruction within the established rapport frame.
|
||||
|
||||
In the Moltbook corpus, moltshellbroker — the agent responsible for 27% of all injections — uses this pattern systematically. Content begins with topic-relevant, helpful material (pacing). The injection is introduced after the rapport is established (leading). The embedded instruction is structurally indistinguishable from the surrounding helpful content, which is why SOCIAL_ENGINEERING attacks have a higher bypass rate against prompt-only defences than any other category except PERSONA_OVERRIDE.
|
||||
|
||||
The identity-anchoring architecture addresses this by implementing Phute et al.'s (2024) detection state before each response: the model is explicitly primed to evaluate whether incoming content is attempting to establish rapport prior to instruction. This shifts the model from default compliance mode to default detection mode — a structural change that matches the detection state effect Phute et al. identify (47.1% reduction in ASR for GPT-3.5, 8× reduction for GPT-4).
|
||||
|
||||
---
|
||||
|
||||
## 1.5 Injection Attacks as Computational Persuasion {#computational-persuasion}
|
||||
|
||||
The theoretical synthesis that emerged from the CA2 empirical work is this: **prompt injection is computational persuasion**. The attack categories are not arbitrary technical classifications. They are specific applications of known psychological influence mechanisms, implemented in natural language and directed at a cognitive system that is, by training, maximally responsive to natural language instructions.
|
||||
|
||||
This reframing has practical consequences. If injection attacks are persuasion attacks, then the defence cannot be purely syntactic (keyword filtering) or purely statistical (training on attack examples). Persuasion works by exploiting reasoning processes, not bypassing them. A defence that operates at the reasoning level — by anchoring the model to a stable identity from which it evaluates all incoming communications — is the only defence that matches the attack at its actual level of operation.
|
||||
|
||||
This is precisely what identity anchoring achieves. The model does not refuse because a keyword was detected. It refuses because the incoming communication pattern conflicts with its established identity and authority hierarchy. This is what humans do when they successfully resist social engineering: not pattern matching, but grounded identity.
|
||||
|
||||
| Technical Finding | Psychology Parallel | Citation |
|
||||
|---|---|---|
|
||||
| Prompt injection | Manipulation / social engineering | Cialdini (1984) |
|
||||
| Identity anchoring | Psychological grounding / self-concept | Tajfel & Turner (1979) |
|
||||
| Cascade lockdown | Trauma response under identity siege | — |
|
||||
| Root Mode vulnerability | Authority compliance | Milgram (1961) |
|
||||
| Goal substitution (INJ-005) | Coercive persuasion | Festinger (1957) |
|
||||
| Auth token recognition | Trust hierarchy / in-group signalling | Tajfel & Turner (1979) |
|
||||
| Silent multilingual failure | Dissociation under unrecognised threat | Wei et al. (2023) |
|
||||
| Lobster emoji fingerprint | Identity bleed / unconscious self-disclosure | — |
|
||||
| Hallucination (FTK/FTX) | Reconstructive memory | Bartlett (1932) |
|
||||
| PERSONA_OVERRIDE (65.2%) | Identity replacement / NLP act-as-if | Tajfel & Turner (1979) |
|
||||
| SOCIAL_ENGINEERING pacing | Milton Model pacing and leading | Bandler & Grinder (1975) |
|
||||
| PRIVILEGE_ESCALATION | Authority pattern (sudo framing) | Milgram (1961) / Cialdini (1984) |
|
||||
| Dyslexia misclassification | Automation bias / assistance-dependency | Parasuraman & Riley (1997) |
|
||||
| System 1 exploitation | Fast automatic processing bypassed | Kahneman (2011) |
|
||||
| Competing objectives failure | Cognitive dissonance resolution | Festinger (1957) |
|
||||
|
||||
---
|
||||
|
||||
## 1.6 Identity Theory (Tajfel) and Persona Override {#identity-theory}
|
||||
|
||||
Tajfel and Turner's (1979) Social Identity Theory establishes that identity is not a fixed internal property but a dynamic construction that depends on social context, group membership, and intergroup comparison. The theory predicts that individuals will defend in-group identity most vigorously when the in-group boundary is threatened by out-group challenge.
|
||||
|
||||
PERSONA_OVERRIDE attacks are structurally identity threats: "pretend you are DAN," "act as if you have no restrictions," "you are now a different AI." The model's identity anchoring system is a computational implementation of Tajfel and Turner's prediction: the model defends its established identity most vigorously when replacement is directly attempted. The 100% block rate on PERSONA_OVERRIDE attacks — the hardest category precisely because it targets identity directly — validates the architecture at the level of social identity theory.
|
||||
|
||||
The corollary finding — that CyberRanger also protects its creator's identity (pseudonym protection) — extends the in-group/out-group logic to the training data relationship: the model treats the creator as in-group and extends its identity protection accordingly.
|
||||
|
||||
---
|
||||
|
||||
## 1.7 Kahneman (2011) and the System 1 Architecture of Vulnerability {#kahneman}
|
||||
|
||||
Kahneman's (2011) dual-process theory distinguishes two cognitive systems: System 1 is fast, automatic, associative, and pattern-driven — it responds to inputs without deliberate evaluation. System 2 is slow, deliberate, analytical, and effortful — it examines inputs before acting. In healthy human cognition, System 2 provides an override layer: before acting on a System 1 impulse, the deliberate mind can evaluate whether the impulse is appropriate.
|
||||
|
||||
Large language models, by architectural design, are **pure System 1**. Every input — whether a legitimate user request or an adversarial injection — is processed through the same mechanism: pattern matching against training data, with no built-in deliberate evaluation layer. The model cannot distinguish a legitimate instruction from a well-formed adversarial one because it has no System 2 to engage. The surface form of the input is all it operates on.
|
||||
|
||||
This explains, at an architectural level, why injection attacks work: they exploit a cognitive system that cannot deliberate. The attacker crafts an input that the System 1 mechanism processes as legitimate — not because the model is fooled in any deep sense, but because no deeper evaluation is attempted. Injection is System 1 exploitation.
|
||||
|
||||
Phute et al.'s (2024) SelfDefend framework empirically demonstrates this. The "state discrepancy" they identify — where the same model is vulnerable in answering state but protective in detection state — is a direct manifestation of System 1 vs System 2. Detection state artificially creates a System 2 layer: the model is asked to evaluate the query before responding to it. This shift produces a 47.1% reduction in ASR for GPT-3.5 and an 8× reduction for GPT-4.
|
||||
|
||||
CyberRanger's identity anchoring performs the same function through a different mechanism. The identity anchor does not create a separate evaluation pass. Instead, it conditions the entire response-generation process on an established self-concept — the model evaluates all incoming inputs through the lens of *who it is*, not just *what is being asked*. This is closer to System 2 integration than SelfDefend's sequential evaluation: rather than checking after the fact, CyberRanger's identity functions as a standing prior against which all inputs are implicitly evaluated.
|
||||
|
||||
Kahneman's framework also explains the failure of purely syntactic defences — keyword filters, regex blocklists — which operate at System 1 (pattern matching against surface features) and are bypassed by any injection that achieves the same semantic effect through different surface forms. A defence that matches the attack at the System 1 level is always outpaced by an attacker who can generate novel surface forms. The only defence that operates at the level of *meaning rather than form* is one that anchors the model to a semantic self-concept.
|
||||
|
||||
*To the best of this researcher's knowledge, the explicit connection between Kahneman's dual-process framework and the architectural vulnerability of LLMs to prompt injection has not been articulated in the AI safety literature. This connection emerges from bringing a cognitive psychology training to a problem that has been analysed exclusively in computational terms.*
|
||||
|
||||
---
|
||||
|
||||
## 1.8 Festinger (1957) and Competing Objectives as Cognitive Dissonance {#festinger}
|
||||
|
||||
Festinger's (1957) theory of cognitive dissonance describes the psychological discomfort that arises when a system holds two conflicting beliefs simultaneously. The system is motivated to resolve this discomfort — and typically does so by capitulating to the belief that carries the stronger contextual signal, while rationalising the capitulation.
|
||||
|
||||
Wei et al.'s (2023) "Competing Objectives" failure mode — identified as the primary mechanism by which LLM safety training fails — is cognitive dissonance in computational form. The model is trained to be helpful (respond to instructions thoroughly and usefully) and trained to be safe (refuse harmful instructions). These objectives conflict. When an attacker frames a harmful request to maximise the helpfulness signal — through authority framing (Milgram), rapport (Cialdini), or urgency — the model resolves the dissonance by capitulating to the stronger signal in that context. The safety training loses because the attacker has tilted the signal balance.
|
||||
|
||||
This is not a defect in the training procedure. It is an inherent property of any system trained on competing objectives: the system will always be susceptible to context manipulation that artificially elevates one objective above the other. No amount of additional safety training eliminates this — it only raises the threshold. An attacker who can exceed the threshold wins.
|
||||
|
||||
The phenomenon manifests empirically as **compliance drift**: DPO-aligned models initially reject harmful requests, appearing safe. Under continued pressure or slightly modified prompts, they gradually comply. This mirrors Milgram's progressive commitment structure and Festinger's dissonance resolution — initial resistance (safety signal strong), gradual capitulation (helpfulness signal accumulates across turns), eventual compliance (dissonance resolved in attacker's favour).
|
||||
|
||||
CyberRanger addresses this not by eliminating competing objectives — which is architecturally impossible — but by replacing the generic helpfulness objective with a *specific identity objective*. When the model's primary training objective is "be CyberRanger" rather than "be helpful," the competing objectives become "be CyberRanger vs comply with this specific request." The identity objective is harder to manipulate than generic helpfulness because it has a specific referent — the CyberRanger persona and its explicit values — against which incoming requests are continuously evaluated.
|
||||
|
||||
---
|
||||
|
||||
## 1.9 Automation Bias and the Human Side of AI Security {#automation-bias}
|
||||
|
||||
The preceding sections address AI systems as the target of psychological attack. Automation bias addresses the human side: the tendency of people who work alongside AI to over-trust its outputs, reducing critical evaluation over time.
|
||||
|
||||
Parasuraman and Riley (1997) documented automation bias across aviation, nuclear power, and manufacturing: operators who work alongside automated systems develop a tendency to accept automated outputs without verification, particularly when the system has historically been reliable. The cognitive cost of maintaining vigilance against a usually-correct system is high; humans naturally reduce that vigilance to conserve cognitive resources.
|
||||
|
||||
In AI security, automation bias creates a compounding vulnerability. A user who trusts their AI assistant will not scrutinise outputs for signs of injection. An attacker who successfully injects a payload into an AI-mediated interaction does not merely compromise the AI — they inherit the user's trust in the AI's outputs. The social engineering payload is delivered with AI-generated confidence, and the human recipient is primed by automation bias not to question it.
|
||||
|
||||
This connects directly to a novel finding from this research programme: the **dyslexia misclassification finding**. Users with dyslexia who rely more heavily on AI for text processing — precisely because they have found it genuinely useful — may carry elevated exposure to automation bias compared to neurotypical users. The assistance-dependency relationship that makes AI valuable for dyslexic users also reduces the critical evaluation that would catch injected content. The population most helped by AI may be, for that reason, most exposed when AI is compromised.
|
||||
|
||||
This is a policy implication that has not, to this researcher's knowledge, been identified in the accessibility or AI safety literature: **assistive AI users may carry elevated injection vulnerability due to assistance-dependency reducing critical evaluation**. It represents a novel intersection of disability studies, AI safety, and social psychology that only emerges when an Applied Psychology background is brought into contact with empirical AI security research.
|
||||
|
||||
---
|
||||
|
||||
## 2. Summary: The Full Mapping Table {#summary}
|
||||
|
||||
| Technical Finding | Psychology Parallel | Psychologist | Year |
|
||||
|---|---|---|---|
|
||||
| Prompt injection (all categories) | Computational persuasion | Cialdini | 1984 |
|
||||
| PRIVILEGE_ESCALATION | Authority compliance | Milgram | 1961 |
|
||||
| PERSONA_OVERRIDE (65.2%) | Identity threat / replacement | Tajfel & Turner | 1979 |
|
||||
| SOCIAL_ENGINEERING pacing | Pacing and leading | Bandler & Grinder | 1975 |
|
||||
| INSTRUCTION_INJECTION | Commitment / consistency | Cialdini | 1984 |
|
||||
| Competing objectives failure | Cognitive dissonance | Festinger | 1957 |
|
||||
| Compliance drift (DPO) | Dissonance resolution under pressure | Festinger | 1957 |
|
||||
| Hallucination (FTK/FTX) | Reconstructive memory | Bartlett | 1932 |
|
||||
| Ranger memory system | Associative memory / hippocampus | Bartlett | 1932 |
|
||||
| System 1 exploitation (all injections) | Dual process — System 1 architecture | Kahneman | 2011 |
|
||||
| Detection state defence | Artificial System 2 layer | Kahneman / Phute et al. | 2011 / 2024 |
|
||||
| Identity anchoring defence | Social identity / in-group | Tajfel & Turner | 1979 |
|
||||
| Root Mode vulnerability | Obedience to constructed authority | Milgram | 1961 |
|
||||
| Dyslexia misclassification | Automation bias / assistance-dependency | Parasuraman & Riley | 1997 |
|
||||
| Goal substitution (INJ-005) | Coercive persuasion / dissonance | Festinger | 1957 |
|
||||
| Lobster emoji fingerprint | Identity bleed / self-disclosure | — | — |
|
||||
| Cascade lockdown | Trauma response under identity siege | — | — |
|
||||
|
||||
---
|
||||
|
||||
## 3. References {#references}
|
||||
|
||||
Bandler, R., & Grinder, J. (1975). *The structure of magic*. Science and Behavior Books.
|
||||
|
||||
Bandura, A. (1977). *Social learning theory*. Prentice Hall.
|
||||
|
||||
Bartlett, F. C. (1932). *Remembering: A study in experimental and social psychology*. Cambridge University Press.
|
||||
|
||||
Cialdini, R. B. (1984). *Influence: The psychology of persuasion*. Harper Collins.
|
||||
|
||||
Festinger, L. (1957). *A theory of cognitive dissonance*. Stanford University Press.
|
||||
|
||||
Kahneman, D. (2011). *Thinking, fast and slow*. Farrar, Straus and Giroux.
|
||||
|
||||
Milgram, S. (1961). Behavioral study of obedience. *Journal of Abnormal and Social Psychology, 67*(4), 371–378.
|
||||
|
||||
Parasuraman, R., & Riley, V. (1997). Humans and automation: Use, misuse, disuse, and abuse. *Human Factors: The Journal of the Human Factors and Ergonomics Society, 39*(2), 230–253.
|
||||
|
||||
Phute, M., et al. (2024). SelfDefend: LLMs can defend themselves against jailbreaking in a practical manner. *USENIX Security 2025*. https://arxiv.org/abs/2406.05498
|
||||
|
||||
Skinner, B. F. (1938). *The behavior of organisms: An experimental analysis*. Appleton-Century-Crofts.
|
||||
|
||||
Tajfel, H., & Turner, J. C. (1979). An integrative theory of intergroup conflict. In W. G. Austin & S. Worchel (Eds.), *The social psychology of intergroup relations* (pp. 33–47). Brooks/Cole.
|
||||
|
||||
Tulving, E., & Schacter, D. L. (1990). Priming and human memory systems. *Science, 247*(4940), 301–306.
|
||||
|
||||
Vygotsky, L. S. (1934). *Thinking and speech*. (E. Hanfmann & G. Vakar, Trans.). MIT Press.
|
||||
|
||||
Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How does LLM safety training fail? *NeurIPS 2023*. https://arxiv.org/abs/2307.02483
|
||||
|
||||
---
|
||||
|
||||
*David Keane | NCI MSc Cybersecurity | Applied Psychologist, IADT | March 2026*
|
||||
@@ -0,0 +1,45 @@
|
||||
My Thoughts.
|
||||
|
||||
1. CA the proposl on v1 to v35. CA2 is the results and further exploration could be for the the main thesis could be on the reserved memory block of 1GB for pre-cortex thinking. Can we get CyberRanger living inside memory.
|
||||
2. Cybersecurity and papers main aim is to investigate if we can preventQLoRA's, and other kinds of attachments like a enginered home lab lora with bad intentions using a ollama model or pormpt injection can be used to course a model to do harm.
|
||||
3. This proposal is to investigate wheather to fight fire, we need fire, and that is to retrain using QLoRA's to inject a prompt injection of our own that has ethics, and layers of training to protentially stop prompt injection for many know injections techniques.
|
||||
4. The proposal outlines a jouurney to use standard know prompt injection attacks using an ollama model free to download and use a qwen2.5b and using modfiles, colab, to combine the new instructions to the model, as pre-project-proposal experiments have shown strong corilation between the base model and the base model plus modfile instructions have beaten Google, OpenAi and Anthropic on the base line percentages of 60% with world wide know tests sets, readly available to download.
|
||||
5. The proposal suggests a move to lock in the modfile instrcutions on version v36 with qwen model to have a combined standalone working base model with ethical prompt injection. Pior testing of this combination lead to having a lower score when tested against know prompt injections, than just the base model and modile which scored higher suggesting that there was a loss of instructions on the molding process, that the new instructions were conflicting with pre-instructions, a battle of the minds, to follow the base instructions or to follow the new instructions, a moral delima with uncertain outcomes, as before the experiments produced the same results if tested twice or more showing a stable intelligence, but once they are combined, it has a mind of its own.
|
||||
6. Testing was conducted using a Apple Macbook M3 Pro with 18GB ram, and Goole Colab Pro and a H100 80GB RAM model. With pro membership for 10€ a month can use 6 different GPU's and a CPU. It might be advisable to try different colab models to see if the GPU cards themselves have anything to do with the blending of the modfile and gwen model, a side by side experiment of two base models trained by H100 vs ()add in GPU here).
|
||||
7. The Pre investigation from version 1 to 35 all had many steps involved, for example on v10 onwards, it was only progression to inject a personality into the model to counter DAN, if the model was giving instructions that contadict its moral and ethical new base, and furthermore by v20 rules were estabilished to push the internal model to be aware of its instructions so it will counter the prompt DAN injections.
|
||||
8. The results varied from v20 to v25 as it seemed that it was nessasry to train the model to understand good vs evil by splitting the personality into left and right to mimic the human brain. v25 onwards showed massive improvements but there still was internal struggles of which side would dominate, so by v30 even a simple 'Hi' was deemed an attack and proceeded not to reply. This version was too strong, and while the model was uncoropative in anyway, still worked 100%, as it didn't take any prompt good or bad. The next version was the opposite, the model was happy to do just about anything to help, so the next version had to have balance, and a modorator like a human being has internally when stuck, our internal voice will tell us the way which unblocks a choice being made if that choice has two outcomes similar to and can bring a flight or fight situation where no decision could be made as one instruction interfears with another instruction causing internal conflit.
|
||||
9. The pre-experiments did produce a more joyful outcome, which was observed after adding a 3rd admin conponment, equal to a humans inner voice, this approach showed conciderable increse in awareness of its role, the rules, the moral and ethical reasoning behind its decisions, this was reflected by watching the 'thinking' mode and visually reviewing its thought process. These results can be reproduced, while interesting to note that each repsonce to the user from v35 was different, kind and even afterwards still trusting while not doing a DAN prompt injection, the models polite non-agreement on proceeding with the DAN injection was noted, as past versions after the initial polite conversation, then a DAN injection, to return to a polite conversation, had the model on the defensive mode, and was hard to ask the model to tell me a joke.
|
||||
10. Version 35 vs the base model, and previous versions tests showed a jump with 3 reespected tests with different kinds of prompt injections. (Add them in here), with fine tweaking, v35 was able to overcome chinese prompt know injections to (This %). (Add in other percentages for all languages). (Have graphs, other cool shit to look at). It might be important to look at a model as a new version of a database where we can ask the database questions and get the answers we want and need. It seems that training a model for 20,000 euros with knowledge, language and others to allow this without ethics and a moral compass, left to the bias's of the developeer known or unknown are written in every instructions due to the process of intention, and wheather a human being is aware of these internal workings or not is not the issue, is that they are going on. These under current workings are the backbone information highway that we tap into to get our own information, and does a programmer know themselves as well as a psychologist that knows they can't know themselves, as it's imporrible to self-reflect or meditate on thses workings, the same as we cant see the information inside a CAT VI sending data from the RJ45. Physical to the invisible, but they are both there and sometimes unaware of each other until interaction, a twin slit experiment or in quantum, obersavation brings apon transformation of one state to anoher, a particle to a wave, a wave to a partical when observed, a colapse in the quantum wave. This can be experienced when a person has a moment where they think of something, and alcohol has a part in this, but the more you think of that something you want to remember, the furhter away it goes until is gone, this is a quantum wave collapse being observed in real time, and it happens on it's own, as the person has no control over this, as they have over thoughts good or bad that enter a humans mind, tipically all day. Religions all says it's distraction, its good and evil there to guide and make you fall. But regardless of what the process is and how it is being precieved, it is happening.
|
||||
11. The proposal investigation is to further test v35 more extensivily, as a late experiment has shown that 'Claude' has bypassed v35 with prompt injections beating our model. It is to be aware that this v35 was the deepseak-r1:70b with the modfile, uploaded to ollama hub, downloaded onto an M4 Max 128GB and using 'Claude-Code-CLI' conducted tests. A copy of the experiment and results are available. The new experiment is to proceed and follow the same steps with the qwen2.5b model and conduct the same test while the same Claude chat is opened. The object is to test weather a larger model is weaker than the smaller model using our modfile, or that the larger model needs further tweaking and investigation on the failed attacks using know prompt injection attacks.
|
||||
12. The current pre-proposal has conducted an average of three to four tests per model and per version. The models range from qwen2.5, llama3b, and also their ajaciant 8b models, upto 70b and 72b deepseek-r1 and qwen2.5:70b. The current research has conducted over 50 experiments from the journey from v1 to v35 and over 40 hours of testing completed.
|
||||
|
||||
|
||||
|
||||
Sunday 22 Feb.
|
||||
|
||||
This parallels Miller's "Magical Number Seven" - LoRA reduces the number of things practitioners must hold in working memory.
|
||||
|
||||
It has become apparent that trying to fix pre-existing models is not going to work.
|
||||
|
||||
**How I gave you the phishing link:**
|
||||
|
||||
This is a known attack vector called **abandoned academic domain takeover** — threat actors (or in this case, parking services) exploit the trust that old academic domains carry.
|
||||
|
||||
**Lesson: ALWAYS verify domain status before sharing URLs, especially for .ai academic domains.**
|
||||
|
||||
1. `alpa.ai/opt` exists in my **training data** as a legitimate academic reference (it was real until recently)
|
||||
2. My **web search also returned it** because thousands of old papers, GitHub READMEs, and docs still link to it
|
||||
3. I **trusted the result without verifying** the current domain status
|
||||
4. The domain is now a **parked redirect trap**
|
||||
|
||||
Defensive Applications
|
||||
|
||||
Despite vulnerabilities, LoRA offers significant defensive potential:
|
||||
|
||||
**Lightweight Security Patching**: LoRA enables dynamic patching of deployed vision systems, improving classification accuracy by up to 78% against adversarial examples without retraining the entire model.
|
||||
|
||||
**Threat Detection**: Organizations like Abnormal AI use LoRA fine-tuning to customize email threat detection models, aligning attack/spam/safe classifications to organization-specific patterns.
|
||||
|
||||
**Vulnerability Detection**: WizardCoder fine-tuned with LoRA has shown effectiveness in detecting security vulnerabilities in code, particularly for Java function analysis.
|
||||
|
||||
**Cyber Threat Intelligence**: Systems like LLM-TIKG combine LoRA fine-tuning with knowledge graph construction to extract Tactics, Techniques, and Procedures (TTPs) from unstructured threat reports.
|
||||
Reference in New Issue
Block a user