Introspection as R&D

WRITTEN BY

Claudia Vaduvescu

QUICK ANSWER

The short version, for anyone building with AI: introspection isn't evidence (a century of research says the stories we tell about our own minds are mostly assembled after the fact), but it's an unusually fast generator of testable hypotheses. Treat the observation as a guess, then test it against a system you can actually instrument, the model in front of you, where a wrong guess costs an afternoon. AI now moves fast enough that a practitioner can run that observe-then-test loop faster than peer review can. Keep what works, write down what doesn't.

AVAILABILITY

Accepting Projects

ARTICLE DEK

I realised I'd been navigating my own memory like a folder structure my whole life. Taken as a hypothesis to test rather than a fact, that kind of noticing turns out to be a fast way to build with AI.

I noticed something not long ago that I think I've been doing my whole life without ever looking at it directly.

When I say a word, or articulate a thought, the sentence comes out linear, one thing after another, like a discourse. But that's not what's happening underneath. A reactive thought like wow, this sucks isn't a single event. By the time it surfaces as four words, my brain has already run something closer to a full report: context, ramifications, correlations, related topics, the whole web of why this particular situation reads as bad. I'm aware of almost none of it. The four words are the receipt. The transaction happened somewhere I can't see.

It gets stranger when I search for something. Say I open a client project with nothing but a brief. I don't "retrieve" the answer, I navigate. My brain pulls up cultural context, then moves through what feels like a folder structure, narrowing until I land on the right section, and only then do I start extracting: references, keywords, posts I've seen that rhyme with this one, the specific page in a specific book where the useful thing was. It works less like a lookup and more like a traversal.

I only have that second description, "folder structure," "traversal", because I spent months building a knowledge graph for my personal wiki. Ingesting books and videos, monitoring the quality of each ingest, writing my own tooling to catch bad extractions, building review loops to correct and redo past batches. Somewhere in doing that, I got a vocabulary for a thing my own head had been doing the entire time. The tool taught me to see the cognition. I'm not sure I'd have noticed otherwise.

This essay is an argument for taking that noticing seriously, as a working method rather than a path to some grand theory of consciousness. My claim is narrow and practical: careful observation of your own cognition is an underused, unusually fast way to generate ideas about how to build and collaborate with AI systems. It's a claim with real holes in it, and I want to walk through the holes honestly, because the honest version is more useful than the triumphant one.

The surface and the machinery

Start with the least controversial piece, because it's also the firmest. The idea that conscious, verbal output is a thin surface over a much larger unconscious process is not a fringe position, it's close to the consensus structure of modern cognitive psychology.

The most familiar version is dual-process theory, popularized in Daniel Kahneman's Thinking, Fast and Slow (2011). The "System 1 / System 2" framing distinguishes fast, automatic, associative, effortless processing that runs below awareness from slow, deliberate, rule-governed reasoning that we experience as "thinking." On this picture, the snap reaction, this sucks, is a System 1 product. It arrives already formed. The reasons we'd give if asked are assembled afterward, by a slower system that did not have access to the actual computation.

I want to be careful here, because this is exactly the kind of claim that's easy to over-sell. The "two systems" are a useful abstraction, not two literal modules you could point to in a brain scan, and several of the specific, dramatic studies in Kahneman's book have failed to replicate since 2015, a problem Kahneman himself publicly acknowledged. So I lean on dual-process theory for the broad shape of the claim (fast automatic layer underneath, slow narrating layer on top) and not for any single party trick.

The second piece is more specific and, for my purposes, almost suspiciously convenient. When I describe memory retrieval as "traversal through a folder structure," I'm unknowingly paraphrasing one of the most influential models in cognitive science: the spreading-activation theory of semantic memory, from Allan Collins and Elizabeth Loftus, "A spreading-activation theory of semantic processing" (Psychological Review, 1975). In that model, concepts are nodes, relationships between them are weighted links, and remembering is activation spreading outward through the network until it reaches what you're looking for. It explains priming, why typical examples come to mind faster than atypical ones, and the general feeling of one idea "leading to" another.

In other words: the dominant cognitive model of human memory is already a graph-traversal model. My folder-structure intuition isn't a quirky personal metaphor; it's a naive restatement of Collins and Loftus. And here's where it gets interesting for anyone building AI tools, recent work on LLM agent memory cites spreading activation directly. Papers like "SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation" reach back to a 1975 psychology paper to design how an artificial agent should navigate what it knows. The loop from human cognition to machine architecture is not metaphorical there. It's a citation.

I'll flag the same caution as before: spreading activation is a model that fits a great deal of data, not a proven mechanism, and it competes with alternatives. The right way to say it is "a dominant model describes retrieval as activation spreading through a network," not "memory works by graph traversal." But even hedged, it's a remarkable thing to find waiting at the bottom of an offhand introspective report.

The obvious objection, taken seriously

Here is where the essay has to slow down and concede, because the objection to my whole method is real, old, and well-evidenced.

Introspection is unreliable, and not in a mild, occasional way: it is systematically, demonstrably unreliable as a source of evidence about how the mind actually works. Psychology largely abandoned introspection as a method for studying mechanism more than a century ago, and the twentieth century kept finding new ways to show why.

The foundational document is Richard Nisbett and Timothy Wilson's "Telling more than we can know: Verbal reports on mental processes" (Psychological Review, 1977), one of the most-cited papers in the history of the field, with well over ten thousand citations. Their conclusion is blunt: people have "little or no direct introspective access to higher order cognitive processes." When asked why they did something, subjects readily produce confident, fluent, plausible explanations, that turn out not to match the actual causes of their behavior. They're confabulating rather than lying: the mind generates a story that fits, with no awareness that it's a story.

It gets worse, in the sense of more vivid. In the "choice blindness" experiments, Petter Johansson and colleagues, "Failure to detect mismatches between intention and outcome in a simple decision task" (Science, 2005), people chose which of two faces they found more attractive, and were then handed, by sleight of hand, the face they didn't pick. Most never noticed the swap, and then explained at length why they "preferred" the face they had actually rejected. The introspective report wasn't just inaccurate about hidden machinery; it confidently described a preference that didn't exist.

The social-psychology version of the critique has a name: the "introspection illusion," developed by Emily Pronin in "The Introspection Illusion" (Advances in Experimental Social Psychology, 2009). We over-trust our own introspective access while discounting everyone else's, which is precisely the bias I'd be most prone to in writing an essay like this one. Timothy Wilson's formulation is the sharpest single sentence on the matter, and I think anyone arguing my position has to sit with it: introspection "does not provide a direct pipeline to nonconscious mental processes. Instead, it is best thought of as a process whereby people use the contents of consciousness to construct a personal narrative that may or may not correspond to their nonconscious states."

And the philosophers go further still. Eric Schwitzgebel, in "The Unreliability of Naive Introspection" (The Philosophical Review, 2008) and the book Perplexities of Consciousness (2011), argues we're unreliable not just about the hidden causes of behavior but about our current conscious experience itself, that we are, in his words, "not simply fallible at the margins but broadly inept."

So: the thing I'm proposing to use as a tool is one that a century of careful work has shown to be a poor instrument. I don't want to wave that away. I want to take it as the actual starting condition.

The move, from readout to hypothesis

Everything in Part 2 is an argument against using introspection as evidence. It is not an argument against using it as a source of hypotheses. And that distinction is the entire load-bearing wall of this essay.

A confabulated reason and a generated hypothesis can be the exact same sentence. What's different is what you do next. If you treat "I think I retrieve memories by traversing a structure" as a finding, you've committed the Nisbett-and-Wilson error. If you treat it as a guess to be tested, against the behavior of a system you can actually instrument, you've done something else entirely. The unreliability of the report doesn't disappear. It just stops mattering, because the report isn't the final word; it's the first move.

This isn't a rhetorical dodge. It's where a careful reading of the literature actually lands. The dual-factor view, laid out in work like "When can we introspect accurately about mental processes?" (Memory & Cognition), proposes that introspective reports are inaccurate when a task runs on automatic, unconscious processing, but accurate when the task itself prompts intentional hypothesis testing. The very act of framing your observation as a question to be checked moves you toward the regime where introspection does better.

And there's a clean historical case where this exact loop ran to completion. Grapheme-color synesthesia, the experience of seeing, say, the number 5 as intrinsically red, began as an introspective report. People said they saw colors in numbers; for a long time the scientific default was that they were speaking metaphorically or misremembering. Then V. S. Ramachandran and Edward Hubbard, in "Synaesthesia, A Window Into Perception, Thought and Language" (Journal of Consciousness Studies, 2001), built objective, third-person tests: embed a triangle of 2s in a field of 5s and see whether synesthetes can spot the shape faster because the colors "pop out." Some experiments suggested they could, implying the experience was genuinely perceptual, not a verbal habit. The introspective report generated a hypothesis; the lab bench tested it.

I want to handle this example honestly, because its later history is itself the best argument for my thesis. The original "pop-out" result did not cleanly replicate. The most careful follow-up, Jamie Ward and colleagues, "Grapheme-colour synaesthesia improves detection of embedded shapes, but without pre-attentive 'pop-out' of synaesthetic colour" (Proceedings of the Royal Society B, 2010), found a real but modest detection advantage that occurs without genuine pre-attentive pop-out, and Rothen and Meier (2009) failed to find the effect at all. So the introspective report pointed at something real (synesthesia is a genuine perceptual phenomenon, now well established), but the first specific mechanism proposed for it was partly wrong, and only testing revealed that. Far from a defeat for the method, that is the method. The intuition opens the question; the testing closes it, sometimes by telling you your first guess was too neat.

The most striking contemporary version of this loop is happening inside AI research right now, and it mirrors my argument almost uncomfortably well. In "Emergent Introspective Awareness in Large Language Models" (Transformer Circuits, 2025), Anthropic researchers confront precisely the confabulation problem, a model claiming to report on its own internal state could just be making it up, and solve it the same way Ramachandran did: with a third-person method. They use concept injection, artificially activating a known internal pattern, and then check whether the model can accurately report that something has been injected. The finding is carefully bounded: the models show some genuine, functional introspective awareness, but it is, in their words, "highly unreliable and context-dependent." Which is exactly the right amount of confidence. Introspection in these systems is real enough to be worth probing and unreliable enough that you must verify it externally. The field is already running the introspection-then-test loop as standard practice, it just calls it interpretability.

Why this is faster than waiting for the official answer

If introspective hypotheses have to be tested anyway, why not just do proper science and skip the messy first-person part?

Because of a timing mismatch that has become impossible to ignore. Doing this through traditional academic channels is slow, not because the people are slow, but because the structure is. Peer review runs in months-to-years. The incentive gradient pushes toward what is fundable, defensible, and citable rather than toward fast, exploratory, free internal referencing. By the time a defensible result on how to work with a given model is published, the model is two or three generations gone.

And in AI, "two or three generations" is now a matter of weeks. This is the part that sounds like exaggeration and isn't. Looking at one model line: Claude Opus 4.5 shipped in late November 2025, 4.6 in early February 2026, 4.7 in mid-April, and 4.8 at the end of May, the gap between 4.7 and 4.8 was roughly six weeks. Zoom out across the whole frontier and the cadence is documented: per data compiled by ARK Invest from Artificial Analysis (reported via OfficeChai, April 2026), the median number of days between major frontier-model releases across the leading labs fell from 37.5 in 2023 to around 11 so far in 2026. A reliable Claude-model release timeline tells the same story for one vendor.

Set those two clocks side by side. A peer-reviewed finding about how to prompt, structure, or collaborate with a specific model might take a year to publish. The model it describes has, by then, been replaced several times over. This is not an argument against rigor, the testing still has to happen. It's an argument about who can afford to run the loop at the relevant speed, and the answer is: the practitioner sitting at the keyboard, who can have the observation in the morning, build the workflow by afternoon, and know by evening whether it holds.

A worked example, markdown files as memory

The cleanest real-world instance of this loop I know is the pattern of giving a language model a folder of plain markdown files as its memory.

At some point the question got asked out loud: what if you don't build a vector database, don't compute embeddings, don't stand up a retrieval pipeline, what if you just give the model well-organized markdown files and let it find things by reading indexes and following links, the way a person navigates Wikipedia? Andrej Karpathy described his version of this as an "LLM Wiki" in a widely-shared April 2026 gist: immutable raw sources at the bottom, an LLM-maintained wiki of interlinked markdown pages in the middle, and a schema file on top (he names CLAUDE.md and AGENTS.md explicitly). His framing, "Obsidian is the IDE; the LLM is the programmer; the wiki is the codebase", and his nod to Vannevar Bush's 1945 Memex make the lineage explicit. The post was widely shared and forked within days.

Look at what that idea is. It's the traversal intuition from the opening of this essay, navigate a structure by following links until you reach the relevant section, promoted from introspective observation to system architecture. Someone noticed how human memory feels from the inside, and shipped it.

It works well enough that conventions formed around it. CLAUDE.md and AGENTS.md became standard files for giving agents persistent context; AGENTS.md was donated to the Agentic AI Foundation under the Linux Foundation in late 2025, and context files of this kind now appear in tens of thousands of open-source repositories. There's a genuine convergent-evolution story here too: independent reverse-engineering of how different agents store memory keeps finding the same primitive, no vector store, no semantic search, just read/write/edit tools operating on markdown files, "just like a human would." When separate teams keep landing on the same design, the underlying intuition was tracking something real.

But, and this is the part that makes the example honest rather than triumphant, the pattern is genuinely contested, and the contest is itself a demonstration of my thesis. A team at ETH Zurich and LogicStar.ai tested it properly in "Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?" (arXiv, February 2026). Their headline result: context files "tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%." LLM-generated context files hurt in most of the settings they tried; agents dutifully followed the extra instructions, more testing, more file traversal, without producing better results. Their recommendation was to omit auto-generated context files for now and include only minimal requirements.

So the intuition was good enough to spawn a global convention and good enough to be partly wrong in ways that only careful testing surfaced. That's not a strike against generating ideas from introspection. It's the loop working as designed: a first-person hunch produces a concrete architecture, the architecture gets adopted fast, and then third-person measurement tells you where the hunch overreached. The intuition is the cheap part. The testing is what makes it knowledge.

The asymmetry that makes the loop worth running

Here's why I think the practice is positive in expectation even when individual attempts fail.

If your observation-turned-hypothesis works, you've got a new tool. If it doesn't, you've spent the time and gained a precise map of what doesn't work, which is the thing that sharpens the next attempt. The downside of a failed test is more information. That's the structure of a process that improves by being run.

This isn't just optimism; it has a respectable epistemological backbone. It's Karl Popper's falsificationism wearing work clothes: a hypothesis earns its keep by being the kind of thing that could be shown wrong, and knowledge advances through refutation. And it's the explicit logic of Eric Ries's The Lean Startup (2011), the build-measure-learn loop, in which a "pivot" is not a failure but a structured course-correction grounded in what a cheap test just taught you. There's even empirical support that thinking this way pays off: research by Felin, Gambardella, and colleagues found that founders trained to formulate falsifiable hypotheses and run rigorous tests made better keep-or-kill decisions and pivoted faster away from bad ideas.

The value-of-negative-results point also holds up well beyond startups. In meta-science, negative results prevent wasted duplication and expose flawed assumptions; the scale of what usually goes unrecorded is captured by Monya Baker's "1,500 scientists lift the lid on reproducibility" (Nature, 2016), in which more than 70% of surveyed researchers reported failing to reproduce another scientist's results and more than half failed to reproduce their own. A failed replication is not nothing; it's signal that the field systematically under-shares.

I do have to take one thing back, though, because the strong form of my own claim is false. "You can't lose" is too much. There is always opportunity cost, the afternoon you spent testing one hunch is an afternoon you didn't spend on another. And publication bias has a personal analogue: negative results tend not to get written down or shared, so the "knowledge" from a failed test often dies with the person who ran it instead of compounding. The honest version is narrower and still strong: the expected value of a fast, cheap test is almost always positive, provided you actually record what you learned when it failed. The loop only self-improves if you keep the receipts from the failures too.

The metaphor I keep using, and where it breaks

I've leaned on a single metaphor throughout, the brain as something that stores, retrieves, traverses, processes, and I owe it some scrutiny, because there's a serious case that it's wrong.

The sharpest popular version is Robert Epstein's "The Empty Brain" (Aeon, 2016): your brain "does not process information, retrieve knowledge or store memories"; it is not a computer, and the information-processing metaphor is just the latest in a long historical series, hydraulic, mechanical, electrical, telegraphic, that we'll eventually discard like the others. If Epstein is right, then my entire "folder traversal" framing is a category error, and the resemblance I keep noticing between my mind and my wiki is an artifact of the tools I happen to use, projected backward onto cognition.

In fairness, Epstein's essay was heavily rebutted, critics pointed out that it conflates the literal, strong "brain is a digital computer" claim with the much weaker and quite defensible claim that the brain processes information in some sense, and that he offers no replacement theory to do the explanatory work. The serious philosophical position, laid out in the Stanford Encyclopedia of Philosophy's entry on the computational theory of mind, isn't that the brain is a desktop PC; it's that mental processes are computational in an abstract sense.

But I don't actually need to win that fight, and I don't want to pretend I can. The stance that fits both the evidence and my purpose is the instrumental one, George Box's "all models are wrong, but some are useful." I'm not claiming the brain is a knowledge graph, only that treating my cognition as if it traverses a structure generated a testable idea about how to build an AI system, and that the idea, when tested, turned out to be productive more often than not. The metaphor's job is to be generative, and then to get out of the way when the measurement comes back. Being true is not the job. When I catch myself believing the metaphor instead of using it, that's the signal to stop.

What I'm actually proposing

Stripped down, the method is this:

Pay attention to the part of your mind no one else can audit. When you have a thought, a reaction, a sense of how you just did something, treat the report as a candidate rather than a fact, because a century of research says the report is probably a construction, not a readout. Then turn the candidate into a concrete hypothesis about how to build or work with an AI system, and test it against the system, fast, where a wrong guess costs you an afternoon and a right one might change how you work for a year. Keep what works. Write down what doesn't, so the failures compound into knowledge instead of evaporating.

None of the pieces are original to me. The unconscious machinery under the verbal surface is dual-process theory. The traversal feeling is spreading activation. The unreliability is Nisbett and Wilson. The rescue, hypothesis, not readout, is the dual-factor view and the synesthesia case and Anthropic's concept injection. The asymmetry is Popper and Ries. What's new, if anything is, is the timing: there now exists a testbed that moves fast enough, and is instrumentable enough, that the slow first-person hunch and the fast third-person check can finally run in the same afternoon.

The people innovating fastest right now, in my experience, aren't the ones waiting for a settled science of cognition. They're the ones reading their own background reports, and then, crucially, checking whether what they read was true.

Open questions I don't have answers to

Which parts of cognition are actually introspectable, and which are permanently invisible no matter how hard you look? The dual-factor literature suggests it's a real split rather than a smooth gradient, some processes report accurately, others don't, and you don't reliably know in advance which kind you're looking at.
Where exactly does the brain-as-retrieval-system metaphor stop describing neurons and start describing the tools I built? Am I observing my mind, or seeing my own software reflected back at me? The synesthesia story suggests you sometimes only find out which one it was after you test.
Is the resemblance between how I navigate memory and how a markdown-wiki agent navigates files a genuine structural correspondence, or a coincidence dressed up by the fact that I built the wiki? Convergent evolution across independent teams is suggestive, but suggestive is not proof, and I'd rather hold the question open than answer it too quickly.

I don't think you need to resolve any of these to use the method. But they're the questions that keep it honest, and honesty, here, is just the discipline of not mistaking the receipt for the transaction.

References

Baker, M. (2016). "1,500 scientists lift the lid on reproducibility." Nature, 533, 452–454. https://www.nature.com/articles/533452a
Collins, A. M., & Loftus, E. F. (1975). "A spreading-activation theory of semantic processing." Psychological Review, 82(6), 407–428. https://web-archive.southampton.ac.uk/cogprints.org/1663/1/act2.htm
Epstein, R. (2016). "The Empty Brain." Aeon. https://aeon.co/essays/your-brain-does-not-process-information-and-it-is-not-a-computer
Gloaguen, T., Mündler, N., Müller, M., Raychev, V., & Vechev, M. (2026). "Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?" arXiv:2602.11988. https://arxiv.org/html/2602.11988v1
Johansson, P., Hall, L., Sikström, S., & Olsson, A. (2005). "Failure to detect mismatches between intention and outcome in a simple decision task." Science, 310(5745), 116–119. https://pubmed.ncbi.nlm.nih.gov/17049881/
Kahneman, D. (2011). Thinking, Fast and Slow. https://us.macmillan.com/books/9780374533557/thinkingfastandslow/
Karpathy, A. (2026). "LLM Wiki" (gist). https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f
Lindsey, J., et al. / Anthropic (2025). "Emergent Introspective Awareness in Large Language Models." Transformer Circuits Thread. https://www.anthropic.com/research/introspection
Nisbett, R. E., & Wilson, T. D. (1977). "Telling more than we can know: Verbal reports on mental processes." Psychological Review, 84(3), 231–259. https://doi.org/10.1037/0033-295X.84.3.231
Pronin, E. (2009). "The Introspection Illusion." Advances in Experimental Social Psychology, 41, 1–67. https://www.sciencedirect.com/science/chapter/bookseries/abs/pii/S0065260108004012
Ramachandran, V. S., & Hubbard, E. M. (2001). "Synaesthesia, A Window Into Perception, Thought and Language." Journal of Consciousness Studies, 8(12), 3–34. https://philpapers.org/rec/RAMSA-5
Ries, E. (2011). The Lean Startup. https://theleanstartup.com/
Schwitzgebel, E. (2008). "The Unreliability of Naive Introspection." The Philosophical Review, 117(2), 245–273. https://read.dukeupress.edu/the-philosophical-review/article/117/2/245/2787/The-Unreliability-of-Naive-Introspection
Stanford Encyclopedia of Philosophy. "The Computational Theory of Mind." https://plato.stanford.edu/entries/computational-mind/
"When can we introspect accurately about mental processes?" Memory & Cognition. https://link.springer.com/article/10.3758/BF03209215
Ward, J., Jonas, C., Dienes, Z., & Seth, A. (2010). "Grapheme-colour synaesthesia improves detection of embedded shapes, but without pre-attentive 'pop-out' of synaesthetic colour." Proceedings of the Royal Society B, 277(1684), 1021–1026.