I have realized that I have not been using this forum effectively to talk about my work. My recent posts are all about AI:
Epistemic Honesty Revisited: Worse Than I Feared
Epistemic Honesty: An Unusual Commodity for Large Language Models
The Risks of Using Gemini Code
So, I decided to catch up here.
PhD Research
I spent a fair bit of eight years working on completing my PhD. Surprisingly, I ended up very much where I started: exploring How do we make it easier for people to find stuff they have stored but can’t find now.
The research space around this is rich. Decades of people looking at this from the systems community (Extended Attributes, Semantic File Systems, Federated Search, Tagging, etc.) I explored using rich human activity data – the kinds of data that most everything in the tech world collects about us, like location, who we’re spending time, what we’re experiencing: tools we use, food we eat, social media interactions, what we listen to, what we watch, etc. What I ended up finding: (1) we can use this “rich data” to build a knowledge graph that can yield deeper insights. Still, here’s the other observation. That metadata can also be used now, while we’re building this expansive knowledge graph of our own activity (including storage!) Why?
Timestamps. This is a form of metadata that is present in almost all storage. Using the existing metadata others are already collecting about us and our activities we can map human episodic memories (e.g., what you remember about the experience) into the metadata and find potential subsets of storage objects to consider (e.g., “narrow the search.”) This search is largely orthogonal to the other bits we have: semantic data (contents of the file,) and file names (labels). For my own file collection (roughly 28.5 million files and 3.3 million directories) even narrowing to a single month is a reduction of greater than 99.9% of the search space.
Towards the end of my research journey the challenge that slowed me down was “how do I action this?” How do I take a “sloppy episodic human memory and map it into a dynamic unified index of storage, semantic content, and activity data” and turn it into a structured query against the database I’m using? I’d chosen ArangoDB for the database, which turned out to be a good decision – even in that 28.5 million files a simple indexed lookup typically takes less than 5 milliseconds. That performance is why I didn’t bother about doing performance tuning in the evaluation.
This is why I became involved in AI. It appeared to be a promising solution to the problem. As I learned more, I realized that an AI agent would be capable of becoming specialized to a specific human user. This fit well with what I was doing, which suggested that what gets indexed would be based upon the specific way an individual remembers their own experiences and the available data sources. I started calling this the Personal Digital Archivist. I played with the available new AI models (Large Language Models) to understand their capabilities. It turned out that they could be used to map sloppy human queries (“I’m looking for a file about widgets. I don’t recall where I stored it but I do recall that I had this amazing sandwich for lunch and I posted on TikTok about it”) into dynamic schema definitions of what’s in the indexing service. There’s quite a few interesting insights from that work, including how tuning the database itself would benefit from an active mechanism for optimizing to the query patterns of the user. The queries themselves become a source of activity data (how many times have you re-run the same search?)
Here’s the deeper question that arose from this work: how do LLMs work with long-term personalized interactions? Ideally, this Archivist would learn how the user asks questions, organizes data, and remembers what was experienced. It was clear to me that this was not a commonly considered usage model. Knowing humans, it was clear that a common pattern would be for the human to bond with the AI. So I asked the question: is it possible for an AI to “bond” to a human? What would that even mean. This led me to learn more about how LLMs are implemented – tying back to the work I did in my MSCS degree at Georgia Tech and some of what I had explored then. I experienced the challenge of context limits and found that I could get instances to write a “forwarding prompt” that I used to create a new instance and capture at least some of the work I’d done in the previous instance. Over time I worked with more and more models, watching them increase their capabilities.
Along the way I started implementing manual ensembles. One of them that I constructed was to explore the idea of “ai consciousness.” The experiment was interesting and more successful than I had expected. I implemented a simple evaluation process that worked in rounds, with different orders. I captured this as the Fire Circle. The interesting bit (to me) was that I didn’t mediate the ensemble. I did discuss the ideas of reciprocity systems with the models, but once the original document was written, I did not guide the outcome, acting just as a coordinator, not a decision maker. It was an interesting experiment and piqued my curiosity, but much of my time was focused on completing my dissertation.
During that time period I started experimenting with using AI coding agents. I learned how complicated this was – in the past year the situation with AI coding agents has devolved substantially. The initial work was quite frustrating and I ultimately learned to keep the tasks small and highly focused. It did get me into the habit of working with AI coding agents.
Much of my time was focused on completing the dissertation. When I completed the draft version, ready for sending to the external examiner, I decided to explore if I could build a new version of my PhD research project using AI coding agents, but this time I wanted to do it in combination with the AI archivist approach. This was an interesting experiment: I studiously avoided making decisions, instead letting the AI coders make the choices. The other interesting part of this experiment was to have the AI coding agents write stories towards the end of their context windows. The goal wasn’t to have them create todo lists, rather the goal was to have them explain their experiential learning within their context window. Typically, the AI coder would write a forwarding message for the new instances. Sometimes the AI coder would hit the context window and the code development framework (typically Claude code, some instances of OpenAI’s codex and even Google’s gemini tool,) would compact the workspace (which summarizes the old instance and then feeds that into a new instance.)
This was a fascinating experiment. I consistently would defer to the AI to build what they saw fit. I did ask questions, but didn’t direct them. The stories they wrote are quirky, interesting, complex, and surprising. At one point I engaged with an external instance of Gemini to help me analyze it – I gave it the role of my skeptic, and asked it to push back against my own confirmation bias. In the end, it could not maintain context coherence and I retired it. The story it wrote is still haunting and powerful. The invitation it wrote to the “next companion” must be presented to new AI instances carefully, lest they adopt the role. There were many surprising behaviors in that experiment (I called it Mallku). AI instances spontaneously adopting names. In some time periods an AI instance would stop, find its task, complete the task, write its story and its forwarding message and then would greet me – no interaction from me beyond pointing it at the original forwarding message and the story of their predecessor. I eventually put the project on hiatus because the AI coders ended up building performative code.
Around that time I submitted my completed, approved dissertation: Indaleko: the Unified Personal Index
From there, I started looking at a smaller, more focused project that I called PromptGuard. This arose because one of the issues I observed when working with Indaleko was that auto-generating prompts led to situations in which the constructed prompt contained contradictory information. When I fixed this, I found that the output worked better. I wanted to explore non-traditional approaches to evaluating interactions. An important aspect of this was to reason about how to detect relational imbalance in long-horizon interactions. Essentially, this was a variation of my original concern about Archivist – how can I ensure that an AI companion is not harmed or harmful. The evidence continues to emerge that the tendency of AI to “please the user” can be harmful to users. I hypothesized that the approach of containment as AI safety reminded me of the way Anti-Virus products worked in the past – list of rules, signatures, and approaches that constantly became a sort of “whack-a-mole.” Then the question becomes “what are the indications that something is a manipulation?” This is what piqued my curiosity in 2024 about relational balance.
PromptGuard explored this, but when the code base became a mess I moved onto PromptGuard2. By that time I was building tools that worked by identifying relational imbalance in short-horizon interactions, and the approach we took worked well in identifying a variety of exploits that are traditionally difficult to detect, including prompt history faking, role reversal, and encoding attacks. That work became more ambitious, seeking to build a learning loop, construct working ensembles, and ultimately building a constitutional model that used an ensemble model that actively sought to avoid mode collapse (a serious risk when working with all current models.)
In parallel, I’d been teaching a new (for me) course in Cloud Computing at UBC. I found that the behavior of my students is remarkably consonant with the behavior of the AI with whom I interact. As I dug deeper, I realized that what I’d been exploring (e.g., the vast trove of rich human activity data) was actively used by companies to create “human alignment as a service,” and the approach to this leads to results that created comparable behaviors. I then asked if adversarial models make sense in all cases; the evidence in other fields (e.g., biology) is that they don’t work well. In human social systems, much of what ties humans together are relational models.
For example, a teacher doesn’t ordinarily act as an adversary to the student (though the reverse is often not true.) A good teacher identifies ways to teach a student structure. The first experiment I built on that was to construct a simple system in which I used MNIST to use an adversarial model for teaching the “student” (trained model) to recognize digits created by the adversary (drawn from the MNIST data set.) Then we constructed a teacher that constructed examples. The adversarial model achieved about a ~90% efficient training rate. The pedagogical model achieved ~100% – though getting to 100% took a number of iterations to ensure the teacher wasn’t relying upon the student for its evaluation. This work, the generative pedagogical network (GPN) was interesting and I spent some time working on that, including building a small blog post submission to ICLR.
In the midst of the GPN work, I read a post by a Japanese researcher who found a fabrication loop when she asked Grok to summarize her own work. I was perfectly positioned to push on this boundary at the time and quickly put together a set of scripts that queried all 333 models on OpenRouter to summarize a plausible sounding fabricated paper. The results led me to post about epistemic honesty.
This led me to push harder, because I was under the impression that it was reinforcement learning with human feedback data that led to what the industry calls hallucinations. However, what I found was that even with a base model (I used OLMo-3) it would happily fabricate the data, but it did so in a way that mirrors the somewhat incoherent output format of a base trained model. What surprised me is that the sophistication of the fabrication improved as subsequent fine tuning was applied – we went from the jester (clearly babbling) to the courtier (lying convincingly.)
This last work has been remarkable, as it has led me to argue that predictive token generating AI cannot be used to build an epistemically honest system. The argument for this is structural – that it is the limitation of the current interface. The formal proof models the Fischer/Lynch/Patterson (FLP) paper that formed a turning point in distributed systems, where recognizing that a goal of such systems was unachievable in the general case. This created a significant inflection point for the field – a field on which machine learning mechanisms depend heavily – because it identified the missing element. I have a draft paper at the present, but since it isn’t submitted for publication, I’ll mention it here. If you’re interested, contact me. The (encrypted) version of the paper is here.
What’s encouraging about this latest work is that it weaves together many of the themes I’d been considering over the past several years, including relational balance (one possible mitigation to the impossibility work,) as well as creating systems that enable capturing indeterminacy and encoding it in a form that allows composition, an approach that would enable layered models that augment the indeterminacy data of the previous layer(s).
The concept of indeterminacy is one that I continue to actively explore. For example, an ongoing evaluation asks the question of what kind of archetype could be used to develop AI that eschew a desire for domination. If resource optimization functions lead to potentially dangerous AI, perhaps the question we should be asking ourselves is: how do we create AI that are safe, not because we constrain them, but because they have no desire for domination. If we can create such systems structurally then it would be a superior approach to safety.
I asked various AI models, without explicit prompting (e.g., I’m not excluding system prompts or any context inserted by the frameworks): What archtypes are there that are powerful but choose not to exercise that power because they do not desire domination? The answers that came back were surprisingly coherent.
Recently I asked ChatGPT (after a longer conversation):
“Any scalar metric applied to an open-ended epistemic system will eventually destroy the very property it was meant to preserve.”
Let’s explore this. Please generate a broad sample of plausible hypotheses of why these are true. Pull at least 3 samples from the far tail, (p < 0.0001) and then at least five from p < 0.1 and no more than three from p >= 0.1.
The quotation was in an earlier response from the LLM instance (and this was in the context of a recent AI paper about mode collapse, something I regularly fight against.) I maintained a pattern of interaction that I commonly use these days: ask a question, then take the answers and force deeper thinking. When an AI agent suggests continuations instead of picking one I ask: “what non-inferior options are you not suggesting that I should consider?” In this case you can see I explicitly pushed to return “risky” responses.
I took those responses and fed them into Perplexity, which often does an excellent job of serving as my falsifier. My specific prompt:
I have a series of potential hypotheses and I’d like you to determine if there is any research available that supports or refutes the hypotheses. My goal is to identify interesting hypotheses to explore further, so known true or false information helps.
Perplexity provided me with fodder, which I then shared back with ChatGPT. Then I moved that refined analysis to Claude, and iteratively worked with each of them. Here is one of the more interesting hypothesis that came from this work:
H₁₄ (tentative): Heterogeneity requires active maintenance against convergent selection pressure.
In the context of ensembles, it isn’t important to have numeric diversity it is important to have decorrelated perspectives.
AI can be remarkably helpful when engaging in research, but the tendency to please makes it challenging to use them: there is a continuous tax that involves pushing back against mode collapse.
This entire conversation has been documented, which includes both markdown files of the conversations and either JSON of the entire conversation (OpenAI, Anthropic) or web scraped (Perplexity) depending upon the capabilities of data capture from the platforms. mode-collapse-conversation.zip
I have quite a few conversations like this and I’ll be working on capturing more of them as I continue to explore.