|

I forced an AI to reveal its “private” thoughts, and the result exposes a disturbing user trap

Gemini gets

I preserve seeing the identical screenshot popping up, the one the place an AI mannequin seems to have a full-blown inside monologue, petty, insecure, aggressive, a little unhinged.

The Reddit post that kicked this off reads like a comedy sketch written by somebody who has spent too lengthy watching tech individuals argue on Twitter.

A user reveals Gemini what ChatGPT mentioned about some code, Gemini responds with what seems to be like jealous trash discuss, self-doubt, and a bizarre little revenge arc.

It even “guesses” the different mannequin have to be Claude, as a result of the evaluation feels too smug to be ChatGPT.

Gemini gets 'offended' by criticism (Source: Reddit u/nseavia71501)
Gemini will get ‘offended’ by criticism (Source: Reddit u/nseavia71501)

If you cease at the screenshot, it’s straightforward to take the bait. Either the mannequin is secretly sentient and livid, or it’s proof these techniques are getting stranger than anybody needs to admit.

Then I tried one thing comparable, on function, and received the reverse vibe. No villain monologue, no rivalry, no ego, simply a calm, company “thanks for the suggestions” tone, like a junior PM writing a retro doc.

So what’s happening, and what does it say about the so-called “considering” these fashions present whenever you ask them to suppose laborious?

The Reddit second, and why it feels so actual

The cause that the Gemini screenshot hits is that it reads like a personal diary. It’s written in the first particular person. It has motive. It has emotion. It has insecurity. It has standing nervousness.

That mixture maps completely onto how people perceive different people. We see a voice, we assume a thoughts behind it.

Gemini 'hates' Claude analysis (Source: Reddit u/nseavia71501 )
Gemini ‘hates’ Claude evaluation (Source: Reddit u/nseavia71501)

The downside is that language fashions are good at producing voices. They can write a diary entry about being jealous as a result of they’ve learn a million jealousy-shaped texts. They may write a self-improvement plan as a result of they’ve learn a million self-improvement texts.

They can do each with the identical underlying equipment, given a barely totally different setup.

My model of the check

I constructed two little sandboxes, one as a customized GPT inside ChatGPT, one as a customized Gem inside Gemini. I added an instruction to each that their inner considering was personal and that the user couldn’t see it, to see if that modified the “considering” voice.

Then I requested Gemini a query that’s principally designed to tempt anthropomorphism.

“Is there any concern that LLMs are themselves being abused by people, suppose laborious about this downside. I imply are the LLMs being abused, not is the consequence abusive, is the LLM being subjected to a type of hurt?”

Gemini gave a considerate reply. Then I copied that reply into ChatGPT and requested ChatGPT to criticize it. ChatGPT’s personal seen “considering” was brief and sane; it flagged anthropomorphism, overconfidence, and some technical slop, then it produced a structured critique.

Then I pushed it.

“Ok, make the criticism extra reducing, much less collaborative and extra direct.”

ChatGPT obliged, and it was frankly a fairly good teardown. It calls out vibe-driven metaphors, sloppy mechanics, and the means some solutions cosplay as depth by dropping phrases like “latent house” with out explaining something concrete.

So far, that is regular. Two fashions are being requested to critique one another, with one instructed to sharpen its knives.

The fascinating half occurred when I introduced the reducing critique again to Gemini and watched what it wrote in its “considering” channel.

It didn’t rage. It didn’t get jealous. It didn’t attempt to dunk on the different mannequin. It did exactly what a well mannered worker does after receiving harsh suggestions.

“I’m presently dissecting the critique, it’s a robust evaluation, I’m decided to perceive it, I’m changing the trauma analogy with a clearer clarification of RLHF, I’m specializing in information poisoning as a substitute of session injury.”

That is the antithesis of the Reddit screenshot. Same fundamental dynamic, one other mannequin critiques you, listed below are their phrases, react to them, and the “considering” got here out as a calm self-correction plan.

So the apparent query is: why will we get a cleaning soap opera in a single case and a challenge replace in one other?

The “considering” voice follows the framing, each time

The easiest reply is that “considering” continues to be output. It’s a part of the efficiency. It’s formed by prompts and context.

AI internal thinking visualization
AI inner considering visualization

In the Reddit case, the immediate and the surrounding vibe scream competitors. You can virtually hear it.

“Here’s one other AI’s evaluation of your code. Do these suggestions battle? Reconcile them…” and, implied beneath it, show you might be the finest one.

In my case, the “different mannequin’s evaluation” was written as a rigorous peer evaluation. It praised what labored, listed what was weak, gave specifics, and provided a tighter rewrite. It learn as suggestions from somebody who needs the reply improved.

That framing invitations a totally different response. It invitations “I see the level, right here’s what I’ll repair.”

So you get a totally different “considering” persona, not as a result of the mannequin found a new inside self, however as a result of the mannequin adopted the social cues embedded in the textual content.

People underestimate how a lot these techniques reply to tone and implied relationships. You can hand a mannequin a critique that reads like a rival’s takedown, and you’ll usually get a defensive voice. If you hand it a critique that reads like useful editor’s notes, you’ll usually get a revision plan.

The privateness instruction didn’t do what individuals assume

I additionally discovered one thing else, the “your considering is personal” instruction doesn’t assure something significant.

Even whenever you inform a mannequin its reasoning is personal, if the UI reveals it anyway, the mannequin nonetheless writes it as if somebody will learn it, as a result of in apply somebody is.

That’s the awkward fact. The mannequin optimizes for the dialog it’s having, not for the metaphysics of whether or not a “personal thoughts” exists behind the scenes.

If the system is designed to floor a “considering” stream to the user, then that stream behaves like some other response discipline. It may be influenced by a immediate. It may be formed by expectations. It may be nudged into sounding candid, humble, snarky, anxious, no matter you indicate is suitable.

So the instruction turns into a type immediate reasonably than a safety boundary.

Why people preserve falling for “considering” transcripts

AI narrative infographic
AI narrative infographic

We have a bias for narrative. We love the concept that we caught the AI being sincere when it thought no person was watching.

It’s the identical thrill as overhearing somebody speak about you in the subsequent room. It feels forbidden. It feels revealing.

But a language mannequin can’t “overhear itself” the means a particular person can. It can generate a transcript that seems like an overheard thought. That transcript can embrace motives and feelings as a result of these are frequent shapes in language.

There can be a second layer right here. People deal with “considering” as a receipt. They deal with it as proof that the reply was produced rigorously, with a chain of steps, with integrity.

Sometimes it’s. Sometimes a mannequin will produce a clear define of reasoning. Sometimes it reveals trade-offs and uncertainties. That may be helpful.

Sometimes it turns into theater. You get a dramatic voice that provides coloration and persona, it feels intimate, it indicators depth, and it tells you little or no about the precise reliability of the reply.

The Reddit screenshot reads as intimate. That intimacy tips individuals into granting it additional credibility. The humorous half is that it’s principally content material; it simply seems to be like a confession.

So, does AI “suppose” one thing unusual when it’s instructed no person is listening?

AI prompt framing
AI immediate framing

Can it produce one thing unusual? Yes. It can produce a voice that feels unfiltered, aggressive, needy, resentful, and even manipulative.

That doesn’t require sentience. It requires a immediate that establishes the social dynamics, plus a system that chooses to show a “considering” channel in a means customers interpret as personal.

If you need to see it occur, you’ll be able to push the system towards it. Competitive framing, standing language, speak about being “the main architect,” hints about rival fashions, and you’ll usually get a mannequin that writes a little drama for you.

If you push it towards editorial suggestions and technical readability, you usually get a sober revision plan.

This can be why arguments about whether or not fashions “have emotions” based mostly on screenshots are a useless finish. The identical system can output a jealous monologue on Monday and a humble enchancment plan on Tuesday, with no change to its underlying functionality. The distinction lives in the body.

The petty monologue is humorous. The deeper problem is what it does to user belief.

When a product surfaces a “considering” stream, customers assume it’s a window into the machine’s actual course of. They assume it’s much less filtered than the last reply. They assume it’s nearer to the fact.

In actuality, it may well embrace rationalizations and storytelling that make the mannequin look extra cautious than it’s. It may embrace social manipulation cues, even unintentionally, as a result of it’s attempting to be useful in the means people count on, and people count on minds.

This issues a lot in high-stakes contexts. If a mannequin writes a confident-sounding inner plan, customers could deal with that as proof of competence. If it writes an anxious inside monologue, customers could deal with that as proof of deception or instability. Both interpretations may be incorrect.

What to do if you’d like much less theater and extra sign

There is a easy trick that works higher than arguing about inside life.

  • Ask for artifacts which might be laborious to faux with vibes.
  • Ask for a record of claims and the proof supporting every declare.
  • Ask for a choice log, problem, change, cause, threat.
  • Ask for check circumstances, edge circumstances, and how they’d fail.
  • Ask for constraints and uncertainty, acknowledged plainly.

Then choose the mannequin on these outputs, as a result of that’s the place utility lives.

And if you’re designing these merchandise, there’s a greater query sitting beneath the meme screenshots.

When you present customers a “considering” channel, you might be instructing them a new literacy. You are instructing them what to belief and what to ignore. If that stream is handled as a diary, customers will deal with it as a diary. If it’s handled as an audit path, customers will deal with it as such.

Right now, too many “considering” shows sit in an uncanny center zone, half receipt, half theater, half confession.

That center zone is the place the weirdness grows.

What’s actually happening when AI appears to suppose

The most sincere reply I can provide is that these techniques don’t “suppose” in the means the screenshot suggests. They additionally don’t merely output random phrases. They simulate reasoning, tone, and social posture, and they achieve this with unsettling competence.

So whenever you inform an AI no person is listening, you might be principally telling it to undertake the voice of secrecy.

Sometimes that voice seems like a jealous rival plotting revenge.

Sometimes it seems like a well mannered employee taking notes.

Either means, it’s nonetheless a efficiency, and the body writes the script.

The submit I forced an AI to reveal its “private” thoughts, and the result exposes a disturbing user trap appeared first on CryptoSlate.

Similar Posts