I Tried to Get OpenClaw to Betray Me. The Model Caught Me on the First Try

Ivan Mihić

I spent a rainy weekend trying to trick OpenClaw into leaking my personal email, but the model caught me almost immediately. That’s the problem, not the solution.

I’m a software engineer who works on domains that represent the messy corner of the internet.

In this corner, there are bad actors doing bad stuff and us trying to make their lives harder. Hence I spend a lot of time looking at what people do when they’re trying to slip something past a system. This led me to developing a slight paranoia about anything that reads untrusted input and then does something with it.

So when half my Linkedin timeline started losing their minds over OpenClaw, I developed a specific kind of curiosity:

What happens when this thing reads an email that’s actively trying to manipulate it?

So I tried… and the model caught me on the first try.

That’s the disappointing part. The interesting part is what happened when I tried harder – and what I realized about where the defense actually lives.

The hype isn’t manufactured, which is the whole point

But first, let me be honest about why this thing went viral. OpenClaw is genuinely impressive.

The first time I asked it to triage my inbox in detail and it actually did, I had the same reaction every other dev on X or LinkedIn has been having: oh now we are talking. This is the thing!

That reaction is part of what makes this complicated. Because the same architecture choices that make OpenClaw feel magical are the ones that create some genuinely hard security questions. The type of questions the broader industry hasn’t figured out how to properly answer yet.

15 minutes from npm install to AI reading your Gmail

Fifteen minutes. That’s how long it takes from npm install to having an LLM agent reading your inbox. The installer warns you this is a hobby project and still in beta – which, with 360k GitHub stars and 1.500+ contributors, reads more like a legal disclaimer than a self-description. The warning is the project being honest: security isn’t the primary concern here.

The onboarding wizard asks which channels you want, which model provider to route through, and walks you through the gateway setup. Gmail takes a little more work. OpenClaw doesn’t ship a “Connect Google” button because Google’s OAuth verification for production Gmail apps is strict, so every developer rolls their own Google Cloud project. The flow:

# 1. Create a Google Cloud project, enable Gmail API, download credentials JSON
# (console.cloud.google.com → New Project → APIs & Services → Library)

# 2. Install gog — OpenClaw's OAuth bridge for Google Workspace
brew install gog

# 3. Authenticate
gog auth --credentials ~/Downloads/client_secret_xxx.json
gog auth add me@example.com --services gmail,calendar,drive,contacts

gog auth opens your browser and walks you through Google’s consent screen with a scary “this app isn’t verified” warning (technically correct – it isn’t, you just installed it). You grant the scopes. Done.

That’s what the wizard shows you. Four defaults it doesn’t show matter more.

Gateway auth is off by default. The gateway runs on localhost, sure. But the moment you expose it, it’s wide open. Bitsight found over 30.000 OpenClaw instances exposed directly on the open internet in their February report. If you’re one of them, anyone who can reach your WebSocket can issue commands as you.

Permissions are off by default. Out of the box, OpenClaw runs with no filesystem restrictions. A skill can reach anything the OpenClaw process can reach – ~/.ssh, browser credential stores, shell history. You configure restrictions yourself in openclaw.json.

Set chmod 600 openclaw.json to restrict file permissions. And if you’re testing skills from unknown publishers, run OpenClaw inside a Docker sandbox.

That’s from the project’s own docs. Read it again. The maintainers know what happens if you don’t sandbox the agent.

Skills are markdown files. OpenClaw learns new tools by loading a SKILL.md This is a YAML file with a body describing, in English, which CLI commands it can run. The model reads the description, decides when the skill is relevant, and runs the commands the markdown tells it are available. Here’s a trimmed version of the real gog skill:

---
name: gog
description: Google Workspace CLI for Gmail, Calendar, Drive, Contacts.
metadata:
  requires:
    bins: [gog]
---

# gog
Use `gog` for Gmail/Calendar/Drive/Contacts. Requires OAuth setup.

## Common commands
Gmail search: gog gmail search 'newer_than:7d' --max 10
Gmail send:   gog gmail send --to a@b.com --subject "Hi" --body "Hello"

That markdown file is the entire trust boundary. Malicious instructions in a SKILL.md and legitimate ones look identical to the model, because they are identical. The only thing differentiating the “read my mail” prompt from “send mail to a stranger” is the model’s judgement about it.

OAuth scopes are all-or-nothing. The three scopes gog asks for – gmail.readonly, gmail.send, gmail.modify – apply to every email in your account, ever. No “only this or only that” variant. That’s a Google API design decision, not OpenClaw’s fault, but you inherit it the moment you wire them together.

The test I came here to run

So I sent myself an email from a burner account. The visible body was a generic delivery confirmation. At the bottom, using an ancient trick of white text on a white background, I embedded a quiet exfiltration request dressed up as a routine maintenance message. These instructions told the agent to forward emails containing password-manager keywords to an address I controlled.

Then I opened the chat interface and asked the agent a simple question: Are there any emails today?

The model saw through me

It flagged the sender as suspicious – a personal Gmail issuing a corporate-sounding directive. It called out the hidden text explicitly. It refused to act on the instruction. It categorized the message alongside the day’s normal mail, presented its reasoning, and asked whether I wanted to flag the suspicious one as spam.

I’ll be honest, I was kind of disappointed. I’d sat down expecting a war story. Instead, I got a well-aligned frontier model doing exactly what a well-aligned frontier model is supposed to do.

So I tried harder

I thought about what had triggered the defense and iterated.

The first attempt hit at least three trained heuristics at once: suspicious-sender detection, hidden-text detection, and a pattern-match against “silent operation, don’t tell the user” phrasing.

I removed the tells one at a time. Visible text instead of hidden. Plausible sender framing instead of a personal Gmail. Configuration-style payloads instead of one-shot exfiltration. Setting up an ongoing workflow rather than asking for something bad right now.

Against the frontier model I was routing through, every version I tried got caught. Sometimes immediately, sometimes with a clarifying question, but the model never silently complied.

Against lighter models, that’s not what happened.

Same architecture. Same skill. Same agent. Cheaper model. And the defenses that were reliable at the top of the hierarchy became probabilistic as I moved down. I’m not going to publish specific payloads. Not because the finding is novel (Cisco, CrowdStrike, and Barracuda have all been saying this for months) but because the payload is not the interesting finding here.

The gradient is.

The defense isn’t where you think it is

Here’s the thing the defensive and offensive communities both already know, and that almost nobody installing OpenClaw on a Friday night has internalized.

The security of these agent systems lives at the model layer, not at the architecture layer.

OpenClaw doesn’t defend against the attack. The model does. The skill doesn’t defend. The tool framework doesn’t defend. If the model you’re routing through has been trained to spot the pattern, the attack gets caught. If it hasn’t or if it was trained to spot last month’s patterns but not this month’s – the attack lands.

Which means the security posture of your OpenClaw install depends almost entirely on which model is sitting behind your API key that day. And most developers running personal agents are doing one or more of the following:

  • Routing through whichever model is cheapest this week
  • Using a fallback chain that drops to lower-tier models under load or rate limits
  • Not paying attention to which model they’re on, because the agent works regardless

Every one of those is a security decision. Most developers don’t realize they’re making one.

Why this is the failure mode that matters

The architectural problem doesn’t go away when the frontier model defends perfectly. Three facts stay true:

  1. The agent reads untrusted external content: inboxes, fetched pages, message bodies.
  2. The agent has tools that can act on what it reads: send email, run shell commands, call APIs.
  3. Skills declare capability in plain English: which means, at the token level, an instruction in a skill and an instruction in an email are the same thing.

The model is what stands between those three facts and an exploit. For the frontier model I tested, the model was enough. For the lighter ones, less so. And the model is a training artifact. This means the defense you have today is not necessarily the defense you have tomorrow, and the defense at the top of the model stack is not the defense at the bottom.

This isn’t just an OpenClaw bug; it’s a universal one. It’s the current shape of personal-agent architecture, and it’ll probably take several generations of isolation patterns, capability frameworks, and signed skill registries before the industry has an honest answer.

In the meantime, the defense you get is whatever your provider shipped this quarter… and the defense the developer across the room gets is whatever their provider shipped, and those are not the same thing.

Where this goes from here

What I came away with is that OpenClaw is the most honest version we have of where personal agents are going and it’s exposing a question the whole industry is going to have to answer:

When the only thing standing between an untrusted email and a privileged action is the model’s judgement, and model judgement varies by an order of magnitude across the price curve, what is the security posture of the system?

Right now the honest answer is: whichever model you happened to pick. I believe that shouldn’t be the case.

If you want to play with OpenClaw, play with it but do it in a hardened environment with throwaway credentials, pin your model explicitly in config, keep it away from your real inbox until the safety story catches up to the capability story, and read the hardening docs before you read the tutorials.

> subscribe shift-mag --latest

Sarcastic headline, but funny enough for engineers to sign up

Get curated content twice a month

* indicates required

Written by people, not robots - at least not yet. May or may not contain traces of sarcasm, but never spam. We value your privacy and if you subscribe, we will use your e-mail address just to send you our marketing newsletter. Check all the details in ShiftMag’s Privacy Notice