Why pretrained security AI doesn't transfer (and what does)
AWS, CrowdStrike, and Microsoft have the data positions. Why hasn't the obvious cross-tenant pretrained model shipped?
Why pretrained security AI doesn't transfer — and what does
There is an obvious move in security AI that, oddly, nobody has made work yet.
AWS GuardDuty has been training on AWS-wide telemetry for years. CrowdStrike Falcon has petabytes of cross-tenant endpoint signal. Microsoft Defender consumes more endpoint events in an hour than most enterprises see in a quarter. Given those data positions, why has no vendor shipped a pretrained security model that arrives at your tenant pre-baked, ready to flag anomalies on day one?
The answer is the design constraint nobody on stage will name: security graphs do not transfer.
The transfer problem, made concrete
A service account that authenticates against twelve crown-jewel databases is a finding in 99% of environments. In the 1% where that account is the payments pipeline, it's the load-bearing column of the business. Pretraining cannot tell those two cases apart, because the structural shape of "weird" is tenant-local in ways no amount of cross-tenant data normalization fixes.
Identity graphs are the worst case. A senior SRE at one company has root-equivalent in three production accounts because that's the operating model. At another company with the same headcount, the same access pattern is privilege escalation in progress. The graph topology, the access pattern, the timing — all identical. Only the org-specific context distinguishes them.
This is why GuardDuty, Falcon, and Defender — despite having the data positions to crush this — ship narrow detections (specific malware families, specific lateral movement primitives, specific cloud misconfigurations) rather than the cross-tenant pretrained anomaly model that would seem to be sitting in plain sight on their data lakes. The data is there. It does not produce the model people imagine it would.
The cold-start problem in security AI
Every security AI vendor faces a fork on day one of a new tenant deployment:
Path A (industry-standard). Require N weeks of "baseline" telemetry before the model produces meaningful output. The vendor calls this "learning your environment." The customer experiences it as paying for an AI product that doesn't work for the first 4–12 weeks. Many SOCs have lived through this and now refuse to renew vendors who require it.
Path B. Ship unsupervised structural methods that produce signal on day one without any baseline. Use the analyst's triage decisions — closed alerts, dismissed alerts, escalated alerts — as weak labels to learn tenant-specific weights over time. The day-one product is honest about being a structural prior. The 12-month product is a learned model trained on the analyst feedback the customer's own SOC generates while using the day-one product.
Setu is path B. Not because path A is wrong, but because the customer doesn't have 12 weeks to wait, and the "pretrained foundation model for security" that would let path A start fast doesn't transfer for the structural reasons above.
Why structural methods give you something on day one
The math behind path B is older than the math behind path A. Personalized PageRank has been computing influence-from-bad-seeds since 2003. Spectral graph wavelets — the Hammond–Vandergheynst–Gribonval line of work — gave us multi-scale frequency decomposition on graphs in 2011. Heat-kernel diffusion as a smoothing operator on graph signals predates that.
None of this is novel mathematics, and we don't claim it is. What's novel is the production engineering: applying these well-understood operators to a security entity graph that combines identity, endpoint, network, and access telemetry into a single coherent structure, then exposing the resulting per-node exposure scores as a continuous surface that an analyst can triage on day one.
A structural prior is not as good as a learned model trained on six months of clean tenant-specific baseline. We say so directly. What it is, is available on day one when the learned model is not.
What actually compounds: the analyst feedback loop
Here is the part that matters strategically, and that the cross-tenant pretraining dispatch misses.
Every alert a Setu tenant's analysts triage — closed as benign, escalated to incident, dismissed as known-good — is a labeled example. Per tenant. Specific to that org's notion of normal. Within six months of deployment, a typical mid-size SOC generates 15,000–40,000 triage decisions. That's enough labeled data to train a per-tenant model that beats any pretrained model that has never seen the tenant.
The key insight: the analyst feedback loop only produces useful training data if the analyst has alerts to triage in the first place. Path A vendors don't get this loop until their baseline period ends. Path B vendors get it from week one.
So the actual bet looks like this:
| Time horizon | Path A (baseline-required) | Path B (unsupervised + feedback) |
|---|---|---|
| Day 1 | No signal | Structural exposure scores |
| Month 1 | No signal | Structural + first weak feedback |
| Month 6 | First learned model | Tenant-specific weights from 20K decisions |
| Month 12 | Refined learned model | Tenant-specific learned edge weights, attention, temporal layer |
The path B vendor reaches a learned tenant-specific model faster and with better data, because the data was generated by analysts working with a system that produced reasonable signal from day one. Path A vendors are training on retrospective backfill of a baseline period during which nothing was being triaged.
The explainability tax
There's a second-order reason path B works that the AI-first vendors underestimate.
Security analysts will turn off any model that they cannot interrogate. A 0.95-AUC black-box classifier that flags an executive's account at 2 AM with no explanation gets disabled by the SOC manager after the second false positive. A 0.85-AUC model that says "this account scored high because it has unusual reachability to the crown-jewel set, which changed by +12% this week due to a new role grant from group X" gets trusted, gets refined, and stays on.
Spectral and propagation-based methods have a built-in explanation: the score is a function of named graph operations on a graph the analyst can inspect. The path from input to output is auditable. When the GNN layer goes on top later — and it will — the explanations carry through because the substrate is the same graph the analyst already trusts.
Pretrained models on cross-tenant data don't have this property. The analyst can't inspect the training set. They can't tell whether the model is keying off legitimate signal or off correlations with org sizes or industries that pretraining happened to overweight. So the model gets disabled, regardless of its accuracy on a held-out test set.
What this means for the GNN debate
The standard critique of unsupervised graph methods in security goes: "GNNs learn the weights, the aggregation, and the propagation end-to-end — why are you still hand-designing those?" The critique is correct on the math. It misses the deployment reality.
You cannot train a GNN on a tenant's first day of data. You can train one after six months of analyst feedback. The bridge from day-one to month-six is what unsupervised structural methods provide. They are not a substitute for GNNs. They are the runway.
The vendors who skip the runway and require months of baseline before producing signal are betting that customers will tolerate the delay. Some will. Many will not, especially in mid-market security where the procurement cycle measures product value in weeks. The vendors who pretrain on cross-tenant data and ship "ready-to-go" anomaly models are betting that the transfer problem above is solvable with enough data. Twenty years of public security ML research suggests it isn't, at least not with the model architectures currently in production.
There is a third path that academic research is actively exploring: foundation models for security graphs, pretrained on synthetic and unlabeled telemetry, fine-tuned per tenant. Microsoft Research and several university groups have early work here. We watch this work closely. If a pretrained security graph foundation model becomes available with credible evidence of cross-tenant transfer, we'll consume one before we'd build one — but the evidence does not yet exist, and the production architecture has to ship before the pretraining problem is solved.
What we're betting on
Setu's bet, stated plainly: the moat in security AI is not the model. It's the per-tenant analyst feedback that the model gets trained on, and the only way to accumulate that feedback is to ship something useful on day one. Structural graph methods are the day-one product that buys the right to the learned model.
If we're wrong — if cross-tenant pretraining suddenly works, or if the market accepts "your AI works in 12 weeks" as a sales motion — we'll lose to whoever solves either problem. We don't think either is close. The bet is that day-one signal plus accumulated tenant-specific feedback compounds faster than baseline-then-train, and that the resulting trained model is more trustworthy because the analyst saw the system reason from inputs they could inspect.
That's the case for graph physics in security AI as a foundation, with GNNs as the layer that goes on top once the data exists to train them honestly. It is not a defense of 2011 mathematics against 2024 mathematics. It is an argument about the order in which a working production system has to be built.
The math will keep moving forward. The deployment constraint won't.
Setu Research
Setu Security Research