Learn / Ghost data

Ghost data: the undeclared PII hiding in your warehouse.

Ghost data — sometimes called dark data — is personal data that ended up in your systems without being registered, classified, or governed. Nobody decided to store it, so nobody is watching it. It is also the data that most reliably fails an audit.


Definition

The personal data you don't know you have.

Ghost data is personal or sensitive information sitting in your warehouse outside any governance. It is not on your PII registry, not attached to a deletion strategy, and not part of your compliance inventory — because it arrived by accident rather than by design.

The danger is not that ghost data is more sensitive than declared data. It is that you cannot manage what you have not mapped. When you assert that a customer has been erased, a stray copy of their email in an ungoverned column silently makes that assertion false.


How it gets there

Four ways ghost data accumulates.

Free-text fields

A support agent pastes a phone number into a notes field; a customer types their full name into a feedback box. Personal data lands in columns that were never designed to hold it, so no classifier is watching them.

Nested and semi-structured data

JSON blobs and event payloads carry emails, IPs, and identifiers several levels deep. Schema-level scans that only look at top-level column names miss them entirely.

Copies and exports

A one-off export gets loaded back in as a new table; a debugging snapshot never gets cleaned up. Each copy duplicates personal data into a location the registry does not track.

Derived transformations

A dbt model joins a raw table and carries an identifier downstream into a mart. The PII propagates, but the policy attached to the source column does not follow it.


The risk

Why undeclared PII is an audit and breach risk.

GDPR Article 5(2) — the accountability principle — requires you to be able to demonstrate compliance, and that includes demonstrating completeness. An erasure claim that misses an undeclared copy is not just incomplete; it is a contradicted claim, which is far worse in front of a regulator than an honest gap.

Ghost data also widens your breach blast radius. Data you did not know you were storing is data you cannot protect, encrypt, or delete on schedule — so a breach exposes more than your incident response assumed, and your retention policy silently does not apply to it.


Detection

How to find ghost data before a regulator does.

01 / Scan content, not just schema

Match on the shape of the data — email, phone, and identifier patterns in sample values, including inside nested JSON — rather than trusting column names to be honest.

02 / Follow lineage

Use dbt lineage to see which downstream models inherit personal data from a source, so detection covers derived tables, not only raw ones.

03 / Turn hits into findings

Record each match as a reviewable finding with a resource, pattern, severity, and recommended action — not a line buried in a scanner log.

04 / Close the loop

For each finding, register the column and attach a deletion strategy, remove the source data, or confirm aggregate-only handling. Track the status so drift is measured and closed.


FAQ

Common questions

What is ghost data?

Ghost data, also called dark data, is personal or sensitive data stored in your systems without being registered, classified, or governed. It typically arrives through free-text fields, nested JSON, copied exports, or derived tables, and it is excluded from deletion and compliance workflows until a scan surfaces it.

Why is ghost data a compliance problem?

Because you cannot delete or protect data you do not know you have. An undeclared copy of a customer's personal data contradicts an erasure claim and widens the impact of a breach. GDPR's accountability principle requires you to demonstrate completeness, which undeclared data quietly undermines.

How do you detect ghost data in a warehouse?

Scan the content of columns for personal-data patterns (including inside nested JSON), follow dbt lineage so derived tables are covered, and record each match as a reviewable finding with a recommended action. Content-based scanning catches PII that schema-only cataloguing misses.


Keep reading

The registry that ghost-data findings are measured against — a live map of declared PII.