PERSPECTIVE· 4 min read· 2026-05-22

Why we index the internet, and what we choose not to do with the data

Internet-scale data work has ethical edges. Drawing them explicitly is more useful than letting them stay implicit.

The Deepinfo dataset covers 400 million domains, 2 billion subdomains, 200 billion DNS records, and 30 billion SSL certificates. The dataset is updated continuously across all four. Building and maintaining infrastructure at that scale is a multi-year engineering effort, and it's the foundation under everything the platform does.

Most exposure-management vendors don't operate their own internet observatory. They source data from a smaller number of providers, layered with their own enrichment. The advantage of operating the observatory ourselves: we control the freshness, the coverage, and the responsiveness to specific customer scoping. The disadvantage: we have to make decisions about how the data gets collected, what we collect, and what we won't.

The lines we draw, made explicit.

No personal data sold

The aggregate dataset access doesn't include personal data outside what's already public in WHOIS or surface-web indexing. We don't broker personal data. Where WHOIS data is hidden behind privacy services or registrar policy, it stays hidden in our dataset.

No scraping behind authentication

We don't bypass paywalls, defeat anti-scraping measures, or access content that requires accounts. The dataset is built from the open internet: what's publicly accessible to anyone who points a browser at a URL.

Customer data stays customer data

The assets, findings, and configuration tied to a customer account are tied to that customer. We don't surface them to other customers, sell them, or train shared models on them. A customer's discovered subdomains are their inventory, not data we redistribute.

Dark web sourcing is industry-standard

Where dark web data appears in our coverage, it comes through established sourcing patterns used across the threat intelligence industry. We don't pay for newly-stolen data. We don't position ourselves as a buyer of breach materials. The dark web work is about visibility into criminal infrastructure, not about being part of the criminal economy.

Why these lines exist

These lines exist because the alternative, letting them stay implicit, is how internet-data businesses end up in compromised positions. If we ever change them, we'll document the change publicly with rationale. Researchers and procurement teams who need verification of specifics can talk to us; we'd rather answer the question than have it remain ambient.

The capability is real. The data is real. The lines around what we do with the data are real too.

Why we index the internet, and what we choose not to do with the data

The lines we draw, made explicit.

No personal data sold

No scraping behind authentication

Customer data stays customer data

Dark web sourcing is industry-standard

Why these lines exist

Continue reading.

Dark web monitoring, what it actually catches

The vendor risk questionnaire is broken, what continuous monitoring actually shows

Detecting typosquats at scale, eight match types and why they all matter