The Deepinfo dataset covers 400 million domains, 2 billion subdomains, 200 billion DNS records, and 30 billion SSL certificates. The dataset is updated continuously across all four. Building and maintaining infrastructure at that scale is a multi-year engineering effort, and it's the foundation under everything the platform does.
Most exposure-management vendors don't operate their own internet observatory. They source data from a smaller number of providers, layered with their own enrichment. The advantage of operating the observatory ourselves: we control the freshness, the coverage, and the responsiveness to specific customer scoping. The disadvantage: we have to make decisions about how the data gets collected, what we collect, and what we won't.
The lines we draw, made explicit.
No personal data sold
The aggregate dataset access doesn't include personal data outside what's already public in WHOIS or surface-web indexing. We don't broker personal data. Where WHOIS data is hidden behind privacy services or registrar policy, it stays hidden in our dataset.
No scraping behind authentication
We don't bypass paywalls, defeat anti-scraping measures, or access content that requires accounts. The dataset is built from the open internet: what's publicly accessible to anyone who points a browser at a URL.
Customer data stays customer data
The assets, findings, and configuration tied to a customer account are tied to that customer. We don't surface them to other customers, sell them, or train shared models on them. A customer's discovered subdomains are their inventory, not data we redistribute.
Dark web sourcing is industry-standard
Where dark web data appears in our coverage, it comes through established sourcing patterns used across the threat intelligence industry. We don't pay for newly-stolen data. We don't position ourselves as a buyer of breach materials. The dark web work is about visibility into criminal infrastructure, not about being part of the criminal economy.
Why these lines exist
These lines exist because the alternative, letting them stay implicit, is how internet-data businesses end up in compromised positions. If we ever change them, we'll document the change publicly with rationale. Researchers and procurement teams who need verification of specifics can talk to us; we'd rather answer the question than have it remain ambient.
The capability is real. The data is real. The lines around what we do with the data are real too.