Public Records & Open-Source Research: Find What's Actually Out There
Section 9 of 14

11 min read Updated

There's a version of Googling that most people never discover — and the gap between that version and ordinary search is not small.

Think about the last time you typed something into a search engine and scrolled through the first page of results. You probably got a mix of news articles, Wikipedia entries, maybe a few ads, and results that were close to what you wanted but not quite it. That's fine for finding a restaurant or checking a movie time. It's nearly useless when you're trying to find out whether a company actually existed five years ago, who registered a particular website, or whether a photo was taken where someone claims it was taken.

The techniques in this section are what professional open-source investigators actually use — not because they have special access, but because they know how to ask better questions.

The goal here is to cover five distinct toolkits: advanced search operators, the Wayback Machine for archived web content, WHOIS and domain history lookups, reverse image search, and metadata extraction from files. Master all five and the research landscape looks fundamentally different.

Start with the search operators, because they're the most immediately actionable and the most systematically ignored. Google and other search engines accept a set of commands — called operators — that narrow results in ways the regular search bar doesn't. These aren't secret features; they're documented and free. They're just not in the interface, so most people never find them.

The site operator is the one that earns its keep the fastest. Typing site: followed by a domain name, then your search term, restricts results to that domain only. If you want to find every publicly indexed document from a particular government agency, or every mention of a name on a specific news outlet, the site operator does it in seconds. The colon must sit directly against the word site with no space, and the domain follows immediately — so it reads, in plain text, as: site colon, then the domain, then a space, then your search term.

Related to that is the filetype: operator, which restricts results to specific document types. Searching for filetype:pdf alongside a company name will surface annual reports, court filings, regulatory submissions, and internal documents that were publicly posted but wouldn't float to the top of a normal search. Try filetype:xls or filetype:xlsx on a government agency name and you may find spreadsheets of public data that nobody bothered to mention in the press release. This particular combination — site operator plus filetype — is one of the most productive research pairings that exists.

The intitle: operator searches only within page titles, not the full text of the page. This matters because page titles tend to reflect what the page is actually about, rather than what it mentions in passing. If you're looking for coverage of a specific event or a specific person's involvement in something, intitle: cuts the noise dramatically. The inurl: operator does the same thing for the URL itself — useful for finding specific types of pages within a site, like all the press releases on a domain, or all pages that contain the word "contract" in their address.

Two more worth knowing: the minus sign and the quotation marks. Quotation marks force exact phrase matching, which sounds obvious but makes an enormous difference when you're searching for a full name, a specific document title, or a distinctive phrase from a text you're trying to trace. The minus sign immediately before a word excludes that word from results. If you're searching for news about a person named Robert Smith and you're drowning in the musician, adding minus: the band name clears the results instantly.

Here's the part that catches even experienced researchers: operators can be stacked. A search combining site, filetype, quotation marks, and exclusions is entirely valid and extremely powerful. It's essentially a structured query aimed at a specific corner of the indexed web. Most search platforms support some version of these operators — DuckDuckGo, Bing, and others have their own syntax with overlap and differences, so it's worth checking the documentation for whichever platform you're on.

Now, everything just described applies to the live, currently-indexed web. That's a smaller slice of the information landscape than it appears. Pages get taken down, companies scrub their histories, officials delete embarrassing statements, and websites go dark. The internet has a memory, though, and its name is the Wayback Machine.

The Wayback Machine, operated by the Internet Archive and accessible at archive.org, has been crawling and saving snapshots of websites since 1996. The Internet Archive's Wayback Machine documentation[1] describes its mission as building a permanent record of the web. As of 2026, it holds hundreds of billions of saved web pages. The practical implication is significant: a website that exists today looked different last year, and what it looked like five years ago may be the most relevant thing about it.

The interface is straightforward. Enter a URL into the Wayback Machine's search bar and you get a calendar view — years across the top, months expanding into individual days marked with dots, each dot representing a saved snapshot. Click a dot and you see the page as it appeared on that date. The resolution of the archive varies enormously by site; popular pages get crawled frequently, while obscure ones might have only a handful of snapshots across a decade.

Where this matters for research: a company's "About Us" page from three years ago may list executives who have since been quietly removed. A nonprofit's mission statement from 2019 may reveal funding priorities that contradict current public statements. A politician's campaign website from a previous election cycle may contain positions that have since been walked back. None of this is hidden in the sense of being secret — it was public when it was posted — but it requires knowing the archive exists.

There's a catch worth naming. The Wayback Machine doesn't archive everything. Pages that were explicitly blocked by the site owner using a robots.txt exclusion were not archived. Some content was archived and then removed at the request of the site owner. And the archive is imperfect — sometimes images don't load, sometimes only the HTML skeleton was saved without full styling or embedded media. What the Wayback Machine gives you is a best-effort historical record, not a complete one. For most research purposes, that's still extraordinary.

The CDX API — an application programming interface that lets you query the archive programmatically — is worth a brief mention for those comfortable with technical tools. It lets you retrieve a list of all archived URLs for a domain, across all dates, which is useful for finding pages the site doesn't link to anymore. In plain language: it's how you find what a website used to have that it's now hiding. You don't need to write code to use it; several free web tools wrap the CDX API in a simple interface.

From archived content to domain history. When a website is registered, information about the registrant — name, organization, email address, phone number, and mailing address — is technically recorded in a database maintained under the domain name system. WHOIS is the protocol for querying that database.

The phrase "technically recorded" is doing real work in that sentence, because the picture is complicated. In the early years of the commercial internet, WHOIS lookups returned rich contact information directly. That changed substantially after 2018, when the General Data Protection Regulation — GDPR, the European Union's sweeping privacy law — was interpreted to require that registrar databases redact personal information for individuals. Many registrars now show a privacy proxy or simply blank fields where the registrant's identity used to appear. ICANN, the organization that oversees domain naming, has been navigating this tension between privacy and transparency ever since.

That said, WHOIS and domain history remain valuable. Even redacted records show the registrar used, the nameservers, the registration date, the expiration date, and the last-updated timestamp. Registration dates can reveal when a domain was first set up — a company claiming to have operated since 2015 but whose domain was registered in 2023 has a problem to explain. Nameserver configurations can show hosting relationships between seemingly unrelated domains. And for non-European registrations, or older records before GDPR-era redaction kicked in, contact information is sometimes still present.

Domain history goes further than a live WHOIS lookup. Tools like DomainTools, who.is, and ViewDNS.info maintain historical WHOIS records — snapshots of registration data going back years. This is how you find that a domain currently owned by a limited liability company was once registered to a specific person's home address, or that a website currently presenting itself as a legitimate news outlet was previously hosting something else entirely. DomainTools' domain history documentation[2] describes exactly this kind of longitudinal view of domain ownership. The historical record doesn't disappear just because the current registration is privacy-protected.

One technique that follows naturally from WHOIS: reverse IP lookup and shared hosting analysis. A given IP address can host dozens or hundreds of websites. If you have the IP address associated with a suspicious domain, querying that IP can reveal what other domains are hosted on the same server — and the company behind those other domains may have contact information that's not hidden. This can surface connections between seemingly unrelated web properties.

Stay with this for one more step, because it pays off. WHOIS data, nameserver records, and IP lookups all live in the same general ecosystem of DNS — the domain name system. Understanding that ecosystem as a connected web of records, rather than a single lookup, is what separates a one-time check from a full domain investigation. Each record type points to others.

Now shift perspective entirely — from text and metadata to images. Reverse image search is the technique of using an image itself as the search query, rather than a text description of what you're looking for. The practical applications range from verifying whether a photograph is what it claims to be, to identifying individuals, to finding the original context of an image that's been shared without attribution.

The mechanics: you provide an image file or image URL to a search engine that has indexed the web visually, and it returns pages where the same or similar image appears. Google Lens (which replaced the original Google Images reverse search function) handles this, as does TinEye, which specializes in reverse image search and maintains its own index. Yandex's image search is a third option that, according to practitioners in open-source intelligence communities, often surfaces results that the others miss — particularly for images originating in Eastern Europe or Russia. Different engines have different indexes, and running the same image through all three takes under two minutes.

Where this earns its place in serious research: a photograph claiming to show events in a specific city on a specific date may turn up on the reverse search as having been published years earlier in a completely different context. This kind of image recycling is common in misinformation. A profile photo on a social media account or a corporate website may appear on stock photo services, revealing that the "person" doesn't exist. A news story may illustrate a claim with an image that turns out to be from a different country, a different decade, or a different event entirely.

The technique has limits worth acknowledging. Reverse image search works best on photographs that have been published somewhere on the indexed web. Original images that have never been published won't return useful results. Heavy cropping, filtering, or modification of an image can defeat the matching algorithm. And some platforms don't allow image URLs to be submitted directly, requiring a downloaded copy. Despite these limitations, it's often the fastest way to answer the question "where did this image actually come from?"

Geolocation from imagery deserves a brief mention, even though it's technically a separate technique. Visual details in an image — architecture styles, road markings, vegetation, signage in partial view, the angle of shadows — can be cross-referenced against satellite imagery and street-level photography to pin down where a photo was taken. The Bellingcat collective has built a substantial body of practice around this, and Bellingcat's open-source investigation guides[3] describe methodology for geolocation and image verification in considerable detail. It's painstaking work, but it's entirely based on open sources.

The final technique in this set is metadata extraction — and this one surprises people who haven't encountered it before. Digital files are not just their content. They carry embedded data about when they were created, what software created them, what device was used, and in the case of photographs taken on smartphones, often the precise GPS coordinates of where the photo was taken.

This embedded information is called metadata, and the standard format for it in photographs is EXIF — Exchangeable Image File Format. EXIF data can include the camera make and model, the lens used, the exposure settings, the date and time of capture (including timezone in some implementations), and in GPS-enabled devices, the latitude and longitude at the moment of capture. A photograph shared as evidence of something happening somewhere can be checked against its own embedded coordinates.

The catch — and it's an important one — is that many platforms strip EXIF data when images are uploaded or shared. Facebook, Twitter, and most major social platforms have been doing this for years, partly for privacy reasons. An image downloaded from those platforms typically won't carry the original EXIF data. But photographs shared through other channels — email attachments, direct downloads from personal websites, files from leaked document sets — may retain their metadata intact. When they do, the information can be definitive.

Tools for reading metadata are widely available. ExifTool, a free command-line program maintained by Phil Harvey, is the standard for serious work — it reads metadata from a broader range of file types than almost anything else, including documents, audio, and video in addition to images. For those who prefer browser-based tools, Jeffrey's Exif Viewer and similar sites allow drag-and-drop metadata reading without installing software. The metadata in PDF documents and Microsoft Office files is its own category, sometimes containing the author's name, the organization's internal systems information, and revision history — details that can be highly relevant in document authentication.

It's worth treating EXIF and document metadata with appropriate care in both directions. On one hand, it can be compelling evidence when present. On the other hand, metadata can be edited — it's not tamper-proof. Sophisticated actors can alter timestamps and location data. The presence of metadata that supports a claim is useful; the absence of metadata, or metadata that contradicts a claim, is more significant.

Bring these five techniques together and something becomes clear: they form a kind of layered interrogation of any subject on the web. Operators surface the indexed content. The Wayback Machine surfaces the historical content. WHOIS and domain history surface the registration and ownership layers. Reverse image search surfaces the visual provenance. Metadata extraction surfaces what the files themselves say about their origins. Each layer can confirm or contradict the others, and together they constitute a research posture that's qualitatively different from typing a name into a search bar and reading the first few results.

The gap between casual search and structured open-source research isn't about access to special databases or paid subscriptions — most of what's described here is free. The gap is about knowing that these layers exist and developing the habit of checking each one systematically... which is exactly what distinguishes a verifiable finding from a hunch.

None of this will tell you whether a source is trustworthy on its own — that requires triangulation across multiple record types, a process that belongs to the next part of this course, where the question becomes not just how to find information, but how to know when you've found enough to be confident in what it means.

Sources cited

  1. The Internet Archive's Wayback Machine documentation help.archive.org
  2. DomainTools' domain history documentation domaintools.com
  3. Bellingcat's open-source investigation guides bellingcat.com