unearth.wiki

Site Reconnaissance

/saɪt rɪˈkɒn.ɪ.səns/ From Military/Archaeological terminology.
Definition The mandatory preliminary phase of digital excavation. The systematic mapping of a platform's technical architecture, scale, access patterns, and defensive barriers before any preservation tools are deployed.

Narrative Provenance

Physical archaeologists do not simply walk into a valley and start digging. They survey the land, map the topography, and plan their trenches. Digital Archaeologists must do the same. "Digging" in the digital world (scraping) is an aggressive act that can trigger automated defenses or crash fragile servers.

Site Reconnaissance is the act of looking before you touch. It is the difference between a "smash and grab" raid and a scientific excavation.

The Reconnaissance Checklist

A proper site survey must answer five questions before a single script is written:

1. Architecture Assessment

Is this a static HTML site (easy), a dynamic JavaScript app (hard), or a mobile API (complex)? The architecture dictates the toolset (wget vs. Puppeteer vs. Reverse Engineering).

2. Data Typology

What are we saving? Text? Images? Videos? Relationships? Metadata? Mapping the "data types" ensures you don't accidentally preserve the post but lose the image it links to.

3. Scale Estimation

Is this 1,000 pages or 10 billion? Reconnaissance involves "test scrapes" to estimate the volume of storage required and the time needed to complete the dig before the site vanishes.

4. Access Patterns

Is the content public? Behind a login? Ephemeral? Hidden in private DMs? Identifying access barriers early allows for strategy adjustments (e.g., using "sock puppet" accounts or negotiating for database dumps).

5. Technical Barriers

Are there rate limits? CAPTCHAs? IP blocks? A reconnaissance probe tests these defenses gently to determine the maximum safe speed of excavation.

Field Notes

The Golden Rule: Do not overload the site. A dying platform is often hosted on degrading infrastructure. Aggressive scraping without reconnaissance can be the final blow that kills the site before you save it.
Reconnaissance Scraping: We recommend a "100-page sample" scrape to test tools and analyze HTML structure. If you can't scrape 100 pages reliably, you certainly can't scrape 100 million.
Stratigraphy (Related Concepts)
Stratigraphic Analysis The Triage Digital Dust Custodial Filter Excavation Techniques