Narrative Provenance
Physical archaeologists do not simply walk into a valley and start digging. They survey the land, map the topography, and plan their trenches. Digital Archaeologists must do the same. "Digging" in the digital world (scraping) is an aggressive act that can trigger automated defenses or crash fragile servers.
Site Reconnaissance is the act of looking before you touch. It is the difference between a "smash and grab" raid and a scientific excavation.
The Reconnaissance Checklist
A proper site survey must answer five questions before a single script is written:
1. Architecture Assessment
Is this a static HTML site (easy), a dynamic JavaScript app (hard), or a mobile API (complex)? The architecture dictates the toolset (wget vs. Puppeteer vs. Reverse Engineering).
2. Data Typology
What are we saving? Text? Images? Videos? Relationships? Metadata? Mapping the "data types" ensures you don't accidentally preserve the post but lose the image it links to.
3. Scale Estimation
Is this 1,000 pages or 10 billion? Reconnaissance involves "test scrapes" to estimate the volume of storage required and the time needed to complete the dig before the site vanishes.
4. Access Patterns
Is the content public? Behind a login? Ephemeral? Hidden in private DMs? Identifying access barriers early allows for strategy adjustments (e.g., using "sock puppet" accounts or negotiating for database dumps).
5. Technical Barriers
Are there rate limits? CAPTCHAs? IP blocks? A reconnaissance probe tests these defenses gently to determine the maximum safe speed of excavation.
Field Notes
The Golden Rule: Do not overload the site. A dying platform is often hosted on degrading infrastructure. Aggressive scraping without reconnaissance can be the final blow that kills the site before you save it.
Reconnaissance Scraping: We recommend a "100-page sample" scrape to test tools and analyze HTML structure. If you can't scrape 100 pages reliably, you certainly can't scrape 100 million.