Overview
SecWeb is your daily cartographer of the internet’s web layer. It continuously crawls and archives thousands of live websites across newly discovered domains and subdomains. It captures full HTML content, screenshots, extracted JS/API endpoints, and robots.txt
metadata β enabling researchers to explore, analyze, and build attack surface intelligence at scale.
βAutomated Web Reconnaissance for Bug Bounty & Security Research.β
Key Features
- Zone-Based Domain Collection: Pulls fresh domains/subdomains from daily
.zone
files (e.g.,.in
,.us
,.gov
) and custom inputs. - Daily Crawl & Archive:
- Renders and screenshots the live web page (via Playwright)
- Saves full HTML content as
.html.gz
- Extracts all discovered URLs and saves to
-urls.txt
- Parses
robots.txt
and aggregates disallowed paths
- Wordlist Generation: Uses
robots.txt
to generate a globalrobotsDisallowed.txt
for smarter directory brute-forcing. - Open Access: All data is available on GitHub for the community.
π Directory Structure
Each scanned domain is stored in the following format:
/data/mil/
βββ altus_tricare_mil.html.gz # Full HTML content (compressed)
βββ altus_tricare_mil.png # Full-page screenshot
βββ altus_tricare_mil-urls.txt # Extracted URLs (JS, API, links)
βββ altus_tricare_mil_robotstxt.txt # robots.txt snapshot
You can browse or download historical snapshots by date or domain.
GitHub Repo: https://github.com/theprojectnebulla/secweb
How It Works
- Discover Domains:
- Parse zone files or curated target lists
- Extract subdomains from passive sources
- Crawl Each Domain:
- Load in headless browser
- Save
.html.gz
,.png
,robots.txt
, and all discovered URLs
- Post-Processing:
- Update
robotsDisallowed.txt
(from all robots.txt files) - Push updates to GitHub
- Update
Use Cases
Use Case | SecWeb Output Used |
---|---|
Passive Endpoint Discovery | -urls.txt files |
JS Recon | JS URLs from extracted pages |
Wordlist Expansion | robotsDisallowed.txt |
Asset Fingerprinting | Screenshots + HTML |
AI Recon Training | .html.gz for model input |
Access the Data
- All crawled data is available publicly: https://github.com/theprojectnebulla/secweb