How it works¶
Dead Simple Search is intentionally simple. This page explains what happens behind the scenes when you crawl a site and search it, with enough detail to help you understand the codebase if you want to contribute or customize it.
The big picture¶
The system has four main parts:
┌─────────────┐ ┌──────────────┐ ┌──────────┐ ┌──────────┐
│ Flask API │────▶│ Crawler │────▶│ Indexer │────▶│ MySQL │
│ (app.py) │ │ (crawler.py) │ │(indexer.py│ │ Database │
└─────────────┘ └──────────────┘ └──────────┘ └──────────┘
│ ▲
│ ┌──────────────┐ │
└─────────────▶│ Search │──────────────────────────┘
│ (search.py) │
└──────────────┘
The API is the front door — it receives your requests and coordinates everything. The Crawler goes out and fetches web pages. The Indexer extracts useful content from those pages and stores it. The Search module queries the database and returns ranked results.
The crawl process¶
When you trigger a crawl, here's what happens step by step:
1. Robots.txt check¶
The crawler first fetches robots.txt from the target domain. This is a text file that website owners use to tell bots which parts of their site are off-limits. Dead Simple Search respects these rules — if a page is marked as "don't crawl," it won't be crawled.
2. Sitemap discovery¶
Next, the crawler looks for sitemaps. A sitemap is an XML file that lists all the pages on a website — think of it as a table of contents. The crawler checks two places:
- The
robots.txtfile itself, which can reference sitemaps - Common locations like
/sitemap.xmland/sitemap_index.xml
If sitemaps are found, their URLs are used to seed the crawl queue. This is much more efficient than discovering pages only by following links.
3. Page-by-page crawling¶
The crawler works through a queue of URLs:
- It fetches each page using an asynchronous HTTP client (this means it can handle network operations efficiently without blocking)
- It only processes HTML pages — PDFs, images, and other files are skipped
- It waits between requests (the "crawl delay") to avoid overwhelming the target server
- It extracts links from each page and adds new, unvisited ones to the queue
- It stays within the original domain — it won't follow links to other websites
4. Indexing¶
For each page, the indexer extracts:
- Title — from the
<title>HTML tag - Meta description — the short summary that often appears in search results
- H1 and H2 headings — the main headings on the page
- Body text — all the visible text, with scripts, navigation, and other non-content elements removed
- Language — detected from the HTML
langattribute or guessed from the text itself
This data is stored in MySQL using an "upsert" pattern: if the page already exists in the database (based on its URL), the record is updated; otherwise, a new record is created.
The search process¶
When you send a search query:
-
Mode detection — the search module checks if your query contains special operators like
+,-, or". If it does, it uses MySQL's "boolean mode," which gives you more control. Otherwise, it uses "natural language mode," which is simpler and works well for everyday searches. -
Full-text matching — MySQL's FULLTEXT index is the engine behind the search. It searches across the title, meta description, headings, and body text simultaneously. MySQL calculates a relevance score for each matching page.
-
Ranking — results are sorted by relevance, best matches first. The relevance score takes into account things like how often the search terms appear and where they appear (a match in the title is worth more than a match buried in the body text).
-
Pagination — results are returned in pages (20 results at a time by default, configurable up to 100).
The database¶
Dead Simple Search uses three tables:
sites — one row per registered website. Stores the domain, start URL, and whether automatic crawling is enabled.
pages — one row per indexed page. This is where all the extracted content lives. It has a FULLTEXT index across the title, description, headings, and body text, which is what makes search fast.
crawl_log — a history of all crawl runs. Each entry records when the crawl started and finished, how many pages were crawled, and whether it succeeded.
File structure¶
The codebase is small — about 800 lines of Python across 7 files:
| File | Purpose |
|---|---|
app.py |
The Flask web application and all API endpoints |
config.py |
Configuration via environment variables |
database.py |
MySQL connection pool and table creation |
crawler.py |
The async web crawler |
indexer.py |
HTML parsing and database storage |
sitemap.py |
Sitemap discovery and XML parsing |
search.py |
Full-text search logic |
scheduler.py |
Optional scheduled re-crawling |
Design principles¶
A few principles guide the project:
Boring technology. Python, Flask, and MySQL are mature, well-documented, and widely supported. You can find help on any search engine, forum, or chat room.
No magic. All SQL is hand-written. There's no ORM (object-relational mapper) hiding what's happening. When you read the code, you see exactly what queries are running.
Small surface area. The codebase is deliberately small. Every module fits on a screen or two. There are no deep abstraction layers to navigate.
Pragmatic trade-offs. The snippet in search results is the first 300 characters of body text — not a contextual window around the matched terms. Is that ideal? No. Is it simple, fast, and good enough for most cases? Yes.