HTML to Markdown Conversion
Relevant source files
This document describes the HTML parsing and Markdown conversion process that transforms DeepWiki's HTML pages into clean, portable Markdown files. This is a core component of Phase 1 (Markdown Extraction) in the three-phase pipeline.
For information about the diagram enhancement that occurs after this conversion, see Phase 2: Diagram Enhancement. For details on how the wiki structure is discovered before this conversion begins, see Wiki Structure Discovery.
Purpose and Scope
The HTML to Markdown conversion process takes raw HTML fetched from DeepWiki.com and transforms it into clean Markdown files suitable for processing by mdBook. This conversion must handle several challenges:
- Extract only content, removing DeepWiki's UI elements and navigation
- Preserve the semantic structure (headings, lists, code blocks)
- Convert internal wiki links to relative Markdown file paths
- Remove DeepWiki-specific footer content
- Handle hierarchical link relationships between main pages and subsections
Conversion Pipeline Overview
Conversion Flow: HTML to Clean Markdown
Sources: tools/deepwiki-scraper.py:453-594
HTML Parsing and Content Extraction
BeautifulSoup Content Location Strategy
The system uses a multi-strategy approach to locate the main content area, trying selectors in order of specificity:
Content Locator Strategies
flowchart LR
Start["extract_page_content()"]
Strat1["Try CSS Selectors\narticle, main, .wiki-content"]
Strat2["Try Role Attribute\nrole='main'"]
Strat3["Fallback: body Element"]
Success["Content Found"]
Error["Raise Exception"]
Start --> Strat1
Strat1 -->|Found| Success
Strat1 -->|Not Found| Strat2
Strat2 -->|Found| Success
Strat2 -->|Not Found| Strat3
Strat3 -->|Found| Success
Strat3 -->|Not Found| Error
The system attempts these selectors in sequence:
| Priority | Selector Type | Selector Value | Purpose |
|---|---|---|---|
| 1 | CSS | article, main, .wiki-content, .content, #content, .markdown-body | Semantic HTML5 content containers |
| 2 | Attribute | role="main" | ARIA landmark for main content |
| 3 | Fallback | body | Last resort - entire body element |
Sources: tools/deepwiki-scraper.py:469-487
UI Element Removal
The conversion process removes several categories of unwanted elements before processing:
Structural Element Removal
The following element types are removed wholesale using elem.decompose():
Text-Based UI Element Removal
DeepWiki-specific UI elements are identified by text content patterns:
| Pattern | Purpose | Max Length Filter |
|---|---|---|
Index your code with Devin | AI indexing prompt | < 200 chars |
Edit Wiki | Edit button | < 200 chars |
Last indexed: | Metadata display | < 200 chars |
View this search on DeepWiki | Search link | < 200 chars |
The length filter prevents accidental removal of paragraph content that happens to contain these phrases.
Sources: tools/deepwiki-scraper.py:466-500
Navigation List Detection
The system automatically detects and removes navigation lists using heuristics:
Navigation List Detection Algorithm
flowchart TD
FindUL["Find all <ul> elements"]
CountLinks["Count <a> tags"]
Check5["links.length > 5?"]
CountInternal["Count internal links\nhref starts with '/'"]
Check80["wiki_links > 80% of links?"]
Remove["ul.decompose()"]
Keep["Keep element"]
FindUL --> CountLinks
CountLinks --> Check5
Check5 -->|Yes| CountInternal
Check5 -->|No| Keep
CountInternal --> Check80
Check80 -->|Yes| Remove
Check80 -->|No| Keep
This heuristic successfully identifies table of contents lists and navigation menus while preserving legitimate bulleted lists in the content.
Sources: tools/deepwiki-scraper.py:502-511
html2text Conversion Configuration
The core conversion uses the html2text library with specific configuration to ensure clean output:
html2text Configuration
Key Configuration Decisions
| Setting | Value | Rationale |
|---|---|---|
ignore_links | False | Links must be preserved so they can be rewritten to relative paths |
body_width | 0 | Disables line wrapping, which would interfere with diagram matching in Phase 2 |
The body_width=0 setting is particularly important because Phase 2's fuzzy matching algorithm compares text chunks from the JavaScript payload with the converted Markdown. Line wrapping would cause mismatches.
Sources: tools/deepwiki-scraper.py:175-190
DeepWiki Footer Cleaning
After html2text conversion, the system removes DeepWiki-specific footer content using pattern matching.
Footer Detection Patterns
The clean_deepwiki_footer() function uses compiled regex patterns to identify footer content:
Footer Pattern Table
| Pattern | Example Match | Purpose |
|---|---|---|
^\s*Dismiss\s*$ | "Dismiss" | Modal dismiss button |
Refresh this wiki | "Refresh this wiki" | Refresh action link |
This wiki was recently refreshed | Full phrase | Status message |
###\s*On this page | "### On this page" | TOC heading |
Please wait \d+ days? to refresh | "Please wait 7 days" | Rate limit message |
You can refresh again in | Full phrase | Alternative rate limit |
^\s*View this search on DeepWiki | Full phrase | Search link |
^\s*Edit Wiki\s*$ | "Edit Wiki" | Edit action |
Footer Scanning Algorithm
The backward scan ensures the earliest footer indicator is found, preventing content loss if footer elements are scattered.
Sources: tools/deepwiki-scraper.py:127-173
Internal Link Rewriting
The most complex part of the conversion is rewriting internal wiki links to relative Markdown file paths. Links must account for the hierarchical directory structure where subsections are placed in subdirectories.
Link Rewriting Logic
The fix_wiki_link() function handles four distinct cases based on source and target locations:
Link Rewriting Decision Matrix
| Source Location | Target Location | Relative Path Format | Example |
|---|---|---|---|
| Main page | Main page | {file_num}-{slug}.md | 2-overview.md |
| Main page | Subsection | section-{main}/{file_num}-{slug}.md | section-2/2-1-details.md |
| Subsection | Same section subsection | {file_num}-{slug}.md | 2-2-more.md |
| Subsection | Main page | ../{file_num}-{slug}.md | ../3-next.md |
| Subsection | Different section | ../section-{main}/{file_num}-{slug}.md | ../section-3/3-1-sub.md |
Link Rewriting Flow
Link Pattern Matching
The link rewriting uses regex substitution on Markdown link syntax:
The regex captures only the page-slug portion after the repository path, which is then processed by fix_wiki_link().
Sources: tools/deepwiki-scraper.py:547-592
Post-Conversion Cleanup
After all conversions and transformations, final cleanup removes artifacts:
Duplicate Title Removal
Duplicate Title Detection
This cleanup handles cases where DeepWiki includes the page title multiple times in the rendered HTML.
Sources: tools/deepwiki-scraper.py:525-545
flowchart TD
Start["extract_page_content(url, session, current_page_info)"]
Fetch["fetch_page(url, session)\nHTTP GET with retries"]
Parse["BeautifulSoup(response.text)"]
RemoveNav["Remove nav/header/footer"]
FindContent["Locate main content area"]
RemoveUI["Remove DeepWiki UI elements"]
RemoveLists["Remove navigation lists"]
ToStr["str(content)"]
Convert["convert_html_to_markdown(html)"]
CleanUp["Remove duplicate titles/Menu"]
FixLinks["Rewrite internal links\nusing current_page_info"]
Return["Return markdown string"]
Start --> Fetch
Fetch --> Parse
Parse --> RemoveNav
RemoveNav --> FindContent
FindContent --> RemoveUI
RemoveUI --> RemoveLists
RemoveLists --> ToStr
ToStr --> Convert
Convert --> CleanUp
CleanUp --> FixLinks
FixLinks --> Return
Integration with Extract Page Content
The complete content extraction flow shows how all components work together:
extract_page_content() Complete Flow
The current_page_info parameter provides context about the source page's location in the hierarchy, which is essential for generating correct relative link paths.
Sources: tools/deepwiki-scraper.py:453-594
Error Handling and Retries
HTTP Fetch with Retries
The fetch_page() function implements exponential backoff for failed requests:
| Attempt | Action | Delay |
|---|---|---|
| 1 | Try request | None |
| 2 | Retry after error | 2 seconds |
| 3 | Final retry | 2 seconds |
| Fail | Raise exception | N/A |
Browser-like headers are used to avoid being blocked:
Sources: tools/deepwiki-scraper.py:27-42
Rate Limiting
To be respectful to the DeepWiki server, the main extraction loop includes a 1-second delay between page requests:
This appears in the main loop after each successful page extraction.
Sources: tools/deepwiki-scraper.py872
Dependencies
The HTML to Markdown conversion relies on three key Python libraries:
| Library | Version | Purpose |
|---|---|---|
requests | ≥2.31.0 | HTTP requests with session management |
beautifulsoup4 | ≥4.12.0 | HTML parsing and element manipulation |
html2text | ≥2020.1.16 | HTML to Markdown conversion |
Sources: tools/requirements.txt:1-3
Output Characteristics
The Markdown files produced by this conversion have these properties:
- No line wrapping : Original formatting preserved (
body_width=0) - Clean structure : No UI elements or navigation
- Relative links : All internal links point to local
.mdfiles - Title guarantee : Every file starts with an H1 heading
- Hierarchy-aware : Links account for subdirectory structure
- Footer-free : DeepWiki-specific footer content removed
These characteristics make the files suitable for Phase 2 diagram enhancement and Phase 3 mdBook building without further modification.