Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

DeepWiki GitHub

HTML to Markdown Conversion

Relevant source files

This document describes the HTML parsing and Markdown conversion process that transforms DeepWiki's HTML pages into clean, portable Markdown files. This is a core component of Phase 1 (Markdown Extraction) in the three-phase pipeline.

For information about the diagram enhancement that occurs after this conversion, see Phase 2: Diagram Enhancement. For details on how the wiki structure is discovered before this conversion begins, see Wiki Structure Discovery.

Purpose and Scope

The HTML to Markdown conversion process takes raw HTML fetched from DeepWiki.com and transforms it into clean Markdown files suitable for processing by mdBook. This conversion must handle several challenges:

  • Extract only content, removing DeepWiki's UI elements and navigation
  • Preserve the semantic structure (headings, lists, code blocks)
  • Convert internal wiki links to relative Markdown file paths
  • Remove DeepWiki-specific footer content
  • Handle hierarchical link relationships between main pages and subsections

Conversion Pipeline Overview

Conversion Flow: HTML to Clean Markdown

Sources: tools/deepwiki-scraper.py:453-594

HTML Parsing and Content Extraction

BeautifulSoup Content Location Strategy

The system uses a multi-strategy approach to locate the main content area, trying selectors in order of specificity:

Content Locator Strategies

flowchart LR
    Start["extract_page_content()"]
Strat1["Try CSS Selectors\narticle, main, .wiki-content"]
Strat2["Try Role Attribute\nrole='main'"]
Strat3["Fallback: body Element"]
Success["Content Found"]
Error["Raise Exception"]
Start --> Strat1
 
   Strat1 -->|Found| Success
 
   Strat1 -->|Not Found| Strat2
 
   Strat2 -->|Found| Success
 
   Strat2 -->|Not Found| Strat3
 
   Strat3 -->|Found| Success
 
   Strat3 -->|Not Found| Error

The system attempts these selectors in sequence:

PrioritySelector TypeSelector ValuePurpose
1CSSarticle, main, .wiki-content, .content, #content, .markdown-bodySemantic HTML5 content containers
2Attributerole="main"ARIA landmark for main content
3FallbackbodyLast resort - entire body element

Sources: tools/deepwiki-scraper.py:469-487

UI Element Removal

The conversion process removes several categories of unwanted elements before processing:

Structural Element Removal

The following element types are removed wholesale using elem.decompose():

Text-Based UI Element Removal

DeepWiki-specific UI elements are identified by text content patterns:

PatternPurposeMax Length Filter
Index your code with DevinAI indexing prompt< 200 chars
Edit WikiEdit button< 200 chars
Last indexed:Metadata display< 200 chars
View this search on DeepWikiSearch link< 200 chars

The length filter prevents accidental removal of paragraph content that happens to contain these phrases.

Sources: tools/deepwiki-scraper.py:466-500

The system automatically detects and removes navigation lists using heuristics:

Navigation List Detection Algorithm

flowchart TD
    FindUL["Find all <ul> elements"]
CountLinks["Count <a> tags"]
Check5["links.length > 5?"]
CountInternal["Count internal links\nhref starts with '/'"]
Check80["wiki_links > 80% of links?"]
Remove["ul.decompose()"]
Keep["Keep element"]
FindUL --> CountLinks
 
   CountLinks --> Check5
 
   Check5 -->|Yes| CountInternal
 
   Check5 -->|No| Keep
 
   CountInternal --> Check80
 
   Check80 -->|Yes| Remove
 
   Check80 -->|No| Keep

This heuristic successfully identifies table of contents lists and navigation menus while preserving legitimate bulleted lists in the content.

Sources: tools/deepwiki-scraper.py:502-511

html2text Conversion Configuration

The core conversion uses the html2text library with specific configuration to ensure clean output:

html2text Configuration

Key Configuration Decisions

SettingValueRationale
ignore_linksFalseLinks must be preserved so they can be rewritten to relative paths
body_width0Disables line wrapping, which would interfere with diagram matching in Phase 2

The body_width=0 setting is particularly important because Phase 2's fuzzy matching algorithm compares text chunks from the JavaScript payload with the converted Markdown. Line wrapping would cause mismatches.

Sources: tools/deepwiki-scraper.py:175-190

After html2text conversion, the system removes DeepWiki-specific footer content using pattern matching.

The clean_deepwiki_footer() function uses compiled regex patterns to identify footer content:

Footer Pattern Table

PatternExample MatchPurpose
^\s*Dismiss\s*$"Dismiss"Modal dismiss button
Refresh this wiki"Refresh this wiki"Refresh action link
This wiki was recently refreshedFull phraseStatus message
###\s*On this page"### On this page"TOC heading
Please wait \d+ days? to refresh"Please wait 7 days"Rate limit message
You can refresh again inFull phraseAlternative rate limit
^\s*View this search on DeepWikiFull phraseSearch link
^\s*Edit Wiki\s*$"Edit Wiki"Edit action

Footer Scanning Algorithm

The backward scan ensures the earliest footer indicator is found, preventing content loss if footer elements are scattered.

Sources: tools/deepwiki-scraper.py:127-173

The most complex part of the conversion is rewriting internal wiki links to relative Markdown file paths. Links must account for the hierarchical directory structure where subsections are placed in subdirectories.

The fix_wiki_link() function handles four distinct cases based on source and target locations:

Link Rewriting Decision Matrix

Source LocationTarget LocationRelative Path FormatExample
Main pageMain page{file_num}-{slug}.md2-overview.md
Main pageSubsectionsection-{main}/{file_num}-{slug}.mdsection-2/2-1-details.md
SubsectionSame section subsection{file_num}-{slug}.md2-2-more.md
SubsectionMain page../{file_num}-{slug}.md../3-next.md
SubsectionDifferent section../section-{main}/{file_num}-{slug}.md../section-3/3-1-sub.md

Link Rewriting Flow

The link rewriting uses regex substitution on Markdown link syntax:

The regex captures only the page-slug portion after the repository path, which is then processed by fix_wiki_link().

Sources: tools/deepwiki-scraper.py:547-592

Post-Conversion Cleanup

After all conversions and transformations, final cleanup removes artifacts:

Duplicate Title Removal

Duplicate Title Detection

This cleanup handles cases where DeepWiki includes the page title multiple times in the rendered HTML.

Sources: tools/deepwiki-scraper.py:525-545

flowchart TD
    Start["extract_page_content(url, session, current_page_info)"]
Fetch["fetch_page(url, session)\nHTTP GET with retries"]
Parse["BeautifulSoup(response.text)"]
RemoveNav["Remove nav/header/footer"]
FindContent["Locate main content area"]
RemoveUI["Remove DeepWiki UI elements"]
RemoveLists["Remove navigation lists"]
ToStr["str(content)"]
Convert["convert_html_to_markdown(html)"]
CleanUp["Remove duplicate titles/Menu"]
FixLinks["Rewrite internal links\nusing current_page_info"]
Return["Return markdown string"]
Start --> Fetch
 
   Fetch --> Parse
 
   Parse --> RemoveNav
 
   RemoveNav --> FindContent
 
   FindContent --> RemoveUI
 
   RemoveUI --> RemoveLists
 
   RemoveLists --> ToStr
 
   ToStr --> Convert
 
   Convert --> CleanUp
 
   CleanUp --> FixLinks
 
   FixLinks --> Return

Integration with Extract Page Content

The complete content extraction flow shows how all components work together:

extract_page_content() Complete Flow

The current_page_info parameter provides context about the source page's location in the hierarchy, which is essential for generating correct relative link paths.

Sources: tools/deepwiki-scraper.py:453-594

Error Handling and Retries

HTTP Fetch with Retries

The fetch_page() function implements exponential backoff for failed requests:

AttemptActionDelay
1Try requestNone
2Retry after error2 seconds
3Final retry2 seconds
FailRaise exceptionN/A

Browser-like headers are used to avoid being blocked:

Sources: tools/deepwiki-scraper.py:27-42

Rate Limiting

To be respectful to the DeepWiki server, the main extraction loop includes a 1-second delay between page requests:

This appears in the main loop after each successful page extraction.

Sources: tools/deepwiki-scraper.py872

Dependencies

The HTML to Markdown conversion relies on three key Python libraries:

LibraryVersionPurpose
requests≥2.31.0HTTP requests with session management
beautifulsoup4≥4.12.0HTML parsing and element manipulation
html2text≥2020.1.16HTML to Markdown conversion

Sources: tools/requirements.txt:1-3

Output Characteristics

The Markdown files produced by this conversion have these properties:

  • No line wrapping : Original formatting preserved (body_width=0)
  • Clean structure : No UI elements or navigation
  • Relative links : All internal links point to local .md files
  • Title guarantee : Every file starts with an H1 heading
  • Hierarchy-aware : Links account for subdirectory structure
  • Footer-free : DeepWiki-specific footer content removed

These characteristics make the files suitable for Phase 2 diagram enhancement and Phase 3 mdBook building without further modification.