Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

HTML to Markdown Conversion

Loading…

HTML to Markdown Conversion

Relevant source files

This page documents the HTML to Markdown conversion process in Phase 1 of the pipeline. After the wiki structure is discovered (see Wiki Structure Discovery), each page’s HTML content is fetched, cleaned, and converted to Markdown format. This conversion prepares the content for diagram enhancement in Phase 2 (see Phase 2: Diagram Enhancement).

Conversion Pipeline

The HTML to Markdown conversion follows a multi-step pipeline that progressively cleans and transforms the content. The process is orchestrated by the extract_page_content function and involves HTML parsing, element removal, conversion, and post-processing.

Conversion Pipeline Flow

graph TB
    fetch["fetch_page()\n[65-80]"]
parse["BeautifulSoup()\nHTML Parser"]
remove1["Remove Navigation\nElements [761-762]"]
remove2["Find Main Content\nArea [765-782]"]
remove3["Remove DeepWiki UI\nElements [786-806]"]
convert["convert_html_to_markdown()\n[213-228]"]
clean["clean_deepwiki_footer()\n[165-211]"]
format["format_source_references()\n[397-406]"]
links["Fix Internal Links\n[854-875]"]
output["Cleaned Markdown\nOutput"]
fetch --> parse
 
   parse --> remove1
 
   remove1 --> remove2
 
   remove2 --> remove3
 
   remove3 --> convert
 
   convert --> clean
 
   clean --> format
 
   format --> links
 
   links --> output

Sources: python/deepwiki-scraper.py:751-877

HTML Parsing and Content Extraction

The conversion begins by fetching the HTML page using a requests.Session with browser-like headers to avoid bot detection. BeautifulSoup parses the HTML into a navigable tree structure.

Content Area Detection

The system uses a cascading selector strategy to locate the main content area, trying multiple selectors in order of preference:

PrioritySelector TypeExample
1Semantic HTML5 tagsarticle, main
2Class-based selectors.wiki-content, .content, .markdown-body
3ID-based selectors#content
4ARIA role attributesrole="main"
5Fallbackbody tag

Content Detection Logic

Sources: python/deepwiki-scraper.py:765-782

DeepWiki UI Element Removal

Before conversion, the system removes DeepWiki-specific navigation and UI elements that would pollute the final documentation. This occurs in two stages: pre-processing element removal and footer cleanup.

Pre-Processing Element Removal

The first stage removes structural elements and UI components using CSS selectors:

The second stage removes DeepWiki-specific text-based UI elements by scanning for characteristic strings:

UI ElementDetection StringMax Length
Code indexing prompt“Index your code with Devin”200 chars
Edit controls“Edit Wiki”200 chars
Indexing status“Last indexed:”200 chars
Search links“View this search on DeepWiki”200 chars

Sources: python/deepwiki-scraper.py:761-762 python/deepwiki-scraper.py:786-795

DeepWiki pages include navigation lists that link to all wiki pages. The system detects and removes these by identifying unordered lists (<ul>) with the following characteristics:

  1. Contains more than 5 links
  2. At least 80% of links are internal (start with /)
graph LR
    ul["Find all ul\nElements"]
count["Count Links\nin List"]
check1{"More than\n5 links?"}
check2{"80%+ are\ninternal?"}
remove["Remove ul\nElement"]
keep["Keep ul\nElement"]
ul --> count
 
   count --> check1
 
   check1 -->|Yes| check2
 
   check1 -->|No| keep
 
   check2 -->|Yes| remove
 
   check2 -->|No| keep

Navigation List Detection

Sources: python/deepwiki-scraper.py:799-806

html2text Conversion

After cleaning the HTML, the system uses the html2text library to convert HTML to Markdown. The conversion is configured with specific settings to preserve link structure and prevent line wrapping.

html2text Configuration

The body_width = 0 setting is critical because it prevents the converter from introducing artificial line breaks that would break code blocks and formatted content.

Important: Mermaid diagram extraction is explicitly disabled at this stage. DeepWiki’s Next.js payload contains diagrams from ALL pages mixed together, making per-page extraction unreliable. Diagrams are handled separately in Phase 2 using fuzzy matching (see Fuzzy Matching Algorithm).

Sources: python/deepwiki-scraper.py:213-228

graph TB
    scan["Scan Last 50 Lines\nBackwards"]
patterns["Check Against\nFooter Patterns"]
found{"Pattern\nMatch?"}
backward["Scan Backward\n20 More Lines"]
content{"Hit Real\nContent?"}
cut["Cut Lines from\nFooter Start"]
trim["Trim Trailing\nEmpty Lines"]
scan --> patterns
 
   patterns --> found
 
   found -->|Yes| backward
 
   found -->|No| trim
 
   backward --> content
 
   content -->|Yes| cut
 
   content -->|No| backward
 
   cut --> trim

The clean_deepwiki_footer function removes DeepWiki’s footer UI elements that appear at the end of each page. It uses regex patterns to detect footer markers and removes everything from that point onward.

The footer patterns are compiled regex expressions:

PatternPurposeExample Match
^\s*Dismiss\s*$Close button“Dismiss”
Refresh this wikiRefresh controls“Refresh this wiki”
This wiki was recently refreshedStatus messageVarious timestamps
###\s*On this pagePage navigation“### On this page”
Please wait \d+ days?Rate limiting“Please wait 7 days”
View this search on DeepWikiSearch linkExact match
^\s*Edit Wiki\s*$Edit button“Edit Wiki”

Sources: python/deepwiki-scraper.py:165-211

Post-Processing Steps

After initial conversion, two post-processing steps refine the Markdown output: source reference formatting and internal link rewriting.

Source Reference Formatting

The format_source_references function inserts colons between filenames and line numbers in source code references. This transforms patterns like [path/to/file:10-20] into [path/to/file:10-20].

Pattern Matching:

  • Regex: \[([A-Za-z0-9._/-]+?)(\d+-\d+)\]
  • Capture Group 1: Filename path
  • Capture Group 2: Line number range
  • Output: [filename:linerange]

Sources: python/deepwiki-scraper.py:395-406

graph TB
    find["Find Link Pattern:\n/owner/repo/page"]
extract["Extract page_num\nand slug"]
normalize["normalized_number_parts()\n[28-43]"]
build["build_target_path()\n[55-63]"]
relative{"Source in\nSubsection?"}
same{"Target in Same\nSection?"}
diff["Prefix: ../\nDifferent section"]
none["Prefix: ../\nTop level"]
local["No prefix\nSame section"]
find --> extract
 
   extract --> normalize
 
   normalize --> build
 
   build --> relative
 
   relative -->|Yes| same
 
   relative -->|No| done["Return\nRelative Path"]
same -->|Yes| local
 
   same -->|No| diff
 
   diff --> done
 
   local --> done
 
   none --> done

DeepWiki uses absolute URLs for internal wiki links (e.g., /owner/repo/4-query-planning). The system rewrites these to relative Markdown file paths using the build_target_path function.

Link Rewriting Process

Path Resolution Examples:

Source FileTarget LinkResolved Path
1-overview.md/repo/2-architecture2-architecture.md
section-2/2-1-pipeline.md/repo/2-2-build2-2-build.md
section-2/2-1-pipeline.md/repo/3-config../3-config.md
1-overview.md/repo/2-1-subsectionsection-2/2-1-subsection.md

Sources: python/deepwiki-scraper.py:854-875 python/deepwiki-scraper.py:55-63 python/deepwiki-scraper.py:28-43

Duplicate Content Removal

The final cleanup step removes duplicate titles and stray “Menu” text that may appear in the converted Markdown. The system tracks whether a title has been seen and skips subsequent occurrences if they match the first title exactly.

Cleanup Rules:

  1. Skip standalone “Menu” lines
  2. Keep first # Title occurrence
  3. Skip duplicate titles that match the first title
  4. Preserve all other content

Sources: python/deepwiki-scraper.py:820-841

Output Format

The final output is clean Markdown with the following characteristics:

  • Title guaranteed to be present (added if missing)
  • No DeepWiki UI elements
  • No artificial line wrapping
  • Relative internal links
  • Formatted source references
  • Stripped trailing whitespace

The output is written to temporary storage before diagram enhancement in Phase 2. A snapshot of this raw Markdown (without diagrams) is saved to raw_markdown/ for debugging purposes.

Sources: python/deepwiki-scraper.py:1357-1366

Dismiss

Refresh this wiki

Enter email to refresh