This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
HTML to Markdown Conversion
Loading…
HTML to Markdown Conversion
Relevant source files
This page documents the HTML to Markdown conversion process in Phase 1 of the pipeline. After the wiki structure is discovered (see Wiki Structure Discovery), each page’s HTML content is fetched, cleaned, and converted to Markdown format. This conversion prepares the content for diagram enhancement in Phase 2 (see Phase 2: Diagram Enhancement).
Conversion Pipeline
The HTML to Markdown conversion follows a multi-step pipeline that progressively cleans and transforms the content. The process is orchestrated by the extract_page_content function and involves HTML parsing, element removal, conversion, and post-processing.
Conversion Pipeline Flow
graph TB
fetch["fetch_page()\n[65-80]"]
parse["BeautifulSoup()\nHTML Parser"]
remove1["Remove Navigation\nElements [761-762]"]
remove2["Find Main Content\nArea [765-782]"]
remove3["Remove DeepWiki UI\nElements [786-806]"]
convert["convert_html_to_markdown()\n[213-228]"]
clean["clean_deepwiki_footer()\n[165-211]"]
format["format_source_references()\n[397-406]"]
links["Fix Internal Links\n[854-875]"]
output["Cleaned Markdown\nOutput"]
fetch --> parse
parse --> remove1
remove1 --> remove2
remove2 --> remove3
remove3 --> convert
convert --> clean
clean --> format
format --> links
links --> output
Sources: python/deepwiki-scraper.py:751-877
HTML Parsing and Content Extraction
The conversion begins by fetching the HTML page using a requests.Session with browser-like headers to avoid bot detection. BeautifulSoup parses the HTML into a navigable tree structure.
Content Area Detection
The system uses a cascading selector strategy to locate the main content area, trying multiple selectors in order of preference:
| Priority | Selector Type | Example |
|---|---|---|
| 1 | Semantic HTML5 tags | article, main |
| 2 | Class-based selectors | .wiki-content, .content, .markdown-body |
| 3 | ID-based selectors | #content |
| 4 | ARIA role attributes | role="main" |
| 5 | Fallback | body tag |
Content Detection Logic
Sources: python/deepwiki-scraper.py:765-782
DeepWiki UI Element Removal
Before conversion, the system removes DeepWiki-specific navigation and UI elements that would pollute the final documentation. This occurs in two stages: pre-processing element removal and footer cleanup.
Pre-Processing Element Removal
The first stage removes structural elements and UI components using CSS selectors:
The second stage removes DeepWiki-specific text-based UI elements by scanning for characteristic strings:
| UI Element | Detection String | Max Length |
|---|---|---|
| Code indexing prompt | “Index your code with Devin” | 200 chars |
| Edit controls | “Edit Wiki” | 200 chars |
| Indexing status | “Last indexed:” | 200 chars |
| Search links | “View this search on DeepWiki” | 200 chars |
Sources: python/deepwiki-scraper.py:761-762 python/deepwiki-scraper.py:786-795
Navigation List Removal
DeepWiki pages include navigation lists that link to all wiki pages. The system detects and removes these by identifying unordered lists (<ul>) with the following characteristics:
- Contains more than 5 links
- At least 80% of links are internal (start with
/)
graph LR
ul["Find all ul\nElements"]
count["Count Links\nin List"]
check1{"More than\n5 links?"}
check2{"80%+ are\ninternal?"}
remove["Remove ul\nElement"]
keep["Keep ul\nElement"]
ul --> count
count --> check1
check1 -->|Yes| check2
check1 -->|No| keep
check2 -->|Yes| remove
check2 -->|No| keep
Navigation List Detection
Sources: python/deepwiki-scraper.py:799-806
html2text Conversion
After cleaning the HTML, the system uses the html2text library to convert HTML to Markdown. The conversion is configured with specific settings to preserve link structure and prevent line wrapping.
html2text Configuration
The body_width = 0 setting is critical because it prevents the converter from introducing artificial line breaks that would break code blocks and formatted content.
Important: Mermaid diagram extraction is explicitly disabled at this stage. DeepWiki’s Next.js payload contains diagrams from ALL pages mixed together, making per-page extraction unreliable. Diagrams are handled separately in Phase 2 using fuzzy matching (see Fuzzy Matching Algorithm).
Sources: python/deepwiki-scraper.py:213-228
graph TB
scan["Scan Last 50 Lines\nBackwards"]
patterns["Check Against\nFooter Patterns"]
found{"Pattern\nMatch?"}
backward["Scan Backward\n20 More Lines"]
content{"Hit Real\nContent?"}
cut["Cut Lines from\nFooter Start"]
trim["Trim Trailing\nEmpty Lines"]
scan --> patterns
patterns --> found
found -->|Yes| backward
found -->|No| trim
backward --> content
content -->|Yes| cut
content -->|No| backward
cut --> trim
Footer Cleanup
The clean_deepwiki_footer function removes DeepWiki’s footer UI elements that appear at the end of each page. It uses regex patterns to detect footer markers and removes everything from that point onward.
Footer Detection Patterns
The footer patterns are compiled regex expressions:
| Pattern | Purpose | Example Match |
|---|---|---|
^\s*Dismiss\s*$ | Close button | “Dismiss” |
Refresh this wiki | Refresh controls | “Refresh this wiki” |
This wiki was recently refreshed | Status message | Various timestamps |
###\s*On this page | Page navigation | “### On this page” |
Please wait \d+ days? | Rate limiting | “Please wait 7 days” |
View this search on DeepWiki | Search link | Exact match |
^\s*Edit Wiki\s*$ | Edit button | “Edit Wiki” |
Sources: python/deepwiki-scraper.py:165-211
Post-Processing Steps
After initial conversion, two post-processing steps refine the Markdown output: source reference formatting and internal link rewriting.
Source Reference Formatting
The format_source_references function inserts colons between filenames and line numbers in source code references. This transforms patterns like [path/to/file:10-20] into [path/to/file:10-20].
Pattern Matching:
- Regex:
\[([A-Za-z0-9._/-]+?)(\d+-\d+)\] - Capture Group 1: Filename path
- Capture Group 2: Line number range
- Output:
[filename:linerange]
Sources: python/deepwiki-scraper.py:395-406
graph TB
find["Find Link Pattern:\n/owner/repo/page"]
extract["Extract page_num\nand slug"]
normalize["normalized_number_parts()\n[28-43]"]
build["build_target_path()\n[55-63]"]
relative{"Source in\nSubsection?"}
same{"Target in Same\nSection?"}
diff["Prefix: ../\nDifferent section"]
none["Prefix: ../\nTop level"]
local["No prefix\nSame section"]
find --> extract
extract --> normalize
normalize --> build
build --> relative
relative -->|Yes| same
relative -->|No| done["Return\nRelative Path"]
same -->|Yes| local
same -->|No| diff
diff --> done
local --> done
none --> done
Internal Link Rewriting
DeepWiki uses absolute URLs for internal wiki links (e.g., /owner/repo/4-query-planning). The system rewrites these to relative Markdown file paths using the build_target_path function.
Link Rewriting Process
Path Resolution Examples:
| Source File | Target Link | Resolved Path |
|---|---|---|
1-overview.md | /repo/2-architecture | 2-architecture.md |
section-2/2-1-pipeline.md | /repo/2-2-build | 2-2-build.md |
section-2/2-1-pipeline.md | /repo/3-config | ../3-config.md |
1-overview.md | /repo/2-1-subsection | section-2/2-1-subsection.md |
Sources: python/deepwiki-scraper.py:854-875 python/deepwiki-scraper.py:55-63 python/deepwiki-scraper.py:28-43
Duplicate Content Removal
The final cleanup step removes duplicate titles and stray “Menu” text that may appear in the converted Markdown. The system tracks whether a title has been seen and skips subsequent occurrences if they match the first title exactly.
Cleanup Rules:
- Skip standalone “Menu” lines
- Keep first
# Titleoccurrence - Skip duplicate titles that match the first title
- Preserve all other content
Sources: python/deepwiki-scraper.py:820-841
Output Format
The final output is clean Markdown with the following characteristics:
- Title guaranteed to be present (added if missing)
- No DeepWiki UI elements
- No artificial line wrapping
- Relative internal links
- Formatted source references
- Stripped trailing whitespace
The output is written to temporary storage before diagram enhancement in Phase 2. A snapshot of this raw Markdown (without diagrams) is saved to raw_markdown/ for debugging purposes.
Sources: python/deepwiki-scraper.py:1357-1366
Dismiss
Refresh this wiki
Enter email to refresh