Phase 1: Markdown Extraction
Relevant source files
This page documents Phase 1 of the three-phase processing pipeline, which handles the extraction and initial conversion of wiki content from DeepWiki.com into clean Markdown files. Phase 1 produces raw Markdown files in a temporary directory before diagram enhancement (Phase 2, see #7) and mdBook HTML generation (Phase 3, see #8).
For detailed information about specific sub-processes within Phase 1, see:
Scope and Objectives
Phase 1 accomplishes the following:
- Discover all wiki pages and their hierarchical structure from DeepWiki
- Fetch HTML content for each page via HTTP requests
- Parse HTML to extract main content and remove UI elements
- Convert cleaned HTML to Markdown using
html2text - Organize output files into a hierarchical directory structure
- Save to a temporary directory for subsequent processing
This phase is implemented entirely in Python within deepwiki-scraper.py and operates independently of Phases 2 and 3.
Sources: README.md:121-128 tools/deepwiki-scraper.py:790-876
Phase 1 Execution Flow
The following diagram shows the complete execution flow of Phase 1, mapping high-level steps to specific functions in the codebase:
Sources: tools/deepwiki-scraper.py:790-876 tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594
flowchart TD
Start["main()
Entry Point"]
CreateTemp["Create tempfile.TemporaryDirectory()"]
CreateSession["requests.Session()
with User-Agent"]
DiscoverPhase["Structure Discovery Phase"]
ExtractWiki["extract_wiki_structure(repo, session)"]
ParseLinks["BeautifulSoup: find_all('a', href=pattern)"]
SortPages["sort by page number (handle dots)"]
ExtractionPhase["Content Extraction Phase"]
LoopPages["For each page in pages list"]
FetchContent["extract_page_content(url, session, page_info)"]
FetchHTML["fetch_page(url, session)
with retries"]
ParseHTML["BeautifulSoup(response.text)"]
RemoveNav["Remove nav/header/footer/aside elements"]
FindContent["Find main content: article/main/[role='main']"]
ConvertPhase["Conversion Phase"]
ConvertMD["convert_html_to_markdown(html_content)"]
HTML2Text["html2text.HTML2Text with body_width=0"]
CleanFooter["clean_deepwiki_footer(markdown)"]
FixLinks["Regex replace: wiki links → .md paths"]
SavePhase["File Organization Phase"]
DetermineLevel{"page['level'] == 0?"}
SaveRoot["Save to temp_dir/NUM-title.md"]
CreateSubdir["Create temp_dir/section-N/"]
SaveSubdir["Save to section-N/NUM-title.md"]
NextPage{"More pages?"}
Complete["Phase 1 Complete: temp_dir contains all .md files"]
Start --> CreateTemp
CreateTemp --> CreateSession
CreateSession --> DiscoverPhase
DiscoverPhase --> ExtractWiki
ExtractWiki --> ParseLinks
ParseLinks --> SortPages
SortPages --> ExtractionPhase
ExtractionPhase --> LoopPages
LoopPages --> FetchContent
FetchContent --> FetchHTML
FetchHTML --> ParseHTML
ParseHTML --> RemoveNav
RemoveNav --> FindContent
FindContent --> ConvertPhase
ConvertPhase --> ConvertMD
ConvertMD --> HTML2Text
HTML2Text --> CleanFooter
CleanFooter --> FixLinks
FixLinks --> SavePhase
SavePhase --> DetermineLevel
DetermineLevel -->|Yes: Main Page| SaveRoot
DetermineLevel -->|No: Subsection| CreateSubdir
CreateSubdir --> SaveSubdir
SaveRoot --> NextPage
SaveSubdir --> NextPage
NextPage -->|Yes| LoopPages
NextPage -->|No| Complete
Core Components and Data Flow
Structure Discovery Pipeline
The structure discovery process identifies all wiki pages and builds a hierarchical page list:
Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:90-116 tools/deepwiki-scraper.py:118-123
flowchart LR
subgraph Input
BaseURL["Base URL\ndeepwiki.com/owner/repo"]
end
subgraph extract_wiki_structure
FetchMain["fetch_page(base_url)"]
ParseSoup["BeautifulSoup(response.text)"]
FindLinks["soup.find_all('a', href=regex)"]
ExtractInfo["Extract page number & title\nRegex: /(\d+(?:\.\d+)*)-(.+)$"]
CalcLevel["Calculate level from dots\nlevel = page_num.count('.')"]
BuildPages["Build pages list with metadata"]
SortFunc["Sort by sort_key(page)\nparts = [int(x)
for x in num.split('.')]"]
end
subgraph Output
PagesList["List[Dict]\n{'number': '2.1',\n'title': 'Section',\n'url': full_url,\n'href': path,\n'level': 1}"]
end
BaseURL --> FetchMain
FetchMain --> ParseSoup
ParseSoup --> FindLinks
FindLinks --> ExtractInfo
ExtractInfo --> CalcLevel
CalcLevel --> BuildPages
BuildPages --> SortFunc
SortFunc --> PagesList
Content Extraction and Cleaning
Each page undergoes a multi-step cleaning process to remove DeepWiki UI elements:
Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:175-190 tools/deepwiki-scraper.py:127-173
flowchart TD
subgraph fetch_page
MakeRequest["requests.get(url, headers)\nUser-Agent: Mozilla/5.0..."]
RetryLogic["Retry up to 3 times\n2 second delay between attempts"]
CheckStatus["response.raise_for_status()"]
end
subgraph extract_page_content
ParsePage["BeautifulSoup(response.text)"]
RemoveUnwanted["Decompose: nav, header, footer,\naside, .sidebar, script, style"]
FindMain["Try selectors in order:\narticle → main → .wiki-content\n→ [role='main'] → body"]
RemoveUI["Remove DeepWiki UI elements:\n'Edit Wiki', 'Last indexed:',\n'Index your code with Devin'"]
RemoveNavLists["Remove navigation <ul> lists\n(80%+ internal wiki links)"]
end
subgraph convert_html_to_markdown
HTML2TextInit["h = html2text.HTML2Text()\nh.ignore_links = False\nh.body_width = 0"]
HandleContent["markdown = h.handle(html_content)"]
CleanFooterCall["clean_deepwiki_footer(markdown)"]
end
subgraph clean_deepwiki_footer
SplitLines["lines = markdown.split('\\n')"]
ScanBackward["Scan last 50 lines backward\nfor footer patterns"]
MatchPatterns["Regex patterns:\n'Dismiss', 'Refresh this wiki',\n'On this page', 'Edit Wiki'"]
TruncateLines["lines = lines[:footer_start]"]
RemoveEmpty["Remove trailing empty lines"]
end
MakeRequest --> RetryLogic
RetryLogic --> CheckStatus
CheckStatus --> ParsePage
ParsePage --> RemoveUnwanted
RemoveUnwanted --> FindMain
FindMain --> RemoveUI
RemoveUI --> RemoveNavLists
RemoveNavLists --> HTML2TextInit
HTML2TextInit --> HandleContent
HandleContent --> CleanFooterCall
CleanFooterCall --> SplitLines
SplitLines --> ScanBackward
ScanBackward --> MatchPatterns
MatchPatterns --> TruncateLines
TruncateLines --> RemoveEmpty
Link Rewriting Logic
Phase 1 transforms internal DeepWiki links into relative Markdown file paths. The rewriting logic accounts for hierarchical directory structure:
Sources: tools/deepwiki-scraper.py:549-592
flowchart TD
subgraph Input
WikiLink["DeepWiki Link\n[text](/owner/repo/2-1-section)"]
SourcePage["Current Page Info\n{level: 1, number: '2.1'}"]
end
subgraph fix_wiki_link
ExtractPath["Regex: /(\d+(?:\.\d+)*)-(.+)$"]
ParseNumbers["Extract: page_num='2.1', slug='section'"]
ConvertNum["file_num = page_num.replace('.', '-')\nResult: '2-1'"]
CheckTarget{"Is target\nsubsection?\n(has '.')"}
CheckSource{"Is source\nsubsection?\n(level > 0)"}
CheckSame{"Same main\nsection?"}
PathSameSection["Relative path:\nfile_num-slug.md"]
PathDiffSection["Full path:\nsection-N/file_num-slug.md"]
PathToMain["Up one level:\n../file_num-slug.md"]
PathMainToMain["Same level:\nfile_num-slug.md"]
end
subgraph Output
MDLink["Markdown Link\n[text](2-1-section.md)\nor [text](section-2/2-1-section.md)\nor [text](../2-1-section.md)"]
end
WikiLink --> ExtractPath
ExtractPath --> ParseNumbers
ParseNumbers --> ConvertNum
ConvertNum --> CheckTarget
CheckTarget -->|Yes| CheckSource
CheckTarget -->|No: Main Page| CheckSource
CheckSource -->|Target: Sub, Source: Sub| CheckSame
CheckSource -->|Target: Sub, Source: Main| PathDiffSection
CheckSource -->|Target: Main, Source: Sub| PathToMain
CheckSource -->|Target: Main, Source: Main| PathMainToMain
CheckSame -->|Yes| PathSameSection
CheckSame -->|No| PathDiffSection
PathSameSection --> MDLink
PathDiffSection --> MDLink
PathToMain --> MDLink
PathMainToMain --> MDLink
File Organization Strategy
Phase 1 organizes output files into a hierarchical directory structure based on page levels:
Directory Structure Rules
| Page Level | Page Number Format | Directory Location | Filename Pattern | Example |
|---|---|---|---|---|
| 0 (Main) | 1, 2, 3 | temp_dir/ (root) | {num}-{slug}.md | 1-overview.md |
| 1 (Subsection) | 2.1, 3.4 | temp_dir/section-{N}/ | {num}-{slug}.md | section-2/2-1-workspace.md |
File Organization Implementation
Sources: tools/deepwiki-scraper.py:21-25 tools/deepwiki-scraper.py:845-868
HTTP Session Configuration
Phase 1 uses a persistent requests.Session with browser-like headers and retry logic:
Session Setup
Retry Strategy
Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:817-821
Data Structures
Page Metadata Dictionary
Each page discovered by extract_wiki_structure() is represented as a dictionary:
Sources: tools/deepwiki-scraper.py:109-115
BeautifulSoup Content Selectors
Phase 1 attempts multiple selector strategies to find main content, in priority order:
| Priority | Selector Type | Selector Value | Rationale |
|---|---|---|---|
| 1 | CSS Selector | article | Semantic HTML5 element for main content |
| 2 | CSS Selector | main | HTML5 main landmark element |
| 3 | CSS Selector | .wiki-content | Common class name for wiki content |
| 4 | CSS Selector | .content | Generic content class |
| 5 | CSS Selector | #content | Generic content ID |
| 6 | CSS Selector | .markdown-body | GitHub-style markdown container |
| 7 | Attribute | role="main" | ARIA landmark role |
| 8 | Fallback | body | Last resort: entire body |
Sources: tools/deepwiki-scraper.py:472-484
Error Handling and Robustness
Page Extraction Error Handling
Phase 1 implements graceful degradation for individual page failures:
Sources: tools/deepwiki-scraper.py:841-876
Content Extraction Fallbacks
If primary content selectors fail, Phase 1 applies fallback strategies:
- Content Selector Fallback Chain : Try 8 different selectors (see table above)
- Empty Content Check : Raises exception if no content element found tools/deepwiki-scraper.py:486-487
- HTTP Retry Logic : 3 attempts with exponential backoff
- Session Persistence : Reuses TCP connections for efficiency
Sources: tools/deepwiki-scraper.py:472-487 tools/deepwiki-scraper.py:27-42
Output Format
Temporary Directory Structure
At the end of Phase 1, the temporary directory contains the following structure:
temp_dir/
├── 1-overview.md # Main page (level 0)
├── 2-architecture.md # Main page (level 0)
├── 3-components.md # Main page (level 0)
├── section-2/ # Subsections of page 2
│ ├── 2-1-workspace-and-crates.md # Subsection (level 1)
│ └── 2-2-dependency-graph.md # Subsection (level 1)
└── section-4/ # Subsections of page 4
├── 4-1-logical-planning.md
└── 4-2-physical-planning.md
Markdown File Format
Each generated Markdown file has the following characteristics:
- Title : Always starts with
# {Page Title}heading - Content : Cleaned HTML converted to Markdown via
html2text - Links : Internal wiki links rewritten to relative
.mdpaths - No Diagrams : Diagrams are added in Phase 2 (see #7)
- No Footer : DeepWiki UI elements removed via
clean_deepwiki_footer() - Encoding : UTF-8
Sources: tools/deepwiki-scraper.py:862-868 tools/deepwiki-scraper.py:127-173
Phase 1 Completion Criteria
Phase 1 is considered complete when:
- All pages discovered by
extract_wiki_structure()have been processed - Each page's Markdown file has been written to the temporary directory
- Directory structure (main pages +
section-N/subdirectories) has been created - Success count is reported:
"✓ Successfully extracted N/M pages to temp directory"
The temporary directory is then passed to Phase 2 for diagram enhancement.
Sources: tools/deepwiki-scraper.py877 tools/deepwiki-scraper.py:596-788