Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

DeepWiki GitHub

Phase 1: Markdown Extraction

Relevant source files

This page documents Phase 1 of the three-phase processing pipeline, which handles the extraction and initial conversion of wiki content from DeepWiki.com into clean Markdown files. Phase 1 produces raw Markdown files in a temporary directory before diagram enhancement (Phase 2, see #7) and mdBook HTML generation (Phase 3, see #8).

For detailed information about specific sub-processes within Phase 1, see:

  • Wiki structure discovery algorithm: #6.1
  • HTML parsing and Markdown conversion: #6.2

Scope and Objectives

Phase 1 accomplishes the following:

  1. Discover all wiki pages and their hierarchical structure from DeepWiki
  2. Fetch HTML content for each page via HTTP requests
  3. Parse HTML to extract main content and remove UI elements
  4. Convert cleaned HTML to Markdown using html2text
  5. Organize output files into a hierarchical directory structure
  6. Save to a temporary directory for subsequent processing

This phase is implemented entirely in Python within deepwiki-scraper.py and operates independently of Phases 2 and 3.

Sources: README.md:121-128 tools/deepwiki-scraper.py:790-876

Phase 1 Execution Flow

The following diagram shows the complete execution flow of Phase 1, mapping high-level steps to specific functions in the codebase:

Sources: tools/deepwiki-scraper.py:790-876 tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594

flowchart TD
    Start["main()
Entry Point"]
CreateTemp["Create tempfile.TemporaryDirectory()"]
CreateSession["requests.Session()
with User-Agent"]
DiscoverPhase["Structure Discovery Phase"]
ExtractWiki["extract_wiki_structure(repo, session)"]
ParseLinks["BeautifulSoup: find_all('a', href=pattern)"]
SortPages["sort by page number (handle dots)"]
ExtractionPhase["Content Extraction Phase"]
LoopPages["For each page in pages list"]
FetchContent["extract_page_content(url, session, page_info)"]
FetchHTML["fetch_page(url, session)
with retries"]
ParseHTML["BeautifulSoup(response.text)"]
RemoveNav["Remove nav/header/footer/aside elements"]
FindContent["Find main content: article/main/[role='main']"]
ConvertPhase["Conversion Phase"]
ConvertMD["convert_html_to_markdown(html_content)"]
HTML2Text["html2text.HTML2Text with body_width=0"]
CleanFooter["clean_deepwiki_footer(markdown)"]
FixLinks["Regex replace: wiki links → .md paths"]
SavePhase["File Organization Phase"]
DetermineLevel{"page['level'] == 0?"}
SaveRoot["Save to temp_dir/NUM-title.md"]
CreateSubdir["Create temp_dir/section-N/"]
SaveSubdir["Save to section-N/NUM-title.md"]
NextPage{"More pages?"}
Complete["Phase 1 Complete: temp_dir contains all .md files"]
Start --> CreateTemp
 
   CreateTemp --> CreateSession
 
   CreateSession --> DiscoverPhase
    
 
   DiscoverPhase --> ExtractWiki
 
   ExtractWiki --> ParseLinks
 
   ParseLinks --> SortPages
 
   SortPages --> ExtractionPhase
    
 
   ExtractionPhase --> LoopPages
 
   LoopPages --> FetchContent
 
   FetchContent --> FetchHTML
 
   FetchHTML --> ParseHTML
 
   ParseHTML --> RemoveNav
 
   RemoveNav --> FindContent
 
   FindContent --> ConvertPhase
    
 
   ConvertPhase --> ConvertMD
 
   ConvertMD --> HTML2Text
 
   HTML2Text --> CleanFooter
 
   CleanFooter --> FixLinks
 
   FixLinks --> SavePhase
    
 
   SavePhase --> DetermineLevel
 
   DetermineLevel -->|Yes: Main Page| SaveRoot
 
   DetermineLevel -->|No: Subsection| CreateSubdir
 
   CreateSubdir --> SaveSubdir
 
   SaveRoot --> NextPage
 
   SaveSubdir --> NextPage
    
 
   NextPage -->|Yes| LoopPages
 
   NextPage -->|No| Complete

Core Components and Data Flow

Structure Discovery Pipeline

The structure discovery process identifies all wiki pages and builds a hierarchical page list:

Sources: tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:90-116 tools/deepwiki-scraper.py:118-123

flowchart LR
    subgraph Input
        BaseURL["Base URL\ndeepwiki.com/owner/repo"]
end
    
    subgraph extract_wiki_structure
        FetchMain["fetch_page(base_url)"]
ParseSoup["BeautifulSoup(response.text)"]
FindLinks["soup.find_all('a', href=regex)"]
ExtractInfo["Extract page number & title\nRegex: /(\d+(?:\.\d+)*)-(.+)$"]
CalcLevel["Calculate level from dots\nlevel = page_num.count('.')"]
BuildPages["Build pages list with metadata"]
SortFunc["Sort by sort_key(page)\nparts = [int(x)
for x in num.split('.')]"]
end
    
    subgraph Output
        PagesList["List[Dict]\n{'number': '2.1',\n'title': 'Section',\n'url': full_url,\n'href': path,\n'level': 1}"]
end
    
 
   BaseURL --> FetchMain
 
   FetchMain --> ParseSoup
 
   ParseSoup --> FindLinks
 
   FindLinks --> ExtractInfo
 
   ExtractInfo --> CalcLevel
 
   CalcLevel --> BuildPages
 
   BuildPages --> SortFunc
 
   SortFunc --> PagesList

Content Extraction and Cleaning

Each page undergoes a multi-step cleaning process to remove DeepWiki UI elements:

Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:175-190 tools/deepwiki-scraper.py:127-173

flowchart TD
    subgraph fetch_page
        MakeRequest["requests.get(url, headers)\nUser-Agent: Mozilla/5.0..."]
RetryLogic["Retry up to 3 times\n2 second delay between attempts"]
CheckStatus["response.raise_for_status()"]
end
    
    subgraph extract_page_content
        ParsePage["BeautifulSoup(response.text)"]
RemoveUnwanted["Decompose: nav, header, footer,\naside, .sidebar, script, style"]
FindMain["Try selectors in order:\narticle → main → .wiki-content\n→ [role='main'] → body"]
RemoveUI["Remove DeepWiki UI elements:\n'Edit Wiki', 'Last indexed:',\n'Index your code with Devin'"]
RemoveNavLists["Remove navigation <ul> lists\n(80%+ internal wiki links)"]
end
    
    subgraph convert_html_to_markdown
        HTML2TextInit["h = html2text.HTML2Text()\nh.ignore_links = False\nh.body_width = 0"]
HandleContent["markdown = h.handle(html_content)"]
CleanFooterCall["clean_deepwiki_footer(markdown)"]
end
    
    subgraph clean_deepwiki_footer
        SplitLines["lines = markdown.split('\\n')"]
ScanBackward["Scan last 50 lines backward\nfor footer patterns"]
MatchPatterns["Regex patterns:\n'Dismiss', 'Refresh this wiki',\n'On this page', 'Edit Wiki'"]
TruncateLines["lines = lines[:footer_start]"]
RemoveEmpty["Remove trailing empty lines"]
end
    
 
   MakeRequest --> RetryLogic
 
   RetryLogic --> CheckStatus
 
   CheckStatus --> ParsePage
    
 
   ParsePage --> RemoveUnwanted
 
   RemoveUnwanted --> FindMain
 
   FindMain --> RemoveUI
 
   RemoveUI --> RemoveNavLists
 
   RemoveNavLists --> HTML2TextInit
    
 
   HTML2TextInit --> HandleContent
 
   HandleContent --> CleanFooterCall
    
 
   CleanFooterCall --> SplitLines
 
   SplitLines --> ScanBackward
 
   ScanBackward --> MatchPatterns
 
   MatchPatterns --> TruncateLines
 
   TruncateLines --> RemoveEmpty

Phase 1 transforms internal DeepWiki links into relative Markdown file paths. The rewriting logic accounts for hierarchical directory structure:

Sources: tools/deepwiki-scraper.py:549-592

flowchart TD
    subgraph Input
        WikiLink["DeepWiki Link\n[text](/owner/repo/2-1-section)"]
SourcePage["Current Page Info\n{level: 1, number: '2.1'}"]
end
    
    subgraph fix_wiki_link
        ExtractPath["Regex: /(\d+(?:\.\d+)*)-(.+)$"]
ParseNumbers["Extract: page_num='2.1', slug='section'"]
ConvertNum["file_num = page_num.replace('.', '-')\nResult: '2-1'"]
CheckTarget{"Is target\nsubsection?\n(has '.')"}
CheckSource{"Is source\nsubsection?\n(level > 0)"}
CheckSame{"Same main\nsection?"}
PathSameSection["Relative path:\nfile_num-slug.md"]
PathDiffSection["Full path:\nsection-N/file_num-slug.md"]
PathToMain["Up one level:\n../file_num-slug.md"]
PathMainToMain["Same level:\nfile_num-slug.md"]
end
    
    subgraph Output
        MDLink["Markdown Link\n[text](2-1-section.md)\nor [text](section-2/2-1-section.md)\nor [text](../2-1-section.md)"]
end
    
 
   WikiLink --> ExtractPath
 
   ExtractPath --> ParseNumbers
 
   ParseNumbers --> ConvertNum
 
   ConvertNum --> CheckTarget
    
 
   CheckTarget -->|Yes| CheckSource
 
   CheckTarget -->|No: Main Page| CheckSource
    
 
   CheckSource -->|Target: Sub, Source: Sub| CheckSame
 
   CheckSource -->|Target: Sub, Source: Main| PathDiffSection
 
   CheckSource -->|Target: Main, Source: Sub| PathToMain
 
   CheckSource -->|Target: Main, Source: Main| PathMainToMain
    
 
   CheckSame -->|Yes| PathSameSection
 
   CheckSame -->|No| PathDiffSection
    
 
   PathSameSection --> MDLink
 
   PathDiffSection --> MDLink
 
   PathToMain --> MDLink
 
   PathMainToMain --> MDLink

File Organization Strategy

Phase 1 organizes output files into a hierarchical directory structure based on page levels:

Directory Structure Rules

Page LevelPage Number FormatDirectory LocationFilename PatternExample
0 (Main)1, 2, 3temp_dir/ (root){num}-{slug}.md1-overview.md
1 (Subsection)2.1, 3.4temp_dir/section-{N}/{num}-{slug}.mdsection-2/2-1-workspace.md

File Organization Implementation

Sources: tools/deepwiki-scraper.py:21-25 tools/deepwiki-scraper.py:845-868

HTTP Session Configuration

Phase 1 uses a persistent requests.Session with browser-like headers and retry logic:

Session Setup

Retry Strategy

Sources: tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:817-821

Data Structures

Page Metadata Dictionary

Each page discovered by extract_wiki_structure() is represented as a dictionary:

Sources: tools/deepwiki-scraper.py:109-115

BeautifulSoup Content Selectors

Phase 1 attempts multiple selector strategies to find main content, in priority order:

PrioritySelector TypeSelector ValueRationale
1CSS SelectorarticleSemantic HTML5 element for main content
2CSS SelectormainHTML5 main landmark element
3CSS Selector.wiki-contentCommon class name for wiki content
4CSS Selector.contentGeneric content class
5CSS Selector#contentGeneric content ID
6CSS Selector.markdown-bodyGitHub-style markdown container
7Attributerole="main"ARIA landmark role
8FallbackbodyLast resort: entire body

Sources: tools/deepwiki-scraper.py:472-484

Error Handling and Robustness

Page Extraction Error Handling

Phase 1 implements graceful degradation for individual page failures:

Sources: tools/deepwiki-scraper.py:841-876

Content Extraction Fallbacks

If primary content selectors fail, Phase 1 applies fallback strategies:

  1. Content Selector Fallback Chain : Try 8 different selectors (see table above)
  2. Empty Content Check : Raises exception if no content element found tools/deepwiki-scraper.py:486-487
  3. HTTP Retry Logic : 3 attempts with exponential backoff
  4. Session Persistence : Reuses TCP connections for efficiency

Sources: tools/deepwiki-scraper.py:472-487 tools/deepwiki-scraper.py:27-42

Output Format

Temporary Directory Structure

At the end of Phase 1, the temporary directory contains the following structure:

temp_dir/
├── 1-overview.md                    # Main page (level 0)
├── 2-architecture.md                # Main page (level 0)
├── 3-components.md                  # Main page (level 0)
├── section-2/                       # Subsections of page 2
│   ├── 2-1-workspace-and-crates.md  # Subsection (level 1)
│   └── 2-2-dependency-graph.md      # Subsection (level 1)
└── section-4/                       # Subsections of page 4
    ├── 4-1-logical-planning.md
    └── 4-2-physical-planning.md

Markdown File Format

Each generated Markdown file has the following characteristics:

  • Title : Always starts with # {Page Title} heading
  • Content : Cleaned HTML converted to Markdown via html2text
  • Links : Internal wiki links rewritten to relative .md paths
  • No Diagrams : Diagrams are added in Phase 2 (see #7)
  • No Footer : DeepWiki UI elements removed via clean_deepwiki_footer()
  • Encoding : UTF-8

Sources: tools/deepwiki-scraper.py:862-868 tools/deepwiki-scraper.py:127-173

Phase 1 Completion Criteria

Phase 1 is considered complete when:

  1. All pages discovered by extract_wiki_structure() have been processed
  2. Each page's Markdown file has been written to the temporary directory
  3. Directory structure (main pages + section-N/ subdirectories) has been created
  4. Success count is reported: "✓ Successfully extracted N/M pages to temp directory"

The temporary directory is then passed to Phase 2 for diagram enhancement.

Sources: tools/deepwiki-scraper.py877 tools/deepwiki-scraper.py:596-788