Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Python Dependencies

Loading…

Python Dependencies

Relevant source files

This page documents the Python dependencies required by the deepwiki-scraper.py script, including their purposes, version requirements, and how they are used throughout the content extraction and conversion pipeline. For information about the scraper script itself, see deepwiki-scraper.py. For details about how Rust dependencies (mdBook and mdbook-mermaid) are installed, see Docker Multi-Stage Build.

Dependencies Overview

The system requires three core Python libraries for web scraping and HTML-to-Markdown conversion:

PackageMinimum VersionPrimary Purpose
requests2.31.0HTTP client for fetching web pages
beautifulsoup44.12.0HTML parsing and DOM manipulation
html2text2020.1.16HTML to Markdown conversion

These dependencies are declared in tools/requirements.txt:1-3 and installed during Docker image build using the uv package manager.

Sources: tools/requirements.txt:1-3 Dockerfile:16-17

Dependency Usage Flow

The following diagram illustrates how each Python dependency is used across the three-phase processing pipeline:

Sources: tools/deepwiki-scraper.py:17-19 tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:78-125 tools/deepwiki-scraper.py:453-594 tools/deepwiki-scraper.py:596-788

flowchart TD
    subgraph "Phase 1: Markdown Extraction"
        FetchPage["fetch_page()\n[tools/deepwiki-scraper.py:27-42]"]
ExtractStruct["extract_wiki_structure()\n[tools/deepwiki-scraper.py:78-125]"]
ExtractContent["extract_page_content()\n[tools/deepwiki-scraper.py:453-594]"]
ConvertHTML["convert_html_to_markdown()\n[tools/deepwiki-scraper.py:175-190]"]
end
    
    subgraph "Phase 2: Diagram Enhancement"
        ExtractDiagrams["extract_and_enhance_diagrams()\n[tools/deepwiki-scraper.py:596-788]"]
end
    
    subgraph "requests Library"
        Session["requests.Session()"]
GetMethod["session.get()"]
HeadMethod["session.head()"]
end
    
    subgraph "BeautifulSoup4 Library"
        BS4Parser["BeautifulSoup(html, 'html.parser')"]
FindAll["soup.find_all()"]
Select["soup.select()"]
Decompose["element.decompose()"]
end
    
    subgraph "html2text Library"
        H2TClass["html2text.HTML2Text()"]
HandleMethod["h.handle()"]
end
    
 
   FetchPage --> Session
 
   FetchPage --> GetMethod
 
   ExtractStruct --> GetMethod
 
   ExtractStruct --> BS4Parser
 
   ExtractStruct --> FindAll
    
 
   ExtractContent --> GetMethod
 
   ExtractContent --> BS4Parser
 
   ExtractContent --> Select
 
   ExtractContent --> Decompose
 
   ExtractContent --> ConvertHTML
    
 
   ConvertHTML --> H2TClass
 
   ConvertHTML --> HandleMethod
    
 
   ExtractDiagrams --> GetMethod

requests

The requests library provides HTTP client functionality for fetching web pages from DeepWiki.com. It is imported at tools/deepwiki-scraper.py17 and used throughout the scraper.

Key Usage Patterns

Session Management: A requests.Session() object is created at tools/deepwiki-scraper.py:818-821 to maintain connection pooling and share headers across multiple requests:

HTTP GET Requests: The fetch_page() function at tools/deepwiki-scraper.py:27-42 uses session.get() with retry logic, browser-like headers, and 30-second timeout to fetch HTML content.

HTTP HEAD Requests: The discover_subsections() function at tools/deepwiki-scraper.py:44-76 uses session.head() to efficiently check for page existence without downloading full content.

Configuration Options

The library is configured with:

Sources: tools/deepwiki-scraper.py17 tools/deepwiki-scraper.py:27-42 tools/deepwiki-scraper.py:818-821

BeautifulSoup4

The beautifulsoup4 library (imported as bs4) provides HTML parsing and DOM manipulation capabilities. It is imported at tools/deepwiki-scraper.py18 as from bs4 import BeautifulSoup.

Parser Selection

BeautifulSoup is instantiated with the built-in html.parser backend at multiple locations:

This parser choice avoids external dependencies (lxml, html5lib) and provides sufficient functionality for well-formed HTML.

flowchart LR
    subgraph "Navigation Methods"
        FindAll["soup.find_all()"]
Find["soup.find()"]
Select["soup.select()"]
SelectOne["soup.select_one()"]
end
    
    subgraph "Usage in extract_wiki_structure()"
        StructLinks["Find wiki page links\n[line 90]"]
end
    
    subgraph "Usage in extract_page_content()"
        RemoveNav["Remove navigation elements\n[line 466]"]
FindContent["Locate main content area\n[line 473-485]"]
RemoveUI["Remove DeepWiki UI elements\n[line 491-511]"]
end
    
 
   FindAll --> StructLinks
 
   FindAll --> RemoveUI
    
 
   Select --> RemoveNav
 
   SelectOne --> FindContent
    
 
   Find --> FindContent

DOM Navigation Methods

The following diagram maps BeautifulSoup methods to their usage contexts in the codebase:

Sources: tools/deepwiki-scraper.py18 tools/deepwiki-scraper.py84 tools/deepwiki-scraper.py90 tools/deepwiki-scraper.py463 tools/deepwiki-scraper.py:466-511

Content Manipulation

Element Removal: The element.decompose() method permanently removes elements from the DOM tree:

CSS Selectors: BeautifulSoup’s select() and select_one() methods support CSS selector syntax for finding content areas:

tools/deepwiki-scraper.py:473-476

Attribute-Based Selection: The find() method with attrs parameter locates elements by ARIA roles:

tools/deepwiki-scraper.py480

Text Extraction

BeautifulSoup’s get_text() method extracts plain text from elements:

Sources: tools/deepwiki-scraper.py:466-511

html2text

The html2text library converts HTML content to Markdown format. It is imported at tools/deepwiki-scraper.py19 and used exclusively in the convert_html_to_markdown() function.

Configuration

An HTML2Text instance is created with specific configuration at tools/deepwiki-scraper.py:178-180:

Key Settings:

  • ignore_links = False : Preserves hyperlinks as Markdown link syntax <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/text" undefined file-path="text">Hii</FileRef>
  • body_width = 0 : Disables automatic line wrapping at 80 characters, preserving original formatting

Conversion Process

The handle() method at tools/deepwiki-scraper.py181 performs the actual HTML-to-Markdown conversion:

This processes the cleaned HTML (after BeautifulSoup removes unwanted elements) and produces Markdown text with:

  • Headers converted to # syntax
  • Links converted to <FileRef file-url="https://github.com/jzombie/deepwiki-to-mdbook/blob/0378ae61/text" undefined file-path="text">Hii</FileRef> format
  • Lists converted to - or 1. format
  • Bold/italic formatting preserved
  • Code blocks and inline code preserved

Post-Processing

The conversion output undergoes additional cleanup at tools/deepwiki-scraper.py188:

Sources: tools/deepwiki-scraper.py19 tools/deepwiki-scraper.py:175-190

flowchart TD
    subgraph "Dockerfile Stage 2"
        BaseImage["FROM python:3.12-slim\n[Dockerfile:8]"]
CopyUV["COPY uv from astral-sh image\n[Dockerfile:13]"]
CopyReqs["COPY tools/requirements.txt\n[Dockerfile:16]"]
InstallDeps["RUN uv pip install --system\n[Dockerfile:17]"]
end
    
    subgraph "requirements.txt"
        Requests["requests>=2.31.0"]
BS4["beautifulsoup4>=4.12.0"]
HTML2Text["html2text>=2020.1.16"]
end
    
 
   BaseImage --> CopyUV
 
   CopyUV --> CopyReqs
 
   CopyReqs --> InstallDeps
    
 
   Requests --> InstallDeps
 
   BS4 --> InstallDeps
 
   HTML2Text --> InstallDeps

Installation Process

The dependencies are installed during Docker image build using the uv package manager, which provides fast, reliable Python package installation.

Multi-Stage Build Integration

Sources: Dockerfile8 Dockerfile13 Dockerfile:16-17 tools/requirements.txt:1-3

Installation Command

The dependencies are installed with a single uv pip install command at Dockerfile17:

Flags:

  • --system : Installs into system Python, not a virtual environment
  • --no-cache : Avoids caching to reduce Docker image size
  • -r /tmp/requirements.txt : Specifies requirements file path

The uv tool is significantly faster than standard pip and provides deterministic dependency resolution, making builds more reliable and reproducible.

Sources: Dockerfile:16-17

Version Requirements

The minimum version constraints specified in tools/requirements.txt:1-3 ensure compatibility with required features:

requests >= 2.31.0

This version requirement ensures:

  • Security fixes : Addresses CVE-2023-32681 (Proxy-Authorization header leakage)
  • Session improvements : Enhanced connection pooling and retry mechanisms
  • HTTP/2 support : Better performance for multiple requests

The codebase relies on stable Session API behavior introduced in 2.x releases.

beautifulsoup4 >= 4.12.0

This version requirement ensures:

  • Python 3.12 compatibility : Required for the base image python:3.12-slim
  • Parser stability : Consistent behavior with html.parser backend
  • Security updates : Protection against XML parsing vulnerabilities

The codebase uses standard find/select methods that are stable across 4.x versions.

html2text >= 2020.1.16

This version requirement ensures:

  • Python 3 compatibility : Earlier versions targeted Python 2.7
  • Markdown formatting fixes : Improved handling of nested lists and code blocks
  • Link preservation : Proper conversion of HTML links to Markdown syntax

The codebase uses the body_width=0 configuration which was stabilized in this version.

Sources: tools/requirements.txt:1-3

Import Locations

All three dependencies are imported at the top of deepwiki-scraper.py:

These are the only external dependencies required by the Python layer. The script uses only standard library modules for all other functionality (sys, re, time, pathlib, urllib.parse).

Sources: tools/deepwiki-scraper.py:17-19

Dismiss

Refresh this wiki

Enter email to refresh