Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

DeepWiki GitHub

Docker Multi-Stage Build

Relevant source files

Purpose and Scope

This document explains the Docker multi-stage build strategy used to create the deepwiki-scraper container image. It details how the system combines a Rust toolchain for compiling documentation tools with a Python runtime for web scraping, while optimizing the final image size.

For information about how the container orchestrates the build process, see build-docs.sh Orchestrator. For details on the Python scraper implementation, see deepwiki-scraper.py.

Multi-Stage Build Strategy

The Dockerfile implements a two-stage build pattern that separates compilation from runtime. Stage 1 uses a full Rust development environment to compile mdBook binaries from source. Stage 2 creates a minimal Python runtime and extracts only the compiled binaries, discarding the build toolchain.

Build Stages Flow

graph TD
    subgraph Stage1["Stage 1: Rust Builder (rust:latest)"]
RustBase["rust:latest base image\n~1.5 GB with toolchain"]
CargoInstall["cargo install mdbook\ncargo install mdbook-mermaid"]
Binaries["/usr/local/cargo/bin/\nmdbook\nmdbook-mermaid"]
RustBase --> CargoInstall
 
       CargoInstall --> Binaries
    end
    
    subgraph Stage2["Stage 2: Python Runtime (python:3.12-slim)"]
PyBase["python:3.12-slim base\n~150 MB"]
UVInstall["Copy uv from\nghcr.io/astral-sh/uv:latest"]
PipInstall["uv pip install --system\nrequirements.txt"]
CopyBinaries["COPY --from=builder\n/usr/local/cargo/bin/"]
CopyScripts["COPY tools/ and\nbuild-docs.sh"]
PyBase --> UVInstall
 
       UVInstall --> PipInstall
 
       PipInstall --> CopyBinaries
 
       CopyBinaries --> CopyScripts
    end
    
 
   Binaries -.->|Extract only binaries| CopyBinaries
    
 
   CopyScripts --> FinalImage["Final Image\n~300-400 MB"]
Stage1 -.->|Discarded after build| Discard["Discard"]
style RustBase fill:#f5f5f5
    style PyBase fill:#f5f5f5
    style FinalImage fill:#e8e8e8
    style Discard fill:#fff,stroke-dasharray: 5 5

Sources: Dockerfile:1-33

Stage 1: Rust Builder

Stage 1 uses rust:latest as the base image, providing the complete Rust toolchain including cargo, the Rust package manager and build tool.

Rust Builder Configuration

AspectDetails
Base Imagerust:latest
Size~1.5 GB (includes rustc, cargo, stdlib)
Build Commandscargo install mdbook, cargo install mdbook-mermaid
Output Location/usr/local/cargo/bin/
Stage Identifierbuilder

The cargo install commands fetch mdBook and mdbook-mermaid source from crates.io, compile them from source, and install the resulting binaries to /usr/local/cargo/bin/.

flowchart LR
    subgraph BuilderStage["builder stage"]
CratesIO["crates.io\n(package registry)"]
CargoFetch["cargo fetch\n(download sources)"]
CargoCompile["cargo build --release\n(compile to binary)"]
CargoInstallBin["Install to\n/usr/local/cargo/bin/"]
CratesIO --> CargoFetch
 
       CargoFetch --> CargoCompile
 
       CargoCompile --> CargoInstallBin
    end
    
 
   CargoInstallBin --> MdBookBin["mdbook binary"]
CargoInstallBin --> MermaidBin["mdbook-mermaid binary"]
MdBookBin -.->|Copied to Stage 2| NextStage["NextStage"]
MermaidBin -.->|Copied to Stage 2| NextStage

Sources: Dockerfile:1-5

Stage 2: Python Runtime Assembly

Stage 2 builds the final runtime image starting from python:3.12-slim, a minimal Python base image that omits development headers and unnecessary packages.

Python Runtime Components

graph TB
    subgraph PythonBase["python:3.12-slim"]
PyInterpreter["Python 3.12 interpreter"]
PyStdlib["Python standard library"]
BaseUtils["Essential utilities\n(bash, sh, coreutils)"]
end
    
    subgraph InstalledTools["Installed via COPY"]
UV["uv package manager\n/bin/uv, /bin/uvx"]
PyDeps["Python packages\n(requests, beautifulsoup4, html2text)"]
RustBins["Rust binaries\n(mdbook, mdbook-mermaid)"]
Scripts["Application scripts\n(deepwiki-scraper.py, build-docs.sh)"]
end
    
 
   PythonBase --> UV
 
   UV --> PyDeps
 
   PythonBase --> RustBins
 
   PythonBase --> Scripts
    
 
   PyDeps --> Runtime["Runtime Environment"]
RustBins --> Runtime
 
   Scripts --> Runtime

The installation sequence follows a specific order:

  1. Copy uv Dockerfile13 - Multi-stage copy from ghcr.io/astral-sh/uv:latest
  2. Install Python dependencies Dockerfile:16-17 - Uses uv pip install --system --no-cache
  3. Copy Rust binaries Dockerfile:20-21 - Extracts from builder stage
  4. Copy application scripts Dockerfile:24-29 - Adds Python scraper and orchestrator

Sources: Dockerfile:8-29

Binary Extraction and Integration

The critical optimization occurs at Dockerfile:20-21 where the COPY --from=builder directive extracts only the compiled binaries without any build dependencies.

Binary Extraction Pattern

Source (Stage 1)Destination (Stage 2)Purpose
/usr/local/cargo/bin/mdbook/usr/local/bin/mdbookDocumentation builder executable
/usr/local/cargo/bin/mdbook-mermaid/usr/local/bin/mdbook-mermaidMermaid preprocessor executable
flowchart LR
    subgraph BuilderFS["Builder Filesystem"]
CargoDir["/usr/local/cargo/bin/"]
MdBookSrc["mdbook\n(compiled binary)"]
MermaidSrc["mdbook-mermaid\n(compiled binary)"]
CargoDir --> MdBookSrc
 
       CargoDir --> MermaidSrc
    end
    
    subgraph RuntimeFS["Runtime Filesystem"]
BinDir["/usr/local/bin/"]
MdBookDst["mdbook\n(extracted)"]
MermaidDst["mdbook-mermaid\n(extracted)"]
BinDir --> MdBookDst
 
       BinDir --> MermaidDst
    end
    
 
   MdBookSrc -.->|COPY --from=builder| MdBookDst
 
   MermaidSrc -.->|COPY --from=builder| MermaidDst
    
    subgraph Discarded["Discarded (not copied)"]
RustToolchain["rustc compiler"]
CargoTool["cargo build tool"]
SourceFiles["mdBook source files"]
BuildCache["cargo build cache"]
end

Both binaries are statically linked or contain all necessary Rust runtime dependencies, allowing them to execute in the Python base image without the Rust toolchain.

Sources: Dockerfile:19-21

Python Dependency Installation

Python dependencies are installed using uv, a fast Python package installer written in Rust. The dependencies are defined in tools/requirements.txt:1-4

Python Dependencies

PackageVersionPurpose
requests≥2.31.0HTTP client for scraping DeepWiki
beautifulsoup4≥4.12.0HTML parsing and navigation
html2text≥2020.1.16HTML to Markdown conversion

The installation command Dockerfile17 uses these flags:

  • --system: Install to system Python (not virtualenv)
  • --no-cache: Don't cache downloaded packages (reduces image size)
  • -r /tmp/requirements.txt: Read dependencies from file

Sources: Dockerfile:16-17 tools/requirements.txt:1-4

graph LR
    subgraph Approach1["Single-Stage Approach (Hypothetical)"]
Single["rust:latest + Python\n~2+ GB"]
end
    
    subgraph Approach2["Multi-Stage Approach (Actual)"]
Builder["Stage 1: rust:latest\n~1.5 GB\n(discarded)"]
Runtime["Stage 2: python:3.12-slim\n+ binaries + dependencies\n~300-400 MB"]
Builder -.->|Extract binaries only| Runtime
    end
    
 
   Single -->|Contains unnecessary build toolchain| Waste["Wasted Space"]
Runtime -->|Contains only runtime essentials| Efficient["Efficient"]
style Single fill:#f5f5f5
    style Builder fill:#f5f5f5
    style Runtime fill:#e8e8e8
    style Waste fill:#fff,stroke-dasharray: 5 5
    style Efficient fill:#fff,stroke-dasharray: 5 5

Image Size Optimization

The multi-stage strategy achieves significant size reduction by discarding the build environment.

Size Comparison

Size Breakdown of Final Image

ComponentApproximate Size
Python 3.12 slim base~150 MB
Python packages (requests, BeautifulSoup4, html2text)~20 MB
mdBook binary~8 MB
mdbook-mermaid binary~6 MB
uv package manager~10 MB
Application scripts<1 MB
Total~300-400 MB

Sources: Dockerfile:1-33 README.md156

graph TB
    subgraph Filesystem["/usr/local/bin/"]
BuildScript["build-docs.sh\n(orchestrator)"]
ScraperScript["deepwiki-scraper.py\n(Python scraper)"]
MdBookBin["mdbook\n(Rust binary)"]
MermaidBin["mdbook-mermaid\n(Rust binary)"]
UVBin["uv\n(Python installer)"]
end
    
    subgraph SystemPython["/usr/local/lib/python3.12/"]
Requests["requests package"]
BS4["beautifulsoup4 package"]
Html2Text["html2text package"]
end
    
    subgraph Execution["Execution Flow"]
Docker["docker run\n(CMD)"]
Docker --> BuildScript
 
       BuildScript -->|python| ScraperScript
 
       BuildScript -->|subprocess| MdBookBin
 
       MdBookBin -->|preprocessor| MermaidBin
 
       ScraperScript --> Requests
 
       ScraperScript --> BS4
 
       ScraperScript --> Html2Text
    end

Runtime Environment Structure

The final image contains a hybrid Python-Rust runtime where Python scripts can execute Rust binaries as subprocesses.

Runtime Component Locations

The entrypoint Dockerfile32 executes /usr/local/bin/build-docs.sh, which orchestrates calls to both Python and Rust components. The script can execute:

  • python /usr/local/bin/deepwiki-scraper.py for web scraping
  • mdbook init for initialization
  • mdbook build for HTML generation
  • mdbook-mermaid install for asset installation

Sources: Dockerfile:28-32 build-docs.sh

Container Execution Model

When the container runs, Docker executes the CMD Dockerfile32 which invokes build-docs.sh. This shell script has access to all binaries in /usr/local/bin/ (automatically on $PATH).

Process Tree During Execution

graph TD
    Docker["docker run\n(container init)"]
Docker --> CMD["CMD: build-docs.sh"]
CMD --> Phase1["Phase 1:\npython deepwiki-scraper.py"]
CMD --> Phase2["Phase 2: mdbook init"]
CMD --> Phase3["Phase 3: mdbook-mermaid install"]
CMD --> Phase4["Phase 4: mdbook build"]
Phase1 --> PyProc["Python 3.12 process"]
PyProc --> ReqLib["requests.get()"]
PyProc --> BS4Lib["BeautifulSoup()"]
PyProc --> H2TLib["html2text.HTML2Text()"]
Phase2 --> MdBookProc1["mdbook binary process"]
Phase3 --> MermaidProc["mdbook-mermaid binary process"]
Phase4 --> MdBookProc2["mdbook binary process"]
MdBookProc2 --> MermaidPreproc["mdbook-mermaid\n(as preprocessor)"]

Sources: Dockerfile32 README.md:122-145