· 8 min read

Building a Document Export Tool with Playwright and Crawlee

How I automated exporting online course materials to PDF using Playwright for browser automation, Crawlee for web crawling, and pdf-lib for document merging.

Share:

As an online learner, I found myself constantly switching between tabs to review course materials. I wanted everything in one place, a single PDF I could read offline, annotate, and search. Rather than spend hours manually saving each page, I saw an opportunity to learn browser automation. Here’s how I built a document export tool using Playwright, Crawlee, and pdf-lib.

The Problem

Online learning platforms are great for delivering content, but they’re not always optimized for offline study. I had dozens of pages scattered across different modules, and I wanted to:

  1. Consolidate all materials into a single, searchable PDF
  2. Study offline without needing an internet connection
  3. Annotate freely using my preferred PDF reader
  4. Learn automation skills that transfer to other projects

Instead of manually saving each page (which would take hours), I decided to automate the entire process.

The Tech Stack

This project gave me hands-on experience with several portfolio-worthy tools:

ToolWhat It DoesWhat I Learned
CrawleeWeb crawling frameworkQueue management, request handling, routing
PlaywrightBrowser automationHeadless browsers, page.pdf(), session reuse
pdf-libPDF manipulationMerging documents, working with binary buffers
yargsCLI argument parsingBuilding user-friendly command-line tools
TypeScriptType safetyInterfaces, async patterns, strict typing

Architecture Overview

flowchart LR
    subgraph Input
        CLI[CLI Arguments]
        Browser[Browser Profile]
    end

    subgraph Crawlee
        Router[Router]
        Landing[Landing Page Handler]
        Item[Item Page Handler]
        Queue[Request Queue]
    end

    subgraph Output
        Buffers[PDF Buffers]
        Merge[pdf-lib Merge]
        Final[Final PDF]
    end

    CLI --> Router
    Browser --> Router
    Router --> Landing
    Landing -->|enqueueLinks| Queue
    Queue --> Item
    Item -->|page.pdf| Buffers
    Buffers --> Merge
    Merge --> Final

Technical Deep Dives

1. Crawlee & The Router Pattern

Crawlee is Apify’s open-source web crawling framework. It handles the messy parts of crawling—request queuing, retries, rate limiting—so you can focus on the extraction logic.

The key abstraction is the router pattern. Instead of one giant handler, you define specialized handlers for different page types:

import { createPlaywrightRouter, PlaywrightCrawler } from 'crawlee';

const router = createPlaywrightRouter();

// Handler for the landing page that lists all items
router.addHandler('module_landing_page', async ({ page, enqueueLinks, parseWithCheerio }) => {
  // Wait for the content wrapper to ensure page is loaded
  await page.waitForSelector('[data-main-content]');

  // Parse the page with Cheerio (jQuery-like syntax)
  const $ = await parseWithCheerio();

  // Extract links and add them to the crawl queue
  const links = $('.list-item a')
    .map((i, el) => $(el).attr('href'))
    .get();

  await enqueueLinks({
    urls: links,
    label: 'module_item_page'  // Route these to a different handler
  });
});

// Handler for individual content pages
router.addHandler('module_item_page', async ({ page, log }) => {
  await page.waitForSelector('.page-content');
  await page.waitForLoadState('load');

  // Process this page (e.g., generate PDF)
  const pdfBuffer = await page.pdf({ format: 'A4', printBackground: true });
  // Store buffer for later merging...
});

This pattern works well because each page type has its own isolated logic, making the code easier to maintain and debug. Adding new page types is straightforward since you don’t need to touch existing handlers. You can also see exactly which handler processes which URL, and Crawlee handles the request queue automatically.

2. Browser Session Reuse (Authentication)

Most learning platforms require authentication. You have two options:

Option 1: Automate login (fragile)

  • Breaks when UI changes
  • Fails with 2FA/MFA
  • May trigger security alerts

Option 2: Reuse existing browser session (what I chose)

  • Use your already-authenticated browser profile
  • No need to handle login flows
  • Works with any auth method (SSO, 2FA, etc.)

Here’s how to configure Playwright to use your existing browser profile:

import { chromium } from 'playwright';

const crawler = new PlaywrightCrawler({
  launchContext: {
    launcher: chromium,
    // Point to your existing browser profile
    userDataDir: '/path/to/browser/profile',
    launchOptions: {
      headless: false,  // Visible browser for debugging
      args: ['--profile-directory=Default']
    }
  },
  requestHandler: router,
});

Finding your browser profile path:

BrowsermacOS Path
Chrome~/Library/Application Support/Google/Chrome
Chromium~/Library/Application Support/Chromium
Firefox~/Library/Application Support/Firefox/Profiles/[profile-name]
BrowserWindows Path
Chrome%LOCALAPPDATA%\Google\Chrome\User Data
Firefox%APPDATA%\Mozilla\Firefox\Profiles\[profile-name]

Security considerations:

  • Only use this on your own machine
  • Never commit profile paths to version control
  • This pattern is for personal automation tools, not production apps

3. PDF Generation with Playwright

Playwright’s page.pdf() method generates PDFs directly from rendered pages:

const pdfBuffer = await page.pdf({
  format: 'A4',
  printBackground: true,  // Include background colors/images
  margin: {
    top: '20px',
    bottom: '20px',
    left: '20px',
    right: '20px'
  }
});

Key considerations:

  • Pages render differently in print mode (CSS @media print)
  • Some elements may be hidden or styled differently
  • Background images require printBackground: true
  • The returned buffer is a Uint8Array ready for file operations

I stored each PDF buffer in memory for later merging:

const tempBuffers: Uint8Array[] = [];

router.addHandler('module_item_page', async ({ page }) => {
  const pdfBuffer = await page.pdf({ format: 'A4', printBackground: true });
  tempBuffers.push(pdfBuffer);
});

4. PDF Merging with pdf-lib

Having separate PDFs per page isn’t ideal. I wanted a single document for the entire module. Enter pdf-lib—a pure JavaScript library for PDF manipulation.

import { PDFDocument } from 'pdf-lib';
import fs from 'fs-extra';

// Create a new PDF document
const mergedPdf = await PDFDocument.create();

// Merge all collected buffers
for (const buf of tempBuffers) {
  // Load the source PDF
  const donor = await PDFDocument.load(buf);

  // Copy all pages from source to merged document
  const copiedPages = await mergedPdf.copyPages(donor, donor.getPageIndices());

  // Add each page to the merged document
  copiedPages.forEach(page => mergedPdf.addPage(page));
}

// Save the final merged PDF
const finalBytes = await mergedPdf.save();
await fs.writeFile('merged-output.pdf', finalBytes);

I chose pdf-lib because it’s pure JavaScript with no native dependencies, so it works seamlessly in both Node.js and browsers. It preserves formatting, links, and embedded content when merging documents. You can also add metadata, watermarks, and a table of contents if needed.

Important: Page order matters! The buffers must be merged in the correct sequence. Crawlee’s queue processes requests somewhat unpredictably, so in a production version you’d want to track the original order.

5. DOM Waiting Strategies

Modern web apps load content dynamically. If you try to extract content immediately, you’ll get an empty page. Here are the waiting strategies I used:

Wait for a specific element:

await page.waitForSelector('.page-content');

Wait for network activity to settle:

await page.waitForLoadState('networkidle');

Wait for the page load event:

await page.waitForLoadState('load');

Parse with Cheerio after loading:

const $ = await parseWithCheerio();
const links = $('a.content-link').map((i, el) => $(el).attr('href')).get();

Debugging tip: Run with headless: false to watch what’s happening. You’ll immediately see if content isn’t loading or if you’re waiting for the wrong element.

6. Building the CLI with yargs

I wanted a proper command-line interface, not hardcoded values. yargs makes this easy:

import yargs from 'yargs';
import { hideBin } from 'yargs/helpers';

interface Args {
  course: string;
  module: string;
  output?: string;
}

const argv = yargs(hideBin(process.argv))
  .option('course', {
    alias: 'c',
    type: 'string',
    demandOption: true,
    description: 'Course identifier',
  })
  .option('module', {
    alias: 'm',
    type: 'string',
    demandOption: true,
    description: 'Module identifier',
  })
  .option('output', {
    alias: 'o',
    type: 'string',
    default: './exports',
    description: 'Output directory',
  })
  .argv as Args;

Now the tool can be invoked with:

npx tsx src/index.ts --course 123 --module 456 --output ./my-exports

Challenges & Solutions

Challenge 1: Dynamic Content Loading

Problem: Pages loaded content via JavaScript after the initial HTML. Solution: Use waitForSelector() to wait for specific elements before processing.

Challenge 2: Maintaining PDF Order

Problem: Crawlee’s queue doesn’t guarantee processing order. Solution: Store buffers with their original index, then sort before merging. (In my MVP, I accepted the limitation—a future improvement!)

Challenge 3: Session Authentication

Problem: Content required login, and automating auth was fragile. Solution: Reuse the existing browser profile with userDataDir.

Challenge 4: Print Styling

Problem: Some pages looked different when converted to PDF. Solution: Use printBackground: true and accept some styling differences.

The Result

The final tool:

  • Accepts course and module IDs via CLI
  • Crawls all pages in the specified module
  • Generates individual PDFs from each page
  • Merges everything into a single, organized PDF
  • Outputs to a structured directory
$ npx tsx src/index.ts -c 123 -m 456

 Merged PDF saved to ./exports/course_123/module_456/course_123_module_456.pdf
Exported module 456 of course 123 to ./exports/course_123/module_456

Conclusion

This project taught me practical browser automation skills that transfer to many domains. Playwright doubles as a testing framework, so these skills apply directly to end-to-end testing. The Crawlee patterns work for any web scraping project, and the general techniques apply to any repetitive web task you want to automate. The pdf-lib knowledge enables custom document generation workflows.

The combination of Crawlee’s crawling infrastructure, Playwright’s browser control, and pdf-lib’s document manipulation created a versatile automation pipeline that I’ve already reused in other projects.

Resources