Building a Document Export Tool with Playwright and Crawlee
How I automated exporting online course materials to PDF using Playwright for browser automation, Crawlee for web crawling, and pdf-lib for document merging.
As an online learner, I found myself constantly switching between tabs to review course materials. I wanted everything in one place, a single PDF I could read offline, annotate, and search. Rather than spend hours manually saving each page, I saw an opportunity to learn browser automation. Here’s how I built a document export tool using Playwright, Crawlee, and pdf-lib.
The Problem
Online learning platforms are great for delivering content, but they’re not always optimized for offline study. I had dozens of pages scattered across different modules, and I wanted to:
- Consolidate all materials into a single, searchable PDF
- Study offline without needing an internet connection
- Annotate freely using my preferred PDF reader
- Learn automation skills that transfer to other projects
Instead of manually saving each page (which would take hours), I decided to automate the entire process.
The Tech Stack
This project gave me hands-on experience with several portfolio-worthy tools:
| Tool | What It Does | What I Learned |
|---|---|---|
| Crawlee | Web crawling framework | Queue management, request handling, routing |
| Playwright | Browser automation | Headless browsers, page.pdf(), session reuse |
| pdf-lib | PDF manipulation | Merging documents, working with binary buffers |
| yargs | CLI argument parsing | Building user-friendly command-line tools |
| TypeScript | Type safety | Interfaces, async patterns, strict typing |
Architecture Overview
flowchart LR
subgraph Input
CLI[CLI Arguments]
Browser[Browser Profile]
end
subgraph Crawlee
Router[Router]
Landing[Landing Page Handler]
Item[Item Page Handler]
Queue[Request Queue]
end
subgraph Output
Buffers[PDF Buffers]
Merge[pdf-lib Merge]
Final[Final PDF]
end
CLI --> Router
Browser --> Router
Router --> Landing
Landing -->|enqueueLinks| Queue
Queue --> Item
Item -->|page.pdf| Buffers
Buffers --> Merge
Merge --> Final
Technical Deep Dives
1. Crawlee & The Router Pattern
Crawlee is Apify’s open-source web crawling framework. It handles the messy parts of crawling—request queuing, retries, rate limiting—so you can focus on the extraction logic.
The key abstraction is the router pattern. Instead of one giant handler, you define specialized handlers for different page types:
import { createPlaywrightRouter, PlaywrightCrawler } from 'crawlee';
const router = createPlaywrightRouter();
// Handler for the landing page that lists all items
router.addHandler('module_landing_page', async ({ page, enqueueLinks, parseWithCheerio }) => {
// Wait for the content wrapper to ensure page is loaded
await page.waitForSelector('[data-main-content]');
// Parse the page with Cheerio (jQuery-like syntax)
const $ = await parseWithCheerio();
// Extract links and add them to the crawl queue
const links = $('.list-item a')
.map((i, el) => $(el).attr('href'))
.get();
await enqueueLinks({
urls: links,
label: 'module_item_page' // Route these to a different handler
});
});
// Handler for individual content pages
router.addHandler('module_item_page', async ({ page, log }) => {
await page.waitForSelector('.page-content');
await page.waitForLoadState('load');
// Process this page (e.g., generate PDF)
const pdfBuffer = await page.pdf({ format: 'A4', printBackground: true });
// Store buffer for later merging...
});
This pattern works well because each page type has its own isolated logic, making the code easier to maintain and debug. Adding new page types is straightforward since you don’t need to touch existing handlers. You can also see exactly which handler processes which URL, and Crawlee handles the request queue automatically.
2. Browser Session Reuse (Authentication)
Most learning platforms require authentication. You have two options:
Option 1: Automate login (fragile)
- Breaks when UI changes
- Fails with 2FA/MFA
- May trigger security alerts
Option 2: Reuse existing browser session (what I chose)
- Use your already-authenticated browser profile
- No need to handle login flows
- Works with any auth method (SSO, 2FA, etc.)
Here’s how to configure Playwright to use your existing browser profile:
import { chromium } from 'playwright';
const crawler = new PlaywrightCrawler({
launchContext: {
launcher: chromium,
// Point to your existing browser profile
userDataDir: '/path/to/browser/profile',
launchOptions: {
headless: false, // Visible browser for debugging
args: ['--profile-directory=Default']
}
},
requestHandler: router,
});
Finding your browser profile path:
| Browser | macOS Path |
|---|---|
| Chrome | ~/Library/Application Support/Google/Chrome |
| Chromium | ~/Library/Application Support/Chromium |
| Firefox | ~/Library/Application Support/Firefox/Profiles/[profile-name] |
| Browser | Windows Path |
|---|---|
| Chrome | %LOCALAPPDATA%\Google\Chrome\User Data |
| Firefox | %APPDATA%\Mozilla\Firefox\Profiles\[profile-name] |
Security considerations:
- Only use this on your own machine
- Never commit profile paths to version control
- This pattern is for personal automation tools, not production apps
3. PDF Generation with Playwright
Playwright’s page.pdf() method generates PDFs directly from rendered pages:
const pdfBuffer = await page.pdf({
format: 'A4',
printBackground: true, // Include background colors/images
margin: {
top: '20px',
bottom: '20px',
left: '20px',
right: '20px'
}
});
Key considerations:
- Pages render differently in print mode (CSS
@media print) - Some elements may be hidden or styled differently
- Background images require
printBackground: true - The returned buffer is a
Uint8Arrayready for file operations
I stored each PDF buffer in memory for later merging:
const tempBuffers: Uint8Array[] = [];
router.addHandler('module_item_page', async ({ page }) => {
const pdfBuffer = await page.pdf({ format: 'A4', printBackground: true });
tempBuffers.push(pdfBuffer);
});
4. PDF Merging with pdf-lib
Having separate PDFs per page isn’t ideal. I wanted a single document for the entire module. Enter pdf-lib—a pure JavaScript library for PDF manipulation.
import { PDFDocument } from 'pdf-lib';
import fs from 'fs-extra';
// Create a new PDF document
const mergedPdf = await PDFDocument.create();
// Merge all collected buffers
for (const buf of tempBuffers) {
// Load the source PDF
const donor = await PDFDocument.load(buf);
// Copy all pages from source to merged document
const copiedPages = await mergedPdf.copyPages(donor, donor.getPageIndices());
// Add each page to the merged document
copiedPages.forEach(page => mergedPdf.addPage(page));
}
// Save the final merged PDF
const finalBytes = await mergedPdf.save();
await fs.writeFile('merged-output.pdf', finalBytes);
I chose pdf-lib because it’s pure JavaScript with no native dependencies, so it works seamlessly in both Node.js and browsers. It preserves formatting, links, and embedded content when merging documents. You can also add metadata, watermarks, and a table of contents if needed.
Important: Page order matters! The buffers must be merged in the correct sequence. Crawlee’s queue processes requests somewhat unpredictably, so in a production version you’d want to track the original order.
5. DOM Waiting Strategies
Modern web apps load content dynamically. If you try to extract content immediately, you’ll get an empty page. Here are the waiting strategies I used:
Wait for a specific element:
await page.waitForSelector('.page-content');
Wait for network activity to settle:
await page.waitForLoadState('networkidle');
Wait for the page load event:
await page.waitForLoadState('load');
Parse with Cheerio after loading:
const $ = await parseWithCheerio();
const links = $('a.content-link').map((i, el) => $(el).attr('href')).get();
Debugging tip: Run with headless: false to watch what’s happening. You’ll immediately see if content isn’t loading or if you’re waiting for the wrong element.
6. Building the CLI with yargs
I wanted a proper command-line interface, not hardcoded values. yargs makes this easy:
import yargs from 'yargs';
import { hideBin } from 'yargs/helpers';
interface Args {
course: string;
module: string;
output?: string;
}
const argv = yargs(hideBin(process.argv))
.option('course', {
alias: 'c',
type: 'string',
demandOption: true,
description: 'Course identifier',
})
.option('module', {
alias: 'm',
type: 'string',
demandOption: true,
description: 'Module identifier',
})
.option('output', {
alias: 'o',
type: 'string',
default: './exports',
description: 'Output directory',
})
.argv as Args;
Now the tool can be invoked with:
npx tsx src/index.ts --course 123 --module 456 --output ./my-exports
Challenges & Solutions
Challenge 1: Dynamic Content Loading
Problem: Pages loaded content via JavaScript after the initial HTML.
Solution: Use waitForSelector() to wait for specific elements before processing.
Challenge 2: Maintaining PDF Order
Problem: Crawlee’s queue doesn’t guarantee processing order. Solution: Store buffers with their original index, then sort before merging. (In my MVP, I accepted the limitation—a future improvement!)
Challenge 3: Session Authentication
Problem: Content required login, and automating auth was fragile.
Solution: Reuse the existing browser profile with userDataDir.
Challenge 4: Print Styling
Problem: Some pages looked different when converted to PDF.
Solution: Use printBackground: true and accept some styling differences.
The Result
The final tool:
- Accepts course and module IDs via CLI
- Crawls all pages in the specified module
- Generates individual PDFs from each page
- Merges everything into a single, organized PDF
- Outputs to a structured directory
$ npx tsx src/index.ts -c 123 -m 456
✅ Merged PDF saved to ./exports/course_123/module_456/course_123_module_456.pdf
Exported module 456 of course 123 to ./exports/course_123/module_456
Conclusion
This project taught me practical browser automation skills that transfer to many domains. Playwright doubles as a testing framework, so these skills apply directly to end-to-end testing. The Crawlee patterns work for any web scraping project, and the general techniques apply to any repetitive web task you want to automate. The pdf-lib knowledge enables custom document generation workflows.
The combination of Crawlee’s crawling infrastructure, Playwright’s browser control, and pdf-lib’s document manipulation created a versatile automation pipeline that I’ve already reused in other projects.