Multi-Page Document Processing

FileLens provides comprehensive multi-page document processing capabilities, allowing you to generate preview images for every page in PDF, DOC, DOCX, PPT, PPTX, and other document formats. This guide covers everything you need to know about working with multi-page documents.


Supported Formats

Document Types

FileLens supports multi-page processing for the following document formats:

PDF Documents

  • Native PDF processing
  • Preserves vector graphics when possible
  • Supports password-protected PDFs
  • Handles complex layouts and fonts

Microsoft Office

  • DOC, DOCX (Word documents)
  • XLS, XLSX (Excel spreadsheets)
  • PPT, PPTX (PowerPoint presentations)
  • Converts via LibreOffice pipeline

OpenDocument Formats

  • ODT (Text documents)
  • ODS (Spreadsheets)
  • ODP (Presentations)
  • Full compatibility with LibreOffice

Other Formats

  • RTF (Rich Text Format)
  • TXT (Plain text files)
  • CSV (Comma-separated values)
  • Many other document formats via LibreOffice

Processing Pipeline


Processing Options

Page Control

The all_pages option controls how many pages are processed:

{
  "input": "https://example.com/document.pdf",
  "output_format": "jpg",
  "options": {
    "all_pages": true,
    "width": 800,
    "height": 600
  }
}

Quality Settings

Different quality settings work better for different document types:

  • Name
    quality
    Type
    integer
    Description

    High Quality (90-100): Best for presentations and graphics-heavy documents

  • Name
    quality
    Type
    integer
    Description

    Medium Quality (70-89): Good balance for text documents

  • Name
    quality
    Type
    integer
    Description

    Low Quality (50-69): Fastest processing, suitable for thumbnails

Resolution Settings

Choose appropriate dimensions based on your use case:

Thumbnail Size

  • Width: 200-400px
  • Height: 200-400px
  • Best for file browsers and quick previews

Standard Preview

  • Width: 600-800px
  • Height: 800-1000px
  • Good for web display and general viewing

High Resolution

  • Width: 1200-1920px
  • Height: 1600-2560px
  • Best for detailed viewing and printing

Custom Aspect Ratios

  • Maintain document proportions
  • Consider target display requirements
  • Balance file size vs. quality

File Naming

Naming Convention

Generated files follow a consistent naming pattern:

{type}_{timestamp}_{process_id}_{hash}_{page_number}.{extension}

Synchronous Files

sync_1641312000_12345_abc1_1.jpg    # Page 1
sync_1641312000_12345_abc1_2.jpg    # Page 2
sync_1641312000_12345_abc1_3.jpg    # Page 3

Asynchronous Files

result_550e8400-e29b-41d4-a716-446655440000_1641312000_1.png    # Page 1
result_550e8400-e29b-41d4-a716-446655440000_1641312000_2.png    # Page 2
result_550e8400-e29b-41d4-a716-446655440000_1641312000_3.png    # Page 3

File Components

  • Name
    type
    Type
    string
    Description

    sync for synchronous requests, result for asynchronous jobs

  • Name
    timestamp
    Type
    string
    Description

    Unix timestamp when processing started

  • Name
    process_id
    Type
    string
    Description

    Process ID (sync) or Job UUID (async)

  • Name
    hash
    Type
    string
    Description

    Short hash of input (sync only)

  • Name
    page_number
    Type
    integer
    Description

    Sequential page number starting from 1

  • Name
    extension
    Type
    string
    Description

    File extension matching output_format


Examples

PDF Multi-Page Processing

curl -X POST http://localhost:3000/preview \
  -H "Content-Type: application/json" \
  -d '{
    "input": "https://example.com/report.pdf",
    "output_format": "jpg",
    "options": {
      "width": 800,
      "height": 600,
      "quality": 90,
      "all_pages": true
    }
  }'

Response:

{
  "success": true,
  "message": "Preview generated successfully",
  "preview_urls": [
    "/download/sync_1641312000_12345_def4_1.jpg",
    "/download/sync_1641312000_12345_def4_2.jpg",
    "/download/sync_1641312000_12345_def4_3.jpg",
    "/download/sync_1641312000_12345_def4_4.jpg",
    "/download/sync_1641312000_12345_def4_5.jpg"
  ],
  "total_pages": 5,
  "job_id": null
}

PowerPoint Presentation Processing

# Submit job
curl -X POST http://localhost:3000/preview/async \
  -H "Content-Type: application/json" \
  -d '{
    "input": "https://example.com/presentation.pptx",
    "output_format": "png",
    "options": {
      "width": 1920,
      "height": 1080,
      "quality": 95,
      "all_pages": true
    }
  }'

# Check status
curl http://localhost:3000/preview/status/550e8400-e29b-41d4-a716-446655440000

# Download slides
curl http://localhost:3000/download/result_550e8400-e29b-41d4-a716-446655440000_1641312000_1.png -o slide1.png
curl http://localhost:3000/download/result_550e8400-e29b-41d4-a716-446655440000_1641312000_2.png -o slide2.png

Excel Spreadsheet Processing

<?php
class ExcelProcessor {
    private $client;

    public function __construct() {
        $this->client = new FileLensClient();
    }

    public function processSpreadsheet($input, $outputDir = './sheets/') {
        $result = $this->client->generatePreview($input, 'png', [
            'width' => 1200,
            'height' => 900,
            'quality' => 85,
            'all_pages' => true
        ]);

        if (!is_dir($outputDir)) {
            mkdir($outputDir, 0777, true);
        }

        $downloadedSheets = [];
        foreach ($result['preview_urls'] as $index => $url) {
            $sheetNumber = $index + 1;
            $filename = "sheet_{$sheetNumber}.png";
            $outputPath = $outputDir . $filename;

            $fileContent = file_get_contents('http://localhost:3000' . $url);
            file_put_contents($outputPath, $fileContent);

            $downloadedSheets[] = [
                'sheet' => $sheetNumber,
                'file' => $outputPath,
                'url' => $url
            ];
        }

        return [
            'total_sheets' => $result['total_pages'],
            'sheets' => $downloadedSheets
        ];
    }
}

// Usage
$processor = new ExcelProcessor();
$result = $processor->processSpreadsheet('https://example.com/data.xlsx');

echo "Processed {$result['total_sheets']} sheets:\n";
foreach ($result['sheets'] as $sheet) {
    echo "Sheet {$sheet['sheet']}: {$sheet['file']}\n";
}
?>

Best Practices

Performance Optimization

File Size Considerations

  • Documents > 50 pages: Use async processing
  • Large presentations: Use async processing
  • Complex spreadsheets: Consider smaller preview sizes

Quality vs. Speed

  • Lower quality (70-80) for thumbnails
  • Higher quality (90-95) for detailed viewing
  • Balance based on use case requirements

Batch Processing

  • Process multiple documents concurrently
  • Use connection pooling for efficiency
  • Implement proper error handling for batches

Caching Strategy

  • Cache preview URLs to avoid reprocessing
  • Store metadata about processed documents
  • Implement TTL for cache invalidation

Memory Management

async function downloadLargeDocument(previewUrls, outputDir) {
  const fs = require('fs');
  const path = require('path');

  // Create output directory
  if (!fs.existsSync(outputDir)) {
    fs.mkdirSync(outputDir, { recursive: true });
  }

  // Download pages in batches to manage memory
  const batchSize = 5;
  for (let i = 0; i < previewUrls.length; i += batchSize) {
    const batch = previewUrls.slice(i, i + batchSize);

    await Promise.all(batch.map(async (url, index) => {
      const pageNumber = i + index + 1;
      const filename = `page_${pageNumber}.jpg`;
      const outputPath = path.join(outputDir, filename);

      const response = await fetch(`http://localhost:3000${url}`);
      const buffer = Buffer.from(await response.arrayBuffer());
      fs.writeFileSync(outputPath, buffer);

      console.log(`Downloaded page ${pageNumber}`);
    }));

    // Small delay between batches
    if (i + batchSize < previewUrls.length) {
      await new Promise(resolve => setTimeout(resolve, 100));
    }
  }
}

Error Recovery

  • Name
    Page-level errors
    Type
    info
    Description

    If processing fails for specific pages, the service will still return successfully processed pages. Check the total_pages count against the number of URLs returned.

  • Name
    Memory limits
    Type
    warning
    Description

    Very large documents (>200 pages) may hit memory limits. Consider splitting into smaller batches or using lower resolution settings.

  • Name
    Timeout handling
    Type
    error
    Description

    Long processing times can cause timeouts. Use async processing for documents with >50 pages or complex layouts.

Use Case Examples

Document Viewer

  • Generate thumbnails for navigation
  • Use progressive loading for large documents
  • Implement zoom functionality with higher-resolution versions

Archive System

  • Process documents in background jobs
  • Store previews alongside metadata
  • Implement search within document content

Presentation Tools

  • Generate slide thumbnails for editing interface
  • Create animated previews from slide sequences
  • Export individual slides as images

Reporting Dashboard

  • Create visual summaries of document content
  • Generate thumbnail galleries
  • Provide quick document previews