ADR-005: 2-Second Document Processing Polling Interval

Status: Accepted Date: 2024 (initial implementation) Deciders: BrainDrive Team Tags: performance, ux, async-processing, polling

Context

Document processing flow:

User uploads document (PDF, DOCX, etc.)
Backend processes document asynchronously
- Extracts text
- Chunks content
- Generates embeddings
- Indexes for search
Frontend polls for completion status
User can chat with document once processed

Processing times (observed):

Small documents (<5 pages): 10-30 seconds
Medium documents (5-50 pages): 30-90 seconds
Large documents (50-200 pages): 90-180 seconds

Constraints:

Can't use WebSockets (plugin architecture limitation)
Server-Sent Events (SSE) not supported for this endpoint
Must balance: responsiveness vs server load

Problem Statement

What polling interval provides best UX without overloading backend?

Key questions:

How fast should we poll? (interval)
When should we give up? (timeout)
How many errors before stopping? (error threshold)
How many concurrent uploads can we support?

Trade-offs:

Faster polling: Better UX, higher server load
Slower polling: Lower load, feels unresponsive
Longer timeout: Handles large docs, ties up resources longer
Shorter timeout: Quick failure, may abort valid processing

Decision

Chosen approach: 2-second interval with 2-minute timeout

Configuration (documentPolling.ts):

const POLL_INTERVAL = 2000;        // 2 seconds
const MAX_POLL_ATTEMPTS = 60;      // 60 attempts × 2s = 120s (2 minutes)
const ERROR_THRESHOLD = 5;         // Stop after 5 consecutive errors

async function startDocumentPolling(
  documentId: string,
  onStatusUpdate: (status: DocumentStatus) => void
): void {
  let attempts = 0;
  let consecutiveErrors = 0;

  const intervalId = setInterval(async () => {
    attempts++;

    // Timeout check
    if (attempts >= MAX_POLL_ATTEMPTS) {
      clearInterval(intervalId);
      onStatusUpdate({ status: 'timeout' });
      return;
    }

    try {
      const doc = await fetchDocumentStatus(documentId);
      consecutiveErrors = 0; // Reset on success

      if (doc.status === 'processed' || doc.status === 'failed') {
        clearInterval(intervalId);
        onStatusUpdate(doc);
      }
    } catch (error) {
      consecutiveErrors++;

      // Error threshold check
      if (consecutiveErrors >= ERROR_THRESHOLD) {
        clearInterval(intervalId);
        onStatusUpdate({ status: 'error', error });
      }
    }
  }, POLL_INTERVAL);

  // Track for cleanup
  activePollingIntervals.set(documentId, intervalId);
}

Rationale:

Why 2 seconds:

Fast enough: Feels responsive (users see progress within 2-4s)
Not excessive: 30 polls/min per upload (acceptable load)
Handles burst: 10 concurrent uploads = 300 polls/min (still OK)
Psychological: 2s feels active, 5s feels slow

Why 2-minute timeout:

Covers 95% of documents (90% finish < 90s)
Safety net: Large docs finish, bad docs fail fast
User patience: 2min is threshold before user gives up anyway

Why 5 consecutive errors:

Network blips: 1-2 errors tolerated (transient failures)
Real problems: 5 errors = something's broken, stop wasting resources
Recovery time: 5 errors × 2s = 10s to detect failure

Server load calculation:

Single upload: 60 polls over 2 minutes = 30 polls/min
10 concurrent: 300 polls/min = 5 polls/sec
50 concurrent: 1500 polls/min = 25 polls/sec

Backend capacity: ~100 polls/sec
Safe threshold: <50 concurrent uploads

Consequences

Positive

✅ Responsive UX (2s feels active)
✅ Acceptable server load (5 polls/sec per 10 uploads)
✅ Handles large documents (2min timeout)
✅ Graceful error handling (5-error threshold)
✅ Simple implementation (setInterval)

Negative

❌ Wastes polls for fast docs (10s doc gets 5 polls)
❌ 2min timeout may be too short for very large docs (200+ pages)
❌ No backoff strategy (polls at constant rate)
❌ Network inefficient (could use WebSockets)
❌ Ties up client resources (interval keeps running)

Risks

Server overload: 100+ concurrent uploads
- Mitigation: Rate limiting on backend, upload queue
Zombie pollers: Intervals not cleaned up
- Mitigation: stopAllPolling() on component unmount
Fast docs waste polls: 10s processing still polls for 60s
- Mitigation: Stop immediately on success (no waste after completion)
Slow network: Polls timeout before response arrives
- Mitigation: 10s fetch timeout, error threshold handles it

Neutral

Alternative: Exponential backoff (future enhancement)
Works well for expected use case (1-10 concurrent uploads)

Alternatives Considered

Alternative 1: 5-second interval

Description: Poll every 5 seconds instead of 2

Pros:

Lower server load (12 polls/min vs 30)
Better for very large docs
More efficient network usage

Cons:

Feels sluggish (10s before first status update)
Poor perceived performance
Users think app is frozen

Why rejected: UX too poor, 2-3x difference in responsiveness matters

Alternative 2: Exponential backoff

Description: Start fast (1s), slow down (2s, 4s, 8s...) over time

Pros:

Fast for quick docs (1s polls initially)
Efficient for slow docs (backs off to 8s+)
Reduced overall server load
Industry standard pattern

Cons:

More complex implementation
Harder to reason about total timeout
May feel inconsistent (fast → slow)

Why rejected: Added complexity, 2s constant is simpler and works well enough

Alternative 3: WebSockets / SSE

Description: Server pushes status updates to client

Pros:

Real-time updates (no polling)
Zero wasted requests
Scales better (one connection vs many polls)

Cons:

Requires WebSocket server infrastructure
Plugin architecture complexity (Module Federation + WS)
Fallback still needed (firewall/proxy issues)
More complex error handling

Why rejected: Infrastructure overhead, polling works for this use case

Alternative 4: 500ms interval (aggressive)

Description: Poll every 500ms for fast feedback

Pros:

Immediate feedback (sub-second)
Excellent perceived performance

Cons:

120 polls/min per upload (4x current)
Server overload risk (10 uploads = 1200 polls/min)
Wasted requests (most time waiting for processing)
Network inefficient

Why rejected: Server load too high, marginal UX gain

References

src/services/documentPolling.ts (implementation)
src/services/documentService.ts (usage)
src/collection-chat-view/components/DocumentManagerModal.tsx (UI integration)
Related: ADR-002 (evaluation polling uses similar pattern)

Implementation Notes

File paths affected:

src/services/documentPolling.ts - Core polling logic
src/services/documentService.ts - Upload + polling orchestration
src/collection-chat-view/components/DocumentManagerModal.tsx - UI

Polling state tracking:

// Prevent duplicate pollers for same document
const activePollingIntervals = new Map<string, NodeJS.Timeout>();

export function startDocumentPolling(
  documentId: string,
  onStatusUpdate: (status: DocumentStatus) => void
): void {
  // Check if already polling
  if (activePollingIntervals.has(documentId)) {
    console.warn(`Already polling document ${documentId}`);
    return;
  }

  const intervalId = setInterval(() => { /* poll logic */ }, POLL_INTERVAL);
  activePollingIntervals.set(documentId, intervalId);
}

export function stopDocumentPolling(documentId: string): void {
  const intervalId = activePollingIntervals.get(documentId);
  if (intervalId) {
    clearInterval(intervalId);
    activePollingIntervals.delete(documentId);
  }
}

export function stopAllPolling(): void {
  activePollingIntervals.forEach(intervalId => clearInterval(intervalId));
  activePollingIntervals.clear();
}

Critical cleanup pattern:

// Component unmount
componentWillUnmount() {
  stopAllPolling(); // CRITICAL: Prevent memory leaks
}

Status flow:

uploaded → processing → processed ✅
uploaded → processing → failed ❌
uploaded → (timeout after 2min) → timeout ⏱️
uploaded → (5 errors) → error 💥

UI patterns:

Show progress indicator while polling
Display status: "Processing...", "Completed", "Failed"
Allow retry on failure
Disable upload button during processing

Critical gotchas:

Must call stopAllPolling() on unmount (memory leak otherwise)
Check activePollingIntervals before starting (prevent duplicate pollers)
Reset consecutiveErrors on success (don't count past errors)
Clear interval on all exit paths (success, failure, timeout, error)

Monitoring/tuning: Track these metrics to adjust intervals:

Average processing time per document size
95th percentile processing time
Timeout rate (% of docs hitting 2min)
Error rate (% of polls failing)
Concurrent uploads (peak)

When to adjust:

Timeout rate >5% → Increase MAX_POLL_ATTEMPTS
Server load high → Increase POLL_INTERVAL or reduce MAX_POLL_ATTEMPTS
Complaints about slowness → Decrease POLL_INTERVAL (if server can handle)

Migration path: None - this was initial implementation

Rollback plan: If 2s proves too fast (server overload):

Change to 5s interval (POLL_INTERVAL = 5000)
Increase timeout (MAX_POLL_ATTEMPTS = 48, still 4min)
Add backend rate limiting

If too slow (UX complaints):

Implement exponential backoff (Alternative 2)
Start at 1s, max at 5s

Context​

Problem Statement​

Decision​

Consequences​

Positive​

Negative​

Risks​

Neutral​

Alternatives Considered​

Alternative 1: 5-second interval​

Alternative 2: Exponential backoff​

Alternative 3: WebSockets / SSE​

Alternative 4: 500ms interval (aggressive)​

References​

Implementation Notes​

Context

Problem Statement

Decision

Consequences

Positive

Negative

Risks

Neutral

Alternatives Considered

Alternative 1: 5-second interval

Alternative 2: Exponential backoff

Alternative 3: WebSockets / SSE

Alternative 4: 500ms interval (aggressive)

References

Implementation Notes