01 /
Executive Overview
 LIVE V3 CURRENT PROCESS
Section 01

Executive Overview

Plain-English explanation of the Document Processing Pipeline, its business value, and end-to-end function — written for founders, operators, and leadership.

From 7-step manual paper workflow to zero-touch automation
Replaces printing, manual AWB typing, barcode generation, physical arrangement, and paper scanning. Adds a duplication check that never existed before.
Automation Rate
+90%
docs matched without human input
Fast Match Speed
<100ms
for filename or text-layer hits
DB Source
Excel
AWB set reloads every 30s
False Positives
~0
all AWBs verified against DB

What is the Document Processing Pipeline?

The Document Processing Pipeline is an automated document processing system built for FedEx shipment operations. It watches a local "hot-folder" (INBOX), and whenever a new PDF arrives — whether an Air Waybill, commercial invoice, or packing list — it immediately identifies the 12-digit Air Waybill Number embedded in that document.

Once the AWB is confirmed against the live Excel-backed AWB set, the system automatically renames the file, moves it to the correct folder, and records every action in multiple audit trails. Runtime is typically sub-second for easy filename/text-layer matches and can extend to tens of seconds for hard rotated/OCR rescue documents — all without a human touching a keyboard.

The Manual Process — Then & Now

Before automation, every incoming shipment document required a multi-step physical workflow involving printing, manual data entry, barcode generation, physical arrangement, and paper scanning. The pipeline eliminates every single one of those steps.

OLD MANUAL WORKFLOW — 7 Steps Per Document Batch
01
Document Received
PDF arrived from multiple sources
02
Print Document
Employee prints the PDF on paper
03
Stack on Desk
Placed on physical scanning desk stack
04
Manual AWB Entry
Person reads AWB from page, types into barcode software
05
Print Barcodes
Barcode generator prints a barcode label per AWB
06
Arrange & Attach
Barcodes placed on top of each matching AWB doc
07
Scan Physical Batch
Entire paper stack fed through physical scanner
Time-Intensive
Each document requires manual attention — reading, typing, printing, arranging, scanning
Human Error Risk
Manual AWB typing from document to barcode software introduces typos and misreads
No Duplicate Check
No automated way to detect if the same shipment document was submitted twice
Document Processing Pipeline V3 replaces all 7 steps
AUTOMATED PIPELINE — Fully Hands-Free
PDF Arrives in INBOX
Watchdog detects within 2 seconds. No printing, no physical handling.
AWB Auto-Extracted
Stages 0–7 find and verify the 12-digit AWB against the currently loaded Excel AWB set.
Duplicate Check
EDM API check + MD5 hash dedup catches any document submitted more than once.
Batch Ready to Print
Barcode cover pages generated automatically. Batch PDF + TIFF assembled for scanner.
MetricManual ProcessDocument Processing Pipeline V3
Steps per batch7 manual steps1 — drop file in INBOX
AWB identificationHuman reads & types manuallyAutomated OCR + DB verify
Barcode generationManual software entry per AWBAuto-generated with batch cover page
Duplicate detectionNone — manual check onlyEDM gate/layer screening with HASH, PHASH, TEXT, and OCR signals
Error sourceTypos in manual AWB entryZero false positives — DB-verified only
Audit trailNoneJSONL + Excel + 4-sheet workbook
Processing speedMinutes per document<100ms – 20s per document

End-to-End in Six Steps

1 · Document Arrives

PDF lands in INBOX. Watchdog detects it within 2 seconds. Stability guard ensures it is fully written before processing begins.

2 · AWB Extracted

Up to 8 pipeline stages search for the 12-digit AWB: filename pattern, embedded text, OCR at 320 and 420 DPI, rotation handling, and rescue crop passes.

3 · DB Confirmation

Every candidate AWB cross-checked against the live Excel database of 70,000+ records. No match = not accepted. Zero false positives guaranteed.

4 · File Organised

File renamed to its AWB number and moved to PROCESSED. Byte-for-byte duplicates detected via MD5 hash and removed safely without data loss.

5 · Logged & Audited

Every event recorded in: Excel AWB log, JSONL audit log, stage-performance CSV, and a 4-sheet audit workbook with a live Dashboard tab.

6 · Print-Ready

After EDM duplicate checking, clean documents are collated into batch PDFs (max 48 pages, barcode cover) then converted to multi-page TIFFs.

What about difficult documents? Sideways scans, image-only PDFs, and ambiguous documents are handled via multi-stage fallback. If all stages fail, the file moves to NEEDS_REVIEW for manual handling. Difficult files also respect a LONG_PASS_TIMEOUT_SECONDS=45.0 budget during long-pass — they are deferred to a third pass that runs once easier documents clear the queue, with their full OCR state preserved for seamless resumption.

Business Impact for FedEx Operations

This automation is valuable because it solves real operational delays, not just document processing. It reduces shipment risk, protects service timelines, cuts manual workload, and gives teams real visibility into flow and backlog.

Operational Problems Solved
01Late scanning risk: a perishable-shipment invoice scanned 3 hours late can push movement by 1-2 days.
02After-hours pileups: documents received post-shift sit unprocessed and miss next-day clearance windows.
03Duplicate exposure: same shipment documents can re-enter flow without reliable duplicate screening.
04Paper-heavy handling: printing, arranging, rescanning, and manual indexing consume time and resources.
05No real-time queue visibility: teams cannot easily see what is waiting, what is blocked, or what completed.
06Exception handling overload: hard scans consume senior operator time that should be reserved for true edge cases.
07Shift-change discontinuity: work quality and speed vary by shift handoff and manual prioritization.
08Audit reconstruction cost: proving what happened to a file requires manual log chasing across tools.
How Automation Changes Outcomes
Service SpeedNight and backlog documents can be processed continuously, reducing clearance and dispatch delays.
AccuracyDuplicate checks reduce repeat handling, rework, and downstream delivery confusion.
Paperless ShiftPrinting/scanning dependency drops sharply, enabling near paperless operation for this lane.
VisibilityDocument counts, queue state, and status become visible in real time, enabling auditable operations.
Labor EfficiencyCountless manual scan-touch hours are replaced by no-touch routing for routine files.
ContinuityStandardized automation behavior reduces dependence on shift-specific tribal workflows.
PredictabilityDeferred/timeout recovery keeps difficult files from freezing intake lanes.
Leadership ControlLive operational telemetry supports faster escalation and staffing decisions.
Operational DimensionWithout AutomationWith AutomationBusiness Effect
Perishable Shipment TimelinessInvoice scan delay can hold shipment 1-2 extra daysContinuous intake + automated extraction reduces waiting time before clearanceHigher on-time movement probability for sensitive cargo
After-Hours IntakeDocuments accumulate overnight and spill into next-day operationsQueue is monitored and processed with less dependency on immediate manual actionLower morning backlog and fewer next-day misses
Duplicate HandlingManual memory/checklists; high inconsistency riskAutomated duplicate screening with conservative decision policyLower rework, fewer unnecessary touches, stronger control
Labor & PaperHeavy print/scan/manual sort effortNear paperless flow and no-touch processing for routine filesSignificant hours saved and reduced paper cost/waste
Operational VisibilityLittle live visibility into document volume and stateLive counts, status, queue depth, and event logs availableTrue auditability and better staffing/decision planning
Exception ThroughputHard files can block operators and delay normal workloadTimeout-defer + resume path isolates hard files from routine flowBetter SLA protection for standard-volume processing
Shift-to-Shift ConsistencyOutput quality depends heavily on who is on shiftUnified automated rules enforce same decision logic across all shiftsMore consistent operational outcomes and fewer handoff surprises
Audit ReadinessManual evidence collection is slow and fragmentedStructured logs and tracker records available per processing eventFaster internal/external review with higher confidence
Strategic outcome: this shifts teams from endless scan/print/manual handling to exception-based operations. Routine documents move with minimal touch, critical shipments face less delay risk, and leadership gains measurable control through real-time visibility and audit evidence.
Section 02

Visual Pipeline Flow

Click any node to reveal a detailed explanation. The flowchart shows every major stage — from document arrival through three-tier scheduling to final output.

Extraction OCR Stage Rotation Decision Success Review
→ PROCESSED → PROCESSED → PROCESSED → DEFERRED Long Pass Queue → PROCESSED → PROCESSED → NEEDS_REVIEW Manual Review INBOX MONITOR Watchdog observer · 0.8s de-bounce · 30s rescan INPUT STABILITY GUARD 2 polls at 0.3s · size non-zero and stable GUARD PAGE COUNT GUARD PyMuPDF page_count > 0 · bad PDFs need review GUARD STAGE 0 — FILENAME Filename regex · 12-digit or 4-4-4 · DB confirm S-0 STAGE 1 — TEXT LAYER get_text() · rotation fallback · keyword proximity S-1 PRE-OCR ANGLE DETECTION Zero-cost checks · rotation · ratio · char spread FREE STAGE 2 — OCR MAIN · 320 DPI PSM 6/11 · soft pass · invert · digit whitelist S-2 FAST-LANE DECISION ProbeLite strict check · defer only if no match SCHEDULE STAGE 3.1 — ROTATION PROBE 140 DPI · 4 angles · weighted score · flip guard S-3.1 STAGE 3.2 / 3 — STRONG OCR Cached probe reuse · 420 DPI OCR · 45s timeout S-3 STAGE 4 / 5 / 5.5 — RESCUE Rotation fallback · context rescue · label crops S-4/5 STAGE 7 — NEEDS_REVIEW All stages exhausted · diagnostic log · audit updated S-7

Phase 2 — Post-Match Processing Pipeline

Every file that exits the AWB extraction engine as MATCHED enters Phase 2. Click any node to see detailed logic.

PROCESSED
AWB-renamed files land here
EDM CHECK
3-layer duplicate detection
MD5 pHASH OCR
CLEAN
Unique → batch eligible
REJECTED
Duplicate → discarded
BATCH BUILDER
Sort → sequence → pack ≤48pp
HIGH MED LOW
COVER PAGE
Code128 barcode + seq + metadata
PENDING_PRINT
Batch PDFs + MD5 dedup copy
TIFF CONVERT
200 DPI · LZW · 4 parallel workers
PRINT READY
Batch PDF + TIFF in scanner queue

Three-Tier Processing Scheduler

Fast Lane

Stages 0–3 only, plus a strict post-Stage-3 ProbeLite check (400 / exact-high only) before defer. Returns DEFERRED only if probe-lite also fails. Drains the inbox at maximum speed. Up to 5 files per cycle.

Long Pass

Full pipeline on deferred files when fast queue empties. LONG_PASS_TIMEOUT_SECONDS=45.0 per-file timeout. On timeout, full state is captured and file queued for third pass.

Third Pass

Resumes timeout-deferred files with no timeout. OCR cache, candidate pool, rotation state all restored from snapshot. Zero repeated work.

Section 03

Intake & Detection

How the system watches the INBOX folder, ensures files are fully written, and manages queuing, de-bounce, and safety rescans.

Watchdog Observer

The watchdog library attaches InboxPDFHandler to INBOX_DIR. Fires on on_created, on_moved, and on_modified events. Only .pdf files enqueued. A 0.8-second de-bounce prevents double-queuing on OS event bursts (Windows Copy → Flush → Close sequence). Existing inbox PDFs are seeded into the queue at startup.

File Stability Check

file_is_stable() polls the file size twice with a 0.3s delay. A file is safe only when size is non-zero and identical on both reads. Prevents processing a file still being written, transferred, or copied from a network share — a common issue with large EDM document batches.

Safety Rescan

Every 30 seconds, the main loop re-scans INBOX and re-enqueues any PDFs found. Catches files the watchdog missed during high system load or temporary filesystem unavailability.

Heartbeat Log

Every 10 seconds: logs INBOX PDF count, AWB DB size, deferred long-pass queue depth, and timeout-deferred queue depth — real-time health monitoring.

Excel DB Refresh

Every 30 seconds: checks AWB Excel file mtime. On change, reloads full AWB set and rebuilds prefix/suffix buckets. UI can force immediate reload via reload_awb.trigger sentinel file.

Pre-Processing Guards — In Order

GuardCheckOn Failure
GUARD 1File path exists & ends in .pdf (case-insensitive)Silently dropped from queue
GUARD 2file_is_stable() — 2 polls at 0.3s, non-zero identical sizeNEEDS_REVIEW · rescan re-queues later
GUARD 3PyMuPDF page_count > 0 — catches corrupt, truncated PDFsNEEDS_REVIEW with reason string
GUARD 4No inbox-level deduplication — every file processed independentlyDedup only at PROCESSED level via MD5

Outlook Email Intake Script (Companion Flow)

Companion intake script behavior for email attachments before AWB hotfolder processing: attachments are saved to INBOX, and filename renaming is strict and deterministic so no ambiguity is introduced upstream.

Intake Decision Logic
1Read Outlook email attachments and keep PDF files only.
2Run strict AWB extraction patterns on attachment context/name.
3aIf exactly one strict AWB candidate exists: save as <AWB>.pdf in INBOX.
3bIf zero or multiple candidates: keep original filename and save to INBOX unchanged.
4AWB hotfolder picks up from INBOX and runs normal Stage 0–7 logic.
Strict Pattern Rule
Pattern TypeAccepted FormResult
12-digit strict(?<!\d)\d{12}(?!\d)Eligible candidate
4-4-4 strict(?<!\d)\d{4}[\s-]\d{4}[\s-]\d{4}(?!\d)Normalized to 12 digits
Exactly one matchSingle strict candidate onlyRename to <AWB>.pdf
0 or >1 matchesAmbiguous or absent AWBKeep original filename
Safety intent: strict rename only on one unambiguous candidate preserves traceability and avoids wrong pre-labeling; all ambiguous files still enter INBOX for the app’s full extraction pipeline. This intake script is treated as an operational companion layer, keeping core V3 extraction/matching behavior unchanged.
Section 04

AWB Extraction Engine

Eight pipeline stages, each building on a shared candidate pool. Click any stage to expand full technical detail.

Unified Candidate Pool: All stages write into running_high and running_standard. Noisy candidates from invert or rotation passes enter a quarantine set until a second independent stage confirms them — preventing one-off OCR artifacts from ever reaching the matching step.
StageMethodDPITypical TimeDB Check
Stage 0Filename regex~0–5msYES
Stage 1PyMuPDF text layer · 5 sub-stages50–500msYES (excl. 400-pattern)
Pre-OCRAngle detection · 3 free checks~0ms
Stage 2OCR Main · PSM 6+11 · digit whitelist3201–4sYES
3.1/3.2Rotation probe · probe text exit1400.5–2sYES
Stage 3OCR Strong · PSM 6+11 · 420 DPI4202–6sYES
Stage 4Rotation fallback at remaining angles4203–12sYES
Stage 5/5.5Context rescue · airway label crops · 3× upscale4202–15sYES
Stage 6EDM persistence fallback (runtime ON/OFF)0.1–1.5s (when called)EDM
Stage 7NEEDS_REVIEW terminal
Stage 0
Filename Extraction — ~0ms
Instant match with no OCR — handles most EDM-downloaded files

Calls extract_awb_from_filename_strict() using two compiled regexes: a bare 12-digit run \b\d{12}\b and a spaced/hyphenated 4-4-4 format. If found and confirmed against the AWB DB, the file is completed immediately — all OCR is skipped entirely.

DB check requiredCost: ~0msPattern: \b\d{12}\bPattern: \d{4}[\s-]\d{4}[\s-]\d{4}
Stage 1
Text Layer Extraction — 5 Sub-Stages
PyMuPDF embedded text · No image render · 60-keyword proximity search
1a — set_rotation fallback: If text layer is empty, tries rotating the PDF metadata by 90°/270°/180° to unlock text stored at an angle in the PDF's internal stream. Resets to 0° after.
1b — Spatial word sort: For scrambled multi-column PDFs, get_text("words") re-sorted by (row, x) to reconstruct reading order. Used only when it yields more candidates than the raw stream.
1c — 400-pattern: Tight 400\d{12} regex. The only extraction that bypasses the DB check — the FedEx 400 carrier prefix strongly implies a valid AWB.
1d — Clean gate + tiered extraction: Extracts candidates adjacent to AWB keywords and runs full priority match cascade (Exact-High → Exact-Standard → Tolerance).
1e — Keyword proximity (60 terms): Scans 5 lines ahead / 2 lines behind every keyword in the 60-item list including "TRACKING NO", "MAWB", "CARGO CONTROL NUMBER", "ACI NO".
DB check: YESNo image renderKeywords: 60 termsLookahead: 5 lines
Pre-OCR
Angle Detection — Zero Cost
3 free checks before any image is rendered
Check 1 — PDF metadata rotation: page.rotation reports if the PDF viewer displays at 90°/180°/270°.
Check 2 — Aspect ratio: width/height > 1.3 → landscape page → likely rotated 90°.
Check 3 — Character spread: Bounding box distribution of text characters detects vertical vs horizontal flow.

Results stored as _rotation_hint. Narrows the rotation probe angle set later. If the probe is uncertain, the pre-check hint overrides it.

Stage 2
OCR Main — 320 DPI · PSM 6 + 11
Digit whitelist · soft pass · invert variants · table line removal
Pass A — PSM 6, digits-only: Uniform block mode with whitelist 0123456789. Clean gate → tiered extract → priority match. Early exit if quality candidates found (skips PSM 11).
Pass B — PSM 11, digits-only: Sparse text mode — suited for non-uniform layouts. OpenCV table line removal applied before OCR when available.
Pass C — Soft pass (general text): Tesseract unrestricted (PSM 11 then PSM 6) + spatial bounding-box analysis (image_to_data) finding digit clusters adjacent to AWB keyword tokens.
Invert passes: PSM 6+11 with image inversion (white text on dark — common in FedEx label formats). Invert-sourced STANDARD candidates go to quarantine until confirmed by another stage.
DPI: 320PSM: 6, 11cv2 table removalThreshold: 175
Stage 3.1
Rotation Probe — 140 DPI
Keyword-scored · 4 angles · flip guard margin 80pts

Renders at 140 DPI, tests 0°/90°/180°/270°. Score formula: score = digit_count + keyword_hits×120 + coherent_words×2. High keyword weight means one "AWB" token outweighs hundreds of isolated digits.

Fast path: If 0° wins digit-score by ≥24 margin, skips expensive general-text OCR and returns 0° immediately.
Certainty thresholds: Score margin ≥300 = CERTAIN · ≥120 = LIKELY · below = UNCERTAIN. Uncertain probes defer to the pre-check hint.
Stage 3.2 — Probe Text Exit: Reuses OCR text from the probe at zero additional cost. A confident match here skips all of Stage 3 (420 DPI OCR) — saving 3–8 seconds.
DPI: 140Keywords: 30+Flip guard: 80ptsStage 3.2 exit: free
Stage 3
OCR Strong — 420 DPI
Identical sub-pass structure to Stage 2 · fully cached renders · timeout-guarded in long-pass

Identical sub-pass structure to Stage 2 (PSM 6 → PSM 11 → Soft → Invert) applied at 420 DPI at base_angle. Higher DPI yields better clarity for thin-text documents, fine-print labels, and compressed scans. All images cached — if Stage 2 already rendered 420 DPI at 0°, Stage 3 retrieves from cache without re-render. In fast-lane mode, Stage 3 failure triggers a strict ProbeLite check (limited low-DPI rotation probe + 400/exact-high only); only then can it return DEFERRED.

Stage 4
Rotation Fallback
Full OCR at remaining angles · timeout-guarded at each boundary

Runs as rotation last-resort logic using probe certainty and route ordering (not only when base_angle ≠ 0°). It applies the full Stage 2+3 sub-pass sequence at deferred angles, then executes a full-priority sweep on accumulated fallback candidates. Timeout checks fire at angle boundaries — if LONG_PASS_TIMEOUT_SECONDS=45.0 is exceeded, _TimeoutDeferred captures all state (candidates, OCR cache, probe scores, timings) and defers to third-pass.

Stage 5 / 5.5
Context Rescue + Airway Label Rescue
60s budget · ACI / 400-labeled / carrier row / 3× upscale crops
Stage 5 — Context Rescue: ACI-prefixed patterns, "400 NO. XXXX" labeled format, FedEx carrier row positional matching, and 2-3× upscaled region crops for micro-print labels. 60-second time budget.
Stage 5.5 — Airway Label: Crops three overlapping right-side regions (RightMid, UpperRight, RightWide) at each angle. Each crop upscaled 3×. Two-step efficiency: cheap digit-only pass first; if <10 digit chars found, skips general-text OCR for that crop.
Budget: 60sUses cached imagesUpscale: 3×
Stage 6
EDM Persistence Fallback
Runtime-gated API check · single persistent HIGH candidate only
Entry criteria: Exactly one 12-digit candidate remains, tagged HIGH confidence, seen in at least 2 independent stages, and not disqualified as date/noise.
Runtime gate: EDM ON/OFF is read live from data/edm_toggle.json (with env/config fallback). OFF immediately bypasses API calls.
Outcomes: True confirms the candidate as MATCHED (EDM-Exists-Persistent). False or None falls through to Stage 7. Multiple persistent candidates trigger a tie to NEEDS_REVIEW.
UI toggle: EDM ON/OFFCache: edm_awb_exists_cache.jsonBypass-safe on missing token/auth
Stage 7
NEEDS_REVIEW — Terminal State
Full diagnostic log · safe_move() · audit updated

Terminal state when no AWB match is found after all stages. Full diagnostic log written: every candidate tried, confidence tier per candidate, which stages saw each one, and all quarantined candidates. File moved via safe_move() to NEEDS_REVIEW_DIR. Audit tracker updated via record_hotfolder_needs_review() with reason string and complete candidate list.

Section 05

Document Classification

How the pipeline characterises each document's type and difficulty tier before committing to expensive processing paths.

Tier 1 — Named Files
<5ms

Filename already contains a 12-digit AWB. No OCR required. Stage 0 match. Most EDM-downloaded files arrive pre-named and land here.

Tier 2 — Text PDFs
50–500ms

PDF has embedded vector text. AWB extracted via Stage 1 without any Tesseract invocation. Handles multi-column scrambling and metadata rotation recovery.

Tier 3 — Scanned / Image
2–20s

Image-only PDF — no embedded text (_is_image_only=True). Full OCR pipeline, rotation probe, and potentially all rescue stages.

Image-Only Detection

Set when len(txt_layer.strip()) == 0 after the set_rotation fallback attempt. Affects scheduling: rotation probe immediately narrowed to (0°, hint°) if a strong pre-check hint exists — reducing probe cost by ~50%.

Text Layer Quality Signal

A rich text layer (≥25 chars, MIN_EMBEDDED_TEXT_LENGTH) triggers the early rotation probe after Stage 2 failure. A thin layer triggers it immediately after Stage 1, routing weakly-embedded documents into OCR faster.

Section 06

Matching & Validation

How raw OCR candidates are validated against the AWB database — confidence tiers, Hamming distance tolerance, prefix/suffix buckets, and the tie-guard.

Candidate Confidence Tiers

HIGH

Extracted from strong, unambiguous context: adjacent to a known AWB keyword, matching a structured label pattern (ACI, 400-labeled, airway-bill label row), or confirmed by spatial bounding-box analysis. Eligible for Hamming tolerance up to 2 digits.

STANDARD

Extracted from looser pattern matching — a 12-digit run without strong keyword adjacency. Eligible for tolerance only if sole STANDARD candidate AND seen in ≥2 stages. Tolerance limited to 1-digit Hamming.

QUARANTINE

Candidates from noisy passes (invert, Rotation-180/270, AngFallback) that produced >3 candidates in one pass. Excluded until a second independent stage confirms the same candidate — preventing OCR artifacts from forcing matches.

Priority Match Cascade

1
Exact-High: HIGH candidates ∩ AWB DB. Unique → immediate match. Multiple → ambiguous tie → NEEDS_REVIEW.
2
Exact-Standard: STANDARD candidates ∩ AWB DB. Same tie logic applies.
3
Tolerance2-High: Hamming ≤2 against HIGH candidates. Accepted only if evidence candidate seen in ≥1 stage (dist≤1) or ≥2 stages (dist=2). Prefix/suffix bucket lookup limits the search space to ~4-digit matches.
4
Tolerance2-Standard: Hamming ≤1 against STANDARD, only when pool has exactly 1 candidate AND it was seen in ≥2 stages. Most conservative tolerance path.
Candidate Disqualification Rules
More than 3 leading zeros (e.g. 000012345678) — numeric code or padding artifact
Matches a date pattern (YYYYMMDD, DDMMYYYY variants) — disqualified as date reference
All same digit (e.g. 111111111111) — OCR artifact or table cell filler
Matches known HS code or phone number patterns — commercial document noise
Section 07

Routing & Decision Logic

How the orchestrator decides which path a file takes at every decision boundary — three-tier scheduler, timeout management, and full state serialisation for seamless resume.

Return States

MATCHEDAWB confirmed. File renamed and moved to PROCESSED.
NEEDS_REVIEWNo match after all stages. File moved to NEEDS_REVIEW dir.
DEFERREDFast-lane exit after Stage 3 fail. Queued for long-pass.
TIMEOUTBudget exceeded. State fully captured for third-pass resume.

45-Second Timeout Budget

The long-pass budget is LONG_PASS_TIMEOUT_SECONDS=45.0. The timeout check fires only at natural angle-pass boundaries — never mid-subpass. On timeout, _TimeoutDeferred exception is raised internally, which triggers complete state capture into a _state_out dict passed back to the scheduler for third-pass resumption.

State Capture on Timeout

Everything needed for seamless third-pass resume:

captured_state = {
  "probe_scores": # rotation probe scores per angle {0: 891, 90: 42, ...}
  "probe_texts": # (digit_text, general_text) per angle — reused free
  "base_angle": # determined rotation in degrees (0/90/180/270)
  "_angle_certainty": # "CERTAIN" / "LIKELY" / "UNCERTAIN"
  "running_high": # accumulated HIGH candidate set as list
  "running_standard": # accumulated STANDARD candidate set as list
  "candidate_stage_hits": # {candidate: [stages_that_saw_it]}
  "quarantine": # noisy candidates pending second-stage confirmation
  "ocr_cache": # [[key_list, text], ...] — serialisable tuple pairs
  "timings": # accumulated ms per stage, carried forward on resume
}
Section 08

Output Actions

What happens to a file once a match is confirmed or denied — file lifecycle from INBOX to final resting place across all output directories.

On Match → PROCESSED
1. Rename & Move: PROCESSED/<AWB>.pdf. If exists: MD5 compare → identical → source deleted. Different content → _2, _3 suffix appended.
2. AWB Logs Excel: Row buffered to CSV sidecar, flushed to Excel after 10 rows accumulate. Contains: AWB, source filename, timestamp, match method, status.
3. Stage Cache CSV: Per-file performance row: input filename, processed filename, AWB, detection type, extraction seconds.
4. Audit Tracker: record_hotfolder_end() writes to HotfolderV2 sheet in the 4-sheet Excel audit workbook.
On No-Match → NEEDS_REVIEW
1. Safe Move: safe_move() transfers to NEEDS_REVIEW_DIR. Timestamp suffix appended if name collision exists.
2. Diagnostic Log: Every candidate tried: confidence tier, stages that saw it, quarantine status — full post-mortem for manual review.
3. Audit Tracker: record_hotfolder_needs_review() with reason string and sorted candidate list.

Output Directory Map

PROCESSED/
Matched, renamed AWB files. Source for EDM-aware downstream handling.
NEEDS_REVIEW/
Unmatched files. Manual operator review required.
CLEAN/
EDM-checked non-duplicate files. Feeds batch builder.
REJECTED/
EDM-confirmed duplicate documents.
PENDING_PRINT/
Batch PDFs and TIFF files ready for print queue.
data/OUT/
AWB sequence Excel log and awb_list.csv for downstream.

What Happens Next — Downstream Journey

PROCESSED
EDM 3-Layer Check
CLEAN
Batch Builder
Barcode Cover
PENDING_PRINT
TIFF Convert
Print Queue
See Pipeline Flow §2 for interactive Phase 2 diagram
Section 09

EDM Duplicate Check

EDM in current V3 is runtime-gated from the UI and used for Stage 6 AWB persistence fallback, with this section also documenting the downstream duplicate-screening lane used in EDM observability flows.

V3 update: EDM calls are now controlled at runtime (EDM: ON/OFF button in UI). Stage 6 in the hotfolder pipeline uses EDM for persistent-candidate confirmation, and bypasses safely when EDM is OFF, token is missing, or response is inconclusive.
Layer 1 — MD5 Hash

Byte-for-byte identity check. Fastest possible comparison — if two files produce the same MD5, they are guaranteed identical content. Zero false positives. Runs before any image processing.

Cost: ~1ms
Layer 2 — Perceptual Hash

Compares visual fingerprints of page renders. Catches near-identical documents that differ only in metadata, timestamps, or minor encoding differences. Threshold: pHash distance ≤ 10.

Threshold: ≤10
Layer 3 — OCR Text Similarity

RapidFuzz text similarity on OCR/text-layer content with layered safeguards: text-vs-text only when both sides have enough embedded text, OCR fallback when one side lacks text layer. Similarity ≥60% is tracked; ≥85% becomes a strong text signal.

Track: ≥60% Strong: ≥85%
Document Flow Through EDM
1Runtime gate evaluated (EDM: ON/OFF in UI)
2Stage 6 checks a single persistent HIGH candidate only
3EDM metadata API queried (token + auth validated)
4aConfirmed in EDM → MATCHED (EDM-Exists-Persistent)
4bNot confirmed / bypassed → continue to Stage 7 path
5Downstream duplicate lane applies Gate-1 hash-all-pages, Gate-2 bounded probes, then Tier-2 full checks with CCD exemptions and conservative reject rules
EDM API Integration

Connects to the FedEx EDM production endpoint using a Bearer token stored in data/token.txt. Two endpoints used:

Metadata endpoint: Retrieves document group info for the AWB (operatingCompany: "FXE")
Download endpoint: Returns a ZIP of existing PDFs for the AWB, extracted for comparison
Fallback contract handling: Metadata path supports legacy payload first, then body/query fallback variants; inconclusive/auth/network paths route to CLEAN-UNCHECKED without stopping flow.
Token expiry is checked on first call each session. When expired, EDM checks are skipped gracefully and logged as warnings — the pipeline continues without interruption.
Live toggle behavior: the desktop UI button switches EDM between ON and OFF immediately. Running hotfolder checks read the updated value without restart, so API calls are allowed or bypassed in real time.
Runtime Toggle Resolution Order
PrioritySourcePurpose
1data/edm_toggle.jsonPersisted UI state from EDM: ON/OFF button
2PIPELINE_EDM_ENABLEDPer-process override for launched scripts
3ENABLE_EDM_FALLBACKDefault from .env / config
Stage 6 Decision Path
GateRun only for one persistent HIGH candidate seen across 2+ stages
ONCall EDM metadata endpoint with token from data/token.txt (or env)
TrueComplete match with method EDM-Exists-Persistent
FalseCandidate not confirmed; pipeline continues to Stage 7 logic
NoneBypassed/unchecked path (OFF, token missing, auth/network inconclusive)
# Simplified runtime behavior used in V3
if not is_edm_enabled():
  return None # bypass API
token = get_token()
if not token:
  return None # skip safely
exists = edm_awb_exists_fallback(candidate_awb)
if exists: complete_match(candidate_awb, "EDM-Exists-Persistent")
Why originals reach CLEAN faster: the comparison lane avoids brute-force permutations with Gate-1 exact hash over all pages, Gate-2 bounded probes, and Tier-2 expansion only after evidence hit.
Permutation/Combination Pruning
1Gate 1: incoming all-pages vs EDM all-pages hash index gives exact-hit/no-hit cheaply.
2Gate 2: smart sampled probes (first pages + 1/3 + mid + 2/3 + last) using pHash/text/OCR only when Gate 1 misses.
3Tier 2 full checks run only after evidence and apply CCD exemption + bounded OCR parallel prewarm.
Page Limits + Top Picks
ControlDefaultEffect
EARLY_FOCUS_MATCH_THRESHOLD3Promotes likely candidates first (top-pick path)
PAGE_OCR_LIMIT8Caps OCR pages per document to bound heavy comparisons
EDM_OCR_COMPARE_LIMIT10Tier-2 OCR compare page/doc bound for heavy checks
PHASH_THRESHOLD10Fast visual near-match gate before OCR text similarity
EDM_TIER1_INCOMING_PAGES3Base sampled incoming pages before extra strategic probe indices
EDM_TIER1_EDM_PAGE_LIMIT5Tier-1 probe depth per EDM document
EDM_TIER2_EDM_PAGE_LIMIT10Tier-2 full-compare depth per EDM document
P-Cross Handling for Multi-Page Input vs Multi-Page EDM Docs

When both sides have many pages, the lane uses bounded cross-page comparison (P-Cross) instead of unbounded all-page expansion. Duplicate evidence is accumulated as page hits and ratio, then resolved by configured rejection thresholds.

# Simplified bounded P-Cross decision idea
top_in = first_pages(incoming, PAGE_OCR_LIMIT)
top_edm = top_docs(existing_docs, EDM_OCR_COMPARE_LIMIT)
dup_pages, total_pages = compare_cross_pages(top_in, top_edm) # bounded matrix
dup_ratio = dup_pages / max(total_pages, 1)
if dup_pages > EDM_REJECT_IF_DUP_PAGES_OVER and dup_ratio >= EDM_REJECT_IF_DUP_RATIO:
  return "REJECTED"
return "CLEAN" # true originals pass through faster
Operational result: efficient ranking + bounded P-Cross keeps processing fast for genuine originals while still protecting against high-page-count duplicate overlap.
Gate-and-Layer Cascade — Gate 1 hash-all-pages, Gate 2 bounded probes, Tier-2 full checks. Strong duplicate evidence → REJECTED. Otherwise CLEAN / CLEAN-UNCHECKED
INCOMING
From PROCESSED
1
MD5 HASH
Byte-for-byte identity check on raw file content
hashlib.md5() ~1ms
MATCH → REJECTED
PASS
2
PERCEPTUAL HASH
Visual fingerprint comparison of page renders. Catches near-identical docs with different encoding.
ImageHash distance ≤ 10 limit: 10 pages
MATCH → REJECTED
PASS
3
OCR TEXT SIMILARITY
Text-layer compare (only when both sides have enough chars) plus OCR fallback for mixed text/image pages. Catches rescanned content.
rapidfuzz ≥60% tracked ≥85% strong 8 pages max auto: 5pp ≥70%
MATCH → REJECTED
ALL PASS
CLEAN
→ Batch queue
False positive philosophy: A false positive (wrongly rejecting a clean doc) is treated as worse than a missed duplicate. Automatic reject/split decisions require strong methods (HASH/PHASH). TEXT/OCR-only similarity is retained as observability signal and can still pass through when strong evidence is absent.
Layer 1MD5 Hash — Byte Identity

Computed using hashlib.md5() with 64KB read chunks. If two files produce the same MD5 digest, they are guaranteed byte-for-byte identical. Zero false positives. This is the cheapest check at ~1ms and runs first before any image processing.

Speed: ~1msZero false positivesChunk size: 64KB
Layer 2Perceptual Hash — Visual Fingerprint

Uses the ImageHash library to compute a perceptual hash of each page render. Hamming distance between two pHash values ≤ 10 → duplicate. Catches near-identical documents that differ only in metadata, compression level, or minor encoding artefacts. OCR compare limit caps checked pages at 10.

Threshold: distance ≤ 10Page limit: 10Library: ImageHash
Layer 3OCR/Text Similarity — Content Match

When hashes differ but content may still be duplicated (for example re-scanned paper), text comparison runs with safeguards: text-layer compare only when both pages have enough embedded text; otherwise OCR fallback compares mixed text/image pages. Score ≥60% is recorded as TEXT; score ≥85% is recorded as TEXT_STRONG. Conservative routing still requires strong evidence for auto-reject decisions.

Similarity threshold: 60%Strong threshold: 85%Page OCR limit: 8Library: rapidfuzzStrong methods: HASH/PHASH/TEXT_STRONG
Config KeyDefaultWhat It Controls
PHASH_THRESHOLD10Maximum pHash Hamming distance to classify as visual duplicate. Lower = stricter.
TEXT_SIMILARITY_THRESHOLD60Minimum RapidFuzz similarity % to record a TEXT duplicate signal.
TEXT_STRONG_THRESHOLD85Minimum RapidFuzz similarity % to promote TEXT to TEXT_STRONG evidence.
PAGE_OCR_LIMIT8Maximum pages to OCR-compare per document during similarity check.
EDM_OCR_COMPARE_LIMIT10Tier-2 OCR compare limit during heavy matching.
EDM_TIER1_INCOMING_PAGES3Seed size for Gate-2 incoming probe sampling.
EDM_TIER1_EDM_PAGE_LIMIT5Tier-1 probe max pages per EDM doc.
EDM_TIER2_EDM_PAGE_LIMIT10Tier-2 full compare max pages per EDM doc.
EDM_TEXT_LAYER_MIN_CHARS30Minimum text chars before using text-layer compare path.
EDM_OCR_WORKERS2Parallel OCR workers for Tier-2 prewarm.
EDM_OCR_PARALLEL_MIN_TASKS4Minimum pending OCR tasks before enabling parallel prewarm.
EDM_REJECT_IF_DUP_PAGES_OVER5Minimum number of duplicate-matching pages to trigger auto-rejection.
EDM_REJECT_IF_DUP_RATIO0.70Minimum proportion of pages that must match to trigger auto-rejection.
FILE_SETTLE_SECONDS3Wait time after file detected before comparison begins (ensures file is fully written).
EARLY_FOCUS_MATCH_THRESHOLD3Minimum early-focus text matches to prioritise a candidate for full comparison.
MIN_EMBEDDED_TEXT_LENGTH25Minimum chars in text layer before skipping OCR extraction for comparison.
# EDM duplicate decision logic (pseudo-code)
def is_duplicate(incoming, existing):
  # Layer 1: byte-identical
  if md5(incoming) == md5(existing): return True
  # Layer 2: visually identical
  if phash_distance(incoming, existing) <= PHASH_THRESHOLD: return True
  # Layer 3: text content similar
  score = rapidfuzz_similarity(ocr(incoming), ocr(existing))
  if score >= TEXT_SIMILARITY_THRESHOLD: return True
  # Auto-reject if majority of pages match
  if dup_pages > EDM_REJECT_IF_DUP_PAGES_OVER and dup_ratio >= EDM_REJECT_IF_DUP_RATIO:
    return True
  return False # when in doubt: pass through
CLEAN — Passes EDM Check
File moved to CLEAN/ folder
When EDM is bypassed/inconclusive (toggle OFF, no token, metadata/download issue), route is CLEAN-UNCHECKED with explicit compare method reason
When all incoming pages are CCD, duplicate checks are intentionally bypassed (all_ccd_bypass) and routed clean
When only TEXT/OCR similarity exists without strong HASH/PHASH evidence, route can remain clean via conservative bypass
EDM audit row written: result, timing, compare method
File becomes eligible for batch builder ingestion
JSONL audit event appended: stage="EDM_CHECK"
REJECTED — Duplicate Detected
File moved to REJECTED/ folder (or partially split into CLEAN+REJECTED when only some pages are strong duplicates)
EDM audit row written: which layer triggered, dup pages, dup ratio
Original in EDM system is not overwritten or modified
JSONL audit event appended with full comparison details
Audit FieldDescriptionExample Value
EDMResultOutcome of the checkCLEAN, REJECTED, or CLEAN-UNCHECKED
DupPageCountNumber of pages flagged as duplicates3
TotalPagesTotal pages in incoming document4
DupRatioProportion of duplicate pages0.75
EDMSecsTime taken for the full EDM check2.4
CompareMethodWhich path/layer determined the decisionMD5 / pHash / OCR-text / CCD-BYPASS / BYPASS-NO-TOKEN
Section 10

Batch Builder & Sequence Priority

Transforms the CLEAN folder into numbered, print-ready batch PDFs with auto-generated barcode cover pages — replacing the entire manual stack-and-scan workflow.

Max Pages / Batch
48
configurable via MAX_PAGES_PER_BATCH
Cover Page Format
1
per AWB · barcode + metadata
Sort Order
mtime
oldest file in each AWB group first
Write Safety
Atomic
tmp → os.replace — no partial files
Batch Construction Pipeline — Step by Step
01 SCAN CLEAN
Find all \d{12}(?:_\d+)?.pdf files. Group by AWB. Open each once for page count.
mtime cached
02 SORT FIFO
Sort groups globally by oldest file's mtime. Files within each AWB group also sorted by mtime.
oldest first
03 ASSIGN TIER
Read stage_cache.csv to map each AWB's detection method to High / Medium / Low tier.
High Med Low
04 ASSIGN SEQ
Global sequential integer per AWB. Resets per tier when tier batching is on. Logged to Excel.
FIFO → SEQ
05 PACK BATCHES
Greedy first-fit. Cost = 1 cover + N invoice pages. When next AWB would exceed 48pp → open new batch.
≤48 pages/batch
06 BUILD PDF
Insert Code128 cover page + all AWB PDFs via PyMuPDF. Write atomically: .tmp → os.replace.
atomic write reportlab
07 PENDING_PRINT
Copy batch PDF with MD5 dedup check. Delete CLEAN sources only after all copies succeed.
MD5 dedup copy
Batch Packing Algorithm

Each AWB costs 1 cover page + N invoice pages. The packer walks AWBs in sequence order and assigns to the current batch. When the next AWB would exceed 48 pages, a new batch opens. Greedy first-fit — no backtracking or re-ordering.

# Batch packing (greedy first-fit)
for awb in resolved:
  cost = 1 + awb.inv_pages # cover + docs
  if pages + cost > MAX_PAGES_PER_BATCH:
    batch_no += 1; pages = 0
  awb.batch_no = batch_no; pages += cost
CLEAN Folder Scan

Scans CLEAN for \d{12}(?:_\d+)?.pdf. Files grouped by AWB and sorted within group by mtime ascending. Groups globally ordered by each group's oldest file's mtime — FIFO. Each PDF opened exactly once during scan; page counts cached for the build phase.

Pattern: \d{12}(_\d+)?.pdfSort: mtime ascEach PDF opened once
Page count efficiency: Each PDF in CLEAN is opened once during the scan phase to read its page count. The counts are cached in memory and reused during the build phase — no file is opened twice. Keeps the batch builder fast even for large CLEAN folders.
Sequence numbers are assigned globally across all AWBs in FIFO order (oldest file first). Each AWB gets a unique sequential integer (seq) that appears on its barcode cover page. This number is the primary key in the Excel sequence log and allows tracing any printed batch document back to its original PDF.

Sequencing Rules

1
FIFO by file mtime: CLEAN folder is scanned and groups sorted by the oldest file's mtime within each AWB group. Earliest-arrived files get the lowest sequence numbers.
2
Multiple files per AWB: When an AWB has multiple documents (e.g. 123456789012.pdf and 123456789012_2.pdf), all are grouped under the same sequence number. All files are inserted after the single barcode cover page.
3
Append-only sequence log: Every batch run appends rows to data/OUT/awb_sequence.xlsx — history is never overwritten. Columns: Seq, AWB, PDF Files, Timestamp, DocCount, InvoicePages, TotalPages, Batch, Tier.
4
Tier batching resets seq: When ENABLE_TIER_BATCHING=True, sequence numbers restart at 1 within each tier (High, Medium, Low). Each tier is built as a separate batch series — their cover pages are independently numbered.

Excel Sequence Log Schema

ColumnTypeDescription
SeqintegerGlobal sequence number — unique per AWB per build run
AWBstring12-digit Air Waybill Number
PDF FilesstringPipe-separated list of source PDFs: 123456789012.pdf | 123456789012_2.pdf
TimestampdatetimeISO timestamp of when this entry was built
DocCountintegerNumber of PDF documents included for this AWB
InvoicePagesintegerTotal invoice/document page count (excludes cover)
TotalPagesintegerCover page + invoice pages = what occupies batch space
BatchintegerBatch PDF number this AWB was assigned to
TierstringDetection tier (High / Medium / Low) when tier batching enabled
What's on the Cover Page
SEQSequence number in large Helvetica-Bold 18pt
AWB12-digit Air Waybill Number in 22pt bold
BATCHBatch number, zero-padded to 3 digits (e.g. 003)
PAGECover's position within the batch (e.g. Page 7 of 48)
DOCSDocument count and total invoice page count
TIERDetection tier label (when tier batching enabled)
BARCODECode128 barcode of the AWB number, height 60pt, width 1.2pt
FOOTERGeneration timestamp at bottom of page
Cover Page Generation

Built using reportlab's canvas API. Generated in-memory as raw PDF bytes (BytesIO), then merged into the batch PDF via PyMuPDF's insert_pdf(). Page size is configurable: LETTER (default) or A4.

# Cover page structure (letter size)
c.setFont("Helvetica-Bold", 18)
c.drawString(60, h-80, f"SEQ: {seq}")
c.setFont("Helvetica-Bold", 22)
c.drawString(60, h-120, f"AWB: {awb}")
barcode = Code128(awb, barHeight=60, barWidth=1.2)
barcode.drawOn(c, 60, h-290)
Library: reportlabBarcode: Code128Page: LETTER or A4
When ENABLE_TIER_BATCHING=True, the builder creates separate batch series for each detection confidence tier. High-confidence AWBs (matched from filename or text layer) are batched separately from OCR-matched or lower-confidence documents. Each tier's batches are independently numbered and file-named.
HIGH Tier

AWBs matched via filename or text layer extraction. Prefixes: FILENAME, TEXTLAYER-EXACT, TEXT-LAYER.

Batch file format: PRINT_STACK_BATCH_TH_001.pdf
MEDIUM Tier

AWBs matched via exact OCR. Prefixes: OCR-EXACT. These required image rendering but produced a clean exact match.

Batch file format: PRINT_STACK_BATCH_TM_001.pdf
LOW Tier

All other AWBs — tolerance-matched, rescue-stage matches, or any detection method not in the high/medium prefixes.

Batch file format: PRINT_STACK_BATCH_TL_001.pdf
How Tier is Determined

The batch builder reads data/stage_cache.csv which the hotfolder writes after each successful match. The AWB_Detection_Type column stores the method label (e.g. Filename, OCR-Strong-Exact-High). Tier is assigned by prefix matching:

Detection Method PrefixAssigned Tier
FILENAME, TEXTLAYER-EXACT, TEXT-LAYERHigh
OCR-EXACTMedium
All others (tolerance, rescue, rotation fallback)Low
AWB not found in stage_cache.csvLow (default)
Atomic Write Safety

Every batch PDF is written atomically: the PDF is saved to a .tmp file first, then renamed to the final name via os.replace() — which is atomic on both Windows NTFS and macOS APFS. If the process crashes mid-write, no partial batch PDF can be mistaken for a complete one.

# Atomic write (no partial files on crash)
doc.save(str(tmp_path))
doc.close()
os.replace(str(tmp_path), str(out_path)) # atomic
PENDING_PRINT Copy with MD5 Dedup

After batch PDFs are built in data/OUT/, they are copied to PENDING_PRINT/ for TIFF conversion and printing. Before each copy, MD5 hashes are compared: if an identical batch already exists in PENDING_PRINT, the copy is skipped. Different-content files with the same name get a _v2, _v3 suffix.

MD5 dedup on copySuffix: _v2, _v3...
CLEAN Source Deletion Safety

Source files in CLEAN are only deleted after all batch PDFs have been successfully copied to PENDING_PRINT. If any copy fails, deletion is skipped entirely with a safety message logged. This prevents data loss if the copy step is interrupted.

# Safety guard before source deletion
if (failed == 0 and
    copied + skipped_dup == expected):
  delete_clean_sources(resolved)
else:
  log("[SAFETY] Skipping deletion — not all files copied")
data/OUT/
Batch PDFs named PRINT_STACK_BATCH_NNN.pdf
PENDING_PRINT/
Copy of batches + TIFF conversions for scanner
awb_sequence.xlsx
Append-only sequence log — history preserved
Section 11

Tracking / Audit Layer

Every file, every match, every decision — permanently recorded across four complementary audit mechanisms with concurrent-write safety.

4-Sheet Audit Workbook

Central file: pipeline_audit.xlsx.

HotfolderV2: One row per AWB detection — timestamp, employee ID, AWB, filenames, detection method, tier, timing (ms), result, notes.
EDM: One row per EDM check — result, dup page count, dup ratio, compare method.
BatchTIFF: One row per batch build or TIFF conversion — batch number, AWB count, page count, output path.
Dashboard: Programmatically computed summary, rewritten on every write. Session totals and running stats.
JSONL Audit Log

Every event appended to logs/pipeline_audit.jsonl. Auto-rotates at 50MB (one backup). Never interrupts pipeline flow — all writes wrapped in try/catch.

{"ts": "2025-03-15 08:14:22",
"stage": "AWB_HOTFOLDER",
"file": "INVOICE_2025.pdf",
"status": "MATCHED",
"awb": "400856290147",
"match_method": "OCR-Strong-Exact-High",
"hotfolder_secs": 8.4}
Concurrent Write Safety

Lock-file pattern via os.O_CREAT | O_EXCL — atomic on both Windows NTFS and macOS APFS. Timeout: 15 seconds. Stale locks (held >30s) auto-cleared. Enables safe concurrent access from hotfolder + batch_builder + TIFF converter simultaneously.

Known Limitation: When multiple INBOX files share the same AWB (duplicates at different paths), the tracker creates a separate row per file — not per AWB. This is correct behaviour (unique processing records) but AWB-level aggregation requires grouping by the AWB column across rows. A future processing_id UUID field would resolve this cleanly.
Section 12

Error Handling & Recovery

How the pipeline handles every failure mode — from unstable files and corrupt PDFs to OCR noise, rotation ambiguity, and the three-tier timeout recovery cycle.

File Stability Failure

If file_is_stable() fails, pipeline returns NEEDS_REVIEW immediately. The 30-second safety rescan re-enqueues the file once writing completes, allowing processing to succeed on the next cycle without manual intervention.

Corrupt or Zero-Page PDFs

PyMuPDF open() and page_count check catches corrupt files before any extraction stage. Moved to NEEDS_REVIEW with clear reason string ("0-page PDF" or "Corrupt/unreadable PDF: {exception}"). Guard runs before any closures are defined — ensuring clean state even on immediate failure.

45-Second Long-Pass Timeout

High-resolution scans with complex rotation can exceed the budget. Rather than blocking the entire pipeline, the timeout captures full state (candidates, OCR cache, probe scores, timings) and defers to the third-pass queue. Third pass processes up to 3 deferred files per cycle with no timeout, only when both fast and long queues are empty.

OCR Noise Quarantine

Passes known to be noisy (invert, Rotation-180°/270°, AngFallback) that produce >3 STANDARD candidates in one pass have those candidates quarantined. Promoted to active pool only when a second, independent stage sees the same number — preventing one-off OCR artifacts from ever reaching the matching step.

Three-Layer Rotation Defence

(1) Free metadata checks before any render, (2) low-DPI keyword-scored probe, (3) full rotation fallback at all remaining angles. If probe is uncertain, the pre-check hint overrides it. Saved timeout state includes probe scores — third pass skips re-probing entirely.

Table Line Removal

For documents where ruling lines cross digit characters and confuse Tesseract, OpenCV morphological operations detect and remove horizontal and vertical lines before OCR. Applied when CV2_AVAILABLE=True. Graceful fallback when OpenCV is unavailable — only the table cleaning optimisation is skipped.

Section 13

Technical Architecture

Module structure, data flow, key dependencies, and the .envconfig.py configuration system. Single-operator Windows deployment with Mac development environment.

Module Dependency Map

Entry Points
app.py · hotfolder.py · batch_builder.py · tiff_converter.py
Configuration
config.py reads .env → all paths, DPIs, thresholds, feature flags
Orchestrator
pipeline.py — Stages 0–7, timeout, state capture (~1,700 lines)
OCR Engine
ocr_engine.py — render, preprocess, Tesseract wrappers, cv2 table removal
Extractor
awb_extractor.py — 15+ regex patterns, keyword proximity, candidate tiers (~640 lines)
Matcher
awb_matcher.py — Hamming distance, priority cascade, tie guard
File Ops
file_ops.py — stability, safe_move, MD5 dedup, Excel AWB logs, stage cache CSV
Audit Layer
tracker.py (4-sheet Excel, lock-file) · logger.py (JSONL, 50MB rotation)
V3 addition: services/edm_checker.py now owns runtime EDM toggle state, cache-backed AWB existence checks, and V1/V2-compatible EDM metadata request payloads used by Stage 6 fallback.

Configuration System

# .env — machine-specific values, never committed to version control
PIPELINE_BASE_DIR=C:\Users\5834089\Downloads\CCD_Filler
TESSERACT_PATH=C:\Users\5834089\Downloads\CCD_Filler\tesseract.exe

# OCR tuning
OCR_DPI_MAIN=320 # Main OCR render resolution
OCR_DPI_STRONG=420 # Strong OCR render resolution
ROTATION_PROBE_DPI=140 # Low-cost rotation probe DPI
LONG_PASS_TIMEOUT_SECONDS=45.0

# Matching thresholds
TOLERANCE_HIGH_MAX_DISTANCE=2 # Max Hamming distance for HIGH candidates
TOLERANCE_STANDARD_MAX_DISTANCE=1 # Max Hamming distance for STANDARD
MIN_STAGE_HITS_HIGH_TOL2=2 # 2 stages required for 2-digit tolerance

# Output
MAX_PAGES_PER_BATCH=48 TIFF_DPI=200 TIFF_PARALLEL_WORKERS=4

Key Dependencies

PyMuPDF ≥1.24
PDF text extraction, page rendering, metadata access
pytesseract
Tesseract 5.x OCR — PSM modes 6, 7, 11; digit whitelist mode
watchdog ≥4.0
Cross-platform filesystem event monitoring for INBOX
openpyxl ≥3.1
Excel read/write for AWB DB, audit workbook, logs
opencv-python-headless
Table line removal via morphological operations (optional)
Pillow ≥10 · numpy
Image preprocessing, rotation, thresholding, Lanczos upscaling
Section 14

Live Status Simulation

Watch the pipeline process three different document scenarios in real time — from an instant filename match to a complex rotated scan with timeout and third-pass resume.

Document Processing Pipeline V3 — pipeline.log
RUNNING
// Select a scenario above to begin the simulation
// Log entries will stream in real time below
Section 15

Risks & Edge Cases

Known failure modes, ambiguous scenarios, and the mitigations currently in place — or flagged as future work.

HIGHSame AWB in Multiple INBOX Files

Multiple PDFs can arrive containing the same AWB — re-downloads from EDM or duplicates sent by shippers. Each is processed independently. At the PROCESSED level, MD5 comparison catches byte-for-byte identical files. Different content with same AWB gets an auto-incrementing suffix (_2, _3). The audit tracker creates separate rows per file (not per AWB), which is correct but requires join logic for per-AWB reporting.

MEDOCR Digit Misread (1–2 digit errors)

Tesseract commonly misreads 0↔8, 1↔7, 5↔6 in compressed scans. Hamming tolerance (≤2 for HIGH, ≤1 for STANDARD) handles this — but only when the candidate has been seen in multiple stages. The stage-hits requirement prevents a single noisy pass from generating a tolerance match. Risk is higher for AWBs differing from each other by only 1–2 digits, guarded by the unique-match requirement (no ties allowed).

MEDRotation Probe Choosing the Wrong Angle

The probe can be misled by documents with large blocks of digits at non-AWB angles (barcodes, HS code tables). Stage 4 rotation fallback mitigates this by running OCR at remaining angles. The 80-point flip margin threshold guards against switching from 0° without strong evidence. The quarantine logic prevents any rotation fallback candidates from being accepted without multi-stage confirmation.

MEDHS Codes and Phone Numbers as False Candidates

Commercial invoices contain many 12-digit sequences: HS commodity codes, phone numbers with country codes, and reference numbers. The disqualification filter catches leading-zero sequences and date patterns. For remaining false candidates, the DB check is the final gate — unless the false candidate coincidentally matches a currently loaded AWB record (rare but theoretically possible). Stage-hits requirement reduces this risk significantly.

MEDEDM Token Expiry During Session

The EDM Bearer token in data/token.txt can expire after a few hours. In V3 this is fail-safe: Stage 6 logs a warning and returns an unchecked path (None), so the pipeline continues without hard failure. The UI toggle can also set EDM OFF, which intentionally bypasses API calls until token/auth is restored.

LOWAWB Excel DB Out of Sync (stale mtime on NFS/SMB)

DB refreshes every 30s by mtime comparison. On some NFS and SMB mounts, mtime doesn't reflect changes correctly — the pipeline may operate on a stale AWB set for up to 30 seconds. The UI can force an immediate reload via the trigger file. Files that fail to match during a stale window route to NEEDS_REVIEW and can be retried after DB refresh.

LOWAudit Lock File Contention at Scale

The 4-sheet audit workbook uses a file-lock pattern. If two pipeline processes attempt to write simultaneously, one waits up to 15 seconds. Locks held >30s are forcibly cleared. In a single-operator, single-process deployment this is rarely a problem, but at scale (multiple simultaneous pipeline instances) this becomes a serialisation bottleneck. Solution: switch to append-only SQLite audit store.

Section 16

Future Roadmap

Planned improvements and architectural upgrades — from near-term wins to long-term platform evolution.

NEAR-TERMEDM Fallback Hardening

Stage 6 is live in V3. Next step is hardening: add endpoint-contract smoke tests, token expiry telemetry, and richer reason codes in audit logs so bypass causes (OFF/no token/auth/network) are visible at a glance.

NEAR-TERMPer-Record Lineage in Audit Tracker

Introduce a unique processing_id UUID per INBOX file, threading through all audit events from intake to output. Enables complete per-file lineage queries without joining on filenames. Directly resolves the multi-document same-AWB aggregation challenge.

MID-TERMPersistent OCR Result Caching

Persist OCR cache between pipeline runs. Files that return to NEEDS_REVIEW and are re-submitted could skip re-OCR if their content hasn't changed (MD5 match). Saves 5–15 seconds per re-submission on scanned documents — significant for high-volume operations.

MID-TERMOperator Review Web Console

A web-based UI (FastAPI + React) for reviewing NEEDS_REVIEW files. Shows the PDF alongside candidates tried, rotation scores, and OCR snippets. Operator confirms, corrects, or rejects the AWB with one click — feeding corrections back to improve future matches.

LONG-TERMML-Assisted Confidence Tuning

Train a lightweight classifier on accumulated stage-cache and audit data to predict: (a) which DPI/PSM combination will yield the best result for a given document type, (b) optimal rotation probe angle narrowing from file metadata features. Could reduce average processing time by 30–40% for the hardest document tier.

LONG-TERMMulti-Worker Parallel Processing

Current architecture is single-worker. A ProcessPoolExecutor could run multiple files in parallel. Would require switching the audit layer from Excel workbook to append-only SQLite — eliminating lock contention at scale and enabling true concurrent processing.

Section 17

Operator Control Centre

The current Tkinter app is the live command surface: login, hotfolder control, EDM runtime toggle, automatic orchestration, manual batch/TIFF actions, folder counts, logs, and safe cleanup.

Primary Runtime
Tk
V3.ui.app_window.App
Auto Interval
10s
AUTO_INTERVAL_SEC default
Batch Gate
2
min clean batches before AUTO build
Manual Escape
Live
operators can intervene at every folder
Primary Controls
Start AWBStarts or stops the watchdog hotfolder process and the EDM duplicate checker when enabled.
EDM ON/OFFWrites the runtime toggle in data/edm_toggle.json; Stage 6 and downstream duplicate checks read it live.
AUTO MODERuns the unattended loop: wait for INBOX drain, wait for PROCESSED drain, batch when ready, then convert TIFF.
Full CycleRuns one supervised cycle without leaving the app in continuous automation mode.
Recovery + Manual Actions
Prepare BatchRuns V3.services.batch_builder against CLEAN and stages outputs into PENDING_PRINT.
Convert TIFFRuns the parallel PDF-to-TIFF converter beside the batch PDFs.
Retry FailedMoves NEEDS_REVIEW PDFs back into INBOX for another pass after DB/token/input fixes.
Clear AllStops running work, preserves protected files, and clears operational working outputs safely.
Auto PhaseCurrent BehaviorSafety Guard
1 INBOX drainWaits until INBOX stays empty for INBOX_EMPTY_STABLE_SECONDS=8.Max wait 1800s; stop event checked every half second.
2 PROCESSED drainLets the EDM duplicate checker move matched files into CLEAN, REJECTED, or CLEAN-UNCHECKED.If EDM checker is not running, direct PROCESSED-to-CLEAN fallback is logged.
3 Batch readinessEstimates batch count before building; default AUTO build waits for at least 2 batches.Force-batches old CLEAN files after AUTO_FORCE_BATCH_AGE_SECONDS=1800.
4 TIFF conversionRuns after successful batch copy to PENDING_PRINT.Skip-if-exists prevents repeat conversion of healthy TIFFs.
Current operator model: the app is intentionally local-first. It uses folder counts, line-capped colored logs, progress animation, session state, protected-file cleanup rules, and one-click launchers instead of requiring a browser server or external queue service.
Section 18

Current Process Snapshot

A concise version map of the repo as it exists now: what is live, what is configurable, and which files own each part of the process.

Live In V3

Two-pass hotfolder, fast-lane ProbeLite, long-pass timeout, third-pass resume, Stage-6 EDM fallback, downstream EDM duplicate checker, batch builder, TIFF converter, and Tk control centre.

Config Driven

Paths, Tesseract, OCR DPI, rotation thresholds, EDM endpoints, duplicate thresholds, batch limits, TIFF settings, and AUTO mode timing all flow through V3/config.py and .env.

Documented Now

README, operations runbook, and this visual guide now line up around the same end-to-end flow: INBOX -> PROCESSED -> CLEAN/REJECTED -> OUT -> PENDING_PRINT.

Process AreaOwner FileCurrent Responsibility
Launcher + App EntryV3/launcher.py, V3/app.pyStarts the branded UI and cross-platform control surface.
ConfigurationV3/config.pyCentralizes paths, feature flags, timeouts, thresholds, EDM endpoints, and TIFF/batch settings.
Hotfolder SchedulerV3/services/hotfolder.pyWatchdog queue, AWB DB refresh, fast lane, long pass, timeout resume, heartbeat, attempt caps.
AWB ExtractionV3/stages/pipeline.pyStages 0-7, OCR, rotation, rescue passes, EDM persistence fallback, diagnostics.
Duplicate ScreeningV3/services/edm_duplicate_checker.pyEDM metadata/download, Gate 1 hash, Gate 2 probes, Tier 2 HASH/PHASH/TEXT/OCR compare, routing.
Batch AssemblyV3/services/batch_builder.pyScans CLEAN, sequences AWBs, writes barcode covers, atomic print-stack PDFs, copies to PENDING_PRINT.
TIFF OutputV3/services/tiff_converter.pyParallel, streaming multi-page TIFF generation beside print-stack PDFs.
Audit + LogsV3/audit/*, logs/*Structured JSONL events, dashboard rebuilds, workbook trackers, pipeline and EDM logs.
Unreleased Documentation Delta

The current changelog records documentation alignment work as Unreleased: README and operations docs were updated for ProbeLite, third-pass resume, EDM Stage 6, downstream duplicate routing, current config keys, and controlled shutdown behavior.

Reality Check

This page avoids hard-coding operational volume claims where the code reads the live Excel DB at runtime. The reliable source of truth is the current data/AWB_dB.xlsx load count printed by the hotfolder log.