Document Processing Pipeline V3 — FedEx Document Automation

Section 01

Executive Overview

Plain-English explanation of the Document Processing Pipeline, its business value, and end-to-end function — written for founders, operators, and leadership.

From 7-step manual paper workflow to zero-touch automation

Replaces printing, manual AWB typing, barcode generation, physical arrangement, and paper scanning. Adds a duplication check that never existed before.

Automation Rate

+90%

docs matched without human input

Fast Match Speed

<100ms

for filename or text-layer hits

DB Source

Excel

AWB set reloads every 30s

False Positives

all AWBs verified against DB

What is the Document Processing Pipeline?

The Document Processing Pipeline is an automated document processing system built for FedEx shipment operations. It watches a local "hot-folder" (INBOX), and whenever a new PDF arrives — whether an Air Waybill, commercial invoice, or packing list — it immediately identifies the 12-digit Air Waybill Number embedded in that document.

Once the AWB is confirmed against the live Excel-backed AWB set, the system automatically renames the file, moves it to the correct folder, and records every action in multiple audit trails. Runtime is typically sub-second for easy filename/text-layer matches and can extend to tens of seconds for hard rotated/OCR rescue documents — all without a human touching a keyboard.

The Manual Process — Then & Now

Before automation, every incoming shipment document required a multi-step physical workflow involving printing, manual data entry, barcode generation, physical arrangement, and paper scanning. The pipeline eliminates every single one of those steps.

OLD MANUAL WORKFLOW — 7 Steps Per Document Batch

Document Received

PDF arrived from multiple sources

Print Document

Employee prints the PDF on paper

Stack on Desk

Placed on physical scanning desk stack

Manual AWB Entry

Person reads AWB from page, types into barcode software

Print Barcodes

Barcode generator prints a barcode label per AWB

Arrange & Attach

Barcodes placed on top of each matching AWB doc

Scan Physical Batch

Entire paper stack fed through physical scanner

Time-Intensive

Each document requires manual attention — reading, typing, printing, arranging, scanning

Human Error Risk

Manual AWB typing from document to barcode software introduces typos and misreads

No Duplicate Check

No automated way to detect if the same shipment document was submitted twice

Document Processing Pipeline V3 replaces all 7 steps

AUTOMATED PIPELINE — Fully Hands-Free

PDF Arrives in INBOX

Watchdog detects within 2 seconds. No printing, no physical handling.

AWB Auto-Extracted

Stages 0–7 find and verify the 12-digit AWB against the currently loaded Excel AWB set.

Duplicate Check

EDM API check + MD5 hash dedup catches any document submitted more than once.

Batch Ready to Print

Barcode cover pages generated automatically. Batch PDF + TIFF assembled for scanner.

Metric	Manual Process	Document Processing Pipeline V3
Steps per batch	7 manual steps	1 — drop file in INBOX
AWB identification	Human reads & types manually	Automated OCR + DB verify
Barcode generation	Manual software entry per AWB	Auto-generated with batch cover page
Duplicate detection	None — manual check only	EDM gate/layer screening with HASH, PHASH, TEXT, and OCR signals
Error source	Typos in manual AWB entry	Zero false positives — DB-verified only
Audit trail	None	JSONL + Excel + 4-sheet workbook
Processing speed	Minutes per document	<100ms – 20s per document

End-to-End in Six Steps

1 · Document Arrives

PDF lands in INBOX. Watchdog detects it within 2 seconds. Stability guard ensures it is fully written before processing begins.

2 · AWB Extracted

Up to 8 pipeline stages search for the 12-digit AWB: filename pattern, embedded text, OCR at 320 and 420 DPI, rotation handling, and rescue crop passes.

3 · DB Confirmation

Every candidate AWB cross-checked against the live Excel database of 70,000+ records. No match = not accepted. Zero false positives guaranteed.

4 · File Organised

File renamed to its AWB number and moved to PROCESSED. Byte-for-byte duplicates detected via MD5 hash and removed safely without data loss.

5 · Logged & Audited

Every event recorded in: Excel AWB log, JSONL audit log, stage-performance CSV, and a 4-sheet audit workbook with a live Dashboard tab.

6 · Print-Ready

After EDM duplicate checking, clean documents are collated into batch PDFs (max 48 pages, barcode cover) then converted to multi-page TIFFs.

What about difficult documents? Sideways scans, image-only PDFs, and ambiguous documents are handled via multi-stage fallback. If all stages fail, the file moves to NEEDS_REVIEW for manual handling. Difficult files also respect a LONG_PASS_TIMEOUT_SECONDS=45.0 budget during long-pass — they are deferred to a third pass that runs once easier documents clear the queue, with their full OCR state preserved for seamless resumption.

Business Impact for FedEx Operations

This automation is valuable because it solves real operational delays, not just document processing. It reduces shipment risk, protects service timelines, cuts manual workload, and gives teams real visibility into flow and backlog.

Operational Problems Solved

01Late scanning risk: a perishable-shipment invoice scanned 3 hours late can push movement by 1-2 days.

02After-hours pileups: documents received post-shift sit unprocessed and miss next-day clearance windows.

03Duplicate exposure: same shipment documents can re-enter flow without reliable duplicate screening.

04Paper-heavy handling: printing, arranging, rescanning, and manual indexing consume time and resources.

05No real-time queue visibility: teams cannot easily see what is waiting, what is blocked, or what completed.

06Exception handling overload: hard scans consume senior operator time that should be reserved for true edge cases.

07Shift-change discontinuity: work quality and speed vary by shift handoff and manual prioritization.

08Audit reconstruction cost: proving what happened to a file requires manual log chasing across tools.

How Automation Changes Outcomes

Service SpeedNight and backlog documents can be processed continuously, reducing clearance and dispatch delays.

AccuracyDuplicate checks reduce repeat handling, rework, and downstream delivery confusion.

Paperless ShiftPrinting/scanning dependency drops sharply, enabling near paperless operation for this lane.

VisibilityDocument counts, queue state, and status become visible in real time, enabling auditable operations.

Labor EfficiencyCountless manual scan-touch hours are replaced by no-touch routing for routine files.

ContinuityStandardized automation behavior reduces dependence on shift-specific tribal workflows.

PredictabilityDeferred/timeout recovery keeps difficult files from freezing intake lanes.

Leadership ControlLive operational telemetry supports faster escalation and staffing decisions.

Operational Dimension	Without Automation	With Automation	Business Effect
Perishable Shipment Timeliness	Invoice scan delay can hold shipment 1-2 extra days	Continuous intake + automated extraction reduces waiting time before clearance	Higher on-time movement probability for sensitive cargo
After-Hours Intake	Documents accumulate overnight and spill into next-day operations	Queue is monitored and processed with less dependency on immediate manual action	Lower morning backlog and fewer next-day misses
Duplicate Handling	Manual memory/checklists; high inconsistency risk	Automated duplicate screening with conservative decision policy	Lower rework, fewer unnecessary touches, stronger control
Labor & Paper	Heavy print/scan/manual sort effort	Near paperless flow and no-touch processing for routine files	Significant hours saved and reduced paper cost/waste
Operational Visibility	Little live visibility into document volume and state	Live counts, status, queue depth, and event logs available	True auditability and better staffing/decision planning
Exception Throughput	Hard files can block operators and delay normal workload	Timeout-defer + resume path isolates hard files from routine flow	Better SLA protection for standard-volume processing
Shift-to-Shift Consistency	Output quality depends heavily on who is on shift	Unified automated rules enforce same decision logic across all shifts	More consistent operational outcomes and fewer handoff surprises
Audit Readiness	Manual evidence collection is slow and fragmented	Structured logs and tracker records available per processing event	Faster internal/external review with higher confidence

Strategic outcome: this shifts teams from endless scan/print/manual handling to exception-based operations. Routine documents move with minimal touch, critical shipments face less delay risk, and leadership gains measurable control through real-time visibility and audit evidence.

Section 02

Visual Pipeline Flow

Click any node to reveal a detailed explanation. The flowchart shows every major stage — from document arrival through three-tier scheduling to final output.

Extraction OCR Stage Rotation Decision Success Review

Phase 2 — Post-Match Processing Pipeline

Every file that exits the AWB extraction engine as MATCHED enters Phase 2. Click any node to see detailed logic.

PROCESSED

AWB-renamed files land here

EDM CHECK

3-layer duplicate detection

MD5 pHASH OCR

CLEAN

Unique → batch eligible

REJECTED

Duplicate → discarded

BATCH BUILDER

Sort → sequence → pack ≤48pp

HIGH MED LOW

COVER PAGE

Code128 barcode + seq + metadata

PENDING_PRINT

Batch PDFs + MD5 dedup copy

TIFF CONVERT

200 DPI · LZW · 4 parallel workers

PRINT READY

Batch PDF + TIFF in scanner queue

Three-Tier Processing Scheduler

Fast Lane

Stages 0–3 only, plus a strict post-Stage-3 ProbeLite check (400 / exact-high only) before defer. Returns DEFERRED only if probe-lite also fails. Drains the inbox at maximum speed. Up to 5 files per cycle.

Long Pass

Full pipeline on deferred files when fast queue empties. LONG_PASS_TIMEOUT_SECONDS=45.0 per-file timeout. On timeout, full state is captured and file queued for third pass.

Third Pass

Resumes timeout-deferred files with no timeout. OCR cache, candidate pool, rotation state all restored from snapshot. Zero repeated work.

Section 03

Intake & Detection

How the system watches the INBOX folder, ensures files are fully written, and manages queuing, de-bounce, and safety rescans.

Watchdog Observer

The watchdog library attaches InboxPDFHandler to INBOX_DIR. Fires on on_created, on_moved, and on_modified events. Only .pdf files enqueued. A 0.8-second de-bounce prevents double-queuing on OS event bursts (Windows Copy → Flush → Close sequence). Existing inbox PDFs are seeded into the queue at startup.

File Stability Check

file_is_stable() polls the file size twice with a 0.3s delay. A file is safe only when size is non-zero and identical on both reads. Prevents processing a file still being written, transferred, or copied from a network share — a common issue with large EDM document batches.

Safety Rescan

Every 30 seconds, the main loop re-scans INBOX and re-enqueues any PDFs found. Catches files the watchdog missed during high system load or temporary filesystem unavailability.

Heartbeat Log

Every 10 seconds: logs INBOX PDF count, AWB DB size, deferred long-pass queue depth, and timeout-deferred queue depth — real-time health monitoring.

Excel DB Refresh

Every 30 seconds: checks AWB Excel file mtime. On change, reloads full AWB set and rebuilds prefix/suffix buckets. UI can force immediate reload via reload_awb.trigger sentinel file.

Pre-Processing Guards — In Order

Guard	Check	On Failure
GUARD 1	File path exists & ends in `.pdf` (case-insensitive)	Silently dropped from queue
GUARD 2	`file_is_stable()` — 2 polls at 0.3s, non-zero identical size	NEEDS_REVIEW · rescan re-queues later
GUARD 3	PyMuPDF `page_count > 0` — catches corrupt, truncated PDFs	NEEDS_REVIEW with reason string
GUARD 4	No inbox-level deduplication — every file processed independently	Dedup only at PROCESSED level via MD5

Outlook Email Intake Script (Companion Flow)

Companion intake script behavior for email attachments before AWB hotfolder processing: attachments are saved to INBOX, and filename renaming is strict and deterministic so no ambiguity is introduced upstream.

Intake Decision Logic

1Read Outlook email attachments and keep PDF files only.

2Run strict AWB extraction patterns on attachment context/name.

3aIf exactly one strict AWB candidate exists: save as <AWB>.pdf in INBOX.

3bIf zero or multiple candidates: keep original filename and save to INBOX unchanged.

4AWB hotfolder picks up from INBOX and runs normal Stage 0–7 logic.

Strict Pattern Rule

Pattern Type	Accepted Form	Result
`12-digit strict`	`(?<!\d)\d{12}(?!\d)`	Eligible candidate
`4-4-4 strict`	`(?<!\d)\d{4}[\s-]\d{4}[\s-]\d{4}(?!\d)`	Normalized to 12 digits
`Exactly one match`	Single strict candidate only	Rename to `<AWB>.pdf`
`0 or >1 matches`	Ambiguous or absent AWB	Keep original filename

Safety intent: strict rename only on one unambiguous candidate preserves traceability and avoids wrong pre-labeling; all ambiguous files still enter INBOX for the app’s full extraction pipeline. This intake script is treated as an operational companion layer, keeping core V3 extraction/matching behavior unchanged.

Section 04

AWB Extraction Engine

Eight pipeline stages, each building on a shared candidate pool. Click any stage to expand full technical detail.

Unified Candidate Pool: All stages write into running_high and running_standard. Noisy candidates from invert or rotation passes enter a quarantine set until a second independent stage confirms them — preventing one-off OCR artifacts from ever reaching the matching step.

Stage	Method	DPI	Typical Time	DB Check
Stage 0	Filename regex	—	~0–5ms	YES
Stage 1	PyMuPDF text layer · 5 sub-stages	—	50–500ms	YES (excl. 400-pattern)
Pre-OCR	Angle detection · 3 free checks	—	~0ms	—
Stage 2	OCR Main · PSM 6+11 · digit whitelist	320	1–4s	YES
3.1/3.2	Rotation probe · probe text exit	140	0.5–2s	YES
Stage 3	OCR Strong · PSM 6+11 · 420 DPI	420	2–6s	YES
Stage 4	Rotation fallback at remaining angles	420	3–12s	YES
Stage 5/5.5	Context rescue · airway label crops · 3× upscale	420	2–15s	YES
Stage 6	EDM persistence fallback (runtime ON/OFF)	—	0.1–1.5s (when called)	EDM
Stage 7	NEEDS_REVIEW terminal	—	—	—

Stage 0

Filename Extraction — ~0ms

Instant match with no OCR — handles most EDM-downloaded files

Calls extract_awb_from_filename_strict() using two compiled regexes: a bare 12-digit run \b\d{12}\b and a spaced/hyphenated 4-4-4 format. If found and confirmed against the AWB DB, the file is completed immediately — all OCR is skipped entirely.

DB check requiredCost: ~0msPattern: \b\d{12}\bPattern: \d{4}[\s-]\d{4}[\s-]\d{4}

Stage 1

Text Layer Extraction — 5 Sub-Stages

PyMuPDF embedded text · No image render · 60-keyword proximity search

1a — set_rotation fallback: If text layer is empty, tries rotating the PDF metadata by 90°/270°/180° to unlock text stored at an angle in the PDF's internal stream. Resets to 0° after.

1b — Spatial word sort: For scrambled multi-column PDFs, get_text("words") re-sorted by (row, x) to reconstruct reading order. Used only when it yields more candidates than the raw stream.

1c — 400-pattern: Tight 400\d{12} regex. The only extraction that bypasses the DB check — the FedEx 400 carrier prefix strongly implies a valid AWB.

1d — Clean gate + tiered extraction: Extracts candidates adjacent to AWB keywords and runs full priority match cascade (Exact-High → Exact-Standard → Tolerance).

1e — Keyword proximity (60 terms): Scans 5 lines ahead / 2 lines behind every keyword in the 60-item list including "TRACKING NO", "MAWB", "CARGO CONTROL NUMBER", "ACI NO".

DB check: YESNo image renderKeywords: 60 termsLookahead: 5 lines

Pre-OCR

Angle Detection — Zero Cost

3 free checks before any image is rendered

Check 1 — PDF metadata rotation: page.rotation reports if the PDF viewer displays at 90°/180°/270°.

Check 2 — Aspect ratio: width/height > 1.3 → landscape page → likely rotated 90°.

Check 3 — Character spread: Bounding box distribution of text characters detects vertical vs horizontal flow.

Results stored as _rotation_hint. Narrows the rotation probe angle set later. If the probe is uncertain, the pre-check hint overrides it.

Stage 2

OCR Main — 320 DPI · PSM 6 + 11

Digit whitelist · soft pass · invert variants · table line removal

Pass A — PSM 6, digits-only: Uniform block mode with whitelist 0123456789. Clean gate → tiered extract → priority match. Early exit if quality candidates found (skips PSM 11).

Pass B — PSM 11, digits-only: Sparse text mode — suited for non-uniform layouts. OpenCV table line removal applied before OCR when available.

Pass C — Soft pass (general text): Tesseract unrestricted (PSM 11 then PSM 6) + spatial bounding-box analysis (image_to_data) finding digit clusters adjacent to AWB keyword tokens.

Invert passes: PSM 6+11 with image inversion (white text on dark — common in FedEx label formats). Invert-sourced STANDARD candidates go to quarantine until confirmed by another stage.

DPI: 320PSM: 6, 11cv2 table removalThreshold: 175

Stage 3.1

Rotation Probe — 140 DPI

Keyword-scored · 4 angles · flip guard margin 80pts

Renders at 140 DPI, tests 0°/90°/180°/270°. Score formula: score = digit_count + keyword_hits×120 + coherent_words×2. High keyword weight means one "AWB" token outweighs hundreds of isolated digits.

Fast path: If 0° wins digit-score by ≥24 margin, skips expensive general-text OCR and returns 0° immediately.

Certainty thresholds: Score margin ≥300 = CERTAIN · ≥120 = LIKELY · below = UNCERTAIN. Uncertain probes defer to the pre-check hint.

Stage 3.2 — Probe Text Exit: Reuses OCR text from the probe at zero additional cost. A confident match here skips all of Stage 3 (420 DPI OCR) — saving 3–8 seconds.

DPI: 140Keywords: 30+Flip guard: 80ptsStage 3.2 exit: free

Stage 3

OCR Strong — 420 DPI

Identical sub-pass structure to Stage 2 · fully cached renders · timeout-guarded in long-pass

Identical sub-pass structure to Stage 2 (PSM 6 → PSM 11 → Soft → Invert) applied at 420 DPI at base_angle. Higher DPI yields better clarity for thin-text documents, fine-print labels, and compressed scans. All images cached — if Stage 2 already rendered 420 DPI at 0°, Stage 3 retrieves from cache without re-render. In fast-lane mode, Stage 3 failure triggers a strict ProbeLite check (limited low-DPI rotation probe + 400/exact-high only); only then can it return DEFERRED.

Stage 4

Rotation Fallback

Full OCR at remaining angles · timeout-guarded at each boundary

Runs as rotation last-resort logic using probe certainty and route ordering (not only when base_angle ≠ 0°). It applies the full Stage 2+3 sub-pass sequence at deferred angles, then executes a full-priority sweep on accumulated fallback candidates. Timeout checks fire at angle boundaries — if LONG_PASS_TIMEOUT_SECONDS=45.0 is exceeded, _TimeoutDeferred captures all state (candidates, OCR cache, probe scores, timings) and defers to third-pass.

Stage 5 / 5.5

Context Rescue + Airway Label Rescue

60s budget · ACI / 400-labeled / carrier row / 3× upscale crops

Stage 5 — Context Rescue: ACI-prefixed patterns, "400 NO. XXXX" labeled format, FedEx carrier row positional matching, and 2-3× upscaled region crops for micro-print labels. 60-second time budget.

Stage 5.5 — Airway Label: Crops three overlapping right-side regions (RightMid, UpperRight, RightWide) at each angle. Each crop upscaled 3×. Two-step efficiency: cheap digit-only pass first; if <10 digit chars found, skips general-text OCR for that crop.

Budget: 60sUses cached imagesUpscale: 3×

Stage 6

EDM Persistence Fallback

Runtime-gated API check · single persistent HIGH candidate only

Entry criteria: Exactly one 12-digit candidate remains, tagged HIGH confidence, seen in at least 2 independent stages, and not disqualified as date/noise.

Runtime gate: EDM ON/OFF is read live from data/edm_toggle.json (with env/config fallback). OFF immediately bypasses API calls.

Outcomes: True confirms the candidate as MATCHED (EDM-Exists-Persistent). False or None falls through to Stage 7. Multiple persistent candidates trigger a tie to NEEDS_REVIEW.

UI toggle: EDM ON/OFFCache: edm_awb_exists_cache.jsonBypass-safe on missing token/auth

Stage 7

NEEDS_REVIEW — Terminal State

Full diagnostic log · safe_move() · audit updated

Terminal state when no AWB match is found after all stages. Full diagnostic log written: every candidate tried, confidence tier per candidate, which stages saw each one, and all quarantined candidates. File moved via safe_move() to NEEDS_REVIEW_DIR. Audit tracker updated via record_hotfolder_needs_review() with reason string and complete candidate list.

Section 05

Document Classification

How the pipeline characterises each document's type and difficulty tier before committing to expensive processing paths.

Tier 1 — Named Files

<5ms

Filename already contains a 12-digit AWB. No OCR required. Stage 0 match. Most EDM-downloaded files arrive pre-named and land here.

Tier 2 — Text PDFs

50–500ms

PDF has embedded vector text. AWB extracted via Stage 1 without any Tesseract invocation. Handles multi-column scrambling and metadata rotation recovery.

Tier 3 — Scanned / Image

2–20s

Image-only PDF — no embedded text (_is_image_only=True). Full OCR pipeline, rotation probe, and potentially all rescue stages.

Image-Only Detection

Set when len(txt_layer.strip()) == 0 after the set_rotation fallback attempt. Affects scheduling: rotation probe immediately narrowed to (0°, hint°) if a strong pre-check hint exists — reducing probe cost by ~50%.

Text Layer Quality Signal

A rich text layer (≥25 chars, MIN_EMBEDDED_TEXT_LENGTH) triggers the early rotation probe after Stage 2 failure. A thin layer triggers it immediately after Stage 1, routing weakly-embedded documents into OCR faster.

Section 06

Matching & Validation

How raw OCR candidates are validated against the AWB database — confidence tiers, Hamming distance tolerance, prefix/suffix buckets, and the tie-guard.

Candidate Confidence Tiers

HIGH

Extracted from strong, unambiguous context: adjacent to a known AWB keyword, matching a structured label pattern (ACI, 400-labeled, airway-bill label row), or confirmed by spatial bounding-box analysis. Eligible for Hamming tolerance up to 2 digits.

STANDARD

Extracted from looser pattern matching — a 12-digit run without strong keyword adjacency. Eligible for tolerance only if sole STANDARD candidate AND seen in ≥2 stages. Tolerance limited to 1-digit Hamming.

QUARANTINE

Candidates from noisy passes (invert, Rotation-180/270, AngFallback) that produced >3 candidates in one pass. Excluded until a second independent stage confirms the same candidate — preventing OCR artifacts from forcing matches.

Priority Match Cascade

Exact-High: HIGH candidates ∩ AWB DB. Unique → immediate match. Multiple → ambiguous tie → NEEDS_REVIEW.

Exact-Standard: STANDARD candidates ∩ AWB DB. Same tie logic applies.

Tolerance2-High: Hamming ≤2 against HIGH candidates. Accepted only if evidence candidate seen in ≥1 stage (dist≤1) or ≥2 stages (dist=2). Prefix/suffix bucket lookup limits the search space to ~4-digit matches.

Tolerance2-Standard: Hamming ≤1 against STANDARD, only when pool has exactly 1 candidate AND it was seen in ≥2 stages. Most conservative tolerance path.

Candidate Disqualification Rules

✕More than 3 leading zeros (e.g. 000012345678) — numeric code or padding artifact

✕Matches a date pattern (YYYYMMDD, DDMMYYYY variants) — disqualified as date reference

✕All same digit (e.g. 111111111111) — OCR artifact or table cell filler

✕Matches known HS code or phone number patterns — commercial document noise

Section 07

Routing & Decision Logic

How the orchestrator decides which path a file takes at every decision boundary — three-tier scheduler, timeout management, and full state serialisation for seamless resume.

Return States

MATCHEDAWB confirmed. File renamed and moved to PROCESSED.

NEEDS_REVIEWNo match after all stages. File moved to NEEDS_REVIEW dir.

DEFERREDFast-lane exit after Stage 3 fail. Queued for long-pass.

TIMEOUTBudget exceeded. State fully captured for third-pass resume.

45-Second Timeout Budget

The long-pass budget is LONG_PASS_TIMEOUT_SECONDS=45.0. The timeout check fires only at natural angle-pass boundaries — never mid-subpass. On timeout, _TimeoutDeferred exception is raised internally, which triggers complete state capture into a _state_out dict passed back to the scheduler for third-pass resumption.

State Capture on Timeout

Everything needed for seamless third-pass resume:

captured_state = {
  "probe_scores": # rotation probe scores per angle {0: 891, 90: 42, ...}
  "probe_texts": # (digit_text, general_text) per angle — reused free
  "base_angle": # determined rotation in degrees (0/90/180/270)
  "_angle_certainty": # "CERTAIN" / "LIKELY" / "UNCERTAIN"
  "running_high": # accumulated HIGH candidate set as list
  "running_standard": # accumulated STANDARD candidate set as list
  "candidate_stage_hits": # {candidate: [stages_that_saw_it]}
  "quarantine": # noisy candidates pending second-stage confirmation
  "ocr_cache": # [[key_list, text], ...] — serialisable tuple pairs
  "timings": # accumulated ms per stage, carried forward on resume
}

Section 08

Output Actions

What happens to a file once a match is confirmed or denied — file lifecycle from INBOX to final resting place across all output directories.

On Match → PROCESSED

1. Rename & Move: PROCESSED/<AWB>.pdf. If exists: MD5 compare → identical → source deleted. Different content → _2, _3 suffix appended.

2. AWB Logs Excel: Row buffered to CSV sidecar, flushed to Excel after 10 rows accumulate. Contains: AWB, source filename, timestamp, match method, status.

3. Stage Cache CSV: Per-file performance row: input filename, processed filename, AWB, detection type, extraction seconds.

4. Audit Tracker: record_hotfolder_end() writes to HotfolderV2 sheet in the 4-sheet Excel audit workbook.

On No-Match → NEEDS_REVIEW

1. Safe Move: safe_move() transfers to NEEDS_REVIEW_DIR. Timestamp suffix appended if name collision exists.

2. Diagnostic Log: Every candidate tried: confidence tier, stages that saw it, quarantine status — full post-mortem for manual review.

3. Audit Tracker: record_hotfolder_needs_review() with reason string and sorted candidate list.

Output Directory Map

PROCESSED/

Matched, renamed AWB files. Source for EDM-aware downstream handling.

NEEDS_REVIEW/

Unmatched files. Manual operator review required.

CLEAN/

EDM-checked non-duplicate files. Feeds batch builder.

REJECTED/

EDM-confirmed duplicate documents.

PENDING_PRINT/

Batch PDFs and TIFF files ready for print queue.

data/OUT/

AWB sequence Excel log and awb_list.csv for downstream.

What Happens Next — Downstream Journey

PROCESSED

EDM 3-Layer Check

CLEAN

Batch Builder

Barcode Cover

PENDING_PRINT

TIFF Convert

Print Queue

See Pipeline Flow §2 for interactive Phase 2 diagram

Section 09

EDM Duplicate Check

EDM in current V3 is runtime-gated from the UI and used for Stage 6 AWB persistence fallback, with this section also documenting the downstream duplicate-screening lane used in EDM observability flows.

V3 update: EDM calls are now controlled at runtime (EDM: ON/OFF button in UI). Stage 6 in the hotfolder pipeline uses EDM for persistent-candidate confirmation, and bypasses safely when EDM is OFF, token is missing, or response is inconclusive.

Layer 1 — MD5 Hash

Byte-for-byte identity check. Fastest possible comparison — if two files produce the same MD5, they are guaranteed identical content. Zero false positives. Runs before any image processing.

Cost: ~1ms

Layer 2 — Perceptual Hash

Compares visual fingerprints of page renders. Catches near-identical documents that differ only in metadata, timestamps, or minor encoding differences. Threshold: pHash distance ≤ 10.

Threshold: ≤10

Layer 3 — OCR Text Similarity

RapidFuzz text similarity on OCR/text-layer content with layered safeguards: text-vs-text only when both sides have enough embedded text, OCR fallback when one side lacks text layer. Similarity ≥60% is tracked; ≥85% becomes a strong text signal.

Track: ≥60% Strong: ≥85%

Document Flow Through EDM

1Runtime gate evaluated (EDM: ON/OFF in UI)

2Stage 6 checks a single persistent HIGH candidate only

3EDM metadata API queried (token + auth validated)

4aConfirmed in EDM → MATCHED (EDM-Exists-Persistent)

4bNot confirmed / bypassed → continue to Stage 7 path

5Downstream duplicate lane applies Gate-1 hash-all-pages, Gate-2 bounded probes, then Tier-2 full checks with CCD exemptions and conservative reject rules

EDM API Integration

Connects to the FedEx EDM production endpoint using a Bearer token stored in data/token.txt. Two endpoints used:

Metadata endpoint: Retrieves document group info for the AWB (operatingCompany: "FXE")

Download endpoint: Returns a ZIP of existing PDFs for the AWB, extracted for comparison

Fallback contract handling: Metadata path supports legacy payload first, then body/query fallback variants; inconclusive/auth/network paths route to CLEAN-UNCHECKED without stopping flow.

Token expiry is checked on first call each session. When expired, EDM checks are skipped gracefully and logged as warnings — the pipeline continues without interruption.

Live toggle behavior: the desktop UI button switches EDM between ON and OFF immediately. Running hotfolder checks read the updated value without restart, so API calls are allowed or bypassed in real time.

Runtime Toggle Resolution Order

Priority	Source	Purpose
1	`data/edm_toggle.json`	Persisted UI state from `EDM: ON/OFF` button
2	`PIPELINE_EDM_ENABLED`	Per-process override for launched scripts
3	`ENABLE_EDM_FALLBACK`	Default from `.env` / config

Stage 6 Decision Path

GateRun only for one persistent HIGH candidate seen across 2+ stages

ONCall EDM metadata endpoint with token from data/token.txt (or env)

TrueComplete match with method EDM-Exists-Persistent

FalseCandidate not confirmed; pipeline continues to Stage 7 logic

NoneBypassed/unchecked path (OFF, token missing, auth/network inconclusive)

# Simplified runtime behavior used in V3
if not is_edm_enabled():
return None # bypass API
token = get_token()
if not token:
return None # skip safely
exists = edm_awb_exists_fallback(candidate_awb)
if exists: complete_match(candidate_awb, "EDM-Exists-Persistent")

Why originals reach CLEAN faster: the comparison lane avoids brute-force permutations with Gate-1 exact hash over all pages, Gate-2 bounded probes, and Tier-2 expansion only after evidence hit.

Permutation/Combination Pruning

1Gate 1: incoming all-pages vs EDM all-pages hash index gives exact-hit/no-hit cheaply.

2Gate 2: smart sampled probes (first pages + 1/3 + mid + 2/3 + last) using pHash/text/OCR only when Gate 1 misses.

3Tier 2 full checks run only after evidence and apply CCD exemption + bounded OCR parallel prewarm.

Page Limits + Top Picks

Control	Default	Effect
`EARLY_FOCUS_MATCH_THRESHOLD`	3	Promotes likely candidates first (top-pick path)
`PAGE_OCR_LIMIT`	8	Caps OCR pages per document to bound heavy comparisons
`EDM_OCR_COMPARE_LIMIT`	10	Tier-2 OCR compare page/doc bound for heavy checks
`PHASH_THRESHOLD`	10	Fast visual near-match gate before OCR text similarity
`EDM_TIER1_INCOMING_PAGES`	3	Base sampled incoming pages before extra strategic probe indices
`EDM_TIER1_EDM_PAGE_LIMIT`	5	Tier-1 probe depth per EDM document
`EDM_TIER2_EDM_PAGE_LIMIT`	10	Tier-2 full-compare depth per EDM document

P-Cross Handling for Multi-Page Input vs Multi-Page EDM Docs

When both sides have many pages, the lane uses bounded cross-page comparison (P-Cross) instead of unbounded all-page expansion. Duplicate evidence is accumulated as page hits and ratio, then resolved by configured rejection thresholds.

# Simplified bounded P-Cross decision idea
top_in = first_pages(incoming, PAGE_OCR_LIMIT)
top_edm = top_docs(existing_docs, EDM_OCR_COMPARE_LIMIT)
dup_pages, total_pages = compare_cross_pages(top_in, top_edm) # bounded matrix
dup_ratio = dup_pages / max(total_pages, 1)
if dup_pages > EDM_REJECT_IF_DUP_PAGES_OVER and dup_ratio >= EDM_REJECT_IF_DUP_RATIO:
return "REJECTED"
return "CLEAN" # true originals pass through faster

Operational result: efficient ranking + bounded P-Cross keeps processing fast for genuine originals while still protecting against high-page-count duplicate overlap.

Gate-and-Layer Cascade — Gate 1 hash-all-pages, Gate 2 bounded probes, Tier-2 full checks. Strong duplicate evidence → REJECTED. Otherwise CLEAN / CLEAN-UNCHECKED

INCOMING

From PROCESSED

MD5 HASH

Byte-for-byte identity check on raw file content

hashlib.md5() ~1ms

MATCH → REJECTED

PASS

PERCEPTUAL HASH

Visual fingerprint comparison of page renders. Catches near-identical docs with different encoding.

ImageHash distance ≤ 10 limit: 10 pages

MATCH → REJECTED

PASS

OCR TEXT SIMILARITY

Text-layer compare (only when both sides have enough chars) plus OCR fallback for mixed text/image pages. Catches rescanned content.

rapidfuzz ≥60% tracked ≥85% strong 8 pages max auto: 5pp ≥70%

MATCH → REJECTED

ALL PASS

CLEAN

→ Batch queue

False positive philosophy: A false positive (wrongly rejecting a clean doc) is treated as worse than a missed duplicate. Automatic reject/split decisions require strong methods (HASH/PHASH). TEXT/OCR-only similarity is retained as observability signal and can still pass through when strong evidence is absent.

Layer 1MD5 Hash — Byte Identity

Computed using hashlib.md5() with 64KB read chunks. If two files produce the same MD5 digest, they are guaranteed byte-for-byte identical. Zero false positives. This is the cheapest check at ~1ms and runs first before any image processing.

Speed: ~1msZero false positivesChunk size: 64KB

Layer 2Perceptual Hash — Visual Fingerprint

Uses the ImageHash library to compute a perceptual hash of each page render. Hamming distance between two pHash values ≤ 10 → duplicate. Catches near-identical documents that differ only in metadata, compression level, or minor encoding artefacts. OCR compare limit caps checked pages at 10.

Threshold: distance ≤ 10Page limit: 10Library: ImageHash

Layer 3OCR/Text Similarity — Content Match

When hashes differ but content may still be duplicated (for example re-scanned paper), text comparison runs with safeguards: text-layer compare only when both pages have enough embedded text; otherwise OCR fallback compares mixed text/image pages. Score ≥60% is recorded as TEXT; score ≥85% is recorded as TEXT_STRONG. Conservative routing still requires strong evidence for auto-reject decisions.

Similarity threshold: 60%Strong threshold: 85%Page OCR limit: 8Library: rapidfuzzStrong methods: HASH/PHASH/TEXT_STRONG

Config Key	Default	What It Controls
`PHASH_THRESHOLD`	10	Maximum pHash Hamming distance to classify as visual duplicate. Lower = stricter.
`TEXT_SIMILARITY_THRESHOLD`	60	Minimum RapidFuzz similarity % to record a TEXT duplicate signal.
`TEXT_STRONG_THRESHOLD`	85	Minimum RapidFuzz similarity % to promote TEXT to TEXT_STRONG evidence.
`PAGE_OCR_LIMIT`	8	Maximum pages to OCR-compare per document during similarity check.
`EDM_OCR_COMPARE_LIMIT`	10	Tier-2 OCR compare limit during heavy matching.
`EDM_TIER1_INCOMING_PAGES`	3	Seed size for Gate-2 incoming probe sampling.
`EDM_TIER1_EDM_PAGE_LIMIT`	5	Tier-1 probe max pages per EDM doc.
`EDM_TIER2_EDM_PAGE_LIMIT`	10	Tier-2 full compare max pages per EDM doc.
`EDM_TEXT_LAYER_MIN_CHARS`	30	Minimum text chars before using text-layer compare path.
`EDM_OCR_WORKERS`	2	Parallel OCR workers for Tier-2 prewarm.
`EDM_OCR_PARALLEL_MIN_TASKS`	4	Minimum pending OCR tasks before enabling parallel prewarm.
`EDM_REJECT_IF_DUP_PAGES_OVER`	5	Minimum number of duplicate-matching pages to trigger auto-rejection.
`EDM_REJECT_IF_DUP_RATIO`	0.70	Minimum proportion of pages that must match to trigger auto-rejection.
`FILE_SETTLE_SECONDS`	3	Wait time after file detected before comparison begins (ensures file is fully written).
`EARLY_FOCUS_MATCH_THRESHOLD`	3	Minimum early-focus text matches to prioritise a candidate for full comparison.
`MIN_EMBEDDED_TEXT_LENGTH`	25	Minimum chars in text layer before skipping OCR extraction for comparison.

# EDM duplicate decision logic (pseudo-code)
def is_duplicate(incoming, existing):
  # Layer 1: byte-identical
  if md5(incoming) == md5(existing): return True
  # Layer 2: visually identical
  if phash_distance(incoming, existing) <= PHASH_THRESHOLD: return True
  # Layer 3: text content similar
  score = rapidfuzz_similarity(ocr(incoming), ocr(existing))
  if score >= TEXT_SIMILARITY_THRESHOLD: return True
  # Auto-reject if majority of pages match
  if dup_pages > EDM_REJECT_IF_DUP_PAGES_OVER and dup_ratio >= EDM_REJECT_IF_DUP_RATIO:
    return True
  return False # when in doubt: pass through

CLEAN — Passes EDM Check

File moved to CLEAN/ folder

When EDM is bypassed/inconclusive (toggle OFF, no token, metadata/download issue), route is CLEAN-UNCHECKED with explicit compare method reason

When all incoming pages are CCD, duplicate checks are intentionally bypassed (all_ccd_bypass) and routed clean

When only TEXT/OCR similarity exists without strong HASH/PHASH evidence, route can remain clean via conservative bypass

EDM audit row written: result, timing, compare method

File becomes eligible for batch builder ingestion

JSONL audit event appended: stage="EDM_CHECK"

REJECTED — Duplicate Detected

File moved to REJECTED/ folder (or partially split into CLEAN+REJECTED when only some pages are strong duplicates)

EDM audit row written: which layer triggered, dup pages, dup ratio

Original in EDM system is not overwritten or modified

JSONL audit event appended with full comparison details

Audit Field	Description	Example Value
`EDMResult`	Outcome of the check	CLEAN, REJECTED, or CLEAN-UNCHECKED
`DupPageCount`	Number of pages flagged as duplicates	3
`TotalPages`	Total pages in incoming document	4
`DupRatio`	Proportion of duplicate pages	0.75
`EDMSecs`	Time taken for the full EDM check	2.4
`CompareMethod`	Which path/layer determined the decision	MD5 / pHash / OCR-text / CCD-BYPASS / BYPASS-NO-TOKEN

Section 10

Batch Builder & Sequence Priority

Transforms the CLEAN folder into numbered, print-ready batch PDFs with auto-generated barcode cover pages — replacing the entire manual stack-and-scan workflow.

Max Pages / Batch

configurable via MAX_PAGES_PER_BATCH

Cover Page Format

per AWB · barcode + metadata

Sort Order

mtime

oldest file in each AWB group first

Write Safety

Atomic

tmp → os.replace — no partial files

Batch Construction Pipeline — Step by Step

01 SCAN CLEAN

Find all \d{12}(?:_\d+)?.pdf files. Group by AWB. Open each once for page count.

mtime cached

02 SORT FIFO

Sort groups globally by oldest file's mtime. Files within each AWB group also sorted by mtime.

oldest first

03 ASSIGN TIER

Read stage_cache.csv to map each AWB's detection method to High / Medium / Low tier.

High Med Low

04 ASSIGN SEQ

Global sequential integer per AWB. Resets per tier when tier batching is on. Logged to Excel.

FIFO → SEQ

05 PACK BATCHES

Greedy first-fit. Cost = 1 cover + N invoice pages. When next AWB would exceed 48pp → open new batch.

≤48 pages/batch

06 BUILD PDF

Insert Code128 cover page + all AWB PDFs via PyMuPDF. Write atomically: .tmp → os.replace.

atomic write reportlab

07 PENDING_PRINT

Copy batch PDF with MD5 dedup check. Delete CLEAN sources only after all copies succeed.

MD5 dedup copy

Batch Packing Algorithm

Each AWB costs 1 cover page + N invoice pages. The packer walks AWBs in sequence order and assigns to the current batch. When the next AWB would exceed 48 pages, a new batch opens. Greedy first-fit — no backtracking or re-ordering.

# Batch packing (greedy first-fit)
for awb in resolved:
  cost = 1 + awb.inv_pages # cover + docs
  if pages + cost > MAX_PAGES_PER_BATCH:
    batch_no += 1; pages = 0
  awb.batch_no = batch_no; pages += cost

CLEAN Folder Scan

Scans CLEAN for \d{12}(?:_\d+)?.pdf. Files grouped by AWB and sorted within group by mtime ascending. Groups globally ordered by each group's oldest file's mtime — FIFO. Each PDF opened exactly once during scan; page counts cached for the build phase.

Pattern: \d{12}(_\d+)?.pdfSort: mtime ascEach PDF opened once

Page count efficiency: Each PDF in CLEAN is opened once during the scan phase to read its page count. The counts are cached in memory and reused during the build phase — no file is opened twice. Keeps the batch builder fast even for large CLEAN folders.

Sequence numbers are assigned globally across all AWBs in FIFO order (oldest file first). Each AWB gets a unique sequential integer (seq) that appears on its barcode cover page. This number is the primary key in the Excel sequence log and allows tracing any printed batch document back to its original PDF.

Sequencing Rules

FIFO by file mtime: CLEAN folder is scanned and groups sorted by the oldest file's mtime within each AWB group. Earliest-arrived files get the lowest sequence numbers.

Multiple files per AWB: When an AWB has multiple documents (e.g. 123456789012.pdf and 123456789012_2.pdf), all are grouped under the same sequence number. All files are inserted after the single barcode cover page.

Append-only sequence log: Every batch run appends rows to data/OUT/awb_sequence.xlsx — history is never overwritten. Columns: Seq, AWB, PDF Files, Timestamp, DocCount, InvoicePages, TotalPages, Batch, Tier.

Tier batching resets seq: When ENABLE_TIER_BATCHING=True, sequence numbers restart at 1 within each tier (High, Medium, Low). Each tier is built as a separate batch series — their cover pages are independently numbered.

Excel Sequence Log Schema

Column	Type	Description
`Seq`	integer	Global sequence number — unique per AWB per build run
`AWB`	string	12-digit Air Waybill Number
`PDF Files`	string	Pipe-separated list of source PDFs: `123456789012.pdf \| 123456789012_2.pdf`
`Timestamp`	datetime	ISO timestamp of when this entry was built
`DocCount`	integer	Number of PDF documents included for this AWB
`InvoicePages`	integer	Total invoice/document page count (excludes cover)
`TotalPages`	integer	Cover page + invoice pages = what occupies batch space
`Batch`	integer	Batch PDF number this AWB was assigned to
`Tier`	string	Detection tier (High / Medium / Low) when tier batching enabled

What's on the Cover Page

SEQSequence number in large Helvetica-Bold 18pt

AWB12-digit Air Waybill Number in 22pt bold

BATCHBatch number, zero-padded to 3 digits (e.g. 003)

PAGECover's position within the batch (e.g. Page 7 of 48)

DOCSDocument count and total invoice page count

TIERDetection tier label (when tier batching enabled)

BARCODECode128 barcode of the AWB number, height 60pt, width 1.2pt

FOOTERGeneration timestamp at bottom of page

Cover Page Generation

Built using reportlab's canvas API. Generated in-memory as raw PDF bytes (BytesIO), then merged into the batch PDF via PyMuPDF's insert_pdf(). Page size is configurable: LETTER (default) or A4.

# Cover page structure (letter size)
c.setFont("Helvetica-Bold", 18)
c.drawString(60, h-80, f"SEQ: {seq}")
c.setFont("Helvetica-Bold", 22)
c.drawString(60, h-120, f"AWB: {awb}")
barcode = Code128(awb, barHeight=60, barWidth=1.2)
barcode.drawOn(c, 60, h-290)

Library: reportlabBarcode: Code128Page: LETTER or A4

When ENABLE_TIER_BATCHING=True, the builder creates separate batch series for each detection confidence tier. High-confidence AWBs (matched from filename or text layer) are batched separately from OCR-matched or lower-confidence documents. Each tier's batches are independently numbered and file-named.

HIGH Tier

AWBs matched via filename or text layer extraction. Prefixes: FILENAME, TEXTLAYER-EXACT, TEXT-LAYER.

Batch file format: PRINT_STACK_BATCH_TH_001.pdf

MEDIUM Tier

AWBs matched via exact OCR. Prefixes: OCR-EXACT. These required image rendering but produced a clean exact match.

Batch file format: PRINT_STACK_BATCH_TM_001.pdf

LOW Tier

All other AWBs — tolerance-matched, rescue-stage matches, or any detection method not in the high/medium prefixes.

Batch file format: PRINT_STACK_BATCH_TL_001.pdf

How Tier is Determined

The batch builder reads data/stage_cache.csv which the hotfolder writes after each successful match. The AWB_Detection_Type column stores the method label (e.g. Filename, OCR-Strong-Exact-High). Tier is assigned by prefix matching:

Detection Method Prefix	Assigned Tier
`FILENAME`, `TEXTLAYER-EXACT`, `TEXT-LAYER`	High
`OCR-EXACT`	Medium
All others (tolerance, rescue, rotation fallback)	Low
AWB not found in stage_cache.csv	Low (default)

Atomic Write Safety

Every batch PDF is written atomically: the PDF is saved to a .tmp file first, then renamed to the final name via os.replace() — which is atomic on both Windows NTFS and macOS APFS. If the process crashes mid-write, no partial batch PDF can be mistaken for a complete one.

# Atomic write (no partial files on crash)
doc.save(str(tmp_path))
doc.close()
os.replace(str(tmp_path), str(out_path)) # atomic

PENDING_PRINT Copy with MD5 Dedup

After batch PDFs are built in data/OUT/, they are copied to PENDING_PRINT/ for TIFF conversion and printing. Before each copy, MD5 hashes are compared: if an identical batch already exists in PENDING_PRINT, the copy is skipped. Different-content files with the same name get a _v2, _v3 suffix.

MD5 dedup on copySuffix: _v2, _v3...

CLEAN Source Deletion Safety

Source files in CLEAN are only deleted after all batch PDFs have been successfully copied to PENDING_PRINT. If any copy fails, deletion is skipped entirely with a safety message logged. This prevents data loss if the copy step is interrupted.

# Safety guard before source deletion
if (failed == 0 and
    copied + skipped_dup == expected):
  delete_clean_sources(resolved)
else:
  log("[SAFETY] Skipping deletion — not all files copied")

data/OUT/

Batch PDFs named PRINT_STACK_BATCH_NNN.pdf

PENDING_PRINT/

Copy of batches + TIFF conversions for scanner

awb_sequence.xlsx

Append-only sequence log — history preserved

Section 11

Tracking / Audit Layer

Every file, every match, every decision — permanently recorded across four complementary audit mechanisms with concurrent-write safety.

4-Sheet Audit Workbook

Central file: pipeline_audit.xlsx.

HotfolderV2: One row per AWB detection — timestamp, employee ID, AWB, filenames, detection method, tier, timing (ms), result, notes.

EDM: One row per EDM check — result, dup page count, dup ratio, compare method.

BatchTIFF: One row per batch build or TIFF conversion — batch number, AWB count, page count, output path.

Dashboard: Programmatically computed summary, rewritten on every write. Session totals and running stats.

JSONL Audit Log

Every event appended to logs/pipeline_audit.jsonl. Auto-rotates at 50MB (one backup). Never interrupts pipeline flow — all writes wrapped in try/catch.

{"ts": "2025-03-15 08:14:22",
"stage": "AWB_HOTFOLDER",
"file": "INVOICE_2025.pdf",
"status": "MATCHED",
"awb": "400856290147",
"match_method": "OCR-Strong-Exact-High",
"hotfolder_secs": 8.4}

Concurrent Write Safety

Lock-file pattern via os.O_CREAT | O_EXCL — atomic on both Windows NTFS and macOS APFS. Timeout: 15 seconds. Stale locks (held >30s) auto-cleared. Enables safe concurrent access from hotfolder + batch_builder + TIFF converter simultaneously.

Known Limitation: When multiple INBOX files share the same AWB (duplicates at different paths), the tracker creates a separate row per file — not per AWB. This is correct behaviour (unique processing records) but AWB-level aggregation requires grouping by the AWB column across rows. A future processing_id UUID field would resolve this cleanly.

Section 12

Error Handling & Recovery

How the pipeline handles every failure mode — from unstable files and corrupt PDFs to OCR noise, rotation ambiguity, and the three-tier timeout recovery cycle.

File Stability Failure

If file_is_stable() fails, pipeline returns NEEDS_REVIEW immediately. The 30-second safety rescan re-enqueues the file once writing completes, allowing processing to succeed on the next cycle without manual intervention.

Corrupt or Zero-Page PDFs

PyMuPDF open() and page_count check catches corrupt files before any extraction stage. Moved to NEEDS_REVIEW with clear reason string ("0-page PDF" or "Corrupt/unreadable PDF: {exception}"). Guard runs before any closures are defined — ensuring clean state even on immediate failure.

45-Second Long-Pass Timeout

High-resolution scans with complex rotation can exceed the budget. Rather than blocking the entire pipeline, the timeout captures full state (candidates, OCR cache, probe scores, timings) and defers to the third-pass queue. Third pass processes up to 3 deferred files per cycle with no timeout, only when both fast and long queues are empty.

OCR Noise Quarantine

Passes known to be noisy (invert, Rotation-180°/270°, AngFallback) that produce >3 STANDARD candidates in one pass have those candidates quarantined. Promoted to active pool only when a second, independent stage sees the same number — preventing one-off OCR artifacts from ever reaching the matching step.

Three-Layer Rotation Defence

(1) Free metadata checks before any render, (2) low-DPI keyword-scored probe, (3) full rotation fallback at all remaining angles. If probe is uncertain, the pre-check hint overrides it. Saved timeout state includes probe scores — third pass skips re-probing entirely.

Table Line Removal

For documents where ruling lines cross digit characters and confuse Tesseract, OpenCV morphological operations detect and remove horizontal and vertical lines before OCR. Applied when CV2_AVAILABLE=True. Graceful fallback when OpenCV is unavailable — only the table cleaning optimisation is skipped.

Section 13

Technical Architecture

Module structure, data flow, key dependencies, and the .env → config.py configuration system. Single-operator Windows deployment with Mac development environment.

Module Dependency Map

Entry Points

app.py · hotfolder.py · batch_builder.py · tiff_converter.py

↓

Configuration

config.py reads .env → all paths, DPIs, thresholds, feature flags

Orchestrator

pipeline.py — Stages 0–7, timeout, state capture (~1,700 lines)

↔

OCR Engine

ocr_engine.py — render, preprocess, Tesseract wrappers, cv2 table removal

Extractor

awb_extractor.py — 15+ regex patterns, keyword proximity, candidate tiers (~640 lines)

↔

Matcher

awb_matcher.py — Hamming distance, priority cascade, tie guard

File Ops

file_ops.py — stability, safe_move, MD5 dedup, Excel AWB logs, stage cache CSV

↔

Audit Layer

tracker.py (4-sheet Excel, lock-file) · logger.py (JSONL, 50MB rotation)

V3 addition: services/edm_checker.py now owns runtime EDM toggle state, cache-backed AWB existence checks, and V1/V2-compatible EDM metadata request payloads used by Stage 6 fallback.

Configuration System

# .env — machine-specific values, never committed to version control
PIPELINE_BASE_DIR=C:\Users\5834089\Downloads\CCD_Filler
TESSERACT_PATH=C:\Users\5834089\Downloads\CCD_Filler\tesseract.exe

# OCR tuning
OCR_DPI_MAIN=320 # Main OCR render resolution
OCR_DPI_STRONG=420 # Strong OCR render resolution
ROTATION_PROBE_DPI=140 # Low-cost rotation probe DPI
LONG_PASS_TIMEOUT_SECONDS=45.0

# Matching thresholds
TOLERANCE_HIGH_MAX_DISTANCE=2 # Max Hamming distance for HIGH candidates
TOLERANCE_STANDARD_MAX_DISTANCE=1 # Max Hamming distance for STANDARD
MIN_STAGE_HITS_HIGH_TOL2=2 # 2 stages required for 2-digit tolerance

# Output
MAX_PAGES_PER_BATCH=48 TIFF_DPI=200 TIFF_PARALLEL_WORKERS=4

Key Dependencies

PyMuPDF ≥1.24

PDF text extraction, page rendering, metadata access

pytesseract

Tesseract 5.x OCR — PSM modes 6, 7, 11; digit whitelist mode

watchdog ≥4.0

Cross-platform filesystem event monitoring for INBOX

openpyxl ≥3.1

Excel read/write for AWB DB, audit workbook, logs

opencv-python-headless

Table line removal via morphological operations (optional)

Pillow ≥10 · numpy

Image preprocessing, rotation, thresholding, Lanczos upscaling

Section 14

Live Status Simulation

Watch the pipeline process three different document scenarios in real time — from an instant filename match to a complex rotated scan with timeout and third-pass resume.

Document Processing Pipeline V3 — pipeline.log

RUNNING

// Select a scenario above to begin the simulation
// Log entries will stream in real time below

Section 15

Risks & Edge Cases

Known failure modes, ambiguous scenarios, and the mitigations currently in place — or flagged as future work.

HIGHSame AWB in Multiple INBOX Files

Multiple PDFs can arrive containing the same AWB — re-downloads from EDM or duplicates sent by shippers. Each is processed independently. At the PROCESSED level, MD5 comparison catches byte-for-byte identical files. Different content with same AWB gets an auto-incrementing suffix (_2, _3). The audit tracker creates separate rows per file (not per AWB), which is correct but requires join logic for per-AWB reporting.

MEDOCR Digit Misread (1–2 digit errors)

Tesseract commonly misreads 0↔8, 1↔7, 5↔6 in compressed scans. Hamming tolerance (≤2 for HIGH, ≤1 for STANDARD) handles this — but only when the candidate has been seen in multiple stages. The stage-hits requirement prevents a single noisy pass from generating a tolerance match. Risk is higher for AWBs differing from each other by only 1–2 digits, guarded by the unique-match requirement (no ties allowed).

MEDRotation Probe Choosing the Wrong Angle

The probe can be misled by documents with large blocks of digits at non-AWB angles (barcodes, HS code tables). Stage 4 rotation fallback mitigates this by running OCR at remaining angles. The 80-point flip margin threshold guards against switching from 0° without strong evidence. The quarantine logic prevents any rotation fallback candidates from being accepted without multi-stage confirmation.

MEDHS Codes and Phone Numbers as False Candidates

Commercial invoices contain many 12-digit sequences: HS commodity codes, phone numbers with country codes, and reference numbers. The disqualification filter catches leading-zero sequences and date patterns. For remaining false candidates, the DB check is the final gate — unless the false candidate coincidentally matches a currently loaded AWB record (rare but theoretically possible). Stage-hits requirement reduces this risk significantly.

MEDEDM Token Expiry During Session

The EDM Bearer token in data/token.txt can expire after a few hours. In V3 this is fail-safe: Stage 6 logs a warning and returns an unchecked path (None), so the pipeline continues without hard failure. The UI toggle can also set EDM OFF, which intentionally bypasses API calls until token/auth is restored.

LOWAWB Excel DB Out of Sync (stale mtime on NFS/SMB)

DB refreshes every 30s by mtime comparison. On some NFS and SMB mounts, mtime doesn't reflect changes correctly — the pipeline may operate on a stale AWB set for up to 30 seconds. The UI can force an immediate reload via the trigger file. Files that fail to match during a stale window route to NEEDS_REVIEW and can be retried after DB refresh.

LOWAudit Lock File Contention at Scale

The 4-sheet audit workbook uses a file-lock pattern. If two pipeline processes attempt to write simultaneously, one waits up to 15 seconds. Locks held >30s are forcibly cleared. In a single-operator, single-process deployment this is rarely a problem, but at scale (multiple simultaneous pipeline instances) this becomes a serialisation bottleneck. Solution: switch to append-only SQLite audit store.

Section 16

Future Roadmap

Planned improvements and architectural upgrades — from near-term wins to long-term platform evolution.

NEAR-TERMEDM Fallback Hardening

Stage 6 is live in V3. Next step is hardening: add endpoint-contract smoke tests, token expiry telemetry, and richer reason codes in audit logs so bypass causes (OFF/no token/auth/network) are visible at a glance.

NEAR-TERMPer-Record Lineage in Audit Tracker

Introduce a unique processing_id UUID per INBOX file, threading through all audit events from intake to output. Enables complete per-file lineage queries without joining on filenames. Directly resolves the multi-document same-AWB aggregation challenge.

MID-TERMPersistent OCR Result Caching

Persist OCR cache between pipeline runs. Files that return to NEEDS_REVIEW and are re-submitted could skip re-OCR if their content hasn't changed (MD5 match). Saves 5–15 seconds per re-submission on scanned documents — significant for high-volume operations.

MID-TERMOperator Review Web Console

A web-based UI (FastAPI + React) for reviewing NEEDS_REVIEW files. Shows the PDF alongside candidates tried, rotation scores, and OCR snippets. Operator confirms, corrects, or rejects the AWB with one click — feeding corrections back to improve future matches.

LONG-TERMML-Assisted Confidence Tuning

Train a lightweight classifier on accumulated stage-cache and audit data to predict: (a) which DPI/PSM combination will yield the best result for a given document type, (b) optimal rotation probe angle narrowing from file metadata features. Could reduce average processing time by 30–40% for the hardest document tier.

LONG-TERMMulti-Worker Parallel Processing

Current architecture is single-worker. A ProcessPoolExecutor could run multiple files in parallel. Would require switching the audit layer from Excel workbook to append-only SQLite — eliminating lock contention at scale and enabling true concurrent processing.

Section 17

Operator Control Centre

The current Tkinter app is the live command surface: login, hotfolder control, EDM runtime toggle, automatic orchestration, manual batch/TIFF actions, folder counts, logs, and safe cleanup.

Primary Runtime

V3.ui.app_window.App

Auto Interval

10s

AUTO_INTERVAL_SEC default

Batch Gate

min clean batches before AUTO build

Manual Escape

Live

operators can intervene at every folder

Primary Controls

Start AWBStarts or stops the watchdog hotfolder process and the EDM duplicate checker when enabled.

EDM ON/OFFWrites the runtime toggle in data/edm_toggle.json; Stage 6 and downstream duplicate checks read it live.

AUTO MODERuns the unattended loop: wait for INBOX drain, wait for PROCESSED drain, batch when ready, then convert TIFF.

Full CycleRuns one supervised cycle without leaving the app in continuous automation mode.

Recovery + Manual Actions

Prepare BatchRuns V3.services.batch_builder against CLEAN and stages outputs into PENDING_PRINT.

Convert TIFFRuns the parallel PDF-to-TIFF converter beside the batch PDFs.

Retry FailedMoves NEEDS_REVIEW PDFs back into INBOX for another pass after DB/token/input fixes.

Clear AllStops running work, preserves protected files, and clears operational working outputs safely.

Auto Phase	Current Behavior	Safety Guard
1 INBOX drain	Waits until INBOX stays empty for `INBOX_EMPTY_STABLE_SECONDS=8`.	Max wait `1800s`; stop event checked every half second.
2 PROCESSED drain	Lets the EDM duplicate checker move matched files into CLEAN, REJECTED, or CLEAN-UNCHECKED.	If EDM checker is not running, direct PROCESSED-to-CLEAN fallback is logged.
3 Batch readiness	Estimates batch count before building; default AUTO build waits for at least 2 batches.	Force-batches old CLEAN files after `AUTO_FORCE_BATCH_AGE_SECONDS=1800`.
4 TIFF conversion	Runs after successful batch copy to PENDING_PRINT.	Skip-if-exists prevents repeat conversion of healthy TIFFs.

Current operator model: the app is intentionally local-first. It uses folder counts, line-capped colored logs, progress animation, session state, protected-file cleanup rules, and one-click launchers instead of requiring a browser server or external queue service.

Section 18

Current Process Snapshot

A concise version map of the repo as it exists now: what is live, what is configurable, and which files own each part of the process.

Live In V3

Two-pass hotfolder, fast-lane ProbeLite, long-pass timeout, third-pass resume, Stage-6 EDM fallback, downstream EDM duplicate checker, batch builder, TIFF converter, and Tk control centre.

Config Driven

Paths, Tesseract, OCR DPI, rotation thresholds, EDM endpoints, duplicate thresholds, batch limits, TIFF settings, and AUTO mode timing all flow through V3/config.py and .env.

Documented Now

README, operations runbook, and this visual guide now line up around the same end-to-end flow: INBOX -> PROCESSED -> CLEAN/REJECTED -> OUT -> PENDING_PRINT.

Process Area	Owner File	Current Responsibility
Launcher + App Entry	`V3/launcher.py`, `V3/app.py`	Starts the branded UI and cross-platform control surface.
Configuration	`V3/config.py`	Centralizes paths, feature flags, timeouts, thresholds, EDM endpoints, and TIFF/batch settings.
Hotfolder Scheduler	`V3/services/hotfolder.py`	Watchdog queue, AWB DB refresh, fast lane, long pass, timeout resume, heartbeat, attempt caps.
AWB Extraction	`V3/stages/pipeline.py`	Stages 0-7, OCR, rotation, rescue passes, EDM persistence fallback, diagnostics.
Duplicate Screening	`V3/services/edm_duplicate_checker.py`	EDM metadata/download, Gate 1 hash, Gate 2 probes, Tier 2 HASH/PHASH/TEXT/OCR compare, routing.
Batch Assembly	`V3/services/batch_builder.py`	Scans CLEAN, sequences AWBs, writes barcode covers, atomic print-stack PDFs, copies to PENDING_PRINT.
TIFF Output	`V3/services/tiff_converter.py`	Parallel, streaming multi-page TIFF generation beside print-stack PDFs.
Audit + Logs	`V3/audit/`, `logs/`	Structured JSONL events, dashboard rebuilds, workbook trackers, pipeline and EDM logs.

Unreleased Documentation Delta

The current changelog records documentation alignment work as Unreleased: README and operations docs were updated for ProbeLite, third-pass resume, EDM Stage 6, downstream duplicate routing, current config keys, and controlled shutdown behavior.

Reality Check

This page avoids hard-coding operational volume claims where the code reads the live Excel DB at runtime. The reliable source of truth is the current data/AWB_dB.xlsx load count printed by the hotfolder log.