Plain-English explanation of the Document Processing Pipeline, its business value, and end-to-end function — written for founders, operators, and leadership.
From 7-step manual paper workflow to zero-touch automation
Replaces printing, manual AWB typing, barcode generation, physical arrangement, and paper scanning. Adds a duplication check that never existed before.
Automation Rate
+90%
docs matched without human input
Fast Match Speed
<100ms
for filename or text-layer hits
DB Source
Excel
AWB set reloads every 30s
False Positives
~0
all AWBs verified against DB
What is the Document Processing Pipeline?
The Document Processing Pipeline is an automated document processing system built for FedEx shipment operations. It watches a local "hot-folder" (INBOX), and whenever a new PDF arrives — whether an Air Waybill, commercial invoice, or packing list — it immediately identifies the 12-digit Air Waybill Number embedded in that document.
Once the AWB is confirmed against the live Excel-backed AWB set, the system automatically renames the file, moves it to the correct folder, and records every action in multiple audit trails. Runtime is typically sub-second for easy filename/text-layer matches and can extend to tens of seconds for hard rotated/OCR rescue documents — all without a human touching a keyboard.
The Manual Process — Then & Now
Before automation, every incoming shipment document required a multi-step physical workflow involving printing, manual data entry, barcode generation, physical arrangement, and paper scanning. The pipeline eliminates every single one of those steps.
OLD MANUAL WORKFLOW — 7 Steps Per Document Batch
01
Document Received
PDF arrived from multiple sources
02
Print Document
Employee prints the PDF on paper
03
Stack on Desk
Placed on physical scanning desk stack
04
Manual AWB Entry
Person reads AWB from page, types into barcode software
Manual AWB typing from document to barcode software introduces typos and misreads
No Duplicate Check
No automated way to detect if the same shipment document was submitted twice
Document Processing Pipeline V3 replaces all 7 steps
AUTOMATED PIPELINE — Fully Hands-Free
PDF Arrives in INBOX
Watchdog detects within 2 seconds. No printing, no physical handling.
AWB Auto-Extracted
Stages 0–7 find and verify the 12-digit AWB against the currently loaded Excel AWB set.
Duplicate Check
EDM API check + MD5 hash dedup catches any document submitted more than once.
Batch Ready to Print
Barcode cover pages generated automatically. Batch PDF + TIFF assembled for scanner.
Metric
Manual Process
Document Processing Pipeline V3
Steps per batch
7 manual steps
1 — drop file in INBOX
AWB identification
Human reads & types manually
Automated OCR + DB verify
Barcode generation
Manual software entry per AWB
Auto-generated with batch cover page
Duplicate detection
None — manual check only
EDM gate/layer screening with HASH, PHASH, TEXT, and OCR signals
Error source
Typos in manual AWB entry
Zero false positives — DB-verified only
Audit trail
None
JSONL + Excel + 4-sheet workbook
Processing speed
Minutes per document
<100ms – 20s per document
End-to-End in Six Steps
1 · Document Arrives
PDF lands in INBOX. Watchdog detects it within 2 seconds. Stability guard ensures it is fully written before processing begins.
2 · AWB Extracted
Up to 8 pipeline stages search for the 12-digit AWB: filename pattern, embedded text, OCR at 320 and 420 DPI, rotation handling, and rescue crop passes.
3 · DB Confirmation
Every candidate AWB cross-checked against the live Excel database of 70,000+ records. No match = not accepted. Zero false positives guaranteed.
4 · File Organised
File renamed to its AWB number and moved to PROCESSED. Byte-for-byte duplicates detected via MD5 hash and removed safely without data loss.
5 · Logged & Audited
Every event recorded in: Excel AWB log, JSONL audit log, stage-performance CSV, and a 4-sheet audit workbook with a live Dashboard tab.
6 · Print-Ready
After EDM duplicate checking, clean documents are collated into batch PDFs (max 48 pages, barcode cover) then converted to multi-page TIFFs.
What about difficult documents? Sideways scans, image-only PDFs, and ambiguous documents are handled via multi-stage fallback. If all stages fail, the file moves to NEEDS_REVIEW for manual handling. Difficult files also respect a LONG_PASS_TIMEOUT_SECONDS=45.0 budget during long-pass — they are deferred to a third pass that runs once easier documents clear the queue, with their full OCR state preserved for seamless resumption.
Business Impact for FedEx Operations
This automation is valuable because it solves real operational delays, not just document processing. It reduces shipment risk, protects service timelines, cuts manual workload, and gives teams real visibility into flow and backlog.
Operational Problems Solved
01Late scanning risk: a perishable-shipment invoice scanned 3 hours late can push movement by 1-2 days.
02After-hours pileups: documents received post-shift sit unprocessed and miss next-day clearance windows.
03Duplicate exposure: same shipment documents can re-enter flow without reliable duplicate screening.
04Paper-heavy handling: printing, arranging, rescanning, and manual indexing consume time and resources.
05No real-time queue visibility: teams cannot easily see what is waiting, what is blocked, or what completed.
06Exception handling overload: hard scans consume senior operator time that should be reserved for true edge cases.
07Shift-change discontinuity: work quality and speed vary by shift handoff and manual prioritization.
08Audit reconstruction cost: proving what happened to a file requires manual log chasing across tools.
How Automation Changes Outcomes
Service SpeedNight and backlog documents can be processed continuously, reducing clearance and dispatch delays.
AccuracyDuplicate checks reduce repeat handling, rework, and downstream delivery confusion.
Paperless ShiftPrinting/scanning dependency drops sharply, enabling near paperless operation for this lane.
VisibilityDocument counts, queue state, and status become visible in real time, enabling auditable operations.
Labor EfficiencyCountless manual scan-touch hours are replaced by no-touch routing for routine files.
ContinuityStandardized automation behavior reduces dependence on shift-specific tribal workflows.
PredictabilityDeferred/timeout recovery keeps difficult files from freezing intake lanes.
Leadership ControlLive operational telemetry supports faster escalation and staffing decisions.
Operational Dimension
Without Automation
With Automation
Business Effect
Perishable Shipment Timeliness
Invoice scan delay can hold shipment 1-2 extra days
Continuous intake + automated extraction reduces waiting time before clearance
Higher on-time movement probability for sensitive cargo
After-Hours Intake
Documents accumulate overnight and spill into next-day operations
Queue is monitored and processed with less dependency on immediate manual action
Lower morning backlog and fewer next-day misses
Duplicate Handling
Manual memory/checklists; high inconsistency risk
Automated duplicate screening with conservative decision policy
Lower rework, fewer unnecessary touches, stronger control
Labor & Paper
Heavy print/scan/manual sort effort
Near paperless flow and no-touch processing for routine files
Significant hours saved and reduced paper cost/waste
Operational Visibility
Little live visibility into document volume and state
Live counts, status, queue depth, and event logs available
True auditability and better staffing/decision planning
Exception Throughput
Hard files can block operators and delay normal workload
Timeout-defer + resume path isolates hard files from routine flow
Better SLA protection for standard-volume processing
Shift-to-Shift Consistency
Output quality depends heavily on who is on shift
Unified automated rules enforce same decision logic across all shifts
More consistent operational outcomes and fewer handoff surprises
Audit Readiness
Manual evidence collection is slow and fragmented
Structured logs and tracker records available per processing event
Faster internal/external review with higher confidence
Strategic outcome: this shifts teams from endless scan/print/manual handling to exception-based operations. Routine documents move with minimal touch, critical shipments face less delay risk, and leadership gains measurable control through real-time visibility and audit evidence.
Section 02
Visual Pipeline Flow
Click any node to reveal a detailed explanation. The flowchart shows every major stage — from document arrival through three-tier scheduling to final output.
ExtractionOCR StageRotationDecisionSuccessReview
Phase 2 — Post-Match Processing Pipeline
Every file that exits the AWB extraction engine as MATCHED enters Phase 2. Click any node to see detailed logic.
PROCESSED
AWB-renamed files land here
EDM CHECK
3-layer duplicate detection
MD5pHASHOCR
CLEAN
Unique → batch eligible
REJECTED
Duplicate → discarded
BATCH BUILDER
Sort → sequence → pack ≤48pp
HIGHMEDLOW
COVER PAGE
Code128 barcode + seq + metadata
PENDING_PRINT
Batch PDFs + MD5 dedup copy
TIFF CONVERT
200 DPI · LZW · 4 parallel workers
PRINT READY
Batch PDF + TIFF in scanner queue
Three-Tier Processing Scheduler
Fast Lane
Stages 0–3 only, plus a strict post-Stage-3 ProbeLite check (400 / exact-high only) before defer. Returns DEFERRED only if probe-lite also fails. Drains the inbox at maximum speed. Up to 5 files per cycle.
Long Pass
Full pipeline on deferred files when fast queue empties. LONG_PASS_TIMEOUT_SECONDS=45.0 per-file timeout. On timeout, full state is captured and file queued for third pass.
Third Pass
Resumes timeout-deferred files with no timeout. OCR cache, candidate pool, rotation state all restored from snapshot. Zero repeated work.
Section 03
Intake & Detection
How the system watches the INBOX folder, ensures files are fully written, and manages queuing, de-bounce, and safety rescans.
Watchdog Observer
The watchdog library attaches InboxPDFHandler to INBOX_DIR. Fires on on_created, on_moved, and on_modified events. Only .pdf files enqueued. A 0.8-second de-bounce prevents double-queuing on OS event bursts (Windows Copy → Flush → Close sequence). Existing inbox PDFs are seeded into the queue at startup.
File Stability Check
file_is_stable() polls the file size twice with a 0.3s delay. A file is safe only when size is non-zero and identical on both reads. Prevents processing a file still being written, transferred, or copied from a network share — a common issue with large EDM document batches.
Safety Rescan
Every 30 seconds, the main loop re-scans INBOX and re-enqueues any PDFs found. Catches files the watchdog missed during high system load or temporary filesystem unavailability.
Heartbeat Log
Every 10 seconds: logs INBOX PDF count, AWB DB size, deferred long-pass queue depth, and timeout-deferred queue depth — real-time health monitoring.
Excel DB Refresh
Every 30 seconds: checks AWB Excel file mtime. On change, reloads full AWB set and rebuilds prefix/suffix buckets. UI can force immediate reload via reload_awb.trigger sentinel file.
Pre-Processing Guards — In Order
Guard
Check
On Failure
GUARD 1
File path exists & ends in .pdf (case-insensitive)
Silently dropped from queue
GUARD 2
file_is_stable() — 2 polls at 0.3s, non-zero identical size
No inbox-level deduplication — every file processed independently
Dedup only at PROCESSED level via MD5
Outlook Email Intake Script (Companion Flow)
Companion intake script behavior for email attachments before AWB hotfolder processing: attachments are saved to INBOX, and filename renaming is strict and deterministic so no ambiguity is introduced upstream.
Intake Decision Logic
1Read Outlook email attachments and keep PDF files only.
2Run strict AWB extraction patterns on attachment context/name.
3aIf exactly one strict AWB candidate exists: save as <AWB>.pdf in INBOX.
3bIf zero or multiple candidates: keep original filename and save to INBOX unchanged.
4AWB hotfolder picks up from INBOX and runs normal Stage 0–7 logic.
Strict Pattern Rule
Pattern Type
Accepted Form
Result
12-digit strict
(?<!\d)\d{12}(?!\d)
Eligible candidate
4-4-4 strict
(?<!\d)\d{4}[\s-]\d{4}[\s-]\d{4}(?!\d)
Normalized to 12 digits
Exactly one match
Single strict candidate only
Rename to <AWB>.pdf
0 or >1 matches
Ambiguous or absent AWB
Keep original filename
Safety intent: strict rename only on one unambiguous candidate preserves traceability and avoids wrong pre-labeling; all ambiguous files still enter INBOX for the app’s full extraction pipeline. This intake script is treated as an operational companion layer, keeping core V3 extraction/matching behavior unchanged.
Section 04
AWB Extraction Engine
Eight pipeline stages, each building on a shared candidate pool. Click any stage to expand full technical detail.
Unified Candidate Pool: All stages write into running_high and running_standard. Noisy candidates from invert or rotation passes enter a quarantine set until a second independent stage confirms them — preventing one-off OCR artifacts from ever reaching the matching step.
Stage
Method
DPI
Typical Time
DB Check
Stage 0
Filename regex
—
~0–5ms
YES
Stage 1
PyMuPDF text layer · 5 sub-stages
—
50–500ms
YES (excl. 400-pattern)
Pre-OCR
Angle detection · 3 free checks
—
~0ms
—
Stage 2
OCR Main · PSM 6+11 · digit whitelist
320
1–4s
YES
3.1/3.2
Rotation probe · probe text exit
140
0.5–2s
YES
Stage 3
OCR Strong · PSM 6+11 · 420 DPI
420
2–6s
YES
Stage 4
Rotation fallback at remaining angles
420
3–12s
YES
Stage 5/5.5
Context rescue · airway label crops · 3× upscale
420
2–15s
YES
Stage 6
EDM persistence fallback (runtime ON/OFF)
—
0.1–1.5s (when called)
EDM
Stage 7
NEEDS_REVIEW terminal
—
—
—
Stage 0
Filename Extraction — ~0ms
Instant match with no OCR — handles most EDM-downloaded files
Calls extract_awb_from_filename_strict() using two compiled regexes: a bare 12-digit run \b\d{12}\b and a spaced/hyphenated 4-4-4 format. If found and confirmed against the AWB DB, the file is completed immediately — all OCR is skipped entirely.
DB check requiredCost: ~0msPattern: \b\d{12}\bPattern: \d{4}[\s-]\d{4}[\s-]\d{4}
Stage 1
Text Layer Extraction — 5 Sub-Stages
PyMuPDF embedded text · No image render · 60-keyword proximity search
1a — set_rotation fallback:If text layer is empty, tries rotating the PDF metadata by 90°/270°/180° to unlock text stored at an angle in the PDF's internal stream. Resets to 0° after.
1b — Spatial word sort:For scrambled multi-column PDFs, get_text("words") re-sorted by (row, x) to reconstruct reading order. Used only when it yields more candidates than the raw stream.
1c — 400-pattern:Tight 400\d{12} regex. The only extraction that bypasses the DB check — the FedEx 400 carrier prefix strongly implies a valid AWB.
1d — Clean gate + tiered extraction:Extracts candidates adjacent to AWB keywords and runs full priority match cascade (Exact-High → Exact-Standard → Tolerance).
1e — Keyword proximity (60 terms):Scans 5 lines ahead / 2 lines behind every keyword in the 60-item list including "TRACKING NO", "MAWB", "CARGO CONTROL NUMBER", "ACI NO".
DB check: YESNo image renderKeywords: 60 termsLookahead: 5 lines
Pre-OCR
Angle Detection — Zero Cost
3 free checks before any image is rendered
Check 1 — PDF metadata rotation:page.rotation reports if the PDF viewer displays at 90°/180°/270°.
Pass A — PSM 6, digits-only:Uniform block mode with whitelist 0123456789. Clean gate → tiered extract → priority match. Early exit if quality candidates found (skips PSM 11).
Pass B — PSM 11, digits-only:Sparse text mode — suited for non-uniform layouts. OpenCV table line removal applied before OCR when available.
Pass C — Soft pass (general text):Tesseract unrestricted (PSM 11 then PSM 6) + spatial bounding-box analysis (image_to_data) finding digit clusters adjacent to AWB keyword tokens.
Invert passes:PSM 6+11 with image inversion (white text on dark — common in FedEx label formats). Invert-sourced STANDARD candidates go to quarantine until confirmed by another stage.
Renders at 140 DPI, tests 0°/90°/180°/270°. Score formula: score = digit_count + keyword_hits×120 + coherent_words×2. High keyword weight means one "AWB" token outweighs hundreds of isolated digits.
Fast path:If 0° wins digit-score by ≥24 margin, skips expensive general-text OCR and returns 0° immediately.
Certainty thresholds:Score margin ≥300 = CERTAIN · ≥120 = LIKELY · below = UNCERTAIN. Uncertain probes defer to the pre-check hint.
Stage 3.2 — Probe Text Exit:Reuses OCR text from the probe at zero additional cost. A confident match here skips all of Stage 3 (420 DPI OCR) — saving 3–8 seconds.
Identical sub-pass structure to Stage 2 · fully cached renders · timeout-guarded in long-pass
Identical sub-pass structure to Stage 2 (PSM 6 → PSM 11 → Soft → Invert) applied at 420 DPI at base_angle. Higher DPI yields better clarity for thin-text documents, fine-print labels, and compressed scans. All images cached — if Stage 2 already rendered 420 DPI at 0°, Stage 3 retrieves from cache without re-render. In fast-lane mode, Stage 3 failure triggers a strict ProbeLite check (limited low-DPI rotation probe + 400/exact-high only); only then can it return DEFERRED.
Stage 4
Rotation Fallback
Full OCR at remaining angles · timeout-guarded at each boundary
Runs as rotation last-resort logic using probe certainty and route ordering (not only when base_angle ≠ 0°). It applies the full Stage 2+3 sub-pass sequence at deferred angles, then executes a full-priority sweep on accumulated fallback candidates. Timeout checks fire at angle boundaries — if LONG_PASS_TIMEOUT_SECONDS=45.0 is exceeded, _TimeoutDeferred captures all state (candidates, OCR cache, probe scores, timings) and defers to third-pass.
Stage 5 — Context Rescue:ACI-prefixed patterns, "400 NO. XXXX" labeled format, FedEx carrier row positional matching, and 2-3× upscaled region crops for micro-print labels. 60-second time budget.
Stage 5.5 — Airway Label:Crops three overlapping right-side regions (RightMid, UpperRight, RightWide) at each angle. Each crop upscaled 3×. Two-step efficiency: cheap digit-only pass first; if <10 digit chars found, skips general-text OCR for that crop.
Budget: 60sUses cached imagesUpscale: 3×
Stage 6
EDM Persistence Fallback
Runtime-gated API check · single persistent HIGH candidate only
Entry criteria:Exactly one 12-digit candidate remains, tagged HIGH confidence, seen in at least 2 independent stages, and not disqualified as date/noise.
Runtime gate:EDM ON/OFF is read live from data/edm_toggle.json (with env/config fallback). OFF immediately bypasses API calls.
Outcomes:True confirms the candidate as MATCHED (EDM-Exists-Persistent). False or None falls through to Stage 7. Multiple persistent candidates trigger a tie to NEEDS_REVIEW.
UI toggle: EDM ON/OFFCache: edm_awb_exists_cache.jsonBypass-safe on missing token/auth
Stage 7
NEEDS_REVIEW — Terminal State
Full diagnostic log · safe_move() · audit updated
Terminal state when no AWB match is found after all stages. Full diagnostic log written: every candidate tried, confidence tier per candidate, which stages saw each one, and all quarantined candidates. File moved via safe_move() to NEEDS_REVIEW_DIR. Audit tracker updated via record_hotfolder_needs_review() with reason string and complete candidate list.
Section 05
Document Classification
How the pipeline characterises each document's type and difficulty tier before committing to expensive processing paths.
Tier 1 — Named Files
<5ms
Filename already contains a 12-digit AWB. No OCR required. Stage 0 match. Most EDM-downloaded files arrive pre-named and land here.
Tier 2 — Text PDFs
50–500ms
PDF has embedded vector text. AWB extracted via Stage 1 without any Tesseract invocation. Handles multi-column scrambling and metadata rotation recovery.
Tier 3 — Scanned / Image
2–20s
Image-only PDF — no embedded text (_is_image_only=True). Full OCR pipeline, rotation probe, and potentially all rescue stages.
Image-Only Detection
Set when len(txt_layer.strip()) == 0 after the set_rotation fallback attempt. Affects scheduling: rotation probe immediately narrowed to (0°, hint°) if a strong pre-check hint exists — reducing probe cost by ~50%.
Text Layer Quality Signal
A rich text layer (≥25 chars, MIN_EMBEDDED_TEXT_LENGTH) triggers the early rotation probe after Stage 2 failure. A thin layer triggers it immediately after Stage 1, routing weakly-embedded documents into OCR faster.
Section 06
Matching & Validation
How raw OCR candidates are validated against the AWB database — confidence tiers, Hamming distance tolerance, prefix/suffix buckets, and the tie-guard.
Candidate Confidence Tiers
HIGH
Extracted from strong, unambiguous context: adjacent to a known AWB keyword, matching a structured label pattern (ACI, 400-labeled, airway-bill label row), or confirmed by spatial bounding-box analysis. Eligible for Hamming tolerance up to 2 digits.
STANDARD
Extracted from looser pattern matching — a 12-digit run without strong keyword adjacency. Eligible for tolerance only if sole STANDARD candidate AND seen in ≥2 stages. Tolerance limited to 1-digit Hamming.
QUARANTINE
Candidates from noisy passes (invert, Rotation-180/270, AngFallback) that produced >3 candidates in one pass. Excluded until a second independent stage confirms the same candidate — preventing OCR artifacts from forcing matches.
Exact-Standard:STANDARD candidates ∩ AWB DB. Same tie logic applies.
3
Tolerance2-High:Hamming ≤2 against HIGH candidates. Accepted only if evidence candidate seen in ≥1 stage (dist≤1) or ≥2 stages (dist=2). Prefix/suffix bucket lookup limits the search space to ~4-digit matches.
4
Tolerance2-Standard:Hamming ≤1 against STANDARD, only when pool has exactly 1 candidate AND it was seen in ≥2 stages. Most conservative tolerance path.
Candidate Disqualification Rules
✕More than 3 leading zeros (e.g. 000012345678) — numeric code or padding artifact
✕Matches a date pattern (YYYYMMDD, DDMMYYYY variants) — disqualified as date reference
✕All same digit (e.g. 111111111111) — OCR artifact or table cell filler
✕Matches known HS code or phone number patterns — commercial document noise
Section 07
Routing & Decision Logic
How the orchestrator decides which path a file takes at every decision boundary — three-tier scheduler, timeout management, and full state serialisation for seamless resume.
Return States
MATCHEDAWB confirmed. File renamed and moved to PROCESSED.
NEEDS_REVIEWNo match after all stages. File moved to NEEDS_REVIEW dir.
DEFERREDFast-lane exit after Stage 3 fail. Queued for long-pass.
TIMEOUTBudget exceeded. State fully captured for third-pass resume.
45-Second Timeout Budget
The long-pass budget is LONG_PASS_TIMEOUT_SECONDS=45.0. The timeout check fires only at natural angle-pass boundaries — never mid-subpass. On timeout, _TimeoutDeferred exception is raised internally, which triggers complete state capture into a _state_out dict passed back to the scheduler for third-pass resumption.
State Capture on Timeout
Everything needed for seamless third-pass resume:
captured_state = { "probe_scores": # rotation probe scores per angle {0: 891, 90: 42, ...} "probe_texts": # (digit_text, general_text) per angle — reused free "base_angle": # determined rotation in degrees (0/90/180/270) "_angle_certainty": # "CERTAIN" / "LIKELY" / "UNCERTAIN" "running_high": # accumulated HIGH candidate set as list "running_standard": # accumulated STANDARD candidate set as list "candidate_stage_hits": # {candidate: [stages_that_saw_it]} "quarantine": # noisy candidates pending second-stage confirmation "ocr_cache": # [[key_list, text], ...] — serialisable tuple pairs "timings": # accumulated ms per stage, carried forward on resume
}
Section 08
Output Actions
What happens to a file once a match is confirmed or denied — file lifecycle from INBOX to final resting place across all output directories.
On Match → PROCESSED
1. Rename & Move:PROCESSED/<AWB>.pdf. If exists: MD5 compare → identical → source deleted. Different content → _2, _3 suffix appended.
2. AWB Logs Excel: Row buffered to CSV sidecar, flushed to Excel after 10 rows accumulate. Contains: AWB, source filename, timestamp, match method, status.
AWB sequence Excel log and awb_list.csv for downstream.
What Happens Next — Downstream Journey
PROCESSED
EDM 3-Layer Check
CLEAN
Batch Builder
Barcode Cover
PENDING_PRINT
TIFF Convert
Print Queue
See Pipeline Flow §2 for interactive Phase 2 diagram
Section 09
EDM Duplicate Check
EDM in current V3 is runtime-gated from the UI and used for Stage 6 AWB persistence fallback, with this section also documenting the downstream duplicate-screening lane used in EDM observability flows.
V3 update: EDM calls are now controlled at runtime (EDM: ON/OFF button in UI). Stage 6 in the hotfolder pipeline uses EDM for persistent-candidate confirmation, and bypasses safely when EDM is OFF, token is missing, or response is inconclusive.
Layer 1 — MD5 Hash
Byte-for-byte identity check. Fastest possible comparison — if two files produce the same MD5, they are guaranteed identical content. Zero false positives. Runs before any image processing.
Cost: ~1ms
Layer 2 — Perceptual Hash
Compares visual fingerprints of page renders. Catches near-identical documents that differ only in metadata, timestamps, or minor encoding differences. Threshold: pHash distance ≤ 10.
Threshold: ≤10
Layer 3 — OCR Text Similarity
RapidFuzz text similarity on OCR/text-layer content with layered safeguards: text-vs-text only when both sides have enough embedded text, OCR fallback when one side lacks text layer. Similarity ≥60% is tracked; ≥85% becomes a strong text signal.
Track: ≥60%Strong: ≥85%
Document Flow Through EDM
1Runtime gate evaluated (EDM: ON/OFF in UI)
2Stage 6 checks a single persistent HIGH candidate only
3EDM metadata API queried (token + auth validated)
4aConfirmed in EDM → MATCHED (EDM-Exists-Persistent)
4bNot confirmed / bypassed → continue to Stage 7 path
5Downstream duplicate lane applies Gate-1 hash-all-pages, Gate-2 bounded probes, then Tier-2 full checks with CCD exemptions and conservative reject rules
EDM API Integration
Connects to the FedEx EDM production endpoint using a Bearer token stored in data/token.txt. Two endpoints used:
Metadata endpoint: Retrieves document group info for the AWB (operatingCompany: "FXE")
Download endpoint: Returns a ZIP of existing PDFs for the AWB, extracted for comparison
Fallback contract handling: Metadata path supports legacy payload first, then body/query fallback variants; inconclusive/auth/network paths route to CLEAN-UNCHECKED without stopping flow.
Token expiry is checked on first call each session. When expired, EDM checks are skipped gracefully and logged as warnings — the pipeline continues without interruption.
Live toggle behavior: the desktop UI button switches EDM between ON and OFF immediately. Running hotfolder checks read the updated value without restart, so API calls are allowed or bypassed in real time.
Runtime Toggle Resolution Order
Priority
Source
Purpose
1
data/edm_toggle.json
Persisted UI state from EDM: ON/OFF button
2
PIPELINE_EDM_ENABLED
Per-process override for launched scripts
3
ENABLE_EDM_FALLBACK
Default from .env / config
Stage 6 Decision Path
GateRun only for one persistent HIGH candidate seen across 2+ stages
ONCall EDM metadata endpoint with token from data/token.txt (or env)
TrueComplete match with method EDM-Exists-Persistent
FalseCandidate not confirmed; pipeline continues to Stage 7 logic
# Simplified runtime behavior used in V3 if not is_edm_enabled(): returnNone# bypass API
token = get_token() if not token: returnNone# skip safely
exists = edm_awb_exists_fallback(candidate_awb) if exists: complete_match(candidate_awb, "EDM-Exists-Persistent")
Why originals reach CLEAN faster: the comparison lane avoids brute-force permutations with Gate-1 exact hash over all pages, Gate-2 bounded probes, and Tier-2 expansion only after evidence hit.
Permutation/Combination Pruning
1Gate 1: incoming all-pages vs EDM all-pages hash index gives exact-hit/no-hit cheaply.
2Gate 2: smart sampled probes (first pages + 1/3 + mid + 2/3 + last) using pHash/text/OCR only when Gate 1 misses.
3Tier 2 full checks run only after evidence and apply CCD exemption + bounded OCR parallel prewarm.
Page Limits + Top Picks
Control
Default
Effect
EARLY_FOCUS_MATCH_THRESHOLD
3
Promotes likely candidates first (top-pick path)
PAGE_OCR_LIMIT
8
Caps OCR pages per document to bound heavy comparisons
EDM_OCR_COMPARE_LIMIT
10
Tier-2 OCR compare page/doc bound for heavy checks
PHASH_THRESHOLD
10
Fast visual near-match gate before OCR text similarity
EDM_TIER1_INCOMING_PAGES
3
Base sampled incoming pages before extra strategic probe indices
EDM_TIER1_EDM_PAGE_LIMIT
5
Tier-1 probe depth per EDM document
EDM_TIER2_EDM_PAGE_LIMIT
10
Tier-2 full-compare depth per EDM document
P-Cross Handling for Multi-Page Input vs Multi-Page EDM Docs
When both sides have many pages, the lane uses bounded cross-page comparison (P-Cross) instead of unbounded all-page expansion. Duplicate evidence is accumulated as page hits and ratio, then resolved by configured rejection thresholds.
Operational result: efficient ranking + bounded P-Cross keeps processing fast for genuine originals while still protecting against high-page-count duplicate overlap.
False positive philosophy: A false positive (wrongly rejecting a clean doc) is treated as worse than a missed duplicate. Automatic reject/split decisions require strong methods (HASH/PHASH). TEXT/OCR-only similarity is retained as observability signal and can still pass through when strong evidence is absent.
Layer 1MD5 Hash — Byte Identity
Computed using hashlib.md5() with 64KB read chunks. If two files produce the same MD5 digest, they are guaranteed byte-for-byte identical. Zero false positives. This is the cheapest check at ~1ms and runs first before any image processing.
Speed: ~1msZero false positivesChunk size: 64KB
Layer 2Perceptual Hash — Visual Fingerprint
Uses the ImageHash library to compute a perceptual hash of each page render. Hamming distance between two pHash values ≤ 10 → duplicate. Catches near-identical documents that differ only in metadata, compression level, or minor encoding artefacts. OCR compare limit caps checked pages at 10.
When hashes differ but content may still be duplicated (for example re-scanned paper), text comparison runs with safeguards: text-layer compare only when both pages have enough embedded text; otherwise OCR fallback compares mixed text/image pages. Score ≥60% is recorded as TEXT; score ≥85% is recorded as TEXT_STRONG. Conservative routing still requires strong evidence for auto-reject decisions.
Transforms the CLEAN folder into numbered, print-ready batch PDFs with auto-generated barcode cover pages — replacing the entire manual stack-and-scan workflow.
Max Pages / Batch
48
configurable via MAX_PAGES_PER_BATCH
Cover Page Format
1
per AWB · barcode + metadata
Sort Order
mtime
oldest file in each AWB group first
Write Safety
Atomic
tmp → os.replace — no partial files
Batch Construction Pipeline — Step by Step
01SCAN CLEAN
Find all \d{12}(?:_\d+)?.pdf files. Group by AWB. Open each once for page count.
mtime cached
02SORT FIFO
Sort groups globally by oldest file's mtime. Files within each AWB group also sorted by mtime.
oldest first
03ASSIGN TIER
Read stage_cache.csv to map each AWB's detection method to High / Medium / Low tier.
HighMedLow
04ASSIGN SEQ
Global sequential integer per AWB. Resets per tier when tier batching is on. Logged to Excel.
FIFO → SEQ
05PACK BATCHES
Greedy first-fit. Cost = 1 cover + N invoice pages. When next AWB would exceed 48pp → open new batch.
≤48 pages/batch
06BUILD PDF
Insert Code128 cover page + all AWB PDFs via PyMuPDF. Write atomically: .tmp → os.replace.
atomic writereportlab
07PENDING_PRINT
Copy batch PDF with MD5 dedup check. Delete CLEAN sources only after all copies succeed.
MD5 dedup copy
Batch Packing Algorithm
Each AWB costs 1 cover page + N invoice pages. The packer walks AWBs in sequence order and assigns to the current batch. When the next AWB would exceed 48 pages, a new batch opens. Greedy first-fit — no backtracking or re-ordering.
Scans CLEAN for \d{12}(?:_\d+)?.pdf. Files grouped by AWB and sorted within group by mtime ascending. Groups globally ordered by each group's oldest file's mtime — FIFO. Each PDF opened exactly once during scan; page counts cached for the build phase.
Pattern: \d{12}(_\d+)?.pdfSort: mtime ascEach PDF opened once
Page count efficiency: Each PDF in CLEAN is opened once during the scan phase to read its page count. The counts are cached in memory and reused during the build phase — no file is opened twice. Keeps the batch builder fast even for large CLEAN folders.
Sequence numbers are assigned globally across all AWBs in FIFO order (oldest file first). Each AWB gets a unique sequential integer (seq) that appears on its barcode cover page. This number is the primary key in the Excel sequence log and allows tracing any printed batch document back to its original PDF.
Sequencing Rules
1
FIFO by file mtime:CLEAN folder is scanned and groups sorted by the oldest file's mtime within each AWB group. Earliest-arrived files get the lowest sequence numbers.
2
Multiple files per AWB:When an AWB has multiple documents (e.g. 123456789012.pdf and 123456789012_2.pdf), all are grouped under the same sequence number. All files are inserted after the single barcode cover page.
3
Append-only sequence log:Every batch run appends rows to data/OUT/awb_sequence.xlsx — history is never overwritten. Columns: Seq, AWB, PDF Files, Timestamp, DocCount, InvoicePages, TotalPages, Batch, Tier.
4
Tier batching resets seq:When ENABLE_TIER_BATCHING=True, sequence numbers restart at 1 within each tier (High, Medium, Low). Each tier is built as a separate batch series — their cover pages are independently numbered.
Excel Sequence Log Schema
Column
Type
Description
Seq
integer
Global sequence number — unique per AWB per build run
AWB
string
12-digit Air Waybill Number
PDF Files
string
Pipe-separated list of source PDFs: 123456789012.pdf | 123456789012_2.pdf
Timestamp
datetime
ISO timestamp of when this entry was built
DocCount
integer
Number of PDF documents included for this AWB
InvoicePages
integer
Total invoice/document page count (excludes cover)
TotalPages
integer
Cover page + invoice pages = what occupies batch space
Batch
integer
Batch PDF number this AWB was assigned to
Tier
string
Detection tier (High / Medium / Low) when tier batching enabled
What's on the Cover Page
SEQSequence number in large Helvetica-Bold 18pt
AWB12-digit Air Waybill Number in 22pt bold
BATCHBatch number, zero-padded to 3 digits (e.g. 003)
PAGECover's position within the batch (e.g. Page 7 of 48)
BARCODECode128 barcode of the AWB number, height 60pt, width 1.2pt
FOOTERGeneration timestamp at bottom of page
Cover Page Generation
Built using reportlab's canvas API. Generated in-memory as raw PDF bytes (BytesIO), then merged into the batch PDF via PyMuPDF's insert_pdf(). Page size is configurable: LETTER (default) or A4.
Library: reportlabBarcode: Code128Page: LETTER or A4
When ENABLE_TIER_BATCHING=True, the builder creates separate batch series for each detection confidence tier. High-confidence AWBs (matched from filename or text layer) are batched separately from OCR-matched or lower-confidence documents. Each tier's batches are independently numbered and file-named.
HIGH Tier
AWBs matched via filename or text layer extraction. Prefixes: FILENAME, TEXTLAYER-EXACT, TEXT-LAYER.
Batch file format: PRINT_STACK_BATCH_TH_001.pdf
MEDIUM Tier
AWBs matched via exact OCR. Prefixes: OCR-EXACT. These required image rendering but produced a clean exact match.
Batch file format: PRINT_STACK_BATCH_TM_001.pdf
LOW Tier
All other AWBs — tolerance-matched, rescue-stage matches, or any detection method not in the high/medium prefixes.
Batch file format: PRINT_STACK_BATCH_TL_001.pdf
How Tier is Determined
The batch builder reads data/stage_cache.csv which the hotfolder writes after each successful match. The AWB_Detection_Type column stores the method label (e.g. Filename, OCR-Strong-Exact-High). Tier is assigned by prefix matching:
Detection Method Prefix
Assigned Tier
FILENAME, TEXTLAYER-EXACT, TEXT-LAYER
High
OCR-EXACT
Medium
All others (tolerance, rescue, rotation fallback)
Low
AWB not found in stage_cache.csv
Low (default)
Atomic Write Safety
Every batch PDF is written atomically: the PDF is saved to a .tmp file first, then renamed to the final name via os.replace() — which is atomic on both Windows NTFS and macOS APFS. If the process crashes mid-write, no partial batch PDF can be mistaken for a complete one.
# Atomic write (no partial files on crash)
doc.save(str(tmp_path))
doc.close()
os.replace(str(tmp_path), str(out_path)) # atomic
PENDING_PRINT Copy with MD5 Dedup
After batch PDFs are built in data/OUT/, they are copied to PENDING_PRINT/ for TIFF conversion and printing. Before each copy, MD5 hashes are compared: if an identical batch already exists in PENDING_PRINT, the copy is skipped. Different-content files with the same name get a _v2, _v3 suffix.
MD5 dedup on copySuffix: _v2, _v3...
CLEAN Source Deletion Safety
Source files in CLEAN are only deleted after all batch PDFs have been successfully copied to PENDING_PRINT. If any copy fails, deletion is skipped entirely with a safety message logged. This prevents data loss if the copy step is interrupted.
# Safety guard before source deletion if (failed == 0and
copied + skipped_dup == expected): delete_clean_sources(resolved) else: log("[SAFETY] Skipping deletion — not all files copied")
data/OUT/
Batch PDFs named PRINT_STACK_BATCH_NNN.pdf
PENDING_PRINT/
Copy of batches + TIFF conversions for scanner
awb_sequence.xlsx
Append-only sequence log — history preserved
Section 11
Tracking / Audit Layer
Every file, every match, every decision — permanently recorded across four complementary audit mechanisms with concurrent-write safety.
4-Sheet Audit Workbook
Central file: pipeline_audit.xlsx.
HotfolderV2: One row per AWB detection — timestamp, employee ID, AWB, filenames, detection method, tier, timing (ms), result, notes.
EDM: One row per EDM check — result, dup page count, dup ratio, compare method.
BatchTIFF: One row per batch build or TIFF conversion — batch number, AWB count, page count, output path.
Dashboard: Programmatically computed summary, rewritten on every write. Session totals and running stats.
JSONL Audit Log
Every event appended to logs/pipeline_audit.jsonl. Auto-rotates at 50MB (one backup). Never interrupts pipeline flow — all writes wrapped in try/catch.
Lock-file pattern via os.O_CREAT | O_EXCL — atomic on both Windows NTFS and macOS APFS. Timeout: 15 seconds. Stale locks (held >30s) auto-cleared. Enables safe concurrent access from hotfolder + batch_builder + TIFF converter simultaneously.
Known Limitation: When multiple INBOX files share the same AWB (duplicates at different paths), the tracker creates a separate row per file — not per AWB. This is correct behaviour (unique processing records) but AWB-level aggregation requires grouping by the AWB column across rows. A future processing_id UUID field would resolve this cleanly.
Section 12
Error Handling & Recovery
How the pipeline handles every failure mode — from unstable files and corrupt PDFs to OCR noise, rotation ambiguity, and the three-tier timeout recovery cycle.
File Stability Failure
If file_is_stable() fails, pipeline returns NEEDS_REVIEW immediately. The 30-second safety rescan re-enqueues the file once writing completes, allowing processing to succeed on the next cycle without manual intervention.
Corrupt or Zero-Page PDFs
PyMuPDF open() and page_count check catches corrupt files before any extraction stage. Moved to NEEDS_REVIEW with clear reason string ("0-page PDF" or "Corrupt/unreadable PDF: {exception}"). Guard runs before any closures are defined — ensuring clean state even on immediate failure.
45-Second Long-Pass Timeout
High-resolution scans with complex rotation can exceed the budget. Rather than blocking the entire pipeline, the timeout captures full state (candidates, OCR cache, probe scores, timings) and defers to the third-pass queue. Third pass processes up to 3 deferred files per cycle with no timeout, only when both fast and long queues are empty.
OCR Noise Quarantine
Passes known to be noisy (invert, Rotation-180°/270°, AngFallback) that produce >3 STANDARD candidates in one pass have those candidates quarantined. Promoted to active pool only when a second, independent stage sees the same number — preventing one-off OCR artifacts from ever reaching the matching step.
Three-Layer Rotation Defence
(1) Free metadata checks before any render, (2) low-DPI keyword-scored probe, (3) full rotation fallback at all remaining angles. If probe is uncertain, the pre-check hint overrides it. Saved timeout state includes probe scores — third pass skips re-probing entirely.
Table Line Removal
For documents where ruling lines cross digit characters and confuse Tesseract, OpenCV morphological operations detect and remove horizontal and vertical lines before OCR. Applied when CV2_AVAILABLE=True. Graceful fallback when OpenCV is unavailable — only the table cleaning optimisation is skipped.
Section 13
Technical Architecture
Module structure, data flow, key dependencies, and the .env → config.py configuration system. Single-operator Windows deployment with Mac development environment.
V3 addition:services/edm_checker.py now owns runtime EDM toggle state, cache-backed AWB existence checks, and V1/V2-compatible EDM metadata request payloads used by Stage 6 fallback.
Configuration System
# .env — machine-specific values, never committed to version control
PIPELINE_BASE_DIR=C:\Users\5834089\Downloads\CCD_Filler
TESSERACT_PATH=C:\Users\5834089\Downloads\CCD_Filler\tesseract.exe
# Matching thresholds
TOLERANCE_HIGH_MAX_DISTANCE=2# Max Hamming distance for HIGH candidates
TOLERANCE_STANDARD_MAX_DISTANCE=1# Max Hamming distance for STANDARD
MIN_STAGE_HITS_HIGH_TOL2=2# 2 stages required for 2-digit tolerance
Watch the pipeline process three different document scenarios in real time — from an instant filename match to a complex rotated scan with timeout and third-pass resume.
Document Processing Pipeline V3 — pipeline.log
RUNNING
// Select a scenario above to begin the simulation // Log entries will stream in real time below
Method
—
AWB Found
—
Total Time
—
Result
—
Section 15
Risks & Edge Cases
Known failure modes, ambiguous scenarios, and the mitigations currently in place — or flagged as future work.
HIGHSame AWB in Multiple INBOX Files
Multiple PDFs can arrive containing the same AWB — re-downloads from EDM or duplicates sent by shippers. Each is processed independently. At the PROCESSED level, MD5 comparison catches byte-for-byte identical files. Different content with same AWB gets an auto-incrementing suffix (_2, _3). The audit tracker creates separate rows per file (not per AWB), which is correct but requires join logic for per-AWB reporting.
MEDOCR Digit Misread (1–2 digit errors)
Tesseract commonly misreads 0↔8, 1↔7, 5↔6 in compressed scans. Hamming tolerance (≤2 for HIGH, ≤1 for STANDARD) handles this — but only when the candidate has been seen in multiple stages. The stage-hits requirement prevents a single noisy pass from generating a tolerance match. Risk is higher for AWBs differing from each other by only 1–2 digits, guarded by the unique-match requirement (no ties allowed).
MEDRotation Probe Choosing the Wrong Angle
The probe can be misled by documents with large blocks of digits at non-AWB angles (barcodes, HS code tables). Stage 4 rotation fallback mitigates this by running OCR at remaining angles. The 80-point flip margin threshold guards against switching from 0° without strong evidence. The quarantine logic prevents any rotation fallback candidates from being accepted without multi-stage confirmation.
MEDHS Codes and Phone Numbers as False Candidates
Commercial invoices contain many 12-digit sequences: HS commodity codes, phone numbers with country codes, and reference numbers. The disqualification filter catches leading-zero sequences and date patterns. For remaining false candidates, the DB check is the final gate — unless the false candidate coincidentally matches a currently loaded AWB record (rare but theoretically possible). Stage-hits requirement reduces this risk significantly.
MEDEDM Token Expiry During Session
The EDM Bearer token in data/token.txt can expire after a few hours. In V3 this is fail-safe: Stage 6 logs a warning and returns an unchecked path (None), so the pipeline continues without hard failure. The UI toggle can also set EDM OFF, which intentionally bypasses API calls until token/auth is restored.
LOWAWB Excel DB Out of Sync (stale mtime on NFS/SMB)
DB refreshes every 30s by mtime comparison. On some NFS and SMB mounts, mtime doesn't reflect changes correctly — the pipeline may operate on a stale AWB set for up to 30 seconds. The UI can force an immediate reload via the trigger file. Files that fail to match during a stale window route to NEEDS_REVIEW and can be retried after DB refresh.
LOWAudit Lock File Contention at Scale
The 4-sheet audit workbook uses a file-lock pattern. If two pipeline processes attempt to write simultaneously, one waits up to 15 seconds. Locks held >30s are forcibly cleared. In a single-operator, single-process deployment this is rarely a problem, but at scale (multiple simultaneous pipeline instances) this becomes a serialisation bottleneck. Solution: switch to append-only SQLite audit store.
Section 16
Future Roadmap
Planned improvements and architectural upgrades — from near-term wins to long-term platform evolution.
NEAR-TERMEDM Fallback Hardening
Stage 6 is live in V3. Next step is hardening: add endpoint-contract smoke tests, token expiry telemetry, and richer reason codes in audit logs so bypass causes (OFF/no token/auth/network) are visible at a glance.
NEAR-TERMPer-Record Lineage in Audit Tracker
Introduce a unique processing_id UUID per INBOX file, threading through all audit events from intake to output. Enables complete per-file lineage queries without joining on filenames. Directly resolves the multi-document same-AWB aggregation challenge.
MID-TERMPersistent OCR Result Caching
Persist OCR cache between pipeline runs. Files that return to NEEDS_REVIEW and are re-submitted could skip re-OCR if their content hasn't changed (MD5 match). Saves 5–15 seconds per re-submission on scanned documents — significant for high-volume operations.
MID-TERMOperator Review Web Console
A web-based UI (FastAPI + React) for reviewing NEEDS_REVIEW files. Shows the PDF alongside candidates tried, rotation scores, and OCR snippets. Operator confirms, corrects, or rejects the AWB with one click — feeding corrections back to improve future matches.
LONG-TERMML-Assisted Confidence Tuning
Train a lightweight classifier on accumulated stage-cache and audit data to predict: (a) which DPI/PSM combination will yield the best result for a given document type, (b) optimal rotation probe angle narrowing from file metadata features. Could reduce average processing time by 30–40% for the hardest document tier.
LONG-TERMMulti-Worker Parallel Processing
Current architecture is single-worker. A ProcessPoolExecutor could run multiple files in parallel. Would require switching the audit layer from Excel workbook to append-only SQLite — eliminating lock contention at scale and enabling true concurrent processing.
Section 17
Operator Control Centre
The current Tkinter app is the live command surface: login, hotfolder control, EDM runtime toggle, automatic orchestration, manual batch/TIFF actions, folder counts, logs, and safe cleanup.
Primary Runtime
Tk
V3.ui.app_window.App
Auto Interval
10s
AUTO_INTERVAL_SEC default
Batch Gate
2
min clean batches before AUTO build
Manual Escape
Live
operators can intervene at every folder
Primary Controls
Start AWBStarts or stops the watchdog hotfolder process and the EDM duplicate checker when enabled.
EDM ON/OFFWrites the runtime toggle in data/edm_toggle.json; Stage 6 and downstream duplicate checks read it live.
AUTO MODERuns the unattended loop: wait for INBOX drain, wait for PROCESSED drain, batch when ready, then convert TIFF.
Full CycleRuns one supervised cycle without leaving the app in continuous automation mode.
Recovery + Manual Actions
Prepare BatchRuns V3.services.batch_builder against CLEAN and stages outputs into PENDING_PRINT.
Convert TIFFRuns the parallel PDF-to-TIFF converter beside the batch PDFs.
Retry FailedMoves NEEDS_REVIEW PDFs back into INBOX for another pass after DB/token/input fixes.
Clear AllStops running work, preserves protected files, and clears operational working outputs safely.
Auto Phase
Current Behavior
Safety Guard
1 INBOX drain
Waits until INBOX stays empty for INBOX_EMPTY_STABLE_SECONDS=8.
Max wait 1800s; stop event checked every half second.
2 PROCESSED drain
Lets the EDM duplicate checker move matched files into CLEAN, REJECTED, or CLEAN-UNCHECKED.
If EDM checker is not running, direct PROCESSED-to-CLEAN fallback is logged.
3 Batch readiness
Estimates batch count before building; default AUTO build waits for at least 2 batches.
Force-batches old CLEAN files after AUTO_FORCE_BATCH_AGE_SECONDS=1800.
4 TIFF conversion
Runs after successful batch copy to PENDING_PRINT.
Skip-if-exists prevents repeat conversion of healthy TIFFs.
Current operator model: the app is intentionally local-first. It uses folder counts, line-capped colored logs, progress animation, session state, protected-file cleanup rules, and one-click launchers instead of requiring a browser server or external queue service.
Section 18
Current Process Snapshot
A concise version map of the repo as it exists now: what is live, what is configurable, and which files own each part of the process.
Live In V3
Two-pass hotfolder, fast-lane ProbeLite, long-pass timeout, third-pass resume, Stage-6 EDM fallback, downstream EDM duplicate checker, batch builder, TIFF converter, and Tk control centre.
Config Driven
Paths, Tesseract, OCR DPI, rotation thresholds, EDM endpoints, duplicate thresholds, batch limits, TIFF settings, and AUTO mode timing all flow through V3/config.py and .env.
Documented Now
README, operations runbook, and this visual guide now line up around the same end-to-end flow: INBOX -> PROCESSED -> CLEAN/REJECTED -> OUT -> PENDING_PRINT.
Process Area
Owner File
Current Responsibility
Launcher + App Entry
V3/launcher.py, V3/app.py
Starts the branded UI and cross-platform control surface.
The current changelog records documentation alignment work as Unreleased: README and operations docs were updated for ProbeLite, third-pass resume, EDM Stage 6, downstream duplicate routing, current config keys, and controlled shutdown behavior.
Reality Check
This page avoids hard-coding operational volume claims where the code reads the live Excel DB at runtime. The reliable source of truth is the current data/AWB_dB.xlsx load count printed by the hotfolder log.