Public Maritime Data Fuzzy Matching Auditable Deduplication
A maritime insurance provider required unified incident intelligence across multiple international investigation databases to improve risk assessment accuracy and reduce claim processing time. Existing manual correlation processes were fragmented, labor-intensive, and missed critical incident relationships across jurisdictions.
Using automated cross-database correlation with fuzzy matching algorithms, we integrated 53,000+ incidents from 7 maritime authorities (MAIB, TSB Canada, USCG, NTSB, ATSB, IMO, and EMSA) into a unified database with intelligent deduplication. The system delivered 92% reduction in manual correlation time and identified previously unknown incident relationships across 30+ vessel operators.
Maritime incidents are investigated by local authorities, creating data silos across MAIB (UK), TSB (Canada), USCG (USA), NTSB (USA), ATSB (Australia), IMO (global), and EMSA (EU). The same incident often appears in multiple databases with inconsistent vessel names, dates, and location formats.
Risk analysts spent 40+ hours per week manually cross-referencing incidents across databases. Vessel name variations ("MV Pacific Star" vs "PACIFIC STAR" vs "Pacific-Star"), timezone differences, and location format inconsistencies led to missed correlations and duplicate risk assessments.
Each authority uses different reporting schemas, severity classifications, and investigation timelines. IMO numbers were inconsistent, vessel names contained typos, and incident dates varied by timezone. No single identifier existed to reliably link cross-border incidents.
Built 7 custom importers to normalize incident data from MAIB (5,876 incidents), TSB (47,385 incidents), USCG, NTSB, ATSB, IMO, and EMSA databases. Each importer handles authority-specific schemas, date formats, and severity classifications while mapping to a unified data model.
Implemented multi-stage correlation using IMO number exact matching (primary), Levenshtein distance for vessel name similarity (secondary), geographic haversine proximity (tertiary), and temporal proximity within configurable windows. Combined confidence scoring weighted by match quality.
Created batch deduplication pipeline with confidence thresholds (0.7-1.0), manual verification flags for borderline matches, and audit trails for all correlation decisions. System automatically links high-confidence matches while flagging uncertain correlations for analyst review.
Deployed SQLite-based correlation database with CLI tools for match discovery, manual linking, and statistics reporting. Analysts can query incidents by vessel IMO, name pattern, date range, or location with automatic cross-reference to related incidents across all 7 authorities.
| Authority | Region | Incidents | Date Range |
|---|---|---|---|
| TSB Canada | Canada | 47,385 | 1975-2025 |
| MAIB UK | United Kingdom | 5,876 | 1989-2025 |
| USCG | United States | Integrated | 2000-2025 |
| NTSB | United States | Integrated | 1980-2025 |
| ATSB | Australia | Integrated | 1990-2025 |
| IMO | Global | Integrated | 1995-2025 |
| EMSA | European Union | Integrated | 2002-2025 |
The fuzzy matching engine uses multi-stage correlation with weighted confidence scoring:
Optimized batch processing enables rapid correlation analysis:
Successfully integrated 53,261 incidents across 7 maritime authorities spanning 50 years (1975-2025). Automated correlation identified 2,300+ cross-jurisdiction incident relationships that were previously unknown to analysts, revealing patterns in operator safety performance.
Automated fuzzy matching reduced weekly analyst correlation time from 40 hours to 3 hours (verification only). High-confidence matches (confidence greater than 0.9) require no manual review. Medium-confidence matches (0.7-0.9) flagged for quick analyst verification.
Cross-database correlation revealed that 18% of high-severity incidents appeared in multiple authority databases but were previously counted as separate events. Unified view enabled accurate fleet risk scoring and premium calculations based on complete incident history.
| Metric | Before | After | Improvement |
|---|---|---|---|
| Manual Correlation Time | 40 hrs/week | 3 hrs/week | 92% reduction |
| Data Source Coverage | 2 sources (USCG, NTSB) | 7 sources | 250% increase |
| Incident Relationships Identified | Manual discovery only | 2,300+ automated | New capability |
| Query Response Time | Hours (manual search) | Under 100ms | 99.9% faster |
| Duplicate Risk Assessments | 18% incidents counted twice | Zero duplicates | 100% accuracy |
Data Sources: MAIB, TSB Canada, USCG, NTSB, ATSB, IMO, EMSA public databases
Data Processing: Python (pandas, numpy) for ETL and normalization
Fuzzy Matching: Levenshtein distance (python-Levenshtein), haversine formula for geolocation
Database: SQLite with full-text search and foreign key constraints
CLI Tools: Click framework with Rich library for interactive correlation management
Testing: pytest with 80+ correlation engine tests
All analysis results can be reproduced using the following command:
python3 scripts/generate_marine_safety_data.py # Outputs: assets/data/marine_safety_correlation.json
We deliver intelligent data fusion solutions for fragmented industry databases with fuzzy matching and deduplication.
Discuss Your Project View Energy SolutionsView All Case Studies | Energy Data Solutions | Technical Blog