Emotion Distribution
Sentiment Over Time
Emotions by Source
Emotions by Shape (Top 10)
Unified UFO Sightings Database — Methodology
This document describes how 614,505 sighting records were assembled from five major UFO/UAP databases, deduplicated across sources, and enriched with cross-source metadata. 126,730 duplicate candidate pairs are flagged for review. No records are deleted.
Source Databases
| Source | Raw Records | Imported | Skipped | Description |
|---|---|---|---|---|
| UFOCAT | 320,412 | 197,108 | 123,304 | CUFOS UFOCAT 2023 catalog. Richest metadata: Hynek/Vallee classifications, lat/lon, witness counts, durations. 123K NUFORC-origin records (SOURCE=UFOReportCtr) skipped; metadata transferred via enrichment. |
| NUFORC | 159,320 | 159,320 | 0 | National UFO Reporting Center. Self-reported sightings with detailed free-text descriptions. Enriched post-import with 102K Hynek and 83K Vallee classifications from UFOCAT. |
| MUFON | 138,310 | 138,310 | 0 | Mutual UFO Network case reports. Short + long descriptions, investigator summaries. |
| UPDB | 1,885,757 | 65,016 | 1,820,741 | Unified Phenomena Database (phenomenAInon). 1.82M rows skipped (MUFON/NUFORC already imported from richer originals). Remaining 65K from UFODNA (38K), Blue Book (14K), NICAP (5.8K), etc. |
| UFO-search | 54,751 | 54,751 | 0 | Majestic Timeline compilation from ufo-search.com. Historical records from 19 source compilations (Hatch, Eberhart, NICAP, Vallee, etc.). |
Total raw records across all sources: ~2.56 million. After removing known overlaps at import time: 614,505.
UFOCAT Sub-Source Landscape
UFOCAT is itself an aggregator. Its SOURCE column identifies where each record originated:
| UFOCAT SOURCE | Records | Overlap With |
|---|---|---|
UFOReportCtr | 123,304 | NUFORC (skipped, enriched) |
U (Hatch) | 17,184 | UFO-search Hatch (18K) |
BlueBook1 | 13,101 | UPDB Blue Book (14K) |
GEberhart1 | 11,643 | UFO-search Eberhart (7.9K) |
CanadUFOSurv | 10,785 | — |
NICAP | 2,315 | UPDB NICAP (5.8K), UFO-search NICAP (5.5K) |
MUFONJournal + MUFON* | 2,861 | MUFON |
Only UFOReportCtr is skipped at import time. Other overlaps are handled by the deduplication engine.
Import Methodology
Each source has a custom import script. Two aggregator sources skip known-duplicate sub-sources at import time:
- UFOCAT skips
SOURCE=UFOReportCtr(123K NUFORC-origin records) - UPDB skips
name=MUFONandname=NUFORC(1.82M records)
Source-Specific Handling
- UFOCAT — 55-column CSV with split date fields (YEAR, MO, DAY, TIME). City stored in ALL CAPS in
raw_text; copied tocitypost-import. Longitude negated for US/CA.UFOReportCtrrecords saved to enrichment sidecar. - NUFORC — Multi-line CSV with quoted descriptions. Dates:
1995-02-02 23:00 Local. Locations:City, ST, Country. - MUFON — 7-column CSV with embedded
\nin dates. Locations with escaped commas:Newscandia\, MN\, US. - UPDB — 1.9M rows;
namecolumn identifies sub-source. 1,820,741 MUFON/NUFORC rows skipped. Remaining 65,016 mapped tosource_originentries. - UFO-search — JSON array of 54,751 records from 19 historical compilations. Variable date formats ("Summer 1947", "4/34", "6/24/1947"). Regex-based date parser; free-text location parsing.
Data Quality Fixes
Applied automatically by rebuild_db.py in the apply_data_fixes() pipeline. All fixes are idempotent and preserve original values in date_event_raw and raw_json columns.
Location Fixes
- UFOCAT longitude sign — 30,822 Western Hemisphere locations had positive longitude; negated for US/CA records
- UFOCAT city field — 73,766 locations had city only in
raw_text; copied tocitycolumn - Country code normalization — USA→US, United Kingdom→GB, Canada→CA, Australia→AU
Date Fixes
- MUFON literal
\nin dates — 136,654 MUFON records contained a literal backslash-n (0x5C6E) indate_event(e.g.,2020-01-15\n3:00PM). Time portion extracted totime_raw, date truncated to ISO date - Year 0000 — Records with year
0000havedate_eventset to NULL - Negative years — Records with
date_eventstarting with-(e.g.,-009-02-10) havedate_eventset to NULL - Month 00 — 551 records with
YYYY-00-DDtruncated toYYYY - Day 00 — 3,391 records with
YYYY-MM-00truncated toYYYY-MM - Impossible calendar dates — 14 records with Feb 30+, Apr/Jun/Sep/Nov 31 truncated to
YYYY-MM - UFOCAT century-only
19//— 692 records with 2-digit raw year19(meaning “19xx, year unknown”) haddate_eventset to NULL. Audit logged indate_correctiontable - UFOCAT H-BOMB TEST
195//— 1 record with 3-digit year 195 and city “H-BOMB TEST” (clearly 1950s) set to NULL - NUFORC data entry errors — 2 records with wrong century corrected:
0205→2005(Falmouth),1721→2021(Crescent City, Starlink-era sighting) - UPDB mangled years — 19 records with broken years from upstream UPDB export: 1 corrected (
0196→1962, confirmed by description), 18 set to NULL (century-round years0200–0900and unconfirmed modern dates)
Shape / Classification Normalization
- Shape case normalization — 24 case-duplicate groups collapsed via title-case (e.g.,
circle→Circle), including hyphenated shapes (e.g.,cigar-shaped→Cigar-Shaped). 352→317 distinct values - Shape typo correction — 9 misspellings corrected (e.g.,
Triangel→Triangle,Rectagle→Rectangle) - Junk shapes removed — 3 non-shape values set to NULL (
WITNESS,0,12:45) - Hynek uppercase — 3 case-duplicate Hynek codes normalized (e.g.,
nl→NL). 43→40 distinct values - Vallee uppercase — 2 case-duplicate Vallee codes normalized (e.g.,
fb1→FB1). 43→41 distinct values
Description Cleanup
[MISSING DATA]removal — Records with description consisting solely of[MISSING DATA]or[missing data]set to NULL- MUFON boilerplate —
Submitted by razor via e-mailandInvestigator Notes:boilerplate stripped from descriptions
Historic Date Analysis
The database contains 8,046 sighting records with dates before 1901, spanning from 34 AD (a white round object over China) to 1900. Most are legitimate historical sightings from academic catalogs, but several categories of date errors were identified through systematic analysis.
Extraction & Analysis Method
All pre-1901 records were extracted into a standalone analysis database (temp/historic_pre1901.db) using extract_historic.py. Each record was auto-classified based on its source, raw date format, and year digit count, then flagged for manual review where ambiguous.
Categories Identified
| Category | Source | Records | Status | Description |
|---|---|---|---|---|
ufocat_ancient | UFOCAT | 4,436 | OK | 4-digit raw years (1001–1900). Legitimately pre-modern sightings. No action needed. |
ufocat_century_only | UFOCAT | 692 | Fixed | 2-digit raw year 19// = “sometime in the 1900s, year unknown.” ETL zero-padded to 0019. Descriptions confirm modern events (abductions, radar, motion pictures). Resolution: date_event set to NULL (year genuinely unknown). |
ufocat_3digit_review | UFOCAT | 88 | Fixed | 3-digit raw years (034–999). Mostly legitimate ancient dates. 1 confirmed modern mislabel corrected: 195// “H-BOMB TEST” (1950s) → NULL. 4 ambiguous 188// records left as-is (no descriptions to disambiguate). Remaining 83 are legitimate ancient sightings. |
other_source_review | UFO-search | 1,984 | OK | Geldreich Majestic Timeline. Historical records from 61 AD to 1900. All appear legitimate. |
other_source_review | UPDB | 780 | Fixed | ~760 legitimate (1000–1900). 19 records had mangled modern years from upstream data errors. Resolution: 1 corrected (0196→1962, confirmed by description), 18 set to NULL (century-round years 0200–0900 and unconfirmed modern dates). |
other_source_review | MUFON | 40 | OK | All 1890–1900. Appear legitimate. |
other_source_review | NUFORC | 26 | Fixed | ~23 legitimate historic reports. 2 data entry errors corrected: 0205→2005 (Falmouth), 1721→2021 (Crescent City). 1 ambiguous record (1071) left as-is (could be 1971 or 2007). |
Root Cause: UFOCAT Variable-Length Year Field
UFOCAT stores dates in separate YEAR, MO, DAY columns. The YEAR field uses variable-length encoding:
- 4 digits (196,295 records): Standard years like
1966,2001 - 3 digits (90 records): Ancient years like
034(34 AD),776,919 - 2 digits (715 records): Century-only indicator
19= “19xx” (20th century, unknown year)
The ETL’s parse_ufocat_date() zero-pads all years to 4 digits (f"{y:04d}"), which correctly handles 3-digit ancient years but misinterprets 2-digit 19 as year 19 AD instead of “19xx.”
Applied Resolution
After manual review of the annotated analysis dataset, 714 records were corrected or nulled via Fixes 15–18 in rebuild_db.py. Every correction is logged in the date_correction audit table with the original date, corrected date, correction type, and reason. See GitHub issue #1.
| Fix | Source | Action | Count |
|---|---|---|---|
| Fix 15 | UFOCAT | Century-only 19// → NULL (year unknown) | 692 |
| Fix 16 | UFOCAT | H-BOMB TEST 195// → NULL (1950s, not 195 AD) | 1 |
| Fix 17 | NUFORC | Data entry errors corrected (0205→2005, 1721→2021) | 2 |
| Fix 18 | UPDB | Mangled years corrected/nulled (0196→1962, rest → NULL) | 19 |
Conservative approach: only records with clear evidence were corrected. Ambiguous records (188// in UFOCAT, 1071 in NUFORC) were left unchanged. All fixes are idempotent and re-applied on each database rebuild.
Deduplication Methodology
Deduplication uses a two-phase strategy: known overlaps are eliminated at import time, then a three-tier matching engine flags remaining cross-source duplicates for review. No records are deleted — all 614,505 sightings remain, with 126,730 candidate pairs stored in the duplicate_candidate table.
Phase 1: Import-Time Filtering
Before deduplication runs, two aggregator sources skip sub-sources that would create known duplicates with higher-quality originals already imported:
| Source | Sub-Source Skipped | Records Skipped | Reason |
|---|---|---|---|
| UFOCAT | SOURCE=UFOReportCtr | 123,304 | Copies of NUFORC sightings |
| UPDB | name=MUFON | 131,506 | MUFON imported directly with richer descriptions |
| UPDB | name=NUFORC | 1,689,235 | NUFORC imported directly with richer descriptions |
This eliminates 1,944,045 known duplicates before dedup begins, reducing the working set from ~2.56M to 614,505. The UFOCAT skip triggers enrichment to preserve valuable Hynek/Vallee metadata.
Other overlapping sub-sources (e.g. UFOCAT's Hatch records vs UFO-search's Hatch records) are kept and handled by the dedup engine, since both copies may carry unique metadata.
Phase 1.5: Metadata Enrichment
UFOCAT's 123K skipped UFOReportCtr records carry Hynek and Vallee classifications that NUFORC natively lacks. Rather than lose this data, import_ufocat.py writes skipped records to a sidecar file (ufocat_enrichment.jsonl), and enrich.py transfers the metadata to matching NUFORC sightings post-import.
Matching: Date (YYYY-MM-DD) + normalized UPPER(city) + UPPER(state). City normalization strips parenthetical qualifiers, trailing punctuation, and collapses whitespace.
Transfer rules: Only fills NULL fields — never overwrites existing NUFORC values.
| Field | NUFORC Records Enriched |
|---|---|
| Hynek classification | 102,554 |
| Vallee classification | 83,710 |
| Shape | 1,697 |
| Unmatched (no NUFORC hit) | 19,637 |
Phase 2: Three-Tier Cross-Source Matching
After all imports and enrichment, the dedup engine (dedup.py) compares records across different sources using progressively broader matching strategies. Each tier builds on the previous, skipping pairs already flagged.
Tier 1: MUFON ↔ NUFORC (7,694 pairs)
The highest-overlap pair. Both sources cover modern US sightings with reliable date/location data.
- Match key: Exact date (YYYY-MM-DD) + UPPER(city) + UPPER(state)
- Scoring: Full description similarity with source-specific preprocessing
- Result: 7,694 candidate pairs
Tier 2: All Remaining Cross-Source Pairs (101,879 pairs)
Four sub-tiers cover every remaining source combination, using the match key best suited to each source's location data quality:
| Sub-tier | Sources | Match Key | Why This Key | Pairs |
|---|---|---|---|---|
| 2a | MUFON ↔ UFOCAT | date + city + state | Both have structured state fields | 2,295 |
| 2b | NUFORC ↔ UFOCAT | date + city + state | Both have structured state fields | 4,148 |
| 2c | UPDB ↔ MUFON/NUFORC/UFOCAT | date + city (no state) | UPDB has inconsistent state data | 63,459 |
| 2d | UFO-search ↔ MUFON/NUFORC/UFOCAT | date + city + state | UFO-search locations parsed via regex | 31,977 |
Source-specific notes:
- UFOCAT cities are stored in
raw_text(ALL CAPS), notcity— the loader readsraw_textinstead - UFO-search locations are free-text strings parsed by regex to extract
(city, state)pairs; only locations matching theCity, STpattern with a valid US/Canadian state code are matchable - UPDB sub-tier (2c) filters to US records only (
country='US') to reduce false positives from city-only matching - All candidate pairs are normalized so
sighting_id_a < sighting_id_bto enforce the UNIQUE constraint
Tier 3: Description Fuzzy Matching (17,157 pairs)
Catches duplicates that Tiers 1–2 miss due to location data differences (misspellings, missing state, different geocoding).
- Match key: Date only (no location requirement)
- Scope: Only dates with records from 2+ sources AND ≤20 total records on that date
- Skip: Pairs already found in Tiers 1–2 are excluded
- Two-stage filtering:
- Token Jaccard > 0.25 — Fast set-intersection filter on lowercased word tokens
- SequenceMatcher ≥ 0.5 — Python's
difflib.SequenceMatcheron the first 1,000 characters
- Result: 17,157 candidates from cross-source pairs sharing a date but not caught by location matching
Similarity Scoring
Every candidate pair receives a similarity score (0.0–1.0) computed by compute_similarity():
- Source-specific preprocessing:
- NUFORC: Strips
NUFORC UFO Sighting NNNNNprefix - MUFON: Strips
Submitted by razor via e-mailboilerplate, extracts investigator notes
- NUFORC: Strips
- "Starts with" shortcut: If both descriptions share the same first N characters (N ≥ 20), score = 0.95
- Token Jaccard pre-filter: If token Jaccard < 0.03, return that score immediately
- Full alignment:
difflib.SequenceMatcheron first 1,000 characters of each description
Pairs with no description on either side receive score = 0.0 (still flagged as candidates based on location matching).
Results
126,730 duplicate candidate pairs across 127,440 unique sightings (20.7% of all records).
| Confidence | Score Range | Pairs | Interpretation |
|---|---|---|---|
| Certain | 0.9 – 1.0 | 14,260 | Near-identical descriptions; safe to auto-merge |
| Likely | 0.7 – 0.9 | 9,567 | Strong match; minor wording differences |
| Possible | 0.5 – 0.7 | 13,303 | Same event reported differently across sources |
| Weak | 0.3 – 0.5 | 11,144 | Same date+location, descriptions partially overlap; needs review |
| Unlikely | 0.0 – 0.3 | 78,456 | Same date+location but likely different events |
By Match Method
| Method | Pairs | Avg Score |
|---|---|---|
tier2c_updb_ufocat | 59,620 | 0.225 |
tier2d_ufosearch_ufocat | 31,439 | 0.240 |
tier3_desc_fuzzy | 17,157 | 0.768 |
tier1a_mufon_nuforc | 7,694 | 0.226 |
tier2b_nuforc_ufocat | 4,148 | 0.129 |
tier2c_updb_nuforc | 3,519 | 0.234 |
tier2a_mufon_ufocat | 2,295 | 0.072 |
tier2d_ufosearch_nuforc | 397 | 0.044 |
tier2c_updb_mufon | 320 | 0.012 |
tier2d_ufosearch_mufon | 141 | 0.009 |
Note: The previous build flagged 242K candidates. The current build flags only 126K because the 123K UFOCAT-NUFORC duplicates (UFOReportCtr) are now prevented at import time rather than flagged after the fact.
What Dedup Does NOT Do
- No records are deleted or merged. The
duplicate_candidatetable is advisory. All 614,505 sightings remain queryable. - No within-source dedup. The engine only flags cross-source pairs (different
source_db_id). Duplicates within a single source are not flagged. - No transitive closure. If A↔B and B↔C are both flagged, A↔C is NOT automatically inferred. Each pair is independent.
- Multiple witnesses are preserved. If the same event has genuinely separate witness reports in different sources, both records remain. The similarity score distinguishes true duplicates (high score) from independent reports of the same event (low score).
Database Schema
sighting (614,505 rows, 42 columns)
The main table. Each row is one reported sighting event.
| Category | Fields |
|---|---|
| Provenance | source_db_id, source_record_id, origin_id, origin_record_id |
| Dates | date_event (ISO 8601), date_event_raw, date_end, time_raw, timezone, date_reported, date_posted |
| Location | location_id (FK to location table) |
| Description | summary, description |
| Observation | shape, color, size_estimated, angular_size, distance, duration, duration_seconds, num_objects, num_witnesses, sound, direction, elevation_angle, viewed_from |
| Witness | witness_age, witness_sex, witness_names |
| Classification | hynek, vallee, event_type, svp_rating |
| Resolution | explanation, characteristics |
| Context | weather, terrain, source_ref, page_volume, notes |
| Preservation | raw_json — complete original record as JSON |
Supporting Tables
location— Deduplicated locations withraw_text,city,county,state,country,region,latitude,longitudesource_collection(3 rows) — Top-level provenance grouping:- PUBLIUS — Compiled by Publius from original reporting sites and PhenomAInon downloads (MUFON, NUFORC, UPDB)
- GELDREICH — Rich Geldreich's Majestic Timeline compilation from 19+ historical sources (UFO-search)
- UFOCAT — CUFOS UFOCAT catalog, independent academic dataset
source_database(5 rows) — UFOCAT, NUFORC, MUFON, UPDB, UFO-search. Each linked to a collection viacollection_idsource_origin(31 rows) — Upstream sources within aggregator databases (Blue Book, NICAP, Hatch, etc.)duplicate_candidate(126,730 rows) — Flagged duplicate pairs with similarity scores
Reproducible Build Pipeline
The entire database can be rebuilt from source files with a single command:
python rebuild_db.py
This runs the full pipeline in order:
- Create schema (
create_schema.py) - Import all 5 sources (UFOCAT with enrichment sidecar, NUFORC, MUFON, UPDB, UFO-search)
- Apply data quality fixes — 14 fix categories covering dates, locations, shapes, classifications, and descriptions
- Geocode locations using GeoNames gazetteer (
geocode.py) - Run enrichment (
enrich.py) - Run three-tier deduplication (
dedup.py) - Copy database to explorer (
ufo-explorer/ufo_unified.db)
Total build time: ~2 minutes.
Test Suite
The pipeline is validated by 364 automated tests (pytest tests/):
- 115 dedup tests (
test_dedup.py) — All dedup functions, tier logic, similarity scoring, and edge cases - 114 ETL tests (
test_etl.py) — Schema creation, all 5 importers, data fix pipeline - 135 data quality tests (
test_data_quality.py) — Shape normalization, date validation, classification cleanup, description fixes, historic date corrections, and audit trail
All tests use an in-memory SQLite database with synthetic data — no production data required.
Geocoding
Only UFOCAT provides latitude/longitude coordinates natively. The other four sources have text-only locations (city, state, country). To enable map visualization for all sources, locations are geocoded using the GeoNames cities15000 gazetteer (33,000+ cities with population > 15,000).
Matching strategy (decreasing specificity):
- Exact: UPPER(city) + state + country → highest confidence
- City + country: No state available (e.g., UPDB) → picks largest matching city by population
- City only: No country available (e.g., UFO-search raw text) → picks largest city globally
- Raw text parsing: Regex extraction of city/state/country from free-text location strings
The geocode_src column on the location table tracks provenance: NULL = coordinates from original source data, geonames_exact / geonames_city_country / geonames_city_only = geocoded via GeoNames. The map includes a dropdown to toggle between “All Coords”, “Original Only”, and “Geocoded Only”.
Source Collection Provenance
Each source database belongs to a collection — a top-level grouping that identifies who aggregated or curated the data:
| Collection | Source(s) | Records | Description |
|---|---|---|---|
| PUBLIUS | MUFON, NUFORC, UPDB | 362,646 | Compiled by Publius from original reporting sites (MUFON, NUFORC) and PhenomAInon downloads (UPDB sub-sources: UFODNA, Blue Book, NICAP, UKTNA, CANADAGOV, NIDS, BRAZILGOV, SKINWALKER, PILOTS, BAASS) |
| UFOCAT | UFOCAT | 197,108 | CUFOS academic catalog (2023 release) |
| GELDREICH | UFO-search | 54,751 | Rich Geldreich's Majestic Timeline from 19+ historical compilations |
Collections are filterable in the explorer UI. The three-layer provenance model (source_collection → source_database → source_origin) traces every record back to its ultimate origin.