Loading map...
Sightings by Year
Sentiment & Emotion Insights

Emotion Distribution

Sentiment Over Time

Emotions by Source

Emotions by Shape (Top 10)

Unified UFO Sightings Database — Methodology

This document describes how 614,505 sighting records were assembled from five major UFO/UAP databases, deduplicated across sources, and enriched with cross-source metadata. 126,730 duplicate candidate pairs are flagged for review. No records are deleted.

Source Databases

SourceRaw RecordsImportedSkippedDescription
UFOCAT 320,412197,108123,304 CUFOS UFOCAT 2023 catalog. Richest metadata: Hynek/Vallee classifications, lat/lon, witness counts, durations. 123K NUFORC-origin records (SOURCE=UFOReportCtr) skipped; metadata transferred via enrichment.
NUFORC 159,320159,3200 National UFO Reporting Center. Self-reported sightings with detailed free-text descriptions. Enriched post-import with 102K Hynek and 83K Vallee classifications from UFOCAT.
MUFON 138,310138,3100 Mutual UFO Network case reports. Short + long descriptions, investigator summaries.
UPDB 1,885,75765,0161,820,741 Unified Phenomena Database (phenomenAInon). 1.82M rows skipped (MUFON/NUFORC already imported from richer originals). Remaining 65K from UFODNA (38K), Blue Book (14K), NICAP (5.8K), etc.
UFO-search 54,75154,7510 Majestic Timeline compilation from ufo-search.com. Historical records from 19 source compilations (Hatch, Eberhart, NICAP, Vallee, etc.).

Total raw records across all sources: ~2.56 million. After removing known overlaps at import time: 614,505.

UFOCAT Sub-Source Landscape

UFOCAT is itself an aggregator. Its SOURCE column identifies where each record originated:

UFOCAT SOURCERecordsOverlap With
UFOReportCtr123,304NUFORC (skipped, enriched)
U (Hatch)17,184UFO-search Hatch (18K)
BlueBook113,101UPDB Blue Book (14K)
GEberhart111,643UFO-search Eberhart (7.9K)
CanadUFOSurv10,785
NICAP2,315UPDB NICAP (5.8K), UFO-search NICAP (5.5K)
MUFONJournal + MUFON*2,861MUFON

Only UFOReportCtr is skipped at import time. Other overlaps are handled by the deduplication engine.

Import Methodology

Each source has a custom import script. Two aggregator sources skip known-duplicate sub-sources at import time:

  • UFOCAT skips SOURCE=UFOReportCtr (123K NUFORC-origin records)
  • UPDB skips name=MUFON and name=NUFORC (1.82M records)

Source-Specific Handling

  • UFOCAT — 55-column CSV with split date fields (YEAR, MO, DAY, TIME). City stored in ALL CAPS in raw_text; copied to city post-import. Longitude negated for US/CA. UFOReportCtr records saved to enrichment sidecar.
  • NUFORC — Multi-line CSV with quoted descriptions. Dates: 1995-02-02 23:00 Local. Locations: City, ST, Country.
  • MUFON — 7-column CSV with embedded \n in dates. Locations with escaped commas: Newscandia\, MN\, US.
  • UPDB — 1.9M rows; name column identifies sub-source. 1,820,741 MUFON/NUFORC rows skipped. Remaining 65,016 mapped to source_origin entries.
  • UFO-search — JSON array of 54,751 records from 19 historical compilations. Variable date formats ("Summer 1947", "4/34", "6/24/1947"). Regex-based date parser; free-text location parsing.

Data Quality Fixes

Applied automatically by rebuild_db.py in the apply_data_fixes() pipeline. All fixes are idempotent and preserve original values in date_event_raw and raw_json columns.

Location Fixes
  • UFOCAT longitude sign — 30,822 Western Hemisphere locations had positive longitude; negated for US/CA records
  • UFOCAT city field — 73,766 locations had city only in raw_text; copied to city column
  • Country code normalization — USA→US, United Kingdom→GB, Canada→CA, Australia→AU
Date Fixes
  • MUFON literal \n in dates — 136,654 MUFON records contained a literal backslash-n (0x5C6E) in date_event (e.g., 2020-01-15\n3:00PM). Time portion extracted to time_raw, date truncated to ISO date
  • Year 0000 — Records with year 0000 have date_event set to NULL
  • Negative years — Records with date_event starting with - (e.g., -009-02-10) have date_event set to NULL
  • Month 00 — 551 records with YYYY-00-DD truncated to YYYY
  • Day 00 — 3,391 records with YYYY-MM-00 truncated to YYYY-MM
  • Impossible calendar dates — 14 records with Feb 30+, Apr/Jun/Sep/Nov 31 truncated to YYYY-MM
  • UFOCAT century-only 19// — 692 records with 2-digit raw year 19 (meaning “19xx, year unknown”) had date_event set to NULL. Audit logged in date_correction table
  • UFOCAT H-BOMB TEST 195// — 1 record with 3-digit year 195 and city “H-BOMB TEST” (clearly 1950s) set to NULL
  • NUFORC data entry errors — 2 records with wrong century corrected: 02052005 (Falmouth), 17212021 (Crescent City, Starlink-era sighting)
  • UPDB mangled years — 19 records with broken years from upstream UPDB export: 1 corrected (01961962, confirmed by description), 18 set to NULL (century-round years 02000900 and unconfirmed modern dates)
Shape / Classification Normalization
  • Shape case normalization — 24 case-duplicate groups collapsed via title-case (e.g., circleCircle), including hyphenated shapes (e.g., cigar-shapedCigar-Shaped). 352→317 distinct values
  • Shape typo correction — 9 misspellings corrected (e.g., TriangelTriangle, RectagleRectangle)
  • Junk shapes removed — 3 non-shape values set to NULL (WITNESS, 0, 12:45)
  • Hynek uppercase — 3 case-duplicate Hynek codes normalized (e.g., nlNL). 43→40 distinct values
  • Vallee uppercase — 2 case-duplicate Vallee codes normalized (e.g., fb1FB1). 43→41 distinct values
Description Cleanup
  • [MISSING DATA] removal — Records with description consisting solely of [MISSING DATA] or [missing data] set to NULL
  • MUFON boilerplateSubmitted by razor via e-mail and Investigator Notes: boilerplate stripped from descriptions

Historic Date Analysis

The database contains 8,046 sighting records with dates before 1901, spanning from 34 AD (a white round object over China) to 1900. Most are legitimate historical sightings from academic catalogs, but several categories of date errors were identified through systematic analysis.

Extraction & Analysis Method

All pre-1901 records were extracted into a standalone analysis database (temp/historic_pre1901.db) using extract_historic.py. Each record was auto-classified based on its source, raw date format, and year digit count, then flagged for manual review where ambiguous.

Categories Identified

CategorySourceRecordsStatusDescription
ufocat_ancientUFOCAT4,436 OK 4-digit raw years (1001–1900). Legitimately pre-modern sightings. No action needed.
ufocat_century_onlyUFOCAT692 Fixed 2-digit raw year 19// = “sometime in the 1900s, year unknown.” ETL zero-padded to 0019. Descriptions confirm modern events (abductions, radar, motion pictures). Resolution: date_event set to NULL (year genuinely unknown).
ufocat_3digit_reviewUFOCAT88 Fixed 3-digit raw years (034–999). Mostly legitimate ancient dates. 1 confirmed modern mislabel corrected: 195// “H-BOMB TEST” (1950s) → NULL. 4 ambiguous 188// records left as-is (no descriptions to disambiguate). Remaining 83 are legitimate ancient sightings.
other_source_reviewUFO-search1,984 OK Geldreich Majestic Timeline. Historical records from 61 AD to 1900. All appear legitimate.
other_source_reviewUPDB780 Fixed ~760 legitimate (1000–1900). 19 records had mangled modern years from upstream data errors. Resolution: 1 corrected (0196→1962, confirmed by description), 18 set to NULL (century-round years 02000900 and unconfirmed modern dates).
other_source_reviewMUFON40 OK All 1890–1900. Appear legitimate.
other_source_reviewNUFORC26 Fixed ~23 legitimate historic reports. 2 data entry errors corrected: 0205→2005 (Falmouth), 1721→2021 (Crescent City). 1 ambiguous record (1071) left as-is (could be 1971 or 2007).

Root Cause: UFOCAT Variable-Length Year Field

UFOCAT stores dates in separate YEAR, MO, DAY columns. The YEAR field uses variable-length encoding:

  • 4 digits (196,295 records): Standard years like 1966, 2001
  • 3 digits (90 records): Ancient years like 034 (34 AD), 776, 919
  • 2 digits (715 records): Century-only indicator 19 = “19xx” (20th century, unknown year)

The ETL’s parse_ufocat_date() zero-pads all years to 4 digits (f"{y:04d}"), which correctly handles 3-digit ancient years but misinterprets 2-digit 19 as year 19 AD instead of “19xx.”

Applied Resolution

After manual review of the annotated analysis dataset, 714 records were corrected or nulled via Fixes 15–18 in rebuild_db.py. Every correction is logged in the date_correction audit table with the original date, corrected date, correction type, and reason. See GitHub issue #1.

FixSourceActionCount
Fix 15UFOCATCentury-only 19// → NULL (year unknown)692
Fix 16UFOCATH-BOMB TEST 195// → NULL (1950s, not 195 AD)1
Fix 17NUFORCData entry errors corrected (0205→2005, 1721→2021)2
Fix 18UPDBMangled years corrected/nulled (0196→1962, rest → NULL)19

Conservative approach: only records with clear evidence were corrected. Ambiguous records (188// in UFOCAT, 1071 in NUFORC) were left unchanged. All fixes are idempotent and re-applied on each database rebuild.

Deduplication Methodology

Deduplication uses a two-phase strategy: known overlaps are eliminated at import time, then a three-tier matching engine flags remaining cross-source duplicates for review. No records are deleted — all 614,505 sightings remain, with 126,730 candidate pairs stored in the duplicate_candidate table.

Phase 1: Import-Time Filtering

Before deduplication runs, two aggregator sources skip sub-sources that would create known duplicates with higher-quality originals already imported:

SourceSub-Source SkippedRecords SkippedReason
UFOCATSOURCE=UFOReportCtr123,304Copies of NUFORC sightings
UPDBname=MUFON131,506MUFON imported directly with richer descriptions
UPDBname=NUFORC1,689,235NUFORC imported directly with richer descriptions

This eliminates 1,944,045 known duplicates before dedup begins, reducing the working set from ~2.56M to 614,505. The UFOCAT skip triggers enrichment to preserve valuable Hynek/Vallee metadata.

Other overlapping sub-sources (e.g. UFOCAT's Hatch records vs UFO-search's Hatch records) are kept and handled by the dedup engine, since both copies may carry unique metadata.

Phase 1.5: Metadata Enrichment

UFOCAT's 123K skipped UFOReportCtr records carry Hynek and Vallee classifications that NUFORC natively lacks. Rather than lose this data, import_ufocat.py writes skipped records to a sidecar file (ufocat_enrichment.jsonl), and enrich.py transfers the metadata to matching NUFORC sightings post-import.

Matching: Date (YYYY-MM-DD) + normalized UPPER(city) + UPPER(state). City normalization strips parenthetical qualifiers, trailing punctuation, and collapses whitespace.

Transfer rules: Only fills NULL fields — never overwrites existing NUFORC values.

FieldNUFORC Records Enriched
Hynek classification102,554
Vallee classification83,710
Shape1,697
Unmatched (no NUFORC hit)19,637

Phase 2: Three-Tier Cross-Source Matching

After all imports and enrichment, the dedup engine (dedup.py) compares records across different sources using progressively broader matching strategies. Each tier builds on the previous, skipping pairs already flagged.

Tier 1: MUFON ↔ NUFORC (7,694 pairs)

The highest-overlap pair. Both sources cover modern US sightings with reliable date/location data.

  • Match key: Exact date (YYYY-MM-DD) + UPPER(city) + UPPER(state)
  • Scoring: Full description similarity with source-specific preprocessing
  • Result: 7,694 candidate pairs

Tier 2: All Remaining Cross-Source Pairs (101,879 pairs)

Four sub-tiers cover every remaining source combination, using the match key best suited to each source's location data quality:

Sub-tierSourcesMatch KeyWhy This KeyPairs
2aMUFON ↔ UFOCATdate + city + stateBoth have structured state fields2,295
2bNUFORC ↔ UFOCATdate + city + stateBoth have structured state fields4,148
2cUPDB ↔ MUFON/NUFORC/UFOCATdate + city (no state)UPDB has inconsistent state data63,459
2dUFO-search ↔ MUFON/NUFORC/UFOCATdate + city + stateUFO-search locations parsed via regex31,977

Source-specific notes:

  • UFOCAT cities are stored in raw_text (ALL CAPS), not city — the loader reads raw_text instead
  • UFO-search locations are free-text strings parsed by regex to extract (city, state) pairs; only locations matching the City, ST pattern with a valid US/Canadian state code are matchable
  • UPDB sub-tier (2c) filters to US records only (country='US') to reduce false positives from city-only matching
  • All candidate pairs are normalized so sighting_id_a < sighting_id_b to enforce the UNIQUE constraint

Tier 3: Description Fuzzy Matching (17,157 pairs)

Catches duplicates that Tiers 1–2 miss due to location data differences (misspellings, missing state, different geocoding).

  • Match key: Date only (no location requirement)
  • Scope: Only dates with records from 2+ sources AND ≤20 total records on that date
  • Skip: Pairs already found in Tiers 1–2 are excluded
  • Two-stage filtering:
    1. Token Jaccard > 0.25 — Fast set-intersection filter on lowercased word tokens
    2. SequenceMatcher ≥ 0.5 — Python's difflib.SequenceMatcher on the first 1,000 characters
  • Result: 17,157 candidates from cross-source pairs sharing a date but not caught by location matching

Similarity Scoring

Every candidate pair receives a similarity score (0.0–1.0) computed by compute_similarity():

  1. Source-specific preprocessing:
    • NUFORC: Strips NUFORC UFO Sighting NNNNN prefix
    • MUFON: Strips Submitted by razor via e-mail boilerplate, extracts investigator notes
  2. "Starts with" shortcut: If both descriptions share the same first N characters (N ≥ 20), score = 0.95
  3. Token Jaccard pre-filter: If token Jaccard < 0.03, return that score immediately
  4. Full alignment: difflib.SequenceMatcher on first 1,000 characters of each description

Pairs with no description on either side receive score = 0.0 (still flagged as candidates based on location matching).

Results

126,730 duplicate candidate pairs across 127,440 unique sightings (20.7% of all records).

ConfidenceScore RangePairsInterpretation
Certain0.9 – 1.014,260Near-identical descriptions; safe to auto-merge
Likely0.7 – 0.99,567Strong match; minor wording differences
Possible0.5 – 0.713,303Same event reported differently across sources
Weak0.3 – 0.511,144Same date+location, descriptions partially overlap; needs review
Unlikely0.0 – 0.378,456Same date+location but likely different events

By Match Method

MethodPairsAvg Score
tier2c_updb_ufocat59,6200.225
tier2d_ufosearch_ufocat31,4390.240
tier3_desc_fuzzy17,1570.768
tier1a_mufon_nuforc7,6940.226
tier2b_nuforc_ufocat4,1480.129
tier2c_updb_nuforc3,5190.234
tier2a_mufon_ufocat2,2950.072
tier2d_ufosearch_nuforc3970.044
tier2c_updb_mufon3200.012
tier2d_ufosearch_mufon1410.009

Note: The previous build flagged 242K candidates. The current build flags only 126K because the 123K UFOCAT-NUFORC duplicates (UFOReportCtr) are now prevented at import time rather than flagged after the fact.

What Dedup Does NOT Do

  • No records are deleted or merged. The duplicate_candidate table is advisory. All 614,505 sightings remain queryable.
  • No within-source dedup. The engine only flags cross-source pairs (different source_db_id). Duplicates within a single source are not flagged.
  • No transitive closure. If A↔B and B↔C are both flagged, A↔C is NOT automatically inferred. Each pair is independent.
  • Multiple witnesses are preserved. If the same event has genuinely separate witness reports in different sources, both records remain. The similarity score distinguishes true duplicates (high score) from independent reports of the same event (low score).

Database Schema

sighting (614,505 rows, 42 columns)

The main table. Each row is one reported sighting event.

CategoryFields
Provenancesource_db_id, source_record_id, origin_id, origin_record_id
Datesdate_event (ISO 8601), date_event_raw, date_end, time_raw, timezone, date_reported, date_posted
Locationlocation_id (FK to location table)
Descriptionsummary, description
Observationshape, color, size_estimated, angular_size, distance, duration, duration_seconds, num_objects, num_witnesses, sound, direction, elevation_angle, viewed_from
Witnesswitness_age, witness_sex, witness_names
Classificationhynek, vallee, event_type, svp_rating
Resolutionexplanation, characteristics
Contextweather, terrain, source_ref, page_volume, notes
Preservationraw_json — complete original record as JSON

Supporting Tables

  • location — Deduplicated locations with raw_text, city, county, state, country, region, latitude, longitude
  • source_collection (3 rows) — Top-level provenance grouping:
    • PUBLIUS — Compiled by Publius from original reporting sites and PhenomAInon downloads (MUFON, NUFORC, UPDB)
    • GELDREICH — Rich Geldreich's Majestic Timeline compilation from 19+ historical sources (UFO-search)
    • UFOCAT — CUFOS UFOCAT catalog, independent academic dataset
  • source_database (5 rows) — UFOCAT, NUFORC, MUFON, UPDB, UFO-search. Each linked to a collection via collection_id
  • source_origin (31 rows) — Upstream sources within aggregator databases (Blue Book, NICAP, Hatch, etc.)
  • duplicate_candidate (126,730 rows) — Flagged duplicate pairs with similarity scores

Reproducible Build Pipeline

The entire database can be rebuilt from source files with a single command:

python rebuild_db.py

This runs the full pipeline in order:

  1. Create schema (create_schema.py)
  2. Import all 5 sources (UFOCAT with enrichment sidecar, NUFORC, MUFON, UPDB, UFO-search)
  3. Apply data quality fixes — 14 fix categories covering dates, locations, shapes, classifications, and descriptions
  4. Geocode locations using GeoNames gazetteer (geocode.py)
  5. Run enrichment (enrich.py)
  6. Run three-tier deduplication (dedup.py)
  7. Copy database to explorer (ufo-explorer/ufo_unified.db)

Total build time: ~2 minutes.

Test Suite

The pipeline is validated by 364 automated tests (pytest tests/):

  • 115 dedup tests (test_dedup.py) — All dedup functions, tier logic, similarity scoring, and edge cases
  • 114 ETL tests (test_etl.py) — Schema creation, all 5 importers, data fix pipeline
  • 135 data quality tests (test_data_quality.py) — Shape normalization, date validation, classification cleanup, description fixes, historic date corrections, and audit trail

All tests use an in-memory SQLite database with synthetic data — no production data required.

Geocoding

Only UFOCAT provides latitude/longitude coordinates natively. The other four sources have text-only locations (city, state, country). To enable map visualization for all sources, locations are geocoded using the GeoNames cities15000 gazetteer (33,000+ cities with population > 15,000).

Matching strategy (decreasing specificity):

  1. Exact: UPPER(city) + state + country → highest confidence
  2. City + country: No state available (e.g., UPDB) → picks largest matching city by population
  3. City only: No country available (e.g., UFO-search raw text) → picks largest city globally
  4. Raw text parsing: Regex extraction of city/state/country from free-text location strings

The geocode_src column on the location table tracks provenance: NULL = coordinates from original source data, geonames_exact / geonames_city_country / geonames_city_only = geocoded via GeoNames. The map includes a dropdown to toggle between “All Coords”, “Original Only”, and “Geocoded Only”.

Source Collection Provenance

Each source database belongs to a collection — a top-level grouping that identifies who aggregated or curated the data:

CollectionSource(s)RecordsDescription
PUBLIUSMUFON, NUFORC, UPDB362,646Compiled by Publius from original reporting sites (MUFON, NUFORC) and PhenomAInon downloads (UPDB sub-sources: UFODNA, Blue Book, NICAP, UKTNA, CANADAGOV, NIDS, BRAZILGOV, SKINWALKER, PILOTS, BAASS)
UFOCATUFOCAT197,108CUFOS academic catalog (2023 release)
GELDREICHUFO-search54,751Rich Geldreich's Majestic Timeline from 19+ historical compilations

Collections are filterable in the explorer UI. The three-layer provenance model (source_collectionsource_databasesource_origin) traces every record back to its ultimate origin.