UFO Explorer

Loading map...

Sightings by Year

Sentiment & Emotion Insights

Emotion Distribution

Sentiment Over Time

Emotions by Source

Emotions by Shape (Top 10)

Unified UFO Sightings Database — Methodology

This document describes how 614,505 sighting records were assembled from five major UFO/UAP databases, deduplicated across sources, and enriched with cross-source metadata. 126,730 duplicate candidate pairs are flagged for review. No records are deleted.

Source Databases

Source	Raw Records	Imported	Skipped	Description
UFOCAT	320,412	197,108	123,304	CUFOS UFOCAT 2023 catalog. Richest metadata: Hynek/Vallee classifications, lat/lon, witness counts, durations. 123K NUFORC-origin records (`SOURCE=UFOReportCtr`) skipped; metadata transferred via enrichment.
NUFORC	159,320	159,320	0	National UFO Reporting Center. Self-reported sightings with detailed free-text descriptions. Enriched post-import with 102K Hynek and 83K Vallee classifications from UFOCAT.
MUFON	138,310	138,310	0	Mutual UFO Network case reports. Short + long descriptions, investigator summaries.
UPDB	1,885,757	65,016	1,820,741	Unified Phenomena Database (phenomenAInon). 1.82M rows skipped (MUFON/NUFORC already imported from richer originals). Remaining 65K from UFODNA (38K), Blue Book (14K), NICAP (5.8K), etc.
UFO-search	54,751	54,751	0	Majestic Timeline compilation from ufo-search.com. Historical records from 19 source compilations (Hatch, Eberhart, NICAP, Vallee, etc.).

Total raw records across all sources: ~2.56 million. After removing known overlaps at import time: 614,505.

UFOCAT Sub-Source Landscape

UFOCAT is itself an aggregator. Its SOURCE column identifies where each record originated:

UFOCAT SOURCE	Records	Overlap With
`UFOReportCtr`	123,304	NUFORC (skipped, enriched)
`U` (Hatch)	17,184	UFO-search Hatch (18K)
`BlueBook1`	13,101	UPDB Blue Book (14K)
`GEberhart1`	11,643	UFO-search Eberhart (7.9K)
`CanadUFOSurv`	10,785	—
`NICAP`	2,315	UPDB NICAP (5.8K), UFO-search NICAP (5.5K)
`MUFONJournal` + MUFON*	2,861	MUFON

Only UFOReportCtr is skipped at import time. Other overlaps are handled by the deduplication engine.

Import Methodology

Each source has a custom import script. Two aggregator sources skip known-duplicate sub-sources at import time:

UFOCAT skips SOURCE=UFOReportCtr (123K NUFORC-origin records)
UPDB skips name=MUFON and name=NUFORC (1.82M records)

Source-Specific Handling

UFOCAT — 55-column CSV with split date fields (YEAR, MO, DAY, TIME). City stored in ALL CAPS in raw_text; copied to city post-import. Longitude negated for US/CA. UFOReportCtr records saved to enrichment sidecar.
NUFORC — Multi-line CSV with quoted descriptions. Dates: 1995-02-02 23:00 Local. Locations: City, ST, Country.
MUFON — 7-column CSV with embedded \n in dates. Locations with escaped commas: Newscandia\, MN\, US.
UPDB — 1.9M rows; name column identifies sub-source. 1,820,741 MUFON/NUFORC rows skipped. Remaining 65,016 mapped to source_origin entries.
UFO-search — JSON array of 54,751 records from 19 historical compilations. Variable date formats ("Summer 1947", "4/34", "6/24/1947"). Regex-based date parser; free-text location parsing.

Data Quality Fixes

Applied automatically by rebuild_db.py in the apply_data_fixes() pipeline. All fixes are idempotent and preserve original values in date_event_raw and raw_json columns.

Location Fixes

UFOCAT longitude sign — 30,822 Western Hemisphere locations had positive longitude; negated for US/CA records
UFOCAT city field — 73,766 locations had city only in raw_text; copied to city column
Country code normalization — USA→US, United Kingdom→GB, Canada→CA, Australia→AU

Date Fixes

MUFON literal \n in dates — 136,654 MUFON records contained a literal backslash-n (0x5C6E) in date_event (e.g., 2020-01-15\n3:00PM). Time portion extracted to time_raw, date truncated to ISO date
Year 0000 — Records with year 0000 have date_event set to NULL
Negative years — Records with date_event starting with - (e.g., -009-02-10) have date_event set to NULL
Month 00 — 551 records with YYYY-00-DD truncated to YYYY
Day 00 — 3,391 records with YYYY-MM-00 truncated to YYYY-MM
Impossible calendar dates — 14 records with Feb 30+, Apr/Jun/Sep/Nov 31 truncated to YYYY-MM
UFOCAT century-only 19// — 692 records with 2-digit raw year 19 (meaning “19xx, year unknown”) had date_event set to NULL. Audit logged in date_correction table
UFOCAT H-BOMB TEST 195// — 1 record with 3-digit year 195 and city “H-BOMB TEST” (clearly 1950s) set to NULL
NUFORC data entry errors — 2 records with wrong century corrected: 0205→2005 (Falmouth), 1721→2021 (Crescent City, Starlink-era sighting)
UPDB mangled years — 19 records with broken years from upstream UPDB export: 1 corrected (0196→1962, confirmed by description), 18 set to NULL (century-round years 0200–0900 and unconfirmed modern dates)

Shape / Classification Normalization

Shape case normalization — 24 case-duplicate groups collapsed via title-case (e.g., circle→Circle), including hyphenated shapes (e.g., cigar-shaped→Cigar-Shaped). 352→317 distinct values
Shape typo correction — 9 misspellings corrected (e.g., Triangel→Triangle, Rectagle→Rectangle)
Junk shapes removed — 3 non-shape values set to NULL (WITNESS, 0, 12:45)
Hynek uppercase — 3 case-duplicate Hynek codes normalized (e.g., nl→NL). 43→40 distinct values
Vallee uppercase — 2 case-duplicate Vallee codes normalized (e.g., fb1→FB1). 43→41 distinct values

Description Cleanup

[MISSING DATA] removal — Records with description consisting solely of [MISSING DATA] or [missing data] set to NULL
MUFON boilerplate — Submitted by razor via e-mail and Investigator Notes: boilerplate stripped from descriptions

Historic Date Analysis

The database contains 8,046 sighting records with dates before 1901, spanning from 34 AD (a white round object over China) to 1900. Most are legitimate historical sightings from academic catalogs, but several categories of date errors were identified through systematic analysis.

Extraction & Analysis Method

All pre-1901 records were extracted into a standalone analysis database (temp/historic_pre1901.db) using extract_historic.py. Each record was auto-classified based on its source, raw date format, and year digit count, then flagged for manual review where ambiguous.

Categories Identified

Category	Source	Records	Status	Description
`ufocat_ancient`	UFOCAT	4,436	OK	4-digit raw years (1001–1900). Legitimately pre-modern sightings. No action needed.
`ufocat_century_only`	UFOCAT	692	Fixed	2-digit raw year `19//` = “sometime in the 1900s, year unknown.” ETL zero-padded to `0019`. Descriptions confirm modern events (abductions, radar, motion pictures). Resolution: `date_event` set to NULL (year genuinely unknown).
`ufocat_3digit_review`	UFOCAT	88	Fixed	3-digit raw years (034–999). Mostly legitimate ancient dates. 1 confirmed modern mislabel corrected: `195//` “H-BOMB TEST” (1950s) → NULL. 4 ambiguous `188//` records left as-is (no descriptions to disambiguate). Remaining 83 are legitimate ancient sightings.
`other_source_review`	UFO-search	1,984	OK	Geldreich Majestic Timeline. Historical records from 61 AD to 1900. All appear legitimate.
`other_source_review`	UPDB	780	Fixed	~760 legitimate (1000–1900). 19 records had mangled modern years from upstream data errors. Resolution: 1 corrected (`0196`→1962, confirmed by description), 18 set to NULL (century-round years `0200`–`0900` and unconfirmed modern dates).
`other_source_review`	MUFON	40	OK	All 1890–1900. Appear legitimate.
`other_source_review`	NUFORC	26	Fixed	~23 legitimate historic reports. 2 data entry errors corrected: `0205`→2005 (Falmouth), `1721`→2021 (Crescent City). 1 ambiguous record (`1071`) left as-is (could be 1971 or 2007).

Root Cause: UFOCAT Variable-Length Year Field

UFOCAT stores dates in separate YEAR, MO, DAY columns. The YEAR field uses variable-length encoding:

4 digits (196,295 records): Standard years like 1966, 2001
3 digits (90 records): Ancient years like 034 (34 AD), 776, 919
2 digits (715 records): Century-only indicator 19 = “19xx” (20th century, unknown year)

The ETL’s parse_ufocat_date() zero-pads all years to 4 digits (f"{y:04d}"), which correctly handles 3-digit ancient years but misinterprets 2-digit 19 as year 19 AD instead of “19xx.”

Applied Resolution

After manual review of the annotated analysis dataset, 714 records were corrected or nulled via Fixes 15–18 in rebuild_db.py. Every correction is logged in the date_correction audit table with the original date, corrected date, correction type, and reason. See GitHub issue #1.

Fix	Source	Action	Count
Fix 15	UFOCAT	Century-only `19//` → NULL (year unknown)	692
Fix 16	UFOCAT	H-BOMB TEST `195//` → NULL (1950s, not 195 AD)	1
Fix 17	NUFORC	Data entry errors corrected (`0205`→2005, `1721`→2021)	2
Fix 18	UPDB	Mangled years corrected/nulled (`0196`→1962, rest → NULL)	19

Conservative approach: only records with clear evidence were corrected. Ambiguous records (188// in UFOCAT, 1071 in NUFORC) were left unchanged. All fixes are idempotent and re-applied on each database rebuild.

Deduplication Methodology

Deduplication uses a two-phase strategy: known overlaps are eliminated at import time, then a three-tier matching engine flags remaining cross-source duplicates for review. No records are deleted — all 614,505 sightings remain, with 126,730 candidate pairs stored in the duplicate_candidate table.

Phase 1: Import-Time Filtering

Before deduplication runs, two aggregator sources skip sub-sources that would create known duplicates with higher-quality originals already imported:

Source	Sub-Source Skipped	Records Skipped	Reason
UFOCAT	`SOURCE=UFOReportCtr`	123,304	Copies of NUFORC sightings
UPDB	`name=MUFON`	131,506	MUFON imported directly with richer descriptions
UPDB	`name=NUFORC`	1,689,235	NUFORC imported directly with richer descriptions

This eliminates 1,944,045 known duplicates before dedup begins, reducing the working set from ~2.56M to 614,505. The UFOCAT skip triggers enrichment to preserve valuable Hynek/Vallee metadata.

Other overlapping sub-sources (e.g. UFOCAT's Hatch records vs UFO-search's Hatch records) are kept and handled by the dedup engine, since both copies may carry unique metadata.

Phase 1.5: Metadata Enrichment

UFOCAT's 123K skipped UFOReportCtr records carry Hynek and Vallee classifications that NUFORC natively lacks. Rather than lose this data, import_ufocat.py writes skipped records to a sidecar file (ufocat_enrichment.jsonl), and enrich.py transfers the metadata to matching NUFORC sightings post-import.

Matching: Date (YYYY-MM-DD) + normalized UPPER(city) + UPPER(state). City normalization strips parenthetical qualifiers, trailing punctuation, and collapses whitespace.

Transfer rules: Only fills NULL fields — never overwrites existing NUFORC values.

Field	NUFORC Records Enriched
Hynek classification	102,554
Vallee classification	83,710
Shape	1,697
Unmatched (no NUFORC hit)	19,637

Phase 2: Three-Tier Cross-Source Matching

After all imports and enrichment, the dedup engine (dedup.py) compares records across different sources using progressively broader matching strategies. Each tier builds on the previous, skipping pairs already flagged.

Tier 1: MUFON ↔ NUFORC (7,694 pairs)

The highest-overlap pair. Both sources cover modern US sightings with reliable date/location data.

Match key: Exact date (YYYY-MM-DD) + UPPER(city) + UPPER(state)
Scoring: Full description similarity with source-specific preprocessing
Result: 7,694 candidate pairs

Tier 2: All Remaining Cross-Source Pairs (101,879 pairs)

Four sub-tiers cover every remaining source combination, using the match key best suited to each source's location data quality:

Sub-tier	Sources	Match Key	Why This Key	Pairs
2a	MUFON ↔ UFOCAT	date + city + state	Both have structured state fields	2,295
2b	NUFORC ↔ UFOCAT	date + city + state	Both have structured state fields	4,148
2c	UPDB ↔ MUFON/NUFORC/UFOCAT	date + city (no state)	UPDB has inconsistent state data	63,459
2d	UFO-search ↔ MUFON/NUFORC/UFOCAT	date + city + state	UFO-search locations parsed via regex	31,977

Source-specific notes:

UFOCAT cities are stored in raw_text (ALL CAPS), not city — the loader reads raw_text instead
UFO-search locations are free-text strings parsed by regex to extract (city, state) pairs; only locations matching the City, ST pattern with a valid US/Canadian state code are matchable
UPDB sub-tier (2c) filters to US records only (country='US') to reduce false positives from city-only matching
All candidate pairs are normalized so sighting_id_a < sighting_id_b to enforce the UNIQUE constraint

Tier 3: Description Fuzzy Matching (17,157 pairs)

Catches duplicates that Tiers 1–2 miss due to location data differences (misspellings, missing state, different geocoding).

Match key: Date only (no location requirement)
Scope: Only dates with records from 2+ sources AND ≤20 total records on that date
Skip: Pairs already found in Tiers 1–2 are excluded
Two-stage filtering:
1. Token Jaccard > 0.25 — Fast set-intersection filter on lowercased word tokens
2. SequenceMatcher ≥ 0.5 — Python's difflib.SequenceMatcher on the first 1,000 characters
Result: 17,157 candidates from cross-source pairs sharing a date but not caught by location matching

Similarity Scoring

Every candidate pair receives a similarity score (0.0–1.0) computed by compute_similarity():

Source-specific preprocessing:
- NUFORC: Strips NUFORC UFO Sighting NNNNN prefix
- MUFON: Strips Submitted by razor via e-mail boilerplate, extracts investigator notes
"Starts with" shortcut: If both descriptions share the same first N characters (N ≥ 20), score = 0.95
Token Jaccard pre-filter: If token Jaccard < 0.03, return that score immediately
Full alignment: difflib.SequenceMatcher on first 1,000 characters of each description

Pairs with no description on either side receive score = 0.0 (still flagged as candidates based on location matching).

Results

126,730 duplicate candidate pairs across 127,440 unique sightings (20.7% of all records).

Confidence	Score Range	Pairs	Interpretation
Certain	0.9 – 1.0	14,260	Near-identical descriptions; safe to auto-merge
Likely	0.7 – 0.9	9,567	Strong match; minor wording differences
Possible	0.5 – 0.7	13,303	Same event reported differently across sources
Weak	0.3 – 0.5	11,144	Same date+location, descriptions partially overlap; needs review
Unlikely	0.0 – 0.3	78,456	Same date+location but likely different events

By Match Method

Method	Pairs	Avg Score
`tier2c_updb_ufocat`	59,620	0.225
`tier2d_ufosearch_ufocat`	31,439	0.240
`tier3_desc_fuzzy`	17,157	0.768
`tier1a_mufon_nuforc`	7,694	0.226
`tier2b_nuforc_ufocat`	4,148	0.129
`tier2c_updb_nuforc`	3,519	0.234
`tier2a_mufon_ufocat`	2,295	0.072
`tier2d_ufosearch_nuforc`	397	0.044
`tier2c_updb_mufon`	320	0.012
`tier2d_ufosearch_mufon`	141	0.009

Note: The previous build flagged 242K candidates. The current build flags only 126K because the 123K UFOCAT-NUFORC duplicates (UFOReportCtr) are now prevented at import time rather than flagged after the fact.

What Dedup Does NOT Do

No records are deleted or merged. The duplicate_candidate table is advisory. All 614,505 sightings remain queryable.
No within-source dedup. The engine only flags cross-source pairs (different source_db_id). Duplicates within a single source are not flagged.
No transitive closure. If A↔B and B↔C are both flagged, A↔C is NOT automatically inferred. Each pair is independent.
Multiple witnesses are preserved. If the same event has genuinely separate witness reports in different sources, both records remain. The similarity score distinguishes true duplicates (high score) from independent reports of the same event (low score).

Database Schema

`sighting` (614,505 rows, 42 columns)

The main table. Each row is one reported sighting event.

Category	Fields
Provenance	`source_db_id`, `source_record_id`, `origin_id`, `origin_record_id`
Dates	`date_event` (ISO 8601), `date_event_raw`, `date_end`, `time_raw`, `timezone`, `date_reported`, `date_posted`
Location	`location_id` (FK to `location` table)
Description	`summary`, `description`
Observation	`shape`, `color`, `size_estimated`, `angular_size`, `distance`, `duration`, `duration_seconds`, `num_objects`, `num_witnesses`, `sound`, `direction`, `elevation_angle`, `viewed_from`
Witness	`witness_age`, `witness_sex`, `witness_names`
Classification	`hynek`, `vallee`, `event_type`, `svp_rating`
Resolution	`explanation`, `characteristics`
Context	`weather`, `terrain`, `source_ref`, `page_volume`, `notes`
Preservation	`raw_json` — complete original record as JSON

Supporting Tables

location — Deduplicated locations with raw_text, city, county, state, country, region, latitude, longitude
source_collection (3 rows) — Top-level provenance grouping:
- PUBLIUS — Compiled by Publius from original reporting sites and PhenomAInon downloads (MUFON, NUFORC, UPDB)
- GELDREICH — Rich Geldreich's Majestic Timeline compilation from 19+ historical sources (UFO-search)
- UFOCAT — CUFOS UFOCAT catalog, independent academic dataset
source_database (5 rows) — UFOCAT, NUFORC, MUFON, UPDB, UFO-search. Each linked to a collection via collection_id
source_origin (31 rows) — Upstream sources within aggregator databases (Blue Book, NICAP, Hatch, etc.)
duplicate_candidate (126,730 rows) — Flagged duplicate pairs with similarity scores

Reproducible Build Pipeline

The entire database can be rebuilt from source files with a single command:

python rebuild_db.py

This runs the full pipeline in order:

Create schema (create_schema.py)
Import all 5 sources (UFOCAT with enrichment sidecar, NUFORC, MUFON, UPDB, UFO-search)
Apply data quality fixes — 14 fix categories covering dates, locations, shapes, classifications, and descriptions
Geocode locations using GeoNames gazetteer (geocode.py)
Run enrichment (enrich.py)
Run three-tier deduplication (dedup.py)
Copy database to explorer (ufo-explorer/ufo_unified.db)

Total build time: ~2 minutes.

Test Suite

The pipeline is validated by 364 automated tests (pytest tests/):

115 dedup tests (test_dedup.py) — All dedup functions, tier logic, similarity scoring, and edge cases
114 ETL tests (test_etl.py) — Schema creation, all 5 importers, data fix pipeline
135 data quality tests (test_data_quality.py) — Shape normalization, date validation, classification cleanup, description fixes, historic date corrections, and audit trail

All tests use an in-memory SQLite database with synthetic data — no production data required.

Geocoding

Only UFOCAT provides latitude/longitude coordinates natively. The other four sources have text-only locations (city, state, country). To enable map visualization for all sources, locations are geocoded using the GeoNames cities15000 gazetteer (33,000+ cities with population > 15,000).

Matching strategy (decreasing specificity):

Exact: UPPER(city) + state + country → highest confidence
City + country: No state available (e.g., UPDB) → picks largest matching city by population
City only: No country available (e.g., UFO-search raw text) → picks largest city globally
Raw text parsing: Regex extraction of city/state/country from free-text location strings

The geocode_src column on the location table tracks provenance: NULL = coordinates from original source data, geonames_exact / geonames_city_country / geonames_city_only = geocoded via GeoNames. The map includes a dropdown to toggle between “All Coords”, “Original Only”, and “Geocoded Only”.

Source Collection Provenance

Each source database belongs to a collection — a top-level grouping that identifies who aggregated or curated the data:

Collection	Source(s)	Records	Description
PUBLIUS	MUFON, NUFORC, UPDB	362,646	Compiled by Publius from original reporting sites (MUFON, NUFORC) and PhenomAInon downloads (UPDB sub-sources: UFODNA, Blue Book, NICAP, UKTNA, CANADAGOV, NIDS, BRAZILGOV, SKINWALKER, PILOTS, BAASS)
UFOCAT	UFOCAT	197,108	CUFOS academic catalog (2023 release)
GELDREICH	UFO-search	54,751	Rich Geldreich's Majestic Timeline from 19+ historical compilations

Collections are filterable in the explorer UI. The three-layer provenance model (source_collection → source_database → source_origin) traces every record back to its ultimate origin.