Date Range
Reset
Movement
LAT · LON · READY
Sightings Timeline
1900 — 2026 · 0 in view
All three charts reflect the current Observatory filter state. Change a filter on any tab and these cards re-tally instantly.

All sightings over time

Quality score over time (median per year)

Movement categories over time (yearly share)

Insights

Emotion & Sentiment Analysis

Sentiment Polarity

Emotion Distribution (7-Class)

GoEmotions Detail (28-Class)

Sentiment Score Distributions

Emotion Profile by Source

Data Quality & Red Flags

Quality Score Distribution

Narrative Red Flags (keyword heuristic)

Movement & Shape

Movement Taxonomy

Shape × Movement Matrix (Top 10 shapes)

Ask AI about the data
Bring your own API key — chat happens in your browser, the key never touches our server.
Ask anything about the unified UFO database
Try:
  • "What are the most common shapes?"
  • "Show me triangle sightings in California in the 1970s"
  • "How many sightings happened in October 1973?"
  • "Which states report the most sightings?"
You'll need an API key from your provider — open Settings above.
Powered by MCP-compatible tools. Your API key is stored locally in your browser only.

Connect your own AI to the UFOSINT data

Every tool the website's chatbot has access to is also exposed via the Model Context Protocol at a single HTTPS endpoint, so any MCP-compatible AI client can query the unified UFO sightings database with your own model and your own subscription. The endpoint is read-only and free to use.

MCP endpoint

https://ufosint-explorer.azurewebsites.net/mcp

6 tools available: search_sightings, get_sighting, get_stats, get_timeline, find_duplicates_for, count_by.

Claude Code (CLI / Desktop App)

One command to connect from any project directory:

claude mcp add --transport http ufosint https://ufosint-explorer.azurewebsites.net/mcp

Restart your Claude Code session. The 6 UFOSINT tools will be available immediately. Remove later with claude mcp remove ufosint.

Claude Desktop

Open ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows) and add:

{
  "mcpServers": {
    "ufosint": {
      "url": "https://ufosint-explorer.azurewebsites.net/mcp",
      "transport": "http"
    }
  }
}

Restart Claude Desktop. The 6 UFOSINT tools will appear in the tools panel.

Cursor / Cline / Continue / Windsurf

These all support remote MCP servers. Add the same URL to your client's MCP configuration. Each client documents the exact location, but the JSON shape is the same.

Direct API (curl / Python / any HTTP client)

The endpoint is JSON-RPC 2.0 over HTTPS. List the tools:

curl -s https://ufosint-explorer.azurewebsites.net/mcp \
  -H 'Content-Type: application/json' \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}'

Call a tool:

curl -s https://ufosint-explorer.azurewebsites.net/mcp \
  -H 'Content-Type: application/json' \
  -d '{
    "jsonrpc":"2.0",
    "id":2,
    "method":"tools/call",
    "params":{
      "name":"search_sightings",
      "arguments":{"q":"triangle","state":"CA","limit":5}
    }
  }'

OpenAI / OpenRouter function-calling format

If you're integrating with OpenAI or OpenRouter and want the tool definitions in their native format (instead of going through MCP), fetch:

GET https://ufosint-explorer.azurewebsites.net/api/tools-catalog

And invoke individual tools at:

POST https://ufosint-explorer.azurewebsites.net/api/tool/<tool_name>
Content-Type: application/json

{ "q": "triangle", "state": "CA", "limit": 5 }

Download the database (SQLite)

Want to run your own analysis, train models, or hack on the data offline? The full 508 MB SQLite snapshot is attached to every tagged release on GitHub — 614,505 deduplicated sightings, 502,985 with emotion analysis, and all derived columns.

curl -LO https://github.com/UFOSINT/ufosint-explorer/releases/latest/download/ufo_public.db

sqlite3 ufo_public.db "SELECT COUNT(*) FROM sighting;"
# 614505

See the Methodology tab for the full schema, derived-column definitions, and per-source licensing. Browse the releases page for older versions.

AI Discovery

This site exposes standard AI-readiness files so agents and LLMs can discover and understand the UFOSINT tools automatically:

Local stdio MCP server

Prefer to run an MCP server on your own machine? Clone the repo and use mcp_server.py:

git clone https://github.com/UFOSINT/ufosint-explorer
cd ufosint-explorer
pip install fastmcp psycopg[binary]
DATABASE_URL="postgresql://..." python mcp_server.py

Then point Claude Desktop at the local script via the command form of the MCP config. (You'll need read-only credentials to a PostgreSQL with the UFOSINT schema.)

All access is read-only. Source data is licensed by UFOSINT; the deduplicated database is built by the ufo-dedup pipeline.

Unified UFO Sightings Database — Methodology

This is not raw data. UFOSINT Explorer presents a processed scientific analysis of five major UFO/UAP databases — 614,505 sighting records deduplicated, cross-referenced, quality-scored, movement-classified, and emotion-analyzed using four transformer models. Every step of the pipeline is documented below and can be independently replicated from the source data using the open-source ufo-dedup pipeline. The web application source code is at ufosint-explorer.

Download the full database The 508 MB SQLite snapshot (ufo_public.db) is attached to every tagged release. Download latest · Browse releases
Reproducibility statement. The entire database can be rebuilt from source files with a single command (python rebuild_db.py). All data quality fixes are idempotent and preserve original values. Derived columns (quality scores, movement categories, emotion classifications) are computed deterministically from the raw narratives. No records are deleted — duplicates are flagged, not merged. The deduplication pipeline described below documents the historical methodology; the current build applies deduplication at ingest time.

Source Databases

SourceRaw RecordsImportedSkippedDescription
UFOCAT 320,412197,108123,304 CUFOS UFOCAT 2023 catalog. Richest metadata: Hynek/Vallee classifications, lat/lon, witness counts, durations. 123K NUFORC-origin records (SOURCE=UFOReportCtr) skipped; metadata transferred via enrichment.
NUFORC 159,320159,3200 National UFO Reporting Center. Self-reported sightings with detailed free-text descriptions. Enriched post-import with 102K Hynek and 83K Vallee classifications from UFOCAT.
MUFON 138,310138,3100 Mutual UFO Network case reports. Short + long descriptions, investigator summaries.
UPDB 1,885,75765,0161,820,741 Unified Phenomena Database (phenomenAInon). 1.82M rows skipped (MUFON/NUFORC already imported from richer originals). Remaining 65K from UFODNA (38K), Blue Book (14K), NICAP (5.8K), etc.
UFO-search 54,75154,7510 Majestic Timeline compilation from ufo-search.com. Historical records from 19 source compilations (Hatch, Eberhart, NICAP, Vallee, etc.).

Total raw records across all sources: ~2.56 million. After removing known overlaps at import time: 614,505.

UFOCAT Sub-Source Landscape

UFOCAT is itself an aggregator. Its SOURCE column identifies where each record originated:

UFOCAT SOURCERecordsOverlap With
UFOReportCtr123,304NUFORC (skipped, enriched)
U (Hatch)17,184UFO-search Hatch (18K)
BlueBook113,101UPDB Blue Book (14K)
GEberhart111,643UFO-search Eberhart (7.9K)
CanadUFOSurv10,785
NICAP2,315UPDB NICAP (5.8K), UFO-search NICAP (5.5K)
MUFONJournal + MUFON*2,861MUFON

Only UFOReportCtr is skipped at import time. Other overlaps are handled by the deduplication engine.

Import Methodology

Each source has a custom import script. Two aggregator sources skip known-duplicate sub-sources at import time:

  • UFOCAT skips SOURCE=UFOReportCtr (123K NUFORC-origin records)
  • UPDB skips name=MUFON and name=NUFORC (1.82M records)

Source-Specific Handling

  • UFOCAT — 55-column CSV with split date fields (YEAR, MO, DAY, TIME). City stored in ALL CAPS in raw_text; copied to city post-import. Longitude negated for US/CA. UFOReportCtr records saved to enrichment sidecar.
  • NUFORC — Multi-line CSV with quoted descriptions. Dates: 1995-02-02 23:00 Local. Locations: City, ST, Country.
  • MUFON — 7-column CSV with embedded \n in dates. Locations with escaped commas: Newscandia\, MN\, US.
  • UPDB — 1.9M rows; name column identifies sub-source. 1,820,741 MUFON/NUFORC rows skipped. Remaining 65,016 mapped to source_origin entries.
  • UFO-search — JSON array of 54,751 records from 19 historical compilations. Variable date formats ("Summer 1947", "4/34", "6/24/1947"). Regex-based date parser; free-text location parsing.

Data Quality Fixes

Applied automatically by rebuild_db.py in the apply_data_fixes() pipeline. All fixes are idempotent and preserve original values in date_event_raw and raw_json columns.

Location Fixes
  • UFOCAT longitude sign — 30,822 Western Hemisphere locations had positive longitude; negated for US/CA records
  • UFOCAT city field — 73,766 locations had city only in raw_text; copied to city column
  • Country code normalization — USA→US, United Kingdom→GB, Canada→CA, Australia→AU
Date Fixes
  • MUFON literal \n in dates — 136,654 MUFON records contained a literal backslash-n (0x5C6E) in date_event (e.g., 2020-01-15\n3:00PM). Time portion extracted to time_raw, date truncated to ISO date
  • Year 0000 — Records with year 0000 have date_event set to NULL
  • Negative years — Records with date_event starting with - (e.g., -009-02-10) have date_event set to NULL
  • Month 00 — 551 records with YYYY-00-DD truncated to YYYY
  • Day 00 — 3,391 records with YYYY-MM-00 truncated to YYYY-MM
  • Impossible calendar dates — 14 records with Feb 30+, Apr/Jun/Sep/Nov 31 truncated to YYYY-MM
  • UFOCAT century-only 19// — 692 records with 2-digit raw year 19 (meaning “19xx, year unknown”) had date_event set to NULL. Audit logged in date_correction table
  • UFOCAT H-BOMB TEST 195// — 1 record with 3-digit year 195 and city “H-BOMB TEST” (clearly 1950s) set to NULL
  • NUFORC data entry errors — 2 records with wrong century corrected: 02052005 (Falmouth), 17212021 (Crescent City, Starlink-era sighting)
  • UPDB mangled years — 19 records with broken years from upstream UPDB export: 1 corrected (01961962, confirmed by description), 18 set to NULL (century-round years 02000900 and unconfirmed modern dates)
Shape / Classification Normalization
  • Shape case normalization — 24 case-duplicate groups collapsed via title-case (e.g., circleCircle), including hyphenated shapes (e.g., cigar-shapedCigar-Shaped). 352→317 distinct values
  • Shape typo correction — 9 misspellings corrected (e.g., TriangelTriangle, RectagleRectangle)
  • Junk shapes removed — 3 non-shape values set to NULL (WITNESS, 0, 12:45)
  • Hynek uppercase — 3 case-duplicate Hynek codes normalized (e.g., nlNL). 43→40 distinct values
  • Vallee uppercase — 2 case-duplicate Vallee codes normalized (e.g., fb1FB1). 43→41 distinct values
Description Cleanup
  • [MISSING DATA] removal — Records with description consisting solely of [MISSING DATA] or [missing data] set to NULL
  • MUFON boilerplateSubmitted by razor via e-mail and Investigator Notes: boilerplate stripped from descriptions

Historic Date Analysis

The database contains 8,046 sighting records with dates before 1901, spanning from 34 AD (a white round object over China) to 1900. Most are legitimate historical sightings from academic catalogs, but several categories of date errors were identified through systematic analysis.

Extraction & Analysis Method

All pre-1901 records were extracted into a standalone analysis database (temp/historic_pre1901.db) using extract_historic.py. Each record was auto-classified based on its source, raw date format, and year digit count, then flagged for manual review where ambiguous.

Categories Identified

CategorySourceRecordsStatusDescription
ufocat_ancientUFOCAT4,436 OK 4-digit raw years (1001–1900). Legitimately pre-modern sightings. No action needed.
ufocat_century_onlyUFOCAT692 Fixed 2-digit raw year 19// = “sometime in the 1900s, year unknown.” ETL zero-padded to 0019. Descriptions confirm modern events (abductions, radar, motion pictures). Resolution: date_event set to NULL (year genuinely unknown).
ufocat_3digit_reviewUFOCAT88 Fixed 3-digit raw years (034–999). Mostly legitimate ancient dates. 1 confirmed modern mislabel corrected: 195// “H-BOMB TEST” (1950s) → NULL. 4 ambiguous 188// records left as-is (no descriptions to disambiguate). Remaining 83 are legitimate ancient sightings.
other_source_reviewUFO-search1,984 OK Geldreich Majestic Timeline. Historical records from 61 AD to 1900. All appear legitimate.
other_source_reviewUPDB780 Fixed ~760 legitimate (1000–1900). 19 records had mangled modern years from upstream data errors. Resolution: 1 corrected (0196→1962, confirmed by description), 18 set to NULL (century-round years 02000900 and unconfirmed modern dates).
other_source_reviewMUFON40 OK All 1890–1900. Appear legitimate.
other_source_reviewNUFORC26 Fixed ~23 legitimate historic reports. 2 data entry errors corrected: 0205→2005 (Falmouth), 1721→2021 (Crescent City). 1 ambiguous record (1071) left as-is (could be 1971 or 2007).

Root Cause: UFOCAT Variable-Length Year Field

UFOCAT stores dates in separate YEAR, MO, DAY columns. The YEAR field uses variable-length encoding:

  • 4 digits (196,295 records): Standard years like 1966, 2001
  • 3 digits (90 records): Ancient years like 034 (34 AD), 776, 919
  • 2 digits (715 records): Century-only indicator 19 = “19xx” (20th century, unknown year)

The ETL’s parse_ufocat_date() zero-pads all years to 4 digits (f"{y:04d}"), which correctly handles 3-digit ancient years but misinterprets 2-digit 19 as year 19 AD instead of “19xx.”

Applied Resolution

After manual review of the annotated analysis dataset, 714 records were corrected or nulled via Fixes 15–18 in rebuild_db.py. Every correction is logged in the date_correction audit table with the original date, corrected date, correction type, and reason. See GitHub issue #1.

FixSourceActionCount
Fix 15UFOCATCentury-only 19// → NULL (year unknown)692
Fix 16UFOCATH-BOMB TEST 195// → NULL (1950s, not 195 AD)1
Fix 17NUFORCData entry errors corrected (0205→2005, 1721→2021)2
Fix 18UPDBMangled years corrected/nulled (0196→1962, rest → NULL)19

Conservative approach: only records with clear evidence were corrected. Ambiguous records (188// in UFOCAT, 1071 in NUFORC) were left unchanged. All fixes are idempotent and re-applied on each database rebuild.

Deduplication Methodology

Deduplication uses a two-phase strategy: known overlaps are eliminated at import time, then a three-tier matching engine flags remaining cross-source duplicates for review. No records are deleted — all 614,505 sightings remain, with 126,730 candidate pairs stored in the duplicate_candidate table.

Phase 1: Import-Time Filtering

Before deduplication runs, two aggregator sources skip sub-sources that would create known duplicates with higher-quality originals already imported:

SourceSub-Source SkippedRecords SkippedReason
UFOCATSOURCE=UFOReportCtr123,304Copies of NUFORC sightings
UPDBname=MUFON131,506MUFON imported directly with richer descriptions
UPDBname=NUFORC1,689,235NUFORC imported directly with richer descriptions

This eliminates 1,944,045 known duplicates before dedup begins, reducing the working set from ~2.56M to 614,505. The UFOCAT skip triggers enrichment to preserve valuable Hynek/Vallee metadata.

Other overlapping sub-sources (e.g. UFOCAT's Hatch records vs UFO-search's Hatch records) are kept and handled by the dedup engine, since both copies may carry unique metadata.

Phase 1.5: Metadata Enrichment

UFOCAT's 123K skipped UFOReportCtr records carry Hynek and Vallee classifications that NUFORC natively lacks. Rather than lose this data, import_ufocat.py writes skipped records to a sidecar file (ufocat_enrichment.jsonl), and enrich.py transfers the metadata to matching NUFORC sightings post-import.

Matching: Date (YYYY-MM-DD) + normalized UPPER(city) + UPPER(state). City normalization strips parenthetical qualifiers, trailing punctuation, and collapses whitespace.

Transfer rules: Only fills NULL fields — never overwrites existing NUFORC values.

FieldNUFORC Records Enriched
Hynek classification102,554
Vallee classification83,710
Shape1,697
Unmatched (no NUFORC hit)19,637

Phase 2: Three-Tier Cross-Source Matching

After all imports and enrichment, the dedup engine (dedup.py) compares records across different sources using progressively broader matching strategies. Each tier builds on the previous, skipping pairs already flagged.

Tier 1: MUFON ↔ NUFORC (7,694 pairs)

The highest-overlap pair. Both sources cover modern US sightings with reliable date/location data.

  • Match key: Exact date (YYYY-MM-DD) + UPPER(city) + UPPER(state)
  • Scoring: Full description similarity with source-specific preprocessing
  • Result: 7,694 candidate pairs

Tier 2: All Remaining Cross-Source Pairs (101,879 pairs)

Four sub-tiers cover every remaining source combination, using the match key best suited to each source's location data quality:

Sub-tierSourcesMatch KeyWhy This KeyPairs
2aMUFON ↔ UFOCATdate + city + stateBoth have structured state fields2,295
2bNUFORC ↔ UFOCATdate + city + stateBoth have structured state fields4,148
2cUPDB ↔ MUFON/NUFORC/UFOCATdate + city (no state)UPDB has inconsistent state data63,459
2dUFO-search ↔ MUFON/NUFORC/UFOCATdate + city + stateUFO-search locations parsed via regex31,977

Source-specific notes:

  • UFOCAT cities are stored in raw_text (ALL CAPS), not city — the loader reads raw_text instead
  • UFO-search locations are free-text strings parsed by regex to extract (city, state) pairs; only locations matching the City, ST pattern with a valid US/Canadian state code are matchable
  • UPDB sub-tier (2c) filters to US records only (country='US') to reduce false positives from city-only matching
  • All candidate pairs are normalized so sighting_id_a < sighting_id_b to enforce the UNIQUE constraint

Tier 3: Description Fuzzy Matching (17,157 pairs)

Catches duplicates that Tiers 1–2 miss due to location data differences (misspellings, missing state, different geocoding).

  • Match key: Date only (no location requirement)
  • Scope: Only dates with records from 2+ sources AND ≤20 total records on that date
  • Skip: Pairs already found in Tiers 1–2 are excluded
  • Two-stage filtering:
    1. Token Jaccard > 0.25 — Fast set-intersection filter on lowercased word tokens
    2. SequenceMatcher ≥ 0.5 — Python's difflib.SequenceMatcher on the first 1,000 characters
  • Result: 17,157 candidates from cross-source pairs sharing a date but not caught by location matching

Similarity Scoring

Every candidate pair receives a similarity score (0.0–1.0) computed by compute_similarity():

  1. Source-specific preprocessing:
    • NUFORC: Strips NUFORC UFO Sighting NNNNN prefix
    • MUFON: Strips Submitted by razor via e-mail boilerplate, extracts investigator notes
  2. "Starts with" shortcut: If both descriptions share the same first N characters (N ≥ 20), score = 0.95
  3. Token Jaccard pre-filter: If token Jaccard < 0.03, return that score immediately
  4. Full alignment: difflib.SequenceMatcher on first 1,000 characters of each description

Pairs with no description on either side receive score = 0.0 (still flagged as candidates based on location matching).

Results

126,730 duplicate candidate pairs across 127,440 unique sightings (20.7% of all records).

ConfidenceScore RangePairsInterpretation
Certain0.9 – 1.014,260Near-identical descriptions; safe to auto-merge
Likely0.7 – 0.99,567Strong match; minor wording differences
Possible0.5 – 0.713,303Same event reported differently across sources
Weak0.3 – 0.511,144Same date+location, descriptions partially overlap; needs review
Unlikely0.0 – 0.378,456Same date+location but likely different events

By Match Method

MethodPairsAvg Score
tier2c_updb_ufocat59,6200.225
tier2d_ufosearch_ufocat31,4390.240
tier3_desc_fuzzy17,1570.768
tier1a_mufon_nuforc7,6940.226
tier2b_nuforc_ufocat4,1480.129
tier2c_updb_nuforc3,5190.234
tier2a_mufon_ufocat2,2950.072
tier2d_ufosearch_nuforc3970.044
tier2c_updb_mufon3200.012
tier2d_ufosearch_mufon1410.009

Note: The previous build flagged 242K candidates. The current build flags only 126K because the 123K UFOCAT-NUFORC duplicates (UFOReportCtr) are now prevented at import time rather than flagged after the fact.

What Dedup Does NOT Do

  • No records are deleted or merged. The duplicate_candidate table is advisory. All 614,505 sightings remain queryable.
  • No within-source dedup. The engine only flags cross-source pairs (different source_db_id). Duplicates within a single source are not flagged.
  • No transitive closure. If A↔B and B↔C are both flagged, A↔C is NOT automatically inferred. Each pair is independent.
  • Multiple witnesses are preserved. If the same event has genuinely separate witness reports in different sources, both records remain. The similarity score distinguishes true duplicates (high score) from independent reports of the same event (low score).

Database Schema

sighting (614,505 rows, 42 columns)

The main table. Each row is one reported sighting event.

CategoryFields
Provenancesource_db_id, source_record_id, origin_id, origin_record_id
Datesdate_event (ISO 8601), date_event_raw, date_end, time_raw, timezone, date_reported, date_posted
Locationlocation_id (FK to location table)
Descriptionsummary, description
Observationshape, color, size_estimated, angular_size, distance, duration, duration_seconds, num_objects, num_witnesses, sound, direction, elevation_angle, viewed_from
Witnesswitness_age, witness_sex, witness_names
Classificationhynek, vallee, event_type, svp_rating
Resolutionexplanation, characteristics
Contextweather, terrain, source_ref, page_volume, notes
Preservationraw_json — complete original record as JSON

Supporting Tables

  • location — Deduplicated locations with raw_text, city, county, state, country, region, latitude, longitude
  • source_collection (3 rows) — Top-level provenance grouping:
    • PUBLIUS — Compiled by Publius from original reporting sites and PhenomAInon downloads (MUFON, NUFORC, UPDB)
    • GELDREICH — Rich Geldreich's Majestic Timeline compilation from 19+ historical sources (UFO-search)
    • UFOCAT — CUFOS UFOCAT catalog, independent academic dataset
  • source_database (5 rows) — UFOCAT, NUFORC, MUFON, UPDB, UFO-search. Each linked to a collection via collection_id
  • source_origin (31 rows) — Upstream sources within aggregator databases (Blue Book, NICAP, Hatch, etc.)
  • duplicate_candidate (126,730 rows) — Flagged duplicate pairs with similarity scores

Reproducible Build Pipeline

The entire database can be rebuilt from source files with a single command:

python rebuild_db.py

This runs the full pipeline in order:

  1. Create schema (create_schema.py)
  2. Import all 5 sources (UFOCAT with enrichment sidecar, NUFORC, MUFON, UPDB, UFO-search)
  3. Apply data quality fixes — 14 fix categories covering dates, locations, shapes, classifications, and descriptions
  4. Geocode locations using GeoNames gazetteer (geocode.py)
  5. Run enrichment (enrich.py)
  6. Run three-tier deduplication (dedup.py)
  7. Copy database to explorer (ufo-explorer/ufo_unified.db)

Total build time: ~2 minutes.

Test Suite

The pipeline is validated by 364 automated tests (pytest tests/):

  • 115 dedup tests (test_dedup.py) — All dedup functions, tier logic, similarity scoring, and edge cases
  • 114 ETL tests (test_etl.py) — Schema creation, all 5 importers, data fix pipeline
  • 135 data quality tests (test_data_quality.py) — Shape normalization, date validation, classification cleanup, description fixes, historic date corrections, and audit trail

All tests use an in-memory SQLite database with synthetic data — no production data required.

Geocoding

Only UFOCAT provides latitude/longitude coordinates natively. The other four sources have text-only locations (city, state, country). To enable map visualization for all sources, locations are geocoded using the GeoNames cities15000 gazetteer (33,000+ cities with population > 15,000).

Matching strategy (decreasing specificity):

  1. Exact: UPPER(city) + state + country → highest confidence
  2. City + country: No state available (e.g., UPDB) → picks largest matching city by population
  3. City only: No country available (e.g., UFO-search raw text) → picks largest city globally
  4. Raw text parsing: Regex extraction of city/state/country from free-text location strings

The geocode_src column on the location table tracks provenance: NULL = coordinates from original source data, geonames_exact / geonames_city_country / geonames_city_only = geocoded via GeoNames.

Source Collection Provenance

Each source database belongs to a collection — a top-level grouping that identifies who aggregated or curated the data:

CollectionSource(s)RecordsDescription
PUBLIUSMUFON, NUFORC, UPDB362,646Compiled by Publius from original reporting sites (MUFON, NUFORC) and PhenomAInon downloads (UPDB sub-sources: UFODNA, Blue Book, NICAP, UKTNA, CANADAGOV, NIDS, BRAZILGOV, SKINWALKER, PILOTS, BAASS)
UFOCATUFOCAT197,108CUFOS academic catalog (2023 release)
GELDREICHUFO-search54,751Rich Geldreich's Majestic Timeline from 19+ historical compilations

Collections are filterable in the explorer UI. The three-layer provenance model (source_collectionsource_databasesource_origin) traces every record back to its ultimate origin.

How Sightings Get Mapped

The Observatory map renders 396,158 markers out of the 614,505 total sightings — roughly 64.5% of the database. The other ~218k sightings exist in the DB but have no coordinates, so they never reach the map.

The sighting ↔ location split

The sighting table and the location table are joined through a foreign key: sighting.location_id → location.id. Multiple sightings can share a single location row — for example, Phoenix, AZ is one row in location, but hundreds of Phoenix Lights sightings all point at it through their location_id. This is the key to understanding why two different numbers show up in the UI:

QueryCountMeaning
COUNT(*) FROM sighting 614,505 Total sightings in the database
COUNT(*) FROM sighting s JOIN location l ON s.location_id = l.id WHERE l.latitude IS NOT NULL 396,158 Sightings on the map — the "mapped" chip in the stats badge
COUNT(*) FROM location WHERE latitude IS NOT NULL 105,854 Distinct geocoded places (one row per unique coordinate pair)

The ratio 396,158 / 105,854 ≈ 3.74 sightings per place reflects the long tail of major UFO hotspots: a few hundred cities accumulate dozens to hundreds of sightings each, while the bulk of the location table is rural or historical one-offs.

Why ~35% of sightings have no coordinates

The 218,347 unmapped sightings fall into four rough buckets:

  • Pre-GPS historical records — UFOCAT and UFO-search contain thousands of entries dating back to antiquity. The original catalogs often record only a country or region ("Ohio, 1952"), which the ETL can't resolve to a single coordinate pair.
  • Free-text locations — "my backyard", "en route to LAX", "over the Bermuda Triangle", "somewhere in the Pacific". The geocoder skips these rather than risk a wrong coordinate.
  • Ambiguous city names — "Springfield" without a state qualifier hits dozens of candidates. The ETL prefers silence over a wrong guess.
  • Structurally missing data — some source records legitimately have no location field at all (the original witness report didn't include one).

Original coordinates vs GeoNames lookup

The location table's geocode_src column distinguishes two provenance paths:

  • Original coordinates (geocode_src IS NULL) — latitude/longitude came directly from the source catalog. UFOCAT ships these for most records; MUFON and NUFORC ship them when the witness submitted a specific address.
  • GeoNames lookup (geocode_src = 'geonames_*') — the ETL resolved a city/state/country string against the open GeoNames gazetteer. Three granularity levels are tracked: geonames_exact (city + state + country matched), geonames_city_country (state-level ambiguity resolved by picking the largest city in the country), and geonames_city_only (no country at all; picks the globally largest city with that name).

The Observatory map treats both sources identically — once a row has a valid latitude/longitude, it's a marker. The geocoded_original and geocoded_geonames counts in the stats popover let you see the split.

Movement + Quality Classification

v0.8.3b added a set of derived columns on the sighting table that enrich each record with structured analysis extracted from its narrative. These are the columns the Observatory rail filters (Data Quality) and the Timeline/Insights dashboards (Quality Score Distribution, Movement Taxonomy, etc.) read from.

Movement categories (movement_categories, has_movement_mentioned)

A narrative text classifier scans each description for references to 10 movement categories. The classifier emits a JSON array of the categories it found, plus a boolean flag indicating any movement at all. 249,217 sightings (40.5% of the database) carry at least one movement tag.

CategorySightingsExample narrative patterns
vanished102,178"disappeared", "faded out", "winked out", "vanished into thin air"
hovering89,964"stationary", "stayed in place for several minutes", "hovered silently"
followed51,499"followed the car", "paced the aircraft", "tracked us for miles"
descending27,148"came down", "dropped altitude", "descended toward the field"
ascending27,067"rose straight up", "shot skyward", "climbed at an impossible rate"
landed26,592"touched down", "on the ground", "set down in the clearing"
accelerating22,627"suddenly sped up", "took off at incredible speed", "bolted"
linear21,106"flew straight", "on a direct heading", "steady course"
rotating19,641"spinning", "rotating counter-clockwise", "wobbling as it turned"
erratic8,877"zig-zagging", "erratic motion", "changed direction abruptly"

Categories are not mutually exclusive — a single sighting can carry multiple tags (e.g. "hovered briefly, then accelerated away" gets both hovering and accelerating). The Observatory's Movement cluster uses OR semantics: a sighting matches if any of the checked categories' bits are set.

Under the hood, the category set is bit-packed into a uint16 on the binary wire format so the client can filter ~396k rows in a few milliseconds without a server round-trip.

Quality score (quality_score)

A composite 0–100 integer derived from the richness of structured metadata on each row. Higher means more data you can cross-reference, not necessarily "more credible". The rebalanced v0.8.3b formula weights:

  • Date precision (0–25 points) — full ISO date beats year-only beats decade-only
  • Location specificity (0–25 points) — city + state + country beats country-only beats region
  • Shape classification (0–15 points) — populated standardized_shape (one of the 25 canonical shapes) beats raw shape text beats NULL
  • Witness count (0–15 points) — multi-witness reports score higher than single-witness
  • Source reliability (0–10 points) — investigator-vetted sources (MUFON) score higher than self-reported (NUFORC)
  • Narrative presence (0–10 points) — has_description = 1 adds 10; media attachments add another bump

The Observatory's "High quality only" toggle filters to quality_score ≥ 60, which corresponds to 118,320 sightings (19.3% of the database). The threshold was calibrated against a hand-reviewed training set — below 60 the records are typically missing at least one major dimension (no date, no location, or no shape). The Insights tab's Quality Score Distribution chart shows the shape of the distribution; the 60+ buckets are highlighted in the accent colour.

Hoax likelihood (hoax_likelihood)

A 0–100 integer estimating the probability a record is a hoax, prank, or misidentification, based on patterns like:

  • Obvious hoax language in the narrative ("April Fools", "just kidding", "for Halloween")
  • Shapes that correlate strongly with known-hoax submissions (Chinese lanterns tagged as "triangle formation")
  • Date/location collisions with known viral hoaxes or film releases
  • Source reliability priors (some aggregator sub-sources have high hoax rates)

The "Hide likely hoaxes" toggle in the Quality rail filters to hoax_likelihood ≤ 50. The Insights tab's Hoax Likelihood Curve shows the distribution; the right tail (80-100) is red-shifted so "likely hoax" is visually distinct from "likely genuine". Like the quality score, this is a heuristic filter, not a verdict — no records are ever deleted, just de-emphasised.

Richness score (richness_score)

A companion to quality_score. Where quality asks "how much structured data do we have", richness asks "how much narrative detail do we have". Higher richness means more distinct observation words (colours, durations, sound descriptions, object count, witness count, reaction details). It's the score that tells you how readable a record is going to be, not how verifiable it is. Also a 0–100 integer; no explicit threshold filter in the UI but it's available via the binary wire format for future use.

Primary color

primary_color — one of 22 colours (red, blue, metallic silver, etc.) extracted from the narrative via keyword matching. Populated for 145,209 sightings. Available as a dropdown in the Observatory filter bar.

Emotion & Sentiment Analysis (v0.11)

In v0.11, the science team ran four models against all 502,985 sightings with narrative text, replacing the earlier 8-class keyword classifier with transformer-based analysis. All models were run offline on the full private corpus; the results ship as 12 derived columns in the public database.

Models

ModelTypeOutputCoverage
RoBERTa (cardiffnlp/twitter-roberta-base-sentiment-latest) 3-class sentiment positive / negative / neutral + confidence scores 502,985
RoBERTa (j-hartmann/emotion-english-distilroberta-base) 7-class emotion anger, disgust, fear, joy, neutral, sadness, surprise 502,985
GoEmotions (SamLowe/roberta-base-go_emotions) 28-class emotion admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise, neutral 502,985
VADER (rule-based) Compound sentiment Score from -1.0 (most negative) to +1.0 (most positive) 502,985

Why four models?

No single sentiment model captures the nuance of UFO sighting narratives. VADER is fast and rule-based but misses sarcasm and context. RoBERTa sentiment gives a reliable positive/negative/neutral split but lacks granularity. The 7-class RoBERTa emotion model distinguishes fear from sadness from surprise — crucial for UFO reports where fear and awe often coexist. GoEmotions' 28-class taxonomy catches subtler states like "curiosity", "confusion", and "realization" that are common in witness accounts. Running all four lets the Insights tab cross-reference models and surface patterns that no single classifier would catch.

Coverage and neutrality

86.7% of sightings are classified as "neutral" by GoEmotions. This is expected — most reports are factual descriptions ("I saw a light at 10pm heading north"). The Insights tab's GoEmotions card hides neutral by default (toggle available) so the 13.3% with non-neutral emotion are legible.

Derived columns

The 12 new columns on the sighting table:

ColumnTypeDescription
emotion_28_dominantVARCHARTop GoEmotions label (e.g., "fear", "curiosity")
emotion_28_groupVARCHARSentiment group derived from GoEmotions (positive/negative/neutral/ambiguous)
emotion_28_scoresJSONBFull 28-class probability vector
emotion_7_dominantVARCHARTop 7-class RoBERTa emotion label
emotion_7_scoresJSONBFull 7-class probability vector
vader_compoundREALVADER compound score (-1 to +1)
vader_posREALVADER positive proportion
vader_negREALVADER negative proportion
vader_neuREALVADER neutral proportion
roberta_sentimentVARCHARRoBERTa 3-class label (positive/negative/neutral)
roberta_positiveREALRoBERTa positive confidence
roberta_negativeREALRoBERTa negative confidence

All emotion columns are packed into the 40-byte binary bulk buffer for client-side rendering. VADER compound and RoBERTa sentiment scores are scaled from [-1, +1] to [0, 255] uint8 via round((v+1)*127.5).

Replicating the analysis

The transformer analysis can be replicated independently:

  1. Obtain the raw narrative text from the source databases (NUFORC, MUFON, etc.)
  2. Run the three HuggingFace models against the text (model IDs listed in the table above)
  3. Run VADER (vaderSentiment Python package) against the same text
  4. Join results to the UFOSINT sighting table by source_record_id

The models are deterministic given the same input text and model weights. Minor version differences in the transformer libraries may produce slightly different confidence scores but the dominant labels should be stable.

Notes on the Current Build

  • Raw narrative text is not in the public database. The description, summary, notes, and raw_json columns were stripped from the public export for privacy. All derived columns (quality scores, movement categories, emotion classifications) were computed from the private corpus before stripping and ship as structured fields.
  • Deduplication is applied at ingest time. Known duplicates (1.94M records from UFOCAT-NUFORC and UPDB-MUFON/NUFORC overlaps) are excluded during import. The historical three-tier dedup pipeline documented above produced 126,730 candidate pairs in earlier builds; these are no longer materialized in the current build.
  • This is a scientific analysis, not an editorial product. No records are deleted, ranked, or editorially curated. Quality scores, hoax flags, and emotion labels are algorithmic outputs with known limitations. The coverage strips on the Insights tab show exactly what percentage of the visible dataset each metric covers.
TIME WINDOW · 1947 — 2026 · in range
34 AD ⫽ 1900 2026