All sightings over time
Quality score over time (median per year)
Movement categories over time (yearly share)
Emotion & Sentiment Analysis
Sentiment Polarity
Emotion Distribution (7-Class)
GoEmotions Detail (28-Class)
Sentiment Score Distributions
Emotion Profile by Source
Data Quality & Red Flags
Quality Score Distribution
Narrative Red Flags (keyword heuristic)
Movement & Shape
Movement Taxonomy
Shape × Movement Matrix (Top 10 shapes)
Connect your own AI to the UFOSINT data
Every tool the website's chatbot has access to is also exposed via the Model Context Protocol at a single HTTPS endpoint, so any MCP-compatible AI client can query the unified UFO sightings database with your own model and your own subscription. The endpoint is read-only and free to use.
MCP endpoint
https://ufosint-explorer.azurewebsites.net/mcp
6 tools available: search_sightings, get_sighting, get_stats, get_timeline, find_duplicates_for, count_by.
Claude Code (CLI / Desktop App)
One command to connect from any project directory:
claude mcp add --transport http ufosint https://ufosint-explorer.azurewebsites.net/mcp
Restart your Claude Code session. The 6 UFOSINT tools will be available immediately.
Remove later with claude mcp remove ufosint.
Claude Desktop
Open ~/Library/Application Support/Claude/claude_desktop_config.json
(macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows) and add:
{
"mcpServers": {
"ufosint": {
"url": "https://ufosint-explorer.azurewebsites.net/mcp",
"transport": "http"
}
}
}
Restart Claude Desktop. The 6 UFOSINT tools will appear in the tools panel.
Cursor / Cline / Continue / Windsurf
These all support remote MCP servers. Add the same URL to your client's MCP configuration. Each client documents the exact location, but the JSON shape is the same.
Direct API (curl / Python / any HTTP client)
The endpoint is JSON-RPC 2.0 over HTTPS. List the tools:
curl -s https://ufosint-explorer.azurewebsites.net/mcp \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}'
Call a tool:
curl -s https://ufosint-explorer.azurewebsites.net/mcp \
-H 'Content-Type: application/json' \
-d '{
"jsonrpc":"2.0",
"id":2,
"method":"tools/call",
"params":{
"name":"search_sightings",
"arguments":{"q":"triangle","state":"CA","limit":5}
}
}'
OpenAI / OpenRouter function-calling format
If you're integrating with OpenAI or OpenRouter and want the tool definitions in their native format (instead of going through MCP), fetch:
GET https://ufosint-explorer.azurewebsites.net/api/tools-catalog
And invoke individual tools at:
POST https://ufosint-explorer.azurewebsites.net/api/tool/<tool_name>
Content-Type: application/json
{ "q": "triangle", "state": "CA", "limit": 5 }
Download the database (SQLite)
Want to run your own analysis, train models, or hack on the data offline? The full 508 MB SQLite snapshot is attached to every tagged release on GitHub — 614,505 deduplicated sightings, 502,985 with emotion analysis, and all derived columns.
curl -LO https://github.com/UFOSINT/ufosint-explorer/releases/latest/download/ufo_public.db
sqlite3 ufo_public.db "SELECT COUNT(*) FROM sighting;"
# 614505
See the Methodology tab for the full schema, derived-column definitions, and per-source licensing. Browse the releases page for older versions.
AI Discovery
This site exposes standard AI-readiness files so agents and LLMs can discover and understand the UFOSINT tools automatically:
/llms.txt— Lightweight index of the site, tools, and data for LLMs/llms-full.txt— Full tool schemas and documentation in one file/.well-known/mcp.json— MCP server discovery manifest/robots.txt— All AI crawlers allowed
Local stdio MCP server
Prefer to run an MCP server on your own machine? Clone the repo and use mcp_server.py:
git clone https://github.com/UFOSINT/ufosint-explorer
cd ufosint-explorer
pip install fastmcp psycopg[binary]
DATABASE_URL="postgresql://..." python mcp_server.py
Then point Claude Desktop at the local script via the command form
of the MCP config. (You'll need read-only credentials to a PostgreSQL with the
UFOSINT schema.)
All access is read-only. Source data is licensed by UFOSINT; the deduplicated database is built by the ufo-dedup pipeline.
Unified UFO Sightings Database — Methodology
This is not raw data. UFOSINT Explorer presents a processed scientific analysis of five major UFO/UAP databases — 614,505 sighting records deduplicated, cross-referenced, quality-scored, movement-classified, and emotion-analyzed using four transformer models. Every step of the pipeline is documented below and can be independently replicated from the source data using the open-source ufo-dedup pipeline. The web application source code is at ufosint-explorer.
Source Databases
| Source | Raw Records | Imported | Skipped | Description |
|---|---|---|---|---|
| UFOCAT | 320,412 | 197,108 | 123,304 | CUFOS UFOCAT 2023 catalog. Richest metadata: Hynek/Vallee classifications, lat/lon, witness counts, durations. 123K NUFORC-origin records (SOURCE=UFOReportCtr) skipped; metadata transferred via enrichment. |
| NUFORC | 159,320 | 159,320 | 0 | National UFO Reporting Center. Self-reported sightings with detailed free-text descriptions. Enriched post-import with 102K Hynek and 83K Vallee classifications from UFOCAT. |
| MUFON | 138,310 | 138,310 | 0 | Mutual UFO Network case reports. Short + long descriptions, investigator summaries. |
| UPDB | 1,885,757 | 65,016 | 1,820,741 | Unified Phenomena Database (phenomenAInon). 1.82M rows skipped (MUFON/NUFORC already imported from richer originals). Remaining 65K from UFODNA (38K), Blue Book (14K), NICAP (5.8K), etc. |
| UFO-search | 54,751 | 54,751 | 0 | Majestic Timeline compilation from ufo-search.com. Historical records from 19 source compilations (Hatch, Eberhart, NICAP, Vallee, etc.). |
Total raw records across all sources: ~2.56 million. After removing known overlaps at import time: 614,505.
UFOCAT Sub-Source Landscape
UFOCAT is itself an aggregator. Its SOURCE column identifies where each record originated:
| UFOCAT SOURCE | Records | Overlap With |
|---|---|---|
UFOReportCtr | 123,304 | NUFORC (skipped, enriched) |
U (Hatch) | 17,184 | UFO-search Hatch (18K) |
BlueBook1 | 13,101 | UPDB Blue Book (14K) |
GEberhart1 | 11,643 | UFO-search Eberhart (7.9K) |
CanadUFOSurv | 10,785 | — |
NICAP | 2,315 | UPDB NICAP (5.8K), UFO-search NICAP (5.5K) |
MUFONJournal + MUFON* | 2,861 | MUFON |
Only UFOReportCtr is skipped at import time. Other overlaps are handled by the deduplication engine.
Import Methodology
Each source has a custom import script. Two aggregator sources skip known-duplicate sub-sources at import time:
- UFOCAT skips
SOURCE=UFOReportCtr(123K NUFORC-origin records) - UPDB skips
name=MUFONandname=NUFORC(1.82M records)
Source-Specific Handling
- UFOCAT — 55-column CSV with split date fields (YEAR, MO, DAY, TIME). City stored in ALL CAPS in
raw_text; copied tocitypost-import. Longitude negated for US/CA.UFOReportCtrrecords saved to enrichment sidecar. - NUFORC — Multi-line CSV with quoted descriptions. Dates:
1995-02-02 23:00 Local. Locations:City, ST, Country. - MUFON — 7-column CSV with embedded
\nin dates. Locations with escaped commas:Newscandia\, MN\, US. - UPDB — 1.9M rows;
namecolumn identifies sub-source. 1,820,741 MUFON/NUFORC rows skipped. Remaining 65,016 mapped tosource_originentries. - UFO-search — JSON array of 54,751 records from 19 historical compilations. Variable date formats ("Summer 1947", "4/34", "6/24/1947"). Regex-based date parser; free-text location parsing.
Data Quality Fixes
Applied automatically by rebuild_db.py in the apply_data_fixes() pipeline. All fixes are idempotent and preserve original values in date_event_raw and raw_json columns.
Location Fixes
- UFOCAT longitude sign — 30,822 Western Hemisphere locations had positive longitude; negated for US/CA records
- UFOCAT city field — 73,766 locations had city only in
raw_text; copied tocitycolumn - Country code normalization — USA→US, United Kingdom→GB, Canada→CA, Australia→AU
Date Fixes
- MUFON literal
\nin dates — 136,654 MUFON records contained a literal backslash-n (0x5C6E) indate_event(e.g.,2020-01-15\n3:00PM). Time portion extracted totime_raw, date truncated to ISO date - Year 0000 — Records with year
0000havedate_eventset to NULL - Negative years — Records with
date_eventstarting with-(e.g.,-009-02-10) havedate_eventset to NULL - Month 00 — 551 records with
YYYY-00-DDtruncated toYYYY - Day 00 — 3,391 records with
YYYY-MM-00truncated toYYYY-MM - Impossible calendar dates — 14 records with Feb 30+, Apr/Jun/Sep/Nov 31 truncated to
YYYY-MM - UFOCAT century-only
19//— 692 records with 2-digit raw year19(meaning “19xx, year unknown”) haddate_eventset to NULL. Audit logged indate_correctiontable - UFOCAT H-BOMB TEST
195//— 1 record with 3-digit year 195 and city “H-BOMB TEST” (clearly 1950s) set to NULL - NUFORC data entry errors — 2 records with wrong century corrected:
0205→2005(Falmouth),1721→2021(Crescent City, Starlink-era sighting) - UPDB mangled years — 19 records with broken years from upstream UPDB export: 1 corrected (
0196→1962, confirmed by description), 18 set to NULL (century-round years0200–0900and unconfirmed modern dates)
Shape / Classification Normalization
- Shape case normalization — 24 case-duplicate groups collapsed via title-case (e.g.,
circle→Circle), including hyphenated shapes (e.g.,cigar-shaped→Cigar-Shaped). 352→317 distinct values - Shape typo correction — 9 misspellings corrected (e.g.,
Triangel→Triangle,Rectagle→Rectangle) - Junk shapes removed — 3 non-shape values set to NULL (
WITNESS,0,12:45) - Hynek uppercase — 3 case-duplicate Hynek codes normalized (e.g.,
nl→NL). 43→40 distinct values - Vallee uppercase — 2 case-duplicate Vallee codes normalized (e.g.,
fb1→FB1). 43→41 distinct values
Description Cleanup
[MISSING DATA]removal — Records with description consisting solely of[MISSING DATA]or[missing data]set to NULL- MUFON boilerplate —
Submitted by razor via e-mailandInvestigator Notes:boilerplate stripped from descriptions
Historic Date Analysis
The database contains 8,046 sighting records with dates before 1901, spanning from 34 AD (a white round object over China) to 1900. Most are legitimate historical sightings from academic catalogs, but several categories of date errors were identified through systematic analysis.
Extraction & Analysis Method
All pre-1901 records were extracted into a standalone analysis database (temp/historic_pre1901.db) using extract_historic.py. Each record was auto-classified based on its source, raw date format, and year digit count, then flagged for manual review where ambiguous.
Categories Identified
| Category | Source | Records | Status | Description |
|---|---|---|---|---|
ufocat_ancient | UFOCAT | 4,436 | OK | 4-digit raw years (1001–1900). Legitimately pre-modern sightings. No action needed. |
ufocat_century_only | UFOCAT | 692 | Fixed | 2-digit raw year 19// = “sometime in the 1900s, year unknown.” ETL zero-padded to 0019. Descriptions confirm modern events (abductions, radar, motion pictures). Resolution: date_event set to NULL (year genuinely unknown). |
ufocat_3digit_review | UFOCAT | 88 | Fixed | 3-digit raw years (034–999). Mostly legitimate ancient dates. 1 confirmed modern mislabel corrected: 195// “H-BOMB TEST” (1950s) → NULL. 4 ambiguous 188// records left as-is (no descriptions to disambiguate). Remaining 83 are legitimate ancient sightings. |
other_source_review | UFO-search | 1,984 | OK | Geldreich Majestic Timeline. Historical records from 61 AD to 1900. All appear legitimate. |
other_source_review | UPDB | 780 | Fixed | ~760 legitimate (1000–1900). 19 records had mangled modern years from upstream data errors. Resolution: 1 corrected (0196→1962, confirmed by description), 18 set to NULL (century-round years 0200–0900 and unconfirmed modern dates). |
other_source_review | MUFON | 40 | OK | All 1890–1900. Appear legitimate. |
other_source_review | NUFORC | 26 | Fixed | ~23 legitimate historic reports. 2 data entry errors corrected: 0205→2005 (Falmouth), 1721→2021 (Crescent City). 1 ambiguous record (1071) left as-is (could be 1971 or 2007). |
Root Cause: UFOCAT Variable-Length Year Field
UFOCAT stores dates in separate YEAR, MO, DAY columns. The YEAR field uses variable-length encoding:
- 4 digits (196,295 records): Standard years like
1966,2001 - 3 digits (90 records): Ancient years like
034(34 AD),776,919 - 2 digits (715 records): Century-only indicator
19= “19xx” (20th century, unknown year)
The ETL’s parse_ufocat_date() zero-pads all years to 4 digits (f"{y:04d}"), which correctly handles 3-digit ancient years but misinterprets 2-digit 19 as year 19 AD instead of “19xx.”
Applied Resolution
After manual review of the annotated analysis dataset, 714 records were corrected or nulled via Fixes 15–18 in rebuild_db.py. Every correction is logged in the date_correction audit table with the original date, corrected date, correction type, and reason. See GitHub issue #1.
| Fix | Source | Action | Count |
|---|---|---|---|
| Fix 15 | UFOCAT | Century-only 19// → NULL (year unknown) | 692 |
| Fix 16 | UFOCAT | H-BOMB TEST 195// → NULL (1950s, not 195 AD) | 1 |
| Fix 17 | NUFORC | Data entry errors corrected (0205→2005, 1721→2021) | 2 |
| Fix 18 | UPDB | Mangled years corrected/nulled (0196→1962, rest → NULL) | 19 |
Conservative approach: only records with clear evidence were corrected. Ambiguous records (188// in UFOCAT, 1071 in NUFORC) were left unchanged. All fixes are idempotent and re-applied on each database rebuild.
Deduplication Methodology
Deduplication uses a two-phase strategy: known overlaps are eliminated at import time, then a three-tier matching engine flags remaining cross-source duplicates for review. No records are deleted — all 614,505 sightings remain, with 126,730 candidate pairs stored in the duplicate_candidate table.
Phase 1: Import-Time Filtering
Before deduplication runs, two aggregator sources skip sub-sources that would create known duplicates with higher-quality originals already imported:
| Source | Sub-Source Skipped | Records Skipped | Reason |
|---|---|---|---|
| UFOCAT | SOURCE=UFOReportCtr | 123,304 | Copies of NUFORC sightings |
| UPDB | name=MUFON | 131,506 | MUFON imported directly with richer descriptions |
| UPDB | name=NUFORC | 1,689,235 | NUFORC imported directly with richer descriptions |
This eliminates 1,944,045 known duplicates before dedup begins, reducing the working set from ~2.56M to 614,505. The UFOCAT skip triggers enrichment to preserve valuable Hynek/Vallee metadata.
Other overlapping sub-sources (e.g. UFOCAT's Hatch records vs UFO-search's Hatch records) are kept and handled by the dedup engine, since both copies may carry unique metadata.
Phase 1.5: Metadata Enrichment
UFOCAT's 123K skipped UFOReportCtr records carry Hynek and Vallee classifications that NUFORC natively lacks. Rather than lose this data, import_ufocat.py writes skipped records to a sidecar file (ufocat_enrichment.jsonl), and enrich.py transfers the metadata to matching NUFORC sightings post-import.
Matching: Date (YYYY-MM-DD) + normalized UPPER(city) + UPPER(state). City normalization strips parenthetical qualifiers, trailing punctuation, and collapses whitespace.
Transfer rules: Only fills NULL fields — never overwrites existing NUFORC values.
| Field | NUFORC Records Enriched |
|---|---|
| Hynek classification | 102,554 |
| Vallee classification | 83,710 |
| Shape | 1,697 |
| Unmatched (no NUFORC hit) | 19,637 |
Phase 2: Three-Tier Cross-Source Matching
After all imports and enrichment, the dedup engine (dedup.py) compares records across different sources using progressively broader matching strategies. Each tier builds on the previous, skipping pairs already flagged.
Tier 1: MUFON ↔ NUFORC (7,694 pairs)
The highest-overlap pair. Both sources cover modern US sightings with reliable date/location data.
- Match key: Exact date (YYYY-MM-DD) + UPPER(city) + UPPER(state)
- Scoring: Full description similarity with source-specific preprocessing
- Result: 7,694 candidate pairs
Tier 2: All Remaining Cross-Source Pairs (101,879 pairs)
Four sub-tiers cover every remaining source combination, using the match key best suited to each source's location data quality:
| Sub-tier | Sources | Match Key | Why This Key | Pairs |
|---|---|---|---|---|
| 2a | MUFON ↔ UFOCAT | date + city + state | Both have structured state fields | 2,295 |
| 2b | NUFORC ↔ UFOCAT | date + city + state | Both have structured state fields | 4,148 |
| 2c | UPDB ↔ MUFON/NUFORC/UFOCAT | date + city (no state) | UPDB has inconsistent state data | 63,459 |
| 2d | UFO-search ↔ MUFON/NUFORC/UFOCAT | date + city + state | UFO-search locations parsed via regex | 31,977 |
Source-specific notes:
- UFOCAT cities are stored in
raw_text(ALL CAPS), notcity— the loader readsraw_textinstead - UFO-search locations are free-text strings parsed by regex to extract
(city, state)pairs; only locations matching theCity, STpattern with a valid US/Canadian state code are matchable - UPDB sub-tier (2c) filters to US records only (
country='US') to reduce false positives from city-only matching - All candidate pairs are normalized so
sighting_id_a < sighting_id_bto enforce the UNIQUE constraint
Tier 3: Description Fuzzy Matching (17,157 pairs)
Catches duplicates that Tiers 1–2 miss due to location data differences (misspellings, missing state, different geocoding).
- Match key: Date only (no location requirement)
- Scope: Only dates with records from 2+ sources AND ≤20 total records on that date
- Skip: Pairs already found in Tiers 1–2 are excluded
- Two-stage filtering:
- Token Jaccard > 0.25 — Fast set-intersection filter on lowercased word tokens
- SequenceMatcher ≥ 0.5 — Python's
difflib.SequenceMatcheron the first 1,000 characters
- Result: 17,157 candidates from cross-source pairs sharing a date but not caught by location matching
Similarity Scoring
Every candidate pair receives a similarity score (0.0–1.0) computed by compute_similarity():
- Source-specific preprocessing:
- NUFORC: Strips
NUFORC UFO Sighting NNNNNprefix - MUFON: Strips
Submitted by razor via e-mailboilerplate, extracts investigator notes
- NUFORC: Strips
- "Starts with" shortcut: If both descriptions share the same first N characters (N ≥ 20), score = 0.95
- Token Jaccard pre-filter: If token Jaccard < 0.03, return that score immediately
- Full alignment:
difflib.SequenceMatcheron first 1,000 characters of each description
Pairs with no description on either side receive score = 0.0 (still flagged as candidates based on location matching).
Results
126,730 duplicate candidate pairs across 127,440 unique sightings (20.7% of all records).
| Confidence | Score Range | Pairs | Interpretation |
|---|---|---|---|
| Certain | 0.9 – 1.0 | 14,260 | Near-identical descriptions; safe to auto-merge |
| Likely | 0.7 – 0.9 | 9,567 | Strong match; minor wording differences |
| Possible | 0.5 – 0.7 | 13,303 | Same event reported differently across sources |
| Weak | 0.3 – 0.5 | 11,144 | Same date+location, descriptions partially overlap; needs review |
| Unlikely | 0.0 – 0.3 | 78,456 | Same date+location but likely different events |
By Match Method
| Method | Pairs | Avg Score |
|---|---|---|
tier2c_updb_ufocat | 59,620 | 0.225 |
tier2d_ufosearch_ufocat | 31,439 | 0.240 |
tier3_desc_fuzzy | 17,157 | 0.768 |
tier1a_mufon_nuforc | 7,694 | 0.226 |
tier2b_nuforc_ufocat | 4,148 | 0.129 |
tier2c_updb_nuforc | 3,519 | 0.234 |
tier2a_mufon_ufocat | 2,295 | 0.072 |
tier2d_ufosearch_nuforc | 397 | 0.044 |
tier2c_updb_mufon | 320 | 0.012 |
tier2d_ufosearch_mufon | 141 | 0.009 |
Note: The previous build flagged 242K candidates. The current build flags only 126K because the 123K UFOCAT-NUFORC duplicates (UFOReportCtr) are now prevented at import time rather than flagged after the fact.
What Dedup Does NOT Do
- No records are deleted or merged. The
duplicate_candidatetable is advisory. All 614,505 sightings remain queryable. - No within-source dedup. The engine only flags cross-source pairs (different
source_db_id). Duplicates within a single source are not flagged. - No transitive closure. If A↔B and B↔C are both flagged, A↔C is NOT automatically inferred. Each pair is independent.
- Multiple witnesses are preserved. If the same event has genuinely separate witness reports in different sources, both records remain. The similarity score distinguishes true duplicates (high score) from independent reports of the same event (low score).
Database Schema
sighting (614,505 rows, 42 columns)
The main table. Each row is one reported sighting event.
| Category | Fields |
|---|---|
| Provenance | source_db_id, source_record_id, origin_id, origin_record_id |
| Dates | date_event (ISO 8601), date_event_raw, date_end, time_raw, timezone, date_reported, date_posted |
| Location | location_id (FK to location table) |
| Description | summary, description |
| Observation | shape, color, size_estimated, angular_size, distance, duration, duration_seconds, num_objects, num_witnesses, sound, direction, elevation_angle, viewed_from |
| Witness | witness_age, witness_sex, witness_names |
| Classification | hynek, vallee, event_type, svp_rating |
| Resolution | explanation, characteristics |
| Context | weather, terrain, source_ref, page_volume, notes |
| Preservation | raw_json — complete original record as JSON |
Supporting Tables
location— Deduplicated locations withraw_text,city,county,state,country,region,latitude,longitudesource_collection(3 rows) — Top-level provenance grouping:- PUBLIUS — Compiled by Publius from original reporting sites and PhenomAInon downloads (MUFON, NUFORC, UPDB)
- GELDREICH — Rich Geldreich's Majestic Timeline compilation from 19+ historical sources (UFO-search)
- UFOCAT — CUFOS UFOCAT catalog, independent academic dataset
source_database(5 rows) — UFOCAT, NUFORC, MUFON, UPDB, UFO-search. Each linked to a collection viacollection_idsource_origin(31 rows) — Upstream sources within aggregator databases (Blue Book, NICAP, Hatch, etc.)duplicate_candidate(126,730 rows) — Flagged duplicate pairs with similarity scores
Reproducible Build Pipeline
The entire database can be rebuilt from source files with a single command:
python rebuild_db.py
This runs the full pipeline in order:
- Create schema (
create_schema.py) - Import all 5 sources (UFOCAT with enrichment sidecar, NUFORC, MUFON, UPDB, UFO-search)
- Apply data quality fixes — 14 fix categories covering dates, locations, shapes, classifications, and descriptions
- Geocode locations using GeoNames gazetteer (
geocode.py) - Run enrichment (
enrich.py) - Run three-tier deduplication (
dedup.py) - Copy database to explorer (
ufo-explorer/ufo_unified.db)
Total build time: ~2 minutes.
Test Suite
The pipeline is validated by 364 automated tests (pytest tests/):
- 115 dedup tests (
test_dedup.py) — All dedup functions, tier logic, similarity scoring, and edge cases - 114 ETL tests (
test_etl.py) — Schema creation, all 5 importers, data fix pipeline - 135 data quality tests (
test_data_quality.py) — Shape normalization, date validation, classification cleanup, description fixes, historic date corrections, and audit trail
All tests use an in-memory SQLite database with synthetic data — no production data required.
Geocoding
Only UFOCAT provides latitude/longitude coordinates natively. The other four sources have text-only locations (city, state, country). To enable map visualization for all sources, locations are geocoded using the GeoNames cities15000 gazetteer (33,000+ cities with population > 15,000).
Matching strategy (decreasing specificity):
- Exact: UPPER(city) + state + country → highest confidence
- City + country: No state available (e.g., UPDB) → picks largest matching city by population
- City only: No country available (e.g., UFO-search raw text) → picks largest city globally
- Raw text parsing: Regex extraction of city/state/country from free-text location strings
The geocode_src column on the location table tracks provenance: NULL = coordinates from original source data, geonames_exact / geonames_city_country / geonames_city_only = geocoded via GeoNames.
Source Collection Provenance
Each source database belongs to a collection — a top-level grouping that identifies who aggregated or curated the data:
| Collection | Source(s) | Records | Description |
|---|---|---|---|
| PUBLIUS | MUFON, NUFORC, UPDB | 362,646 | Compiled by Publius from original reporting sites (MUFON, NUFORC) and PhenomAInon downloads (UPDB sub-sources: UFODNA, Blue Book, NICAP, UKTNA, CANADAGOV, NIDS, BRAZILGOV, SKINWALKER, PILOTS, BAASS) |
| UFOCAT | UFOCAT | 197,108 | CUFOS academic catalog (2023 release) |
| GELDREICH | UFO-search | 54,751 | Rich Geldreich's Majestic Timeline from 19+ historical compilations |
Collections are filterable in the explorer UI. The three-layer provenance model (source_collection → source_database → source_origin) traces every record back to its ultimate origin.
How Sightings Get Mapped
The Observatory map renders 396,158 markers out of the 614,505 total sightings — roughly 64.5% of the database. The other ~218k sightings exist in the DB but have no coordinates, so they never reach the map.
The sighting ↔ location split
The sighting table and the location table are joined through a foreign key: sighting.location_id → location.id. Multiple sightings can share a single location row — for example, Phoenix, AZ is one row in location, but hundreds of Phoenix Lights sightings all point at it through their location_id. This is the key to understanding why two different numbers show up in the UI:
| Query | Count | Meaning |
|---|---|---|
COUNT(*) FROM sighting |
614,505 | Total sightings in the database |
COUNT(*) FROM sighting s JOIN location l ON s.location_id = l.id WHERE l.latitude IS NOT NULL |
396,158 | Sightings on the map — the "mapped" chip in the stats badge |
COUNT(*) FROM location WHERE latitude IS NOT NULL |
105,854 | Distinct geocoded places (one row per unique coordinate pair) |
The ratio 396,158 / 105,854 ≈ 3.74 sightings per place reflects the long tail of major UFO hotspots: a few hundred cities accumulate dozens to hundreds of sightings each, while the bulk of the location table is rural or historical one-offs.
Why ~35% of sightings have no coordinates
The 218,347 unmapped sightings fall into four rough buckets:
- Pre-GPS historical records — UFOCAT and UFO-search contain thousands of entries dating back to antiquity. The original catalogs often record only a country or region ("Ohio, 1952"), which the ETL can't resolve to a single coordinate pair.
- Free-text locations — "my backyard", "en route to LAX", "over the Bermuda Triangle", "somewhere in the Pacific". The geocoder skips these rather than risk a wrong coordinate.
- Ambiguous city names — "Springfield" without a state qualifier hits dozens of candidates. The ETL prefers silence over a wrong guess.
- Structurally missing data — some source records legitimately have no location field at all (the original witness report didn't include one).
Original coordinates vs GeoNames lookup
The location table's geocode_src column distinguishes two provenance paths:
- Original coordinates (
geocode_src IS NULL) — latitude/longitude came directly from the source catalog. UFOCAT ships these for most records; MUFON and NUFORC ship them when the witness submitted a specific address. - GeoNames lookup (
geocode_src = 'geonames_*') — the ETL resolved a city/state/country string against the open GeoNames gazetteer. Three granularity levels are tracked:geonames_exact(city + state + country matched),geonames_city_country(state-level ambiguity resolved by picking the largest city in the country), andgeonames_city_only(no country at all; picks the globally largest city with that name).
The Observatory map treats both sources identically — once a row has a valid latitude/longitude, it's a marker. The geocoded_original and geocoded_geonames counts in the stats popover let you see the split.
Movement + Quality Classification
v0.8.3b added a set of derived columns on the sighting table that enrich each record with structured analysis extracted from its narrative. These are the columns the Observatory rail filters (Data Quality) and the Timeline/Insights dashboards (Quality Score Distribution, Movement Taxonomy, etc.) read from.
Movement categories (movement_categories, has_movement_mentioned)
A narrative text classifier scans each description for references to 10 movement categories. The classifier emits a JSON array of the categories it found, plus a boolean flag indicating any movement at all. 249,217 sightings (40.5% of the database) carry at least one movement tag.
| Category | Sightings | Example narrative patterns |
|---|---|---|
vanished | 102,178 | "disappeared", "faded out", "winked out", "vanished into thin air" |
hovering | 89,964 | "stationary", "stayed in place for several minutes", "hovered silently" |
followed | 51,499 | "followed the car", "paced the aircraft", "tracked us for miles" |
descending | 27,148 | "came down", "dropped altitude", "descended toward the field" |
ascending | 27,067 | "rose straight up", "shot skyward", "climbed at an impossible rate" |
landed | 26,592 | "touched down", "on the ground", "set down in the clearing" |
accelerating | 22,627 | "suddenly sped up", "took off at incredible speed", "bolted" |
linear | 21,106 | "flew straight", "on a direct heading", "steady course" |
rotating | 19,641 | "spinning", "rotating counter-clockwise", "wobbling as it turned" |
erratic | 8,877 | "zig-zagging", "erratic motion", "changed direction abruptly" |
Categories are not mutually exclusive — a single sighting can carry multiple tags (e.g. "hovered briefly, then accelerated away" gets both hovering and accelerating). The Observatory's Movement cluster uses OR semantics: a sighting matches if any of the checked categories' bits are set.
Under the hood, the category set is bit-packed into a uint16 on the binary wire format so the client can filter ~396k rows in a few milliseconds without a server round-trip.
Quality score (quality_score)
A composite 0–100 integer derived from the richness of structured metadata on each row. Higher means more data you can cross-reference, not necessarily "more credible". The rebalanced v0.8.3b formula weights:
- Date precision (0–25 points) — full ISO date beats year-only beats decade-only
- Location specificity (0–25 points) — city + state + country beats country-only beats region
- Shape classification (0–15 points) — populated
standardized_shape(one of the 25 canonical shapes) beats raw shape text beats NULL - Witness count (0–15 points) — multi-witness reports score higher than single-witness
- Source reliability (0–10 points) — investigator-vetted sources (MUFON) score higher than self-reported (NUFORC)
- Narrative presence (0–10 points) —
has_description = 1adds 10; media attachments add another bump
The Observatory's "High quality only" toggle filters to quality_score ≥ 60, which corresponds to 118,320 sightings (19.3% of the database). The threshold was calibrated against a hand-reviewed training set — below 60 the records are typically missing at least one major dimension (no date, no location, or no shape). The Insights tab's Quality Score Distribution chart shows the shape of the distribution; the 60+ buckets are highlighted in the accent colour.
Hoax likelihood (hoax_likelihood)
A 0–100 integer estimating the probability a record is a hoax, prank, or misidentification, based on patterns like:
- Obvious hoax language in the narrative ("April Fools", "just kidding", "for Halloween")
- Shapes that correlate strongly with known-hoax submissions (Chinese lanterns tagged as "triangle formation")
- Date/location collisions with known viral hoaxes or film releases
- Source reliability priors (some aggregator sub-sources have high hoax rates)
The "Hide likely hoaxes" toggle in the Quality rail filters to hoax_likelihood ≤ 50. The Insights tab's Hoax Likelihood Curve shows the distribution; the right tail (80-100) is red-shifted so "likely hoax" is visually distinct from "likely genuine". Like the quality score, this is a heuristic filter, not a verdict — no records are ever deleted, just de-emphasised.
Richness score (richness_score)
A companion to quality_score. Where quality asks "how much structured data do we have", richness asks "how much narrative detail do we have". Higher richness means more distinct observation words (colours, durations, sound descriptions, object count, witness count, reaction details). It's the score that tells you how readable a record is going to be, not how verifiable it is. Also a 0–100 integer; no explicit threshold filter in the UI but it's available via the binary wire format for future use.
Primary color
primary_color — one of 22 colours (red, blue, metallic silver, etc.) extracted from the narrative via keyword matching. Populated for 145,209 sightings. Available as a dropdown in the Observatory filter bar.
Emotion & Sentiment Analysis (v0.11)
In v0.11, the science team ran four models against all 502,985 sightings with narrative text, replacing the earlier 8-class keyword classifier with transformer-based analysis. All models were run offline on the full private corpus; the results ship as 12 derived columns in the public database.
Models
| Model | Type | Output | Coverage |
|---|---|---|---|
| RoBERTa (cardiffnlp/twitter-roberta-base-sentiment-latest) | 3-class sentiment | positive / negative / neutral + confidence scores | 502,985 |
| RoBERTa (j-hartmann/emotion-english-distilroberta-base) | 7-class emotion | anger, disgust, fear, joy, neutral, sadness, surprise | 502,985 |
| GoEmotions (SamLowe/roberta-base-go_emotions) | 28-class emotion | admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise, neutral | 502,985 |
| VADER (rule-based) | Compound sentiment | Score from -1.0 (most negative) to +1.0 (most positive) | 502,985 |
Why four models?
No single sentiment model captures the nuance of UFO sighting narratives. VADER is fast and rule-based but misses sarcasm and context. RoBERTa sentiment gives a reliable positive/negative/neutral split but lacks granularity. The 7-class RoBERTa emotion model distinguishes fear from sadness from surprise — crucial for UFO reports where fear and awe often coexist. GoEmotions' 28-class taxonomy catches subtler states like "curiosity", "confusion", and "realization" that are common in witness accounts. Running all four lets the Insights tab cross-reference models and surface patterns that no single classifier would catch.
Coverage and neutrality
86.7% of sightings are classified as "neutral" by GoEmotions. This is expected — most reports are factual descriptions ("I saw a light at 10pm heading north"). The Insights tab's GoEmotions card hides neutral by default (toggle available) so the 13.3% with non-neutral emotion are legible.
Derived columns
The 12 new columns on the sighting table:
| Column | Type | Description |
|---|---|---|
emotion_28_dominant | VARCHAR | Top GoEmotions label (e.g., "fear", "curiosity") |
emotion_28_group | VARCHAR | Sentiment group derived from GoEmotions (positive/negative/neutral/ambiguous) |
emotion_28_scores | JSONB | Full 28-class probability vector |
emotion_7_dominant | VARCHAR | Top 7-class RoBERTa emotion label |
emotion_7_scores | JSONB | Full 7-class probability vector |
vader_compound | REAL | VADER compound score (-1 to +1) |
vader_pos | REAL | VADER positive proportion |
vader_neg | REAL | VADER negative proportion |
vader_neu | REAL | VADER neutral proportion |
roberta_sentiment | VARCHAR | RoBERTa 3-class label (positive/negative/neutral) |
roberta_positive | REAL | RoBERTa positive confidence |
roberta_negative | REAL | RoBERTa negative confidence |
All emotion columns are packed into the 40-byte binary bulk buffer for client-side rendering. VADER compound and RoBERTa sentiment scores are scaled from [-1, +1] to [0, 255] uint8 via round((v+1)*127.5).
Replicating the analysis
The transformer analysis can be replicated independently:
- Obtain the raw narrative text from the source databases (NUFORC, MUFON, etc.)
- Run the three HuggingFace models against the text (model IDs listed in the table above)
- Run VADER (
vaderSentimentPython package) against the same text - Join results to the UFOSINT sighting table by
source_record_id
The models are deterministic given the same input text and model weights. Minor version differences in the transformer libraries may produce slightly different confidence scores but the dominant labels should be stable.
Notes on the Current Build
- Raw narrative text is not in the public database. The
description,summary,notes, andraw_jsoncolumns were stripped from the public export for privacy. All derived columns (quality scores, movement categories, emotion classifications) were computed from the private corpus before stripping and ship as structured fields. - Deduplication is applied at ingest time. Known duplicates (1.94M records from UFOCAT-NUFORC and UPDB-MUFON/NUFORC overlaps) are excluded during import. The historical three-tier dedup pipeline documented above produced 126,730 candidate pairs in earlier builds; these are no longer materialized in the current build.
- This is a scientific analysis, not an editorial product. No records are deleted, ranked, or editorially curated. Quality scores, hoax flags, and emotion labels are algorithmic outputs with known limitations. The coverage strips on the Insights tab show exactly what percentage of the visible dataset each metric covers.