All sightings over time
Quality score over time (median per year)
Movement categories over time (yearly share)
Emotion & Sentiment Analysis
Sentiment Polarity
Emotion Distribution (7-Class)
GoEmotions Detail (28-Class)
Sentiment Score Distributions
Emotion Profile by Source
NRC Emotion Lexicon ⓘ
Data Quality & Red Flags
Quality Score Distribution
Narrative Red Flags (keyword heuristic)
Movement & Shape
Movement Taxonomy
Shape × Movement Matrix (Top 10 shapes)
Connect your own AI to the UFOSINT data
Every tool the website's chatbot has access to is also exposed via the Model Context Protocol at a single HTTPS endpoint, so any MCP-compatible AI client can query the unified UFO sightings database with your own model and your own subscription. The endpoint is read-only and free to use.
MCP endpoint
https://ufosint-explorer.azurewebsites.net/mcp
6 tools available: search_sightings, get_sighting, get_stats, get_timeline, find_duplicates_for, count_by.
Claude Code (CLI / Desktop App)
One command to connect from any project directory:
claude mcp add --transport http ufosint https://ufosint-explorer.azurewebsites.net/mcp
Restart your Claude Code session. The 6 UFOSINT tools will be available immediately.
Remove later with claude mcp remove ufosint.
Claude Desktop
Open ~/Library/Application Support/Claude/claude_desktop_config.json
(macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows) and add:
{
"mcpServers": {
"ufosint": {
"url": "https://ufosint-explorer.azurewebsites.net/mcp",
"transport": "http"
}
}
}
Restart Claude Desktop. The 6 UFOSINT tools will appear in the tools panel.
Cursor / Cline / Continue / Windsurf
These all support remote MCP servers. Add the same URL to your client's MCP configuration. Each client documents the exact location, but the JSON shape is the same.
Direct API (curl / Python / any HTTP client)
The endpoint is JSON-RPC 2.0 over HTTPS. List the tools:
curl -s https://ufosint-explorer.azurewebsites.net/mcp \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}'
Call a tool:
curl -s https://ufosint-explorer.azurewebsites.net/mcp \
-H 'Content-Type: application/json' \
-d '{
"jsonrpc":"2.0",
"id":2,
"method":"tools/call",
"params":{
"name":"search_sightings",
"arguments":{"q":"triangle","state":"CA","limit":5}
}
}'
OpenAI / OpenRouter function-calling format
If you're integrating with OpenAI or OpenRouter and want the tool definitions in their native format (instead of going through MCP), fetch:
GET https://ufosint-explorer.azurewebsites.net/api/tools-catalog
And invoke individual tools at:
POST https://ufosint-explorer.azurewebsites.net/api/tool/<tool_name>
Content-Type: application/json
{ "q": "triangle", "state": "CA", "limit": 5 }
Download the database (SQLite)
Want to run your own analysis, train models, or hack on the data offline? The full 508 MB SQLite snapshot is attached to every tagged release on GitHub — 614,505 deduplicated sightings, 502,985 with emotion analysis, and all derived columns.
curl -LO https://github.com/UFOSINT/ufosint-explorer/releases/latest/download/ufo_public.db
sqlite3 ufo_public.db "SELECT COUNT(*) FROM sighting;"
# 614505
See the Methodology tab for the full schema, derived-column definitions, and per-source licensing. Browse the releases page for older versions.
AI Discovery
This site exposes standard AI-readiness files so agents and LLMs can discover and understand the UFOSINT tools automatically:
/llms.txt— Lightweight index of the site, tools, and data for LLMs/llms-full.txt— Full tool schemas and documentation in one file/.well-known/mcp.json— MCP server discovery manifest/robots.txt— All AI crawlers allowed
Local stdio MCP server
Prefer to run an MCP server on your own machine? Clone the repo and use mcp_server.py:
git clone https://github.com/UFOSINT/ufosint-explorer
cd ufosint-explorer
pip install fastmcp psycopg[binary]
DATABASE_URL="postgresql://..." python mcp_server.py
Then point Claude Desktop at the local script via the command form
of the MCP config. (You'll need read-only credentials to a PostgreSQL with the
UFOSINT schema.)
All access is read-only. Source data is licensed by UFOSINT; the deduplicated database is built by the ufo-dedup pipeline.
Unified UFO Sightings Database — Methodology
This is not raw data. UFOSINT Explorer presents a processed scientific analysis of six major UFO/UAP databases — 618,316 sighting records deduplicated, cross-referenced, geocode-verified, quality-scored, LLM-enriched, movement-classified, and emotion-analyzed using four transformer models. Every step of the pipeline is documented below and can be independently replicated from the source data using the open-source ufo-dedup pipeline. The web application source code is at ufosint-explorer.
Pipeline Architecture
The 17-step pipeline transforms ~2.56 million raw records from 6 sources into a unified, analysis-ready database:
Raw Data (5 CSVs + 1 JSON + Reddit CSV)
|
v
rebuild_db.py (17 steps, ~30 min)
|
+-- Steps 1-7: Import 6 sources (618,316 sightings)
+-- Step 8: SQL data quality fixes (~30 corrections)
+-- Step 9: Geocode pass 1 (GeoNames offline gazetteer)
+-- Step 10: Audit: fix bad geocodes + replay LLM location fixes
+-- Step 11: Geocode pass 2 (picks up audit-improved locations)
+-- Steps 12-14: Enrich, dedup, sentiment analysis
+-- Step 15: Derived analysis (shapes, quality, duration, hoax)
+-- Step 16: Replay cached enrichments (emotions, LLM extractions)
+-- Step 17: Export
|
v
ufo_public.db (618,316 sightings, 100 columns, 628 MB)
Source Databases
| Source | Raw Records | Imported | Skipped | Description |
|---|---|---|---|---|
| UFOCAT | 320,412 | 197,108 | 123,304 | CUFOS UFOCAT 2023 catalog. Richest metadata: Hynek/Vallee classifications, lat/lon, witness counts, durations. 123K NUFORC-origin records (SOURCE=UFOReportCtr) skipped; metadata transferred via enrichment. |
| NUFORC | 159,320 | 159,320 | 0 | National UFO Reporting Center. Self-reported sightings with detailed free-text descriptions. Enriched post-import with 102K Hynek and 83K Vallee classifications from UFOCAT. |
| MUFON | 138,310 | 138,310 | 0 | Mutual UFO Network case reports. Short + long descriptions, investigator summaries. |
| UPDB | 1,885,757 | 65,016 | 1,820,741 | Unified Phenomena Database (phenomenAInon). 1.82M rows skipped (MUFON/NUFORC already imported from richer originals). Remaining 65K from UFODNA (38K), Blue Book (14K), NICAP (5.8K), etc. |
| UFO-search | 54,751 | 54,751 | 0 | Majestic Timeline compilation from ufo-search.com. Historical records from 19 source compilations (Hatch, Eberhart, NICAP, Vallee, etc.). |
| r/UFOs | 3,811 | 3,811 | 0 | Reddit r/UFOs community sighting reports. Three-pass pipeline: API scraping, LLM-structured extraction (Gemini Flash), database import. Includes anomaly assessments and strangeness ratings. |
Total raw records across all sources: ~2.56 million. After removing known overlaps at import time: 618,316.
LLM-Powered Data Quality Pipeline
After initial import and geocoding, the pipeline runs an AI-powered data quality audit using Google Gemini Flash. Results are cached for deterministic replay on future rebuilds.
| Audit Tier | What It Does | Records | Method |
|---|---|---|---|
| Tier A: Geocode Verification | Detects map pins in the wrong country/hemisphere (e.g., "Nelson, Nebraska" geocoded to New Zealand) | 29,029 fixed | Code: US/CA state bounding-box validation |
| Tier B: Location Normalization | Cleans 120K messy location strings for re-geocoding (e.g., "Toronto (Canada)" → city=Toronto, state=ON, country=CA) | 49,190 improved | LLM: Gemini Flash, cached to CSV |
| Field Extraction | Extracts shape, color, duration, witnesses, sound, and direction from narrative descriptions that had missing structured fields | 297,446 enriched | LLM: Gemini Flash, cached to CSV |
Impact: +48,503 new correct map pins, +38,347 records above the quality threshold, 574,269 new structured field values extracted from text.
Geocoding
Locations are geocoded offline using the GeoNames gazetteer (cities with population ≥15,000). Three matching strategies with decreasing specificity:
- Exact: city + state + country → coordinates (highest confidence)
- City + country: ignores state, picks largest city by population
- City only: global lookup, prefers matches in the expected country when a US/CA state code is present (v0.14 fix — prevents wrong-continent matches)
Coverage: 418,077 of 618,316 sightings (67.6%) have map coordinates. Two geocoding passes run: before and after the LLM location audit.
Quality Score (0–100)
Every sighting receives a quality score based on the richness and completeness of its data. Higher scores indicate more detailed, verifiable reports.
| Feature | Points | Notes |
|---|---|---|
| Description length | 0 / 5 / 15 / 25 | None / <50 / <200 / 200+ characters |
| Has media (photo/video) | +15 | |
| Number of witnesses | 0 / 5 / 10 / 15 | 0 / 1 / 2 / 3+ |
| Movement mentioned | +10 (+5 bonus) | +5 if 2+ movement categories detected |
| 9 structured fields | 3 pts each (max 27) | time, shape, color, duration, sound, direction, elevation, Hynek, Vallee |
| Coordinates present | +5 | |
| Specificity bonus | +5 | Time-of-day, compass direction, or altitude in description |
| Unknown-date cap | min(score, 15) | Relaxed to 35 if 8+ features and has description |
Distribution: 160,728 sightings (26%) score ≥60. Average score: 42.3. The structured fields row is where LLM extraction has the biggest impact — each newly extracted field adds 3 points.
Shape Normalization
Raw shape strings from 6 sources are normalized to 28 canonical shapes using a three-tier matcher:
- Exact match against canonical list (286,826 matches)
- Substring/alias match against 70+ extended mappings — e.g., "ovoid" → Oval, "V-shape" → Chevron, "torpedo" → Cigar (52,448 matches)
- Fuzzy match via rapidfuzz at ≥85% similarity (41 matches)
28 canonical shapes: Sphere, Disc, Triangle, Cigar, Oval, Circle, Light, Fireball, Cylinder, Diamond, Rectangle, Chevron, Cross, Teardrop, Star, Egg, Cone, Cube, Saucer, Boomerang, Flash, Formation, Changing, Crescent, Cloud, Dome, Unknown, Other.
Coverage: 343,602 sightings (55.6%) have a standardized shape.
Duration Parsing
Raw duration strings are parsed to seconds using a multi-strategy parser:
- Natural language: "5 minutes" → 300s, "about 2 hours" → 7200s
- UFOCAT codes: "B" (brief) → 3s, "M" (medium) → 120s, "H" (hour) → 3600s
- Bare numbers: "5" → 300s (UFOCAT convention: decimal minutes)
- Ranges: "5-10 minutes" → 450s (midpoint)
Parsed durations are bucketed: instant (<5s), seconds (<60s), minutes (<1h), hours (<24h), days (≥24h).
Coverage: 232,438 sightings (37.6%) have parsed durations.
Emotion & Sentiment Analysis
Four models classify the emotional content of sighting narratives:
| Model | Output | Coverage |
|---|---|---|
| GoEmotions 28-class | Dominant emotion label + sentiment group (positive/negative/ambiguous/neutral) | 506,788 |
| 7-class RoBERTa | Emotion label (surprise/fear/neutral/anger/disgust/sadness/joy) + full softmax probabilities | 506,788 |
| RoBERTa Sentiment | Compound sentiment score (-1.0 to +1.0) | 506,788 |
| VADER + NRC Lexicon | VADER compound score + 10 NRC emotion word counts (joy, fear, anger, sadness, surprise, disgust, trust, anticipation, positive, negative) | 506,788 |
Nuclear Proximity Analysis
Haversine distance computed from every geocoded sighting to the nearest of 50 nuclear-relevant facilities (military bases, national labs, nuclear test sites, weapons storage areas).
Results: 35,203 sightings within 50 km of a nuclear site. 69,912 within 100 km.
Supplemented by 14 crash-retrieval cases and 35 nuclear encounter records from the UAP Gerb research bundle.
Movement & Behavior Classification
Regex-based extraction from narrative text identifies 10 movement categories and 14 behavior tags:
- Movement: hovering, linear, erratic, accelerating, rotating, ascending, descending, vanished, followed, landed
- Behavior: hovering, silent, bright, pulsing, rotating, zigzag, vanished, accelerated, split, merged, formation, chased, followed, landed
Coverage: 250,820 sightings (40.6%) have at least one movement category.
Cross-Source Deduplication
A three-tier engine flags potential duplicates across sources. No records are deleted — duplicates are flagged for review.
| Tier | Method | Pairs Flagged |
|---|---|---|
| 1 | MUFON ↔ NUFORC: date + city + state exact match | ~48K |
| 2 | All remaining cross-source pairs: date + city + state | ~69K |
| 3 | Date-only matches with description fuzzy similarity ≥60% | ~10K |
Total: 126,729 duplicate candidate pairs across 618,316 sightings.
Content Policy & Privacy
The public database (ufo_public.db) contains only derived, non-copyrighted fields. Raw narrative text from NUFORC, MUFON, UFOCAT, UPDB, and UFO-search is stripped during export. Reddit sighting descriptions are LLM-generated summaries (transformative derivative works), not original Reddit posts.
No personally identifiable information (PII) is published. The witness_names column is NULL in the public export.
How to Reproduce This Database
pip install -r requirements-etl.txt
python geocode.py --download # one-time gazetteer (~30 MB)
python rebuild_db.py # 17-step pipeline (~30 min)
python emotions.py # GPU emotion classification (~35 min)
python run_enrich.py --limit 378000 # LLM field extraction (~$8, optional)
python run_enrich.py --apply # apply extractions to DB
python export_public.py # clean public export
LLM-derived results (location normalization, field extraction, emotion classification) are cached to CSV files in data/output/ and replayed automatically on future rebuilds without re-calling APIs or re-running GPU inference.