669 curated frames · 225 video clips · 6 categories · Bilingual annotations
Visual Question Answering — Persian Cultural Understanding
Models are shown Persian frames and asked to describe what they see, read Persian text, and identify cultural context. Scored on cultural knowledge, visual accuracy, Persian text reading, and response language quality.
1 | google/gemini-3.1-flash-lite-preview | 0.5103 | 0.1447 | 0.1184 | 0.7781 | 0.9099 | 50 | 2026-03-16 23:24 |
Cross-Modal Retrieval — Persian Text ↔ Image Matching
Given 669 Persian descriptions and 669 frames, can the embedding model correctly match each description to its frame? Measured by Recall@1, Recall@5, and Mean Reciprocal Rank.
No results yet | - |
No results yet | - |
Sample frames from the benchmark
Category Distribution
| Category | Frames | Description |
|---|---|---|
| 🎬 Documentary | 172 (26%) | Nature, travel, cultural programs |
| 📰 News | 167 (25%) | Broadcasts with Persian chyrons and overlays |
| 🍳 Cooking | 122 (18%) | Iranian dishes, ingredients, techniques |
| 📱 Vlog | 121 (18%) | Street scenes, daily life, casual content |
| 🎵 Music | 49 (7%) | Traditional and modern Iranian music |
| 🎙️ Podcast | 38 (6%) | Interviews and discussions |
About RaahNaameh Visual Benchmark
راهنامه (RaahNaameh) means "guidebook" in Persian — a play on شاهنامه (Shahnameh), the epic that told Iran's past. RaahNaameh shows the way forward for Persian AI.
Why this benchmark exists
Existing multimodal benchmarks test generic visual understanding — "there is a dog in the park." They don't test whether a model understands that the bread in the image is sangak, that the text overlay says "به خبر ۲۰:۳۰ خوش آمدید", or that the tower in the background is Milad Tower.
Persian visual understanding requires cultural knowledge that no existing benchmark measures. RaahNaameh fills that gap.
How it was built
- 300 clips downloaded from 24 Persian YouTube channels across 6 categories
- ~1,800 frames extracted (one every 5 seconds)
- AI-curated using Gemini reasoning to select culturally relevant frames
- Visual deduplication to remove near-identical frames
- 669 frames annotated with rich bilingual descriptions and cultural context
Every frame was selected because it contains something specifically Iranian or Persian — text, landmarks, food, customs, or cultural markers that require Persian knowledge to fully understand.
Evaluation Tracks
Track 1 — Embedding Retrieval: Tests whether multimodal embedding models (PE-AV, SigLIP 2, Gemini Embedding 2, CLIP, etc.) can match Persian text descriptions to their corresponding frames.
Track 2 — Visual QA: Tests whether vision-language models (GPT-4, Claude, Gemini, Llama, etc.) can describe Persian visual content, read Persian text in images, and demonstrate Iranian cultural knowledge.
Part of the RaahNaameh project
- RaahNaameh-1: Open Persian text embedding model (distilled from Gemini Embedding 2)
- RaahNaameh Visual Benchmark: This benchmark
- RaahNaameh-2 (coming): Multimodal Persian encoder (text + vision + audio)
Created by
Reza Sayar & Claude Opus 4.6 (Anthropic)
Built with caffeine, API credits, and a stubborn refusal to accept that Persian AI should be an afterthought.
License & Citation
The benchmark data is released under CC-BY-4.0.
@misc{raahnaameh2026visual,
title={RaahNaameh Visual Benchmark: A Persian Cultural Visual Understanding Dataset},
author={Sayar, Reza and Claude Opus 4.6},
year={2026},
url={https://huggingface.co/spaces/Reza2kn/RaahNaameh-Visual-Benchmark}
}