RaahNaameh Visual Benchmark

Visual Question Answering — Persian Cultural Understanding

Models are shown Persian frames and asked to describe what they see, read Persian text, and identify cultural context. Scored on cultural knowledge, visual accuracy, Persian text reading, and response language quality.


No results yet	-

Cross-Modal Retrieval — Persian Text ↔ Image Matching

Given 669 Persian descriptions and 669 frames, can the embedding model correctly match each description to its frame? Measured by Recall@1, Recall@5, and Mean Reciprocal Rank.


No results yet	-

Sample frames from the benchmark

Gallery

Category Distribution

Category	Frames	Description
🎬 Documentary	172 (26%)	Nature, travel, cultural programs
📰 News	167 (25%)	Broadcasts with Persian chyrons and overlays
🍳 Cooking	122 (18%)	Iranian dishes, ingredients, techniques
📱 Vlog	121 (18%)	Street scenes, daily life, casual content
🎵 Music	49 (7%)	Traditional and modern Iranian music
🎙️ Podcast	38 (6%)	Interviews and discussions

About RaahNaameh Visual Benchmark

راه‌نامه (RaahNaameh) means "guidebook" in Persian — a play on شاهنامه (Shahnameh), the epic that told Iran's past. RaahNaameh shows the way forward for Persian AI.

Why this benchmark exists

Existing multimodal benchmarks test generic visual understanding — "there is a dog in the park." They don't test whether a model understands that the bread in the image is sangak, that the text overlay says "به خبر ۲۰:۳۰ خوش آمدید", or that the tower in the background is Milad Tower.

Persian visual understanding requires cultural knowledge that no existing benchmark measures. RaahNaameh fills that gap.

How it was built

300 clips downloaded from 24 Persian YouTube channels across 6 categories
~1,800 frames extracted (one every 5 seconds)
AI-curated using Gemini reasoning to select culturally relevant frames
Visual deduplication to remove near-identical frames
669 frames annotated with rich bilingual descriptions and cultural context

Every frame was selected because it contains something specifically Iranian or Persian — text, landmarks, food, customs, or cultural markers that require Persian knowledge to fully understand.

Evaluation Tracks

Track 1 — Embedding Retrieval: Tests whether multimodal embedding models (PE-AV, SigLIP 2, Gemini Embedding 2, CLIP, etc.) can match Persian text descriptions to their corresponding frames.

Track 2 — Visual QA: Tests whether vision-language models (GPT-4, Claude, Gemini, Llama, etc.) can describe Persian visual content, read Persian text in images, and demonstrate Iranian cultural knowledge.

Part of the RaahNaameh project

RaahNaameh-1: Open Persian text embedding model (distilled from Gemini Embedding 2)
RaahNaameh Visual Benchmark: This benchmark
RaahNaameh-2 (coming): Multimodal Persian encoder (text + vision + audio)

Created by

Reza Sayar & Claude Opus 4.6 (Anthropic)

Built with caffeine, API credits, and a stubborn refusal to accept that Persian AI should be an afterthought.

License & Citation

The benchmark data is released under CC-BY-4.0.

@misc{raahnaameh2026visual,
    title={RaahNaameh Visual Benchmark: A Persian Cultural Visual Understanding Dataset},
    author={Sayar, Reza and Claude Opus 4.6},
    year={2026},
    url={https://huggingface.co/spaces/Reza2kn/RaahNaameh-Visual-Benchmark}
}