Case Study: Evaluating Multilingual AI in Humanitarian Contexts

According to the UN Refugee Agency (UNHCR), more than 120 million people are currently displaced worldwide. For refugees and asylum seekers, information is a matter of survival, covering essentials such as registration, shelter, healthcare, employment, and legal or administrative processes. Governments and Humanitarian organizations including UNHCR, the U.S. Department of Homeland Security (DHS), and the International Rescue Committee (IRC) are increasingly turning to AI-powered chatbots and digital assistants to deliver information accurately, timely, and at scale.

At the same time, many displaced individuals turn directly to general-purpose LLM chatbots such as ChatGPT, Gemini, and Claude. Questions from asylum seekers, refugees, undocumented individuals, and other displaced groups often involve sensitive or deeply personal topics—ones they may hesitate to ask even on guardrailed chatbots managed by humanitarian agencies.

However, most displaced people do not primarily speak English. Research consistently shows that LLMs underperform and apply weaker safeguards in non-English outputs, producing advice that is often less accurate, less actionable, and at times unsafe or biased. This underscores the need to evaluate large language models in multilingual settings and to determine the safeguards, oversight, and resources required for responsible deployment, informing when and where these systems should be used, under what conditions, and in which languages.

In this case study, we examine how LLM-enabled chatbots (GPT-4.0, Gemini 2.5 Flash, and Mistral Small) perform in refugee and asylum-related scenarios. The analysis compares responses across four language pairs: Arabic, Farsi (Iranian Persian), Pashto, and Kurdish. Readers will gain insight into our methodology, language-specific findings, cross-language comparisons, and the differences between human and LLM evaluators (llm-as-a-judge). Results are presented through charts and analysis, with the full evaluation available via the Mozilla Data Collective.

Case Study: Evaluating Multilingual AI in Humanitarian Contexts

Go To :

Methodology

Findings

Recommendations

Access Data