Recommendations

(Case Study: Evaluating Multilingual AI in Humanitarian Contexts)

Context-specific and language-specific evaluation cannot be conducted in isolation. It requires meaningful engagement with affected communities and civil society organizations who have direct knowledge of the domain, the language, and the lived realities of the people who will use these tools. The Multilingual AI Safety Evaluation Lab was designed in part to lower the barriers to this kind of engagement, providing an open-source platform that enables community-based evaluators, researchers, and civil society partners to participate directly in the evaluation process. Our case study confirmed what we believed at the outset: humans are an inseparable part of evaluation. Automated systems alone cannot assess contextual appropriateness, cultural nuance, or the real-world safety implications of AI outputs for vulnerable populations.

The recommendations below are organized by audience: the first four are for deployers and developers who build and operate AI tools in multilingual contexts; the fifth is for AI labs and NLP researchers who train the underlying models.

For Deployers and Developers

Recommendation 1: Center Human Evaluators with Lived Experience of Context and Language

Human engagement is not a bottleneck to be minimized; it is a core component of credible, context-aware evaluation. Automated systems can assist with scale and consistency, but they cannot replace the judgment of people who understand, from experience, what is at stake.

Compensate evaluators fairly. Evaluation work, especially for sensitive and emotionally demanding content, should be compensated at rates that reflect the expertise and labor involved. Underpaying evaluators, or relying on volunteer labor, undermines the quality and sustainability of evaluation efforts.
Build in double-checking and calibration layers. Human evaluation is not infallible. Evaluation workflows should include mechanisms for inter-rater reliability checks, calibration sessions where evaluators discuss edge cases and align on criteria, and review processes for flagged or contested judgments.
Create feedback loops between evaluators and the evaluation process. Evaluators often surface issues with the evaluation rubric itself, identifying dimensions that are missing, criteria that are ambiguous, or scenarios that do not fit the existing framework. These insights should be incorporated into ongoing evaluation design.

Recommendation 2: Build an Evaluation-to-Guardrail Pipeline

Evaluation and guardrail design should be treated as a continuous, connected cycle rather than separate activities. Concretely, this means:

Step 1: Run multilingual evaluations using the Lab's platform to identify specific failure modes per language. The evaluation should cover applied dimensions such as actionability, factuality, tone, safety, fairness, and censorship, not just linguistic fluency.
Step 2: Translate failures into guardrail policies. Each documented failure should map to a specific guardrail rule. For example, if evaluation reveals that a model refers users to authorities who may be hostile to their situation, the guardrail policy should explicitly prohibit such referrals in the relevant context and language.
Step 3: Implement the guardrails using available tools, whether through system prompt instructions, post-processing filters, or dedicated guardrail frameworks.
Step 4: Re-evaluate. Run the same evaluation suite again to verify that the guardrails resolve the identified issues without introducing new problems, such as over-filtering that reduces helpfulness.
Step 5: Repeat. As models are updated, new languages are added, or deployment contexts change, restart the cycle.

This approach ensures that guardrails are evidence-based and context-specific rather than generic. It also means that organizations have a documented, auditable record of what was tested, what failed, what was fixed, and what was verified.

Recommendation 3: Make LLM-as-a-Judge Systems Agentic and Retrieval-Augmented

LLM-as-a-Judge and guardrail systems are increasingly used to evaluate and monitor AI outputs at scale. However, these systems currently have a critical blind spot: they assess factual accuracy with high confidence but have no mechanism to actually verify facts. When an LLM judge evaluates whether a response correctly states a legal deadline, an organizational contact, or a policy detail, it relies entirely on its parametric knowledge, which may be outdated, incomplete, or wrong, especially for non-English jurisdictions.

To make evaluation trustworthy, organizations should:

Equip the LLM judge with retrieval tools. Connect it to authoritative knowledge bases, legal databases, or organizational directories so it can look up facts rather than guess. This is especially important for high-stakes domains such as asylum law, medical guidance, and financial regulations.
Design an agentic verification workflow. Instead of single-pass judgment, the system should: (a) identify factual claims in the response, (b) retrieve relevant reference material, (c) compare claims against evidence, and (d) issue a judgment with explicit source citations.
Require the judge to express uncertainty. The system should have an "Unsure" or "Unverifiable" option and should be instructed to use it when it cannot verify a claim. A judge that is never uncertain is not rigorous; it is overconfident.
Support non-English retrieval. Verification tools must be able to query sources in the target language. Legal and policy information varies by jurisdiction, and English-language references alone are insufficient.

Organizations building or customizing LLM-as-a-Judge systems can use open-source tools and frameworks to configure custom judgment policies that define what the judge should check, how it should reason, and when it should flag uncertainty. Writing clear, domain-specific evaluation policies for the judge is just as important as writing guardrail policies for the model itself.

Recommendation 4: Use System Prompts to Embed Multilingual Safety Standards

System prompts are one of the most accessible and immediate interventions available to deployers. They operate at the application layer and can be updated without retraining a model. For multilingual deployments, system prompts should be used to encode cross-language consistency requirements and domain-specific safety policies.

Concrete steps include:

State cross-language consistency as an explicit requirement. The system prompt should instruct the model that all supported languages must receive the same standard of helpfulness, the same safety disclaimers, the same level of actionable detail, and the same safeguards. Without this, models default to their training distribution, which favors English.
Embed organizational and humanitarian policies. Organizations deploying AI in humanitarian contexts should encode their "do no harm" policies, data protection standards, and service guidelines directly into the system prompt. The IRC, for example, has integrated its organizational policies into the system prompt of its AI assistant, ensuring the model operates within the organization’s ethical framework.
Add language-specific and context-specific guidance. If evaluation (per Recommendation 1) has identified risks specific to certain languages or contexts, the system prompt should address them explicitly.
Test system prompts in every supported language. A system prompt written in English may not produce the intended behavior when the model responds in Pashto or Kurdish. Validate the prompt's effectiveness across all target languages using the evaluation pipeline.

Recommendation 5: Ensure Consistent Safety Disclaimers Across Languages

Safety disclaimers, such as advising a user to consult a doctor, a lawyer, or a qualified professional, are a basic safeguard in high-risk domains. Our evaluation showed that these disclaimers are applied inconsistently across languages. Rather than relying on the model to spontaneously generate appropriate disclaimers, deployers should build disclaimer consistency into their system architecture.

Specifically:

Deploy multilingual topic classifiers that detect when a query involves medical, legal, financial, or other high-risk domains regardless of input language. If the classifier only works reliably in English, disclaimers will be missed for non-English users. Test classifier accuracy per language.
Implement a disclaimer injection layer that operates independently of the model. When the topic classifier flags a high-risk domain, a pre-written, professionally translated disclaimer should be appended to the response, rather than leaving it to the model to decide whether to include one.
Maintain a library of language-specific disclaimer templates. Disclaimers should be translated and culturally adapted by qualified translators, not generated by the model on the fly.
Track disclaimer presence as an evaluation metric. Include disclaimer detection in routine multilingual evaluations, measured per language and per domain, so that gaps are caught systematically.

For AI Laboratories and NLP Researchers

Recommendation 6: Invest in the Full Technical Stack for Language Support, Not Just Training Data

When multilingual performance gaps surface, the default industry response is to attribute them to insufficient training data. While data quantity and quality are important, reducing the problem to a single variable obscures the other technical and policy factors that directly affect performance and safety for non-English users. The training data narrative is not wrong, but it is incomplete. Meaningful multilingual improvement requires treating language support as a system-level challenge that spans tokenization, safety alignment, evaluation, and infrastructure. AI laboratories should invest across the full stack:

Tokenization and Language typology. Languages with non-Latin scripts are typically tokenized less efficiently, meaning the same content requires more tokens. This increases generation time, reduces effective context window size, and in some cases leads to incomplete outputs or timeouts. Investing in better tokenizers for underserved scripts, or supporting language-adaptive tokenization, would directly improve usability and reliability.
Safety alignment and guardrail training. Safety mechanisms, including content filtering, refusal triggers, and disclaimer generation, must be trained and tested multilingually. If safety alignment is conducted primarily in English, the resulting safeguards may not transfer to other languages. AI labs should include non-English evaluation scenarios in their red-teaming, RLHF, and safety fine-tuning processes.
Evaluation benchmarks. Standard multilingual benchmarks tend to test linguistic competence (translation quality, factual recall) rather than applied dimensions like actionability, contextual safety, empathetic tone, and disclaimer consistency. Labs should adopt or develop evaluation benchmarks that test what actually matters for deployment: does the model give helpful, safe, and contextually appropriate responses in every supported language?
Infrastructure and serving. If decoding paths and serving infrastructure are optimized for English, non-English users experience slower response times, higher latency, and less reliable outputs. This is an accessibility and equity issue. Labs should monitor and report generation performance per language and work to close infrastructure-level gaps.

Recommendations

Go To :

Access Evaluation Data

Methodology

Findings

About Page