Dataset Onboarding Support Team (DOST) for Bhashini and AIKosh

Dataset Onboarding Support Team (DOST) for Bhashini and AIKosh

Context:

India’s linguistic diversity is one of its greatest assets yet it remains significantly underrepresented in Digital Infrastructure and Artificial Intelligence (AI). While citizens increasingly rely on digital platforms for accessing public services, information, and opportunities, language barriers continue to exclude large sections of the population, particularly speakers of low-resource, tribal, and regional languages spanning text, speech, and multimodal content. However, much of India’s language data currently resides fragmented in silos across government bodies, academic institutions, civil society organisations, cultural archives, and with individuals.

Our solution:

To address this challenge, the Dataset Onboarding Support Team (DOST) initiative, was launched in the BHASHINI Samudaye IndiaAI Pre-Summit event, led by CivicDataLab, partnering with the Gates Foundation in collaboration with BHASHINI.

The Dataset Onboarding Support Team (DOST) acts as a structured support layer to enable the identification, preparation, and onboarding of high-quality language datasets for multilingual AI.

The initiative provides end-to-end onboarding support, guiding contributors from initial dataset identification through preparation, validation, and publication, while ensuring compliance with data quality, privacy, and interoperability standards.

It supports contributors in preparing datasets that are structured, machine-readable, clean and consistent, well-documented with metadata, and safe for public use with appropriate handling of sensitive information.

DOST supports a wide range of dataset types, including text, speech, translation, conversational, cultural, and multimodal datasets, enabling diverse use cases across sectors and languages.

Beyond technical support, DOST connects contributors to a wider ecosystem of stakeholders working on multilingual AI, including access to tools and services within the BHASHINI ecosystem, opportunities for collaboration, and visibility within national data platforms.

In partnership with: