Introduction
Generative AI is going to usher in several benefits for people world over and it is necessary to bring in or percolate those benefits to all the people across the world. Especially in India there is a huge percentage of people that don’t speak/write English and in order to ensure these benefits do reach non-English speaking population it is necessary to enable availability of this technology for all the Indian languages. With this in mind there are several initiatives in India that are in progress to enable Generative AI features in different Indian Languages.
On the auspicious occasion of Gudi Padwa (New Year beginning for Hindu Calendar) on 9th April 2024 we announced the soft launch of Shivneri LLM (Large Language Model) - Bilingual Marathi and English LLM.
Model is named as Shivneri, a place in Maharashtra India, is the birthplace of Shivaji Maharaj, the founder of Maratha Empire. It’s the place where a new beginning happened that led to prosperity of people. On the similar lines we wish to bring the benefits of Generative AI to non-English (especially Marathi) speaking population of India.
Marathi has the third largest number of native speakers in India, after Hindi and Bengali. Almost 83 million people speak the language. Thus, there is a need for a Marathi LLM.
Goal
Shivneri LLM is an Open-source initiative and is focused on building an Open source LLM to enable several different applications in Marathi and thus bring in the benefits of Generative AI to Marathi speaking population. To start with we focus on Marathi but this initiative can be extended to other Indian languages too as needed.
Most of the multi-lingual LLMs available today do have very basic support for Indian languages.
There are three key problems in the way these LLMs support Indian Languages which we plan to solve with Shivneri LLM initiative:
High Token Count issue: Most of the available LLMs like LLAMA-2, OpenAI GPT take up too many tokens to encode a word in Indian Langauge. Almost 2-3x token count is needed for Indian language queries. High token count leads to high cost and low performance.
Lack for Indian context in Training Data: Most of the LLMs are trained mainly in English language and data that is available on Internet. Most of the Indian languages are low-resource languages in the sense that less language based text is available on the internet. Thus data to train the models on Indian language is scarce and thus the native data specific to Indian context is missing in the training of the generally available LLMs.
Bilingual Support with English: Most of the spoken Indian languages are mixed with English words thus it is necessary to support text or conversation that has mix of both Indian language and English words.
Solution
Shivneri LLM is trying to address the issues listed in the previous section. For this following is being done:
Optimize the token count for Indian Language.
Pretrain the model for Language specific context with original Indian language data from different source like Indian Language News sites, Books, School Text Books, YouTube Videos and other such original language data.
Train the model to support ability to support bilingual language in order to ensure the usage is aligned to the way people speak/use the language in practice.
Shivneri LLM currently is built on top of Googles Gemma 7B opensource LLM. Currently we have launched preliminary version of Shivneri LLM 7B base model on hugging face (HF link). Following is in the pipeline:
Continue pre-training on further huge corpus of Original Marathi data.
Launch Chat models soon.
Support for Bilingual language.
This Project has been currently Sponsored by Microsoft Azure with Cloud Credits. We are thankful to AI4Bharat – IIT Madras Initiative for their great work on Indic LLM Suite. Last but not the least the overall Open-source community for amazing tools and support system.
You can subscribe to our pages or groups below for further updates:
Linkedin group at linkedin group link.
Github repo: Repo link.
Mail us at amitagh@gmail.com.
HuggingFace repo: HuggingFace Repo.
Summary
Shivneri LLM is an Open-source initiative and is focused on building an Open source LLM to enable several different applications in Marathi and thus bring in the benefits of Generative AI to Marathi speaking population. To start with we focus on Marathi but this initiative can be extended to other Indian languages too as needed.
Anybody wanting to use this LLM for a specific application we are happy to help in customizing it as per the needs.
We are looking for further sponsorship and support for this initiative. We welcome all the support and feedback in relation with this initiative. You can reach out to me Amit Ghadge at amitagh@gmail.com.
Great initiative, Amit!!