Semantic Search On Indian Languages

Better search for Indian language data

Dec 23, 2020

A lot of good stuff happened in indic NLP this year.

Multilingual Representations for Indian Languages(MuRIL)
Supports 17 Indian languages. I think it also works with mix-code as it’s trained with transliterated data. Mix-code handling is needed for dealing with social media data ex. tweets and chats
ai4bharat/indic-bert
Indic-Transformers: An Analysis of Transformer Language Models for Indian Languages

Other multilingual papers

Since MuRIL is showing a great gain on the Tatoeba dataset compared to mBERT, it seems best for neural search in Indic language. (I am not clear on LaBSE Vs MuRIL for Indian data)

Possible applications in India

News search
- Indic websites
- English websites but indic search
FAQ chatbot in indic language
- Customer support for commercial websites
- Customer support for govt websites
Zero-shot article classification via similarity between title and categories
Unsupervised recommendation engine via neural search
- News articles
- Social content
  - Twitter
  - Sharechat

Models can be improved by finetuning with the domain and task data.

Some Indic talks at the recent event Forum for information retrieval (FIRE 2020) - schedule by IDRBT, Hyderabad.

Come join Maxpool - A Data Science community to discuss real ML problems!

Connect with me on Medium, Twitter & LinkedIn.

Pratik’s Pakodas 🍿

Discussion about this post