On Making A Multilingual Search Engine

Beyond English

Dec 20, 2019

In my previous posts, I have been talking theory around semantic search and so I thought, why not also do a starter code for making a multi-lingual search engine — something that understands the semantics of a language and doesn’t need any machine translation engines.

Want to know more on semantic search?

On Semantic Search
Make robust search enginestowardsdatascience.com

I am taking these components for doing the POC

Model — Multilingual Universal Sentence Encoder

Vector search — FAISS

Data — Quora question pair from kaggle

You can read more about USE in this paper. It supports 16 languages.

Multilingual Universal Sentence Encoder for Semantic Retrieval
We introduce two pre-trained retrieval focused multilingual sentence encoding models, respectively based on the…arxiv.org

STEP 1. LOAD DATA

Let’s first read the data. Because the quora dataset is huge and takes a lot of time, we will take only 1% of the data. This will take around 3 minutes for encoding and indexing. It will have 4000 questions.

STEP 2. CREATE ENCODER

Let’s make encoder classes that load the model and have an encode method. I have created classes for different models which you can use. All models work with English and only USE multilingual works with other languages.

USE encodes text in a fixed vector of size 512.

I am using TFHub for USE and Flair for BERT for loading the models.

STEP 3. CREATE INDEXER

Now we will create FAISS indexer class which will store all embeddings efficiently for fast vector search.

STEP 4. ENCODE AND INDEX

Let's create embeddings for all the questions and store them in FAISS. We define a search method which shows us the top k similar results given a query.

STEP 5. SEARCH

Below we can see the results of the model. We first write a question in English and it gives expected results. Then we convert the query to other languages using google translate and the results are great again. Even though I have made a spelling mistake of writing ‘loose’ instead of ‘lose’, the model understands it as it works on the subword level and is contextual.

As you can see, the results are so impressive that the model is worth putting in production.

You can find the complete code in my colab notebook. You can download data from here.

bhavsarpratik/transformers
You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…github.com

Want more?

Find different models for encoding text for semantic search over here!

On Variety Of Encoding Text
Master feature engineering for textmedium.com

Pratik’s Pakodas 🍿

Discussion about this post