GHC 19 — The Story of Smart Compose: The Gmail Model That Loved Thanksgiving

5 min readOct 4, 2019

This is session notes for a GHC19 Session — “ The Story of Smart Compose: The Gmail Model That Loved Thanksgiving” held on 3rd October, 2019.

These notes are based on my understanding of the session. The pictures are photographs taken by me from the slides used by the speakers.

Speakers:

Jackie Tsay — Staff Software Engineer , Google

Matthew Dierker — Software Engineer, Google

This talk gave an overview of Gmail Smart Compose. Each week more than a billion people send mails via Gmail. When “Smart Reply” was released in 2017, the team realized that if you give more tools that help users, they will willingly use them. Back then, Smart Reply would autocomplete just one word.

Mathew then asked the audience to think of ways they would autocomplete a friends message. The audience said that they would look at the syntax, context of the message, the message they were replying to and previous emails sent by this person.

The training data for this were the many emails Gmail users send on a weekly basis. Given some sample sentences, they converted this into a bag of words to build a vocabulary. They then used this to make word embeddings. Predictions can be made as simply as performing cosine similarity to these embeddings.

Smart compose is a Recurrent Neural Network. Without going into the technicality, the speakers mentioned that this uses a bunch of math and encoders and decoders. This can then be used to generate a tree of suggestions.

The flow of the model was to collect the subject, previous message, current message and send this to the machine learning model. This is then scored and the best one is sent out as the prediction.

Below are some of the learnings that the speakers had when building Smart Compose.

Only Apply AI when helpful

After running user studies, this is what the team figured out:

Speed is critical. If The suggestion is too slow, people might type faster than this. Initial analysis showed that top 2% of typers would finish typing the next word/character in around 143 ms. At this point, the team was able to do this at around 600 ms. On their quest to speed up this process, they proceeded from CPU’s to GPU’s to TPU’s. With TPU’s, they were able to reduce this speed to 80 ms. This was further reduced to 40 ms using cloud TPU’s.
Long suggestions are better. However, when they did not set a max length, it caused some noise as below:

Launch (Internally) and iterate

Internally, the Smart Compose was tested to see what visuals would work best.

Gmail has a high bar and any new feature would pass only if it had at least 75% positive user satisfaction. Even after Typeahead was selected as the best way to show the autocomplete, people had difficulty in realizing that they needed to press tab. Hence, they started showing users a small tab icon next to the suggested phrase to let them know. Smart Compose also allowed feedback options to help improve quality of suggestions.

Data is biased, so use human judgement

The above picture shows us how some of the suggestions were not really appropriate. Hence they used a metric called CTR, which was the clicks/views. This would essentially mean that more clicks mean better suggestions . However, this led to data becoming biased as can be seen in the image below:

An investor became associated with the pronoun “him”, while a nurse was associated with the pronoun “her”. The team then removed gender pronouns from suggestions completely.

In the early days, this was another common autocomplete that would show up:

Since ‘H’ is one of the most common characters used to start an email (“Hello”,”Hi”,”Hey” etc.), this was a very bad suggestion. In this case, they removed this phrase again.

However, removing phrases does not work all the time, and so pattern identification (like morning vs evening), day of the week, time of day etc were used for different suggestions.

The speakers said that their favorite part was the reception this project received when it was released. Some told them they liked it a lot. Some said they would type differently just to show the Smart Compose that they could type something different. Someone even called it the “death of creativity”. Below was one of the feedback shared by the speaker:

Today Smart Compose exists in many languages and also in the iOS and Android app.

The audience then asked the speakers some of these questions.

Q1. How did the team make sure that private data would not be suggested?

The speakers said that they would use words only if they were used a minimum number of times, so as to weed out secret words or personal words.

Q2. Why was tab used for accepting suggestion instead of enter?

The team said that they did build the enter feature, but it was very bad as it would autocomplete even if you wanted a line break. The speaker mentioned that even pressing the right key works.

Q3. How do you remove inappropriate suggestions, as there could be so many?

The team said that they rely on spam filtering from gmail. They also do some cleaning before training and then again on the serving side.

Q4. Do you train on each individual language or do you translate from English?

The team said that they train in the language itself.

Q5. Do you plan to introduce typos to make the email look more like the writer?

The speakers said that if you type something a bunch of times it should eventually show up in the autocomplete.

Q6. Do you trigger at every keystroke?

Yes.

Q7. Are the suggestions based on what other people write as well?

The speakers said that they use a blend of both models. Part of their testing process was to find the right blend.

These notes are based on my understanding of the session. There can be mistakes in the technical details in the above notes.

GHC 19 — The Story of Smart Compose: The Gmail Model That Loved Thanksgiving

Only Apply AI when helpful

Launch (Internally) and iterate

Data is biased, so use human judgement

Written by Rashmi Raghunandan