diff --git a/content/post/project-android-keyboard.md b/content/post/project-android-keyboard.md new file mode 100644 index 0000000..57a6b65 --- /dev/null +++ b/content/post/project-android-keyboard.md @@ -0,0 +1,370 @@ +--- +title: "Spelling Correction and Autocompletion for Mobile Devices" +date: 2021-03-11T21:24:44+01:00 +author: "Ziang Lu" +authorAvatar: "img/project-android-keyboard/ziang.png" +tags: [android, keyboard, nlp, n-gram, PED] +categories: ["project"] +image: "img/project-android-keyboard/title_pic.jpg" +draft: false +--- +A virtual keyboard is a powerful tool for smartphones, with which users can improve the quality and efficiency of the input. In this project, we will explore how to use n-gram models to develop an Android keyboard which gives accurate corrections and completions efficiently. + + +# Content + +1. Introduction +2. Algorithm and Data Structure for Similarity Calculation +3. N-gram Models +4. Corpus +5. App +6. Evaluation +7. Potential Improvements +8. Summary + +# Introduction + + + +An efficient keyboard will help user spare lot of work and time when inputting a text. For instance, we want to input a sentence as below: + + +
"We are going to watch a movie"
+ +
+ + +There are 29 characters (without spelling mistakes) to enter which means one needs to press keys for 29 times without any helpful function. However, if we could have a magical keyboard which shows us candidates for the next word and helps in correcting spelling mistakes given our current input , we may reduce a lot of steps. Generally, we expect such a scenario: +
+ +                         current input:           Wee (expecting correction )
+                         candidates to choose: We Lee Bee
+                         current input:           We (expecting prediction)
+                         candidates to choose: are do were
+                         current input:          We are g (expecting completion)
+                         candidates to choose: going gone getting
+                         current input:          We are going (expecting prediction)
+                         candidates to choose: to by on
+                         ............ + +
+In the next sections, we will explore how to make such a “smart” keyboard with combination of n-gram model, Prefix Edit Distance (PED) and Q-gram Index and how well it works. + +# Algorithm and Data Structure for Similarity Calculation +
+Assuming that we are going to type the word “_movie_”, but we have accidently typed “_movvie_". We hope that our keyboard could find this error quickly and show us the correct version of this word. Or maybe you feel so tired to input a very long word and hope your keyboard could guess your final goal by reading a prefix of a word which you have typed, for example, you want to type “_something_” but you only need to input “_somet_” and you will get a candidate “something” to choose. Now the question is, how could a keyboard measure the difference between a wrong word and correct one and the needed work to get to your final complete word? Namely, why could this keyboard show you a candidate with “_movie_” but not with “_move_”? Why “_anything_” should not be expected based on “_somet_”? To find answers, we will firstly take a look at **Prefix Edit Distance (PED)**. +
+
+ Edit Distance (ED) +
+ +**Definition** for two strings x and y. ED(x, y): = minimal number of transformations to get from x to y
+
+Transformations allowed are:
+insert(i, c): insert character c at position i
+delete(i): delete character at position i
+replace(i, c): replace character at position i by c
+ +**Example:**
+ +$$ somethfinge \stackrel{replace}{\longrightarrow} somethinge \stackrel{delete}{\longrightarrow} something $$
+**ED** (somethfinge, something) = 2 +
+
+With a keyboard we will always input a word by starting from the first letter on the left and going to the end on the right. Therefore, if we have inputted “_somet_” as part of word “_something_”, we will expect that “_something_” should be shown as the best candidate, but in this case, an unexpected candidate “_same_” with **ED** = 2 will be shown with a higher probability than “_something_” because of the **ED** = 5. For this reason, we extend Edit Distance into **Prefix Edit Distance**.
+
+
+ Prefix Edit Distance (PED) +
+
+**Definition** $$ \small PED (x, y) = min_{y'} (ED (x, y')) $$ where y' is a prefix of y.
+ +Given a string x, one task of keyboard is to find out all strings $$ \small y_{i}$$ so that $$ \small PED (x, y_{i}) \leq 2 $$ and return the result to user. (For my keyboard the threshold of PED is 2, but it could be changed as one wishes)
+ +A response time feels interactive until around 200ms. It takes us a lot of time if we compare x with other words one by one from a dictionary. It makes calculation very inefficient and unnecessarily slow. +
+
+For example, PED between “_movies_” and “_cinema_” is intuitively larger than 2 and it is unnecessary to make a calculation. + +
+ +To filter out those “impossible” words, we use **q-Gram Index** to minimize the size of group of words which need to be compared. +
+
+ q-Gram Index +
+
+ +**Definition** Q-grams of a string are simply a set of all substrings of this string with a length q. +
+ +If q = 3, then 3-Grams of “_freiburg_” would be “_fre_”, “_rei_”, “_eib_”, “_ibu_”, “_bur_”, “_urg_”; 3-grams of “_movie_” are “_mov_”, “_ovi_”, “_vie_”. +
+ +To optimize the match, we will pad the q – 1 special symbols (we use $) in the beginning for PED (and for ED at both beginning and end). +
+ +Consider x and y with PED (x, y) = δ +Intuitively: if x and y are not too short, and δ is not too large, they will have one or more q-grams in common. So we could apply some rules to judge if calculation of PED should be executed on the number of q-grams in common. +
+ +**Example:** +
+ +x = freiburg
+y = breiberg
+**q** = 3, **δ** = 2 +
+ +after padding "_$$_" at the start, we get 3-grams of "_freiburg_" and "_breiberg_": +
+
+ „$$f“ „$fr“ „fre“ „rei“ „eib“ „ibu“ „bur“ „urg“ +
+ „$$b“ „$br“ “bre „ “rei“ „eib“ “ibe“ “ber“ „erg“ +
+
+number of q-grams in common: 2. +
+Formally: let x' and y' be the padded versions of x and y. +
+Then it holds: $$ comm(x', y') \geq |x| – q ∙ δ $$ +|x| = 8, |y| = 8, **δ** = 2, **q** = 3
+Hence: comm (x’, y’) = 2 **≥** 8 – 3 * 2 = 2 fulfilled!
+ +Therefore, the formula could be applied to check if a comparison should be executed.
+
+ +**Example:** +
+ +x = freiburg
+ **q** = 3, **δ** = 2 (that means we expect the PED should be less than 3)
+
+y1 = freiberg: 5   |x| – 6 = 2 -> Yes
+y2 = nürnberg: 0 |x| – 6 = 2 -> No ×
+y3 = hamgurg: 1  |x| – 6 = 2 -> No ×
+ +
+ +So, for this example, we only have to compute **PED(freiburg, freiberg)** … which is 1, hence breiburg is an output as a match. +
+
+_Note: More details about Edit Distance and q-gram Index see in the lecture InformationRetrieval ._ +
+
+ +# N-Gram Models + +We could find a lot of candidates which fulfill a given threshold of **PED** regarding a string x. But some candidates are evidently impossible to be the next word. For instance, you have typed a incomplete sentence “_who is th_”, and you may get some candidates such as “_those_”, “_than_” or “_thanks_”. These words seem not so convincing like “_there_” or “_that_”. Actually, if you have the same idea, that means, you are considering probabilities in this case. Hence, we still need a so-called n-gram model which could be applied to count and analyze probability of different combinations of words based on a corpus for a precise completion or prediction. +
+
+**Definition** An n-gram model is a sequence of N words: a 2-gram (or bigram) is a two-word sequence of words like “_we are_”, “_going to_”, or “_watch a_”, and a 3-gram (or trigram) is a three-word sequence of words like “_we are going_”, or “_are going to_”. +
+We will see how to use those n-gram models to estimate the probability of the last word given the previous words. +
+
+Let’s consider a word **w** and a history **h** (start of the sentence) and a corpus **K**. +

+h: The weather is so good +
w: that
+

+we want to calculate the probability of the word w given the history h, and we denote it as P (_that_ | _the weather is so good_). Since the size of corpus **K** is limited and some combination like “today’s weather is so good” may not appear, we will instead approximate the history just by the last few words.

+In other words, instead of computing P (_that_ | _the weather is so good_) , we approximate it with the probability of bigram model: P( _good_ | _that_ ). In this case, we just need to count the frequency of sequence “good that” and frequency of “good” in the corpus K (here we denote that as C(sequence)). +$$ P(good|that) = \frac{C(good \ that)}{C(that)}$$ +
+C(_good that_) is to count how many times the combination “good that” appears in the corpus. C(_good_) is to count how many times the word “_good_” in **K** appears.
To extend our bigram model to general n-model, we have this formula as follow: +$$ P(w_{n}|w_{n-N+1}^{n-1}) = \frac{C(w_{n-N+1}^{n-1}w_{n})}{C(w_{n-N+1}^{n-1})} $$ +For instance, when we use a trigram model in the case above, we need to compute: +$$ \frac{C(that|so \ good)}{C(so \ good)} $$ +
+
+# Corpus +To train n-gram models, I used corpuses about web text and tweets from https://www.nltk.org/howto/corpus.html, which cover different topics such as positive, negative expressions, political discussions and so on. The raw corpus includes a great amount of emojis and special symbols which produces noise into n-gram models. Hence, I filtered out those emojis and symbols so that my n-gram models become “cleaner” and concentrate only on plain sentences. +

+
Corpus Info
+
+
+ +||__number of characters__|__number of words__|__number of sentences__|__number of documents__|__topics__| +| --- | --- | --- | --- | --- | --- | --- | +|corpus from tweets|1264807|223201|50070|3|negative, positive and politic tweets| +|corpus from web |1469355|255328|57425|4|firefox, overheard, singles, wine| +
+
Grams Info (after deleting some grams that appear only once or twice in corpus)
+
+ +|__gram__|__amount__| +|---|---|---| +|unigram|9295| +|bigram |21561| +|trigram|10091| +
+ +Before we have talked about the aim and construction of a **q-gram**. Given a word **w**, we need to find all words from a dictionary whose **PED** fulfills the threshold **delta**. To reduce the response time, we should compute **q-grams** of all words in advance from the dictionary, and once a query is executed, we will compute the number of common grams between **w** and all other words from the dictionary.

+To minimize the intern storage of app and its startup delay, the total number of words from a dictionary has been limited into 10000. Hence, for this dictionary I used 10000 most common used English words from www.mit.edu/~ecprice/wordlist.10000 . Another issue is that some words from corpus may not be included in the dictionary. Therefore, after keeping the words which appear both in dictionary and corpus, I removed 4400 words from the dictionary which never appeared in the corpus and added 4400 new words by their frequencies into the dictionary from the corpus. +
+# App +

+ + + +

+ + +The keyboard is implemented in an Android App which could be used to test and evaluate the utility of keyboard. +

+
+ App Design +

+The basic routine is as follows, when any function above is triggerd:

+**1** Keyboard accepts user’s input ↓
+**2** Program splits input by punctuation↓
+**3** The last part will be used to decide whether user expect a completion (correction) or prediction ↓
+**4** Through computation of PED and n-gram model’s probability program return maximal 3 candidates with highest probability↓
+**5** User choose candidate +

+
Completion Routine
+ + +




+ + +
Correction Routine (same as completion, here I just want to show how this routine could also be applied to correct spelling mistakes
+ + +



+ +
Prediction Routine
+ + +


+ Ranking of Candidates +

+For the 4-th step, we need to consider additionally two issues. Some combination of words may never appear in a corpus and exception such as *ZeroDivisionError* may be raised in program. Therefore, we will apply a flexible n-gram model: final probability = P (trigram) + P (bigram) + P (unigram). And final probability will be initialized as 0.0. +$$ \small P = \lambda 1 * P(w_{n}) \ (unigram \ probability) $$ +$$ \small \ \ \ \ \ \ \ \ \ \ + \lambda 2 * P(w_{n} | w_{n-1}) \ (bigram \ probability) $$ +$$ \small \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ + \lambda 3 * P(w_{n} | w_{n-2, \ n-1}) \ (trigram \ probability) $$ + +λ1, λ2 and λ3 are so-called **weights**, which need to be tuned through experiments. Generally, λ3 will be assigned with a relatively larger value because the longer sequence as history will give a more precise result and should play a decisive role. And in this project, I will set λ1 to 0.1, λ2 to 0.3 and λ3 to 0.6. +

+The issue is that sometimes we may encounter a such case: +

+Threshold = 2 **(PED)**
+count of “_support_” in corpus:      10000
+count of “_scotland_” in corpus:      8000
+count of “_should_” in corpus:         9000
+count of “_some_” in corpus:           5000
+count of “_something_” in corpus:   3000
+
+When we type “_som_” at the beginning of a text, program will give us all words with **PED** <= 2 and return three of them by their probabilities. Only regarding the probability of unigram (start of a sentence), we will get a list which consists of “_support_”, “_should_” and “_scottland_”, beacuse they appear at a very high freqency. Intuitively, user may expect a list consisting of “_some_” and “_something_” given the incomplete part “_som_”. Therefore, we need to “punish” those words like “_support_” and “_scottland_” because of their larger PED than “_something_” or “_some_”.

+To filter out those words furtherly which we do not want to see, we could define the final probability of a candidate **P** as follow, here we denote original probability based on a n-gram model as **Pn** , a punishment weight as α and **PED** keeps its original meaning:
+$$ P = P_{n} - \alpha * PED. $$ + +Probability of words with larger **PED** will be reduced because of the punishment weight α and it is just what we want. Hence, candidates that user really want would be returned correctly. (In the case above, “some” and “something” would be seen with higher probability).
Evaluation below will show how the specific value of punishment value will influence the result and will also prove this method works well. + +# Evaluation +In the evaluation the utility of autocorrection and completion will be tested respectively in my keyboard denoted as *ZKeyboard* and the *Android API 30* system keyboard.

+ + Design of Evaluation +

+**95%** of corpus content will be used as the trainset and be applied to build n-gram models. Respective **5%** of the content in web and tweets will be used to evaluate my keyboard with an automatic program. In addition, I picked 100 sentences respectively from web corpus and tweets corpus so that I can manually evaluate the *API 30 keyboard* and my *ZKeyboard*. The evaluation **criterium** is percentage of saved steps or pressed keys by using the keyboard, assuming we need to type every letter for once without a helpful keyboard. The pressed keys used to change capital mode or go to find symbols will not be counted in the total number because the project does not focus on the design and layout of a keyboard but only the models and algorithms.
+ +Firstly, I will try to assign different value to punishment weight in the test of completion based on the 5% of corpus. That could show if this method really works and what the proper value should be.
+ +For test of autocorrection, I will change the first letter of every word whose length is larger than 1 from the 5% of corpus and the 200 sentences. Generally, if we find one spelling mistake in a word, we need two steps to correct it (delete and insert). Hence, in this case, we need to increase total steps that are needed without helpful keyboard.
+ + +At last, I want to show how the trainset from a corpus could adapt for different text’s type. For this purpose, I will use trainset from 95% of web to evaluate 5% of tweets and 95% of tweets to evaluate 5% of web.
+ + Results +

+ +*Evaluation 1: looking for the best punishment weight α and see if it really works.*
+*Train set: 95% from web + tweets*
+*Test set: 5% from web*
+
+ + +| __ALPHA (punishment)__ | __Reduced steps in web (5%)__ | +| --- | --- | --- | +| 0.0 |27.10%| +| 0.0005|36.04%| +| 0.005 |41.16%| +| 0.05 |41.20%| +| 0.5 |41.20%| +| 1.0 |41.20%| + +
+

+Before we have talked about how the punishment value alpha could help filtering out those words which are evidently different from the goal word in the sense of PED. Through experiments it proves that when the value of alpha is equal or larger than 0.5, the utility of the keyboard will be maximized. Therefore, for the next evaluation the alpha will applied as 1.0. +

+ +
+
+ + Evaluation 2: evaluate function of autocompletion
+*Train set: 95% from web + tweets*
+*Test set: 5% contents from web, 5% contents from tweets, 100 sentences from 5% of web, 100 sentences from 5% of tweets*

+ + +| |__Web(small)__ | __Web (5%)__ | __Tweets(small)__|__Tweets(5%)__| +| --- | --- | --- |--- | --- | ---| +|__API30__ |43.19% | -- |40.93% |-- | +|__ZKeyboard__|43.00% |41.20% |43.72% |38.59% | + +
+

+ZKeyboard behaves better than API30 keyboard when tested by 100 sentences from tweets. The perfomance's difference between API30 keyboard and Zkeyboard is very tiny when they were tested by 100 sentences from web. 200 sentences from tweets and web are those sentences which look more normal in the sense of syntax and grammar, so Zkeyboard has a better performance on them. +

+
+
+ +Evaluation 3: evaluate function of autocorrection
+Train set: 95% from web + tweets
+Test set: 5% contents from web, 5% contents from tweets, 100 sentences from 5% of web, 100 sentences from 5% of tweets


+
+ +| |__Web(small)__ | __Web (5%)__ | __Tweets(small)__|__Tweets(5%)__| +| --- | --- | --- |--- | --- | ---| +|__API30__ | 21.00% | -- | 18.60% |-- | +|__ZKeyboard__| 21.35% | 23.62% | 24.06% | 20.71% | + +
+

+For the evaluation of autocorrection, the first letter of every word whose length larger than 1 would be changed. ZKeyboard does better than API30 keyboard especially in the test dataset from tweets. One reason for that: political part from tweets includes a lot of special named entities which may not be identified by the API30 keyboard. In contrast, in training stage, n-gram model that Zkeyboard uses has remebered thoes special named entities which appear at higher frequency, therefore, ZKeyboard has a better performance. +

+
+ + Evaluation 4: evaluate adaptability of n-gram model trained by a particular corpus by the means of evaluating function of autocompletion
+*Train set: 95% from web + tweets*
+*Test set: 5% contents from web, 5% contents from tweets, 100 sentences from 5% of web, 100 sentences from 5% of tweets* +
+ +| | __web (5%)__ |__tweets(5%)__| +| --- | --- | --- |--- | +| __95% web__ |41.62% |31.33% | +| __95% tweets__ |33.83% |39.27% | +| __95% tweets + web__|41.20% |38.59% | + +
+

+The result shows that when we apply n-gram model trained by corpus A to evaluate corpus B (with different sources), the accuracy could be reduced. One can also observe that the performance's difference of n-gram model trained by corpus from tweets is much smaller than that of n-gram model trained by corpus web. The reason is not clear but one possibility is that the content of tweets is more diversified than that of web. +

+
+ +# Potential Improvements + +An In-App keyboard could be only used to evaluate the keyboard. In the next stage, I will implement language models and algorithms on a system keyboard with which people could use to input text in any edit view on a mobile device.
+ +More grams. Now keyboard uses a flexible n-gram model (maximal trigram) to calculate probability. In the future, I will extend my n-gram model into 5 or 6 gram so that the accuracy will be improved.
+ +Using database to accelerate startup speed. Currently, I store data (model, grams info) in .txt format so that app should read them at the startup. And it takes app about 2 seconds to read the data every time when app starts (for first installation it may take longer). This has limited the size of data set and makes startup of keyboard a little bit slow. In the next stage, I will use dataset to store all data and reconstruct my program so that no data needs to be read at the startup. Interaction with data will only be needed on queries.
+ +The keyboard could not follow grammatical rules to filter out unsuitable candidates. This problem should be solved by POS-Tagging to improve grammatical accuracy of keyboard.
+ +At last, keyboard should memorize user’s input so that most common typed words by users will have a higher priority than others.
+ +# Summary + +With the help of n-gram model, Prefix Edit Distance and q-gram Index, we have developed such a smart keyboard (ZKeyboard) which could give relatively accurate corrections and completions. And compared with API30 keyboard, Zkeyboard does not bad not only in completion but also in spelling correction. But we still have seen many aspects which need to be improved such as ignored grammar rules, limit of storage, accuracy of n-gram model and so on. To make a keyboard give more accurate corrections and completions efficiently, we need more complex language models and do everything potential to improve the performance of the keyboard. \ No newline at end of file diff --git a/static/img/project-android-keyboard/completion_routine.png b/static/img/project-android-keyboard/completion_routine.png new file mode 100644 index 0000000..fb5cb34 Binary files /dev/null and b/static/img/project-android-keyboard/completion_routine.png differ diff --git a/static/img/project-android-keyboard/correction_routine.png b/static/img/project-android-keyboard/correction_routine.png new file mode 100644 index 0000000..6748ac2 Binary files /dev/null and b/static/img/project-android-keyboard/correction_routine.png differ diff --git a/static/img/project-android-keyboard/prediction_routine.png b/static/img/project-android-keyboard/prediction_routine.png new file mode 100644 index 0000000..8abd719 Binary files /dev/null and b/static/img/project-android-keyboard/prediction_routine.png differ diff --git a/static/img/project-android-keyboard/screen_shoot1.jpg b/static/img/project-android-keyboard/screen_shoot1.jpg new file mode 100644 index 0000000..8810506 Binary files /dev/null and b/static/img/project-android-keyboard/screen_shoot1.jpg differ diff --git a/static/img/project-android-keyboard/screen_shoot2.png b/static/img/project-android-keyboard/screen_shoot2.png new file mode 100644 index 0000000..cf0a4f1 Binary files /dev/null and b/static/img/project-android-keyboard/screen_shoot2.png differ diff --git a/static/img/project-android-keyboard/title_pic.jpg b/static/img/project-android-keyboard/title_pic.jpg new file mode 100644 index 0000000..75d4f9f Binary files /dev/null and b/static/img/project-android-keyboard/title_pic.jpg differ diff --git a/static/img/project-android-keyboard/ziang.png b/static/img/project-android-keyboard/ziang.png new file mode 100644 index 0000000..7cf9219 Binary files /dev/null and b/static/img/project-android-keyboard/ziang.png differ