This project used a semi-automated tagging system to categorise the various senses associated with each emoji. The resulting datasets were specifically designed to enhance the disambiguation of the most widely used emojis. Subsequently, these datasets were subjected to testing using an Emoji-Lesk algorithm, demonstrating superior performance compared to all preceding datasets.
- Emojis do not have explicit dictionary meanings like words
- They are ambiguous and subjective as their meanings are inferenced from text
- They do not have word equivalence and take on their own unique meanings
- They are extreme homonyms
e.g. Emojipedia interpretation of 'Upside-Down Face' 🙃 : Commonly used to convey irony, sarcasm, joking, or a sense of goofiness or silliness.
![Screenshot 2023-10-05 at 11 37 19](https://private-user-images.githubusercontent.com/53048127/272873488-d23139ad-d40a-4d10-8e83-0b89124e049a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMzE1ODEsIm5iZiI6MTczOTIzMTI4MSwicGF0aCI6Ii81MzA0ODEyNy8yNzI4NzM0ODgtZDIzMTM5YWQtZDQwYS00ZDEwLThlODMtMGI4OTEyNGUwNDlhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDIzNDgwMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWI3ODFmNjk0NmQwMmQ0YjBiYmQ1NGEzOGU0NjllYjkwY2FmZTc1ZjU1ZGExOWZhYzVmODA1YjAyZDc5ZDFhNGImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.YYqoFKyMjLQOZWKNmqErF_8BBT34UMR8vF28oktE_t4)
- Emoji embeddings are vector represenations of emoji meanings
- All current methods are not context dependent. Often key words describing the emoji are used to embed it
- This means the wrong meaning of the emoji is often assigned
![Screenshot 2023-10-05 at 11 34 34](https://private-user-images.githubusercontent.com/53048127/272873013-e5057a22-6b53-4ecc-bc4b-db90e25fdc99.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMzE1ODEsIm5iZiI6MTczOTIzMTI4MSwicGF0aCI6Ii81MzA0ODEyNy8yNzI4NzMwMTMtZTUwNTdhMjItNmI1My00ZWNjLWJjNGItZGI5MGUyNWZkYzk5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDIzNDgwMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTkwNjA3NmFiNzkwNTdmNmU2ZmUxNzc4NGI5MTcxYzFmNTg4ODIzZWJmYzI3NWNhNzM0OTI3ODc3MmM0MjIzOTUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.TCcqvi-rkARg12XX601-QnDQ8gPkwixN4id2xifGHT0)
- Given a Tweet with an emoji compute a sentence embedding of only the text - Using InferSent
- Take the emoji of the corresponding Tweet. Embed all of its different senses.
- Find the closest cosine similarity between the text of the Tweet and the emoji sense embedding.
Image from 'One emoji, many meanings: A corpus for the prediction and disambiguation of emoji sense' who have implemented the first emoji lesk algorithm.
![Screenshot 2023-10-05 at 11 39 06](https://private-user-images.githubusercontent.com/53048127/272873864-f965328b-64dc-4a4f-982b-ae6048841d1c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMzE1ODEsIm5iZiI6MTczOTIzMTI4MSwicGF0aCI6Ii81MzA0ODEyNy8yNzI4NzM4NjQtZjk2NTMyOGItNjRkYy00YTRmLTk4MmItYWU2MDQ4ODQxZDFjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDIzNDgwMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTc3NDQ1NTk5ZTA0ZDZkYzQ5OGY0ZmQ1ZmMyMDJhODkwZDcxMjRjOTdkZjNiYjRhMzhhNDk1NWRmODE4MjRhZTEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.fziQOyTHlEeAE--vsgZ4A1BzZXbyJcbG5BjZMo90jVs)
For each emoji, a distinctive collection of sense words were selected using WordNet using online dictionaries.
![Screenshot 2023-10-05 at 11 43 44](https://private-user-images.githubusercontent.com/53048127/272874955-d42d2b55-7c3e-4f25-b553-811122953208.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMzE1ODEsIm5iZiI6MTczOTIzMTI4MSwicGF0aCI6Ii81MzA0ODEyNy8yNzI4NzQ5NTUtZDQyZDJiNTUtN2MzZS00ZjI1LWI1NTMtODExMTIyOTUzMjA4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDIzNDgwMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTdkZDM5M2ZlM2VlNTY5ZDMxNzRhMTU0NzczZjM3NDViZjQ3ZmMxNTUyNDc0ZTVmMTMwNzViY2U2MjEwODI4MTYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.FN8R42-Pv8MzSlYJT1-6T_V6nyj2YRO3Yi6zkDrVirs)
The open source software 'tortus' was used to label the dataset. Tweets were feed in and the senses were displayed as buttons below for the user to select the meaning of the emoji in context.
![Screenshot 2023-10-05 at 11 46 57](https://private-user-images.githubusercontent.com/53048127/272875832-2347b34c-8e20-4ff6-a889-3f5603a33674.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMzE1ODEsIm5iZiI6MTczOTIzMTI4MSwicGF0aCI6Ii81MzA0ODEyNy8yNzI4NzU4MzItMjM0N2IzNGMtOGUyMC00ZmY2LWE4ODktM2Y1NjAzYTMzNjc0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDIzNDgwMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWJmMjdkYWZmNTViZjBhZTU0ZWQ0NTc3MzE0OGY1ZGQ1MTAyMjk3MjAxOWI1NDNkYzRmZTFmOTZkZTIyMThkMDMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.u5pLTqCoNLVsasNpufg_ynH2o6MSS2xOWcR-ydGpIvs)
The datasets have been double annotated, resulting in a commendable Cohen's Kappa score of 0.6. This score quantifies the level of agreement between the annotators while considering the possibility of random chance.
The most frequent sense (MFS) algorithm assigns the most frequent sense from the data.It is known that this approach is hard to outperform especially by unsupervised approaches like the Lesk algorithm. The new datasets not only improve the MFS algorithm by using concise senses, but can outperform the MFS algorithm where emojis are more ambiguous.
![Screenshot 2023-10-05 at 11 48 24](https://private-user-images.githubusercontent.com/53048127/272876158-ddce5cfd-5701-41d9-8217-72139805e643.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMzE1ODEsIm5iZiI6MTczOTIzMTI4MSwicGF0aCI6Ii81MzA0ODEyNy8yNzI4NzYxNTgtZGRjZTVjZmQtNTcwMS00MWQ5LTgyMTctNzIxMzk4MDVlNjQzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDIzNDgwMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWYzMTk3MjQyNzdjZTExODYzZTAzOWU3YjZkMzJkNjJkOTIxZDMwZGFkMTI2ZjAwYzAyOGVlZTcwNjY4ZjgwYTQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.XXC4xrZLM36o0V6gV3osJlmt10Rq3hDsFi35MB46Xvs)