Emoji Sense Disambiguation 🤔

This project used a semi-automated tagging system to categorise the various senses associated with each emoji. The resulting datasets were specifically designed to enhance the disambiguation of the most widely used emojis. Subsequently, these datasets were subjected to testing using an Emoji-Lesk algorithm, demonstrating superior performance compared to all preceding datasets.

The problem

Emojis do not have explicit dictionary meanings like words
They are ambiguous and subjective as their meanings are inferenced from text
They do not have word equivalence and take on their own unique meanings
They are extreme homonyms

e.g. Emojipedia interpretation of 'Upside-Down Face' 🙃 : Commonly used to convey irony, sarcasm, joking, or a sense of goofiness or silliness.

Emoji Embeddings

Emoji embeddings are vector represenations of emoji meanings
All current methods are not context dependent. Often key words describing the emoji are used to embed it
This means the wrong meaning of the emoji is often assigned

The algorithm

Given a Tweet with an emoji compute a sentence embedding of only the text - Using InferSent
Take the emoji of the corresponding Tweet. Embed all of its different senses.
Find the closest cosine similarity between the text of the Tweet and the emoji sense embedding.

Image from 'One emoji, many meanings: A corpus for the prediction and disambiguation of emoji sense' who have implemented the first emoji lesk algorithm.

Dataset Methodology

For each emoji, a distinctive collection of sense words were selected using WordNet using online dictionaries.

Labelling Method

The open source software 'tortus' was used to label the dataset. Tweets were feed in and the senses were displayed as buttons below for the user to select the meaning of the emoji in context.

Datasets

The datasets have been double annotated, resulting in a commendable Cohen's Kappa score of 0.6. This score quantifies the level of agreement between the annotators while considering the possibility of random chance.

Results of Datasets using Emoji-Lesk Algorithm

The most frequent sense (MFS) algorithm assigns the most frequent sense from the data.It is known that this approach is hard to outperform especially by unsupervised approaches like the Lesk algorithm. The new datasets not only improve the MFS algorithm by using concise senses, but can outperform the MFS algorithm where emojis are more ambiguous.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Comparing annotations		Comparing annotations
Final datasets		Final datasets
Labelling		Labelling
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Emoji Sense Disambiguation 🤔

The problem

Emoji Embeddings

The algorithm

Dataset Methodology

Labelling Method

Datasets

Results of Datasets using Emoji-Lesk Algorithm

About

Releases

Packages

Languages

elenabarry/Emoji-Disambiguation

Folders and files

Latest commit

History

Repository files navigation

Emoji Sense Disambiguation 🤔

The problem

Emoji Embeddings

The algorithm

Dataset Methodology

Labelling Method

Datasets

Results of Datasets using Emoji-Lesk Algorithm

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages