Skip to content

elenabarry/Emoji-Disambiguation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Emoji Sense Disambiguation 🤔

This project used a semi-automated tagging system to categorise the various senses associated with each emoji. The resulting datasets were specifically designed to enhance the disambiguation of the most widely used emojis. Subsequently, these datasets were subjected to testing using an Emoji-Lesk algorithm, demonstrating superior performance compared to all preceding datasets.

The problem

  • Emojis do not have explicit dictionary meanings like words
  • They are ambiguous and subjective as their meanings are inferenced from text
  • They do not have word equivalence and take on their own unique meanings
  • They are extreme homonyms

e.g. Emojipedia interpretation of 'Upside-Down Face' 🙃 : Commonly used to convey irony, sarcasm, joking, or a sense of goofiness or silliness.

Screenshot 2023-10-05 at 11 37 19

Emoji Embeddings

  • Emoji embeddings are vector represenations of emoji meanings
  • All current methods are not context dependent. Often key words describing the emoji are used to embed it
  • This means the wrong meaning of the emoji is often assigned
Screenshot 2023-10-05 at 11 34 34

The algorithm

  1. Given a Tweet with an emoji compute a sentence embedding of only the text - Using InferSent
  2. Take the emoji of the corresponding Tweet. Embed all of its different senses.
  3. Find the closest cosine similarity between the text of the Tweet and the emoji sense embedding.

Image from 'One emoji, many meanings: A corpus for the prediction and disambiguation of emoji sense' who have implemented the first emoji lesk algorithm.

Screenshot 2023-10-05 at 11 39 06

Dataset Methodology

For each emoji, a distinctive collection of sense words were selected using WordNet using online dictionaries.

Screenshot 2023-10-05 at 11 43 44

Labelling Method

The open source software 'tortus' was used to label the dataset. Tweets were feed in and the senses were displayed as buttons below for the user to select the meaning of the emoji in context.

Screenshot 2023-10-05 at 11 46 57

Datasets

The datasets have been double annotated, resulting in a commendable Cohen's Kappa score of 0.6. This score quantifies the level of agreement between the annotators while considering the possibility of random chance.

Results of Datasets using Emoji-Lesk Algorithm

The most frequent sense (MFS) algorithm assigns the most frequent sense from the data.It is known that this approach is hard to outperform especially by unsupervised approaches like the Lesk algorithm. The new datasets not only improve the MFS algorithm by using concise senses, but can outperform the MFS algorithm where emojis are more ambiguous.

Screenshot 2023-10-05 at 11 48 24

About

Datasets for disambiguating the most popular emojis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published