Skip to content

Tracking semantic shifts using fastText embeddings

Notifications You must be signed in to change notification settings

kb-labb/skoldiskurs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A longitudinal study of political school discourse 1962-2023

This codebase contains all produced material for Ema Demir’s research project on Swedish school discourse in a political setting from 1962 to 2023. Here, we have created the tools to study the semantic shift on lexical units related to schools.

We do this by producing multiple vector representation of political text data using fastText embeddings, which is an extension of the word2vec model. We divide the data into buckets representing a time period. The periods at hands are chosen using two different approaches:

A theory driven approach of five buckets that are motivated by real world events (political decisions, school reforms etc). These events are better explained by Ema Demir herself. The periods are as follows:

  • 1962-1967
  • 1968-1980
  • 1981-1991
  • 1992-2011
  • 2012-2023

Regular intervals à 5 years:

  • 1962-1967
  • 1968-1973
  • 1974-1979
  • 1980-1985
  • 1986-1991
  • 1992-1997
  • 1998-2003
  • 2004-2009
  • 2010-2015
  • 2016-2023

The political time buckets are also paired with what we might call control data, i.e newspaper texts from the corresponding time periods. These serve as kind of a control group in our experiments. KBLab has an ongoing collaboration with Språkbanken Text where we are training newspaper text representations for Språkbanken to be used in research. Therefore, we have been able to use these embeddings as well. They were trained by Justyna Sikora at KBLab with data sampled from the four biggest newspapers during the 1962-2021 (data from 2022 and 2023 were not available for this dataset). Note that data is sampled from the newspapers, making the underlying corpus slightly smaller than the political corpus. This can be adjusted at a later time if it turns out to be a problem. Under directory documentation, you’ll find a CSV with vector representation sizes in megabytes (MB).

The core idea of the research methodology is to be able to compare the semantic shift of words and phrases using models with text representation based on data from different periods of time. We do this by creating fastText vector representations of the texts. There are other available models, but we chose fastText in part because of KBLab’s collaboration with Språkbanken but also because of the user-friendly interface. Text representations are created using fastText implementation in gensim. It is a fast C implementation of fastText with a Python interface. Gensim offers a streamlined training option with fewer options and parameters to control the training process, which has both pros and cons. This reduces flexibility but simplifies usages, which in this specific application is an attractive feature. For example, it allows easy loading and offers compatibility with other gensim functionalities.

Gensim’s implementation of fastText also extends beyond the original Facebook implementation by providing some additional functionality. For example, it includes methods for miscellaneous similarity queries. Another motivation is the fact that the gensim implementation is widely used in the Python ecosystem.

A detailed description of the process can be found in documentation/process.md.

fastText model

fastText is a vector embedding model for representing and understanding text. It captures semantic meaning of words while considering morphological variations.

A key aspect of fastText are subword embeddings. Not only does fastText capture semantic meaning of lexical units like words, but it also considers subword information, making it effective for handling features such as prefixes, suffixes and out-of-vocabulary words. fastText does this by breaking words down into subword units called character n-grams. These character n-grams can be as short as a single character or as long as the whole word. By considering these subword units, fastText is able to learn representations for words that contain similar character sequences, even if they are morphologically different.

In the training process, a continuous bag-of-words (CBOW) model predicts the probability of a word based on context, just like Word2Vec. But instead of predicting whole words, fastText predicts the probability of each subword given the context. By considering subwords, fastText can generate word representations even for unseen or misspelled words by leveraging the embeddings of their constituent subwords. Compared to Word2Vec, fastText is efficient. The use of subword units reduces the vocabulary size and enables fastText to handle a large number of words, including rare or out-of-vocabulary words. This, in turn, creates robust representations.

Usage

This is not a comprehensive guide to using gensim fastText. It is specifically aimed at Ema. So, open terminal as previously instructed. Open the Python interactive shell, in which we will perform our analyses, by typing:

python3

In the resulting shell, type:

from gensim.models import FastText

To load a model, type:

model = FastText.load("model.bin")

The file name is model.bin so you need to replace this with the full name of the actual file, including the path, which is where the file is stored on your computer. I think it’s something like /Users/emademir/Documents/politics/ and the file naming scheme is as follows: 1981_1991_lemmatized.bin making the full path to the file /Users/emademir/Documents/politics/1981_1991_lemmatized.bin. If this is the case, you can load the model like this:

model = FastText.load("/Users/emademir/Documents/politics/1981_1991_lemmatized.bin")

Also note that you can load multiple models and keep them in memory. To do this, modify the name of the model to a unique name, for example model_1, model_2 etc instead of model as in the example above.. Some similarity functions are listed below but I will also refer to some tutorials that might be helpful.

Get similarity of two terms:

model.wv.similarity("elev", "lärare")

Similarity of two sets of vectors:

model.wv.n_similarity(["elev", "lärare"], ["skola", "undervisning"])

Get most similar words:

model.wv.most_similar("elev")

Do some word arithmetics:

model.wv.most_similar(positive=["lärare", "rektor"], negative=["elev"])

Find words that do not match the context:

model.wv.doesnt_match("lärare elev undervisning pommac".split())

Some methodological choices

Models that intrinsically track semantic and lexical change are still experimental. I wanted to use a well-established model for Ema's project. Using multiple models to compare lexical use over time is not standard. Please be aware of this.

Evaluation

No evaluation, other than ocular inspection, has been performed at this time. Some inspiration can be found here:

Resources

Some resources for non-data scientists are listed below.

Good tutorial on similarity operations can be found here: https://notebook.community/ELind77/gensim/docs/notebooks/FastText_Tutorial

Here is a link to future me with some quantitative eval inspiration: https://github.com/PacktPublishing/fastText-Quick-Start-Guide/blob/master/chapter5/fasttext%20with%20gensim.ipynb

Further explanation of fastText model: https://towardsdatascience.com/word-embedding-with-word2vec-and-fasttext-a209c1d3e12c

Official documentation of gensim fastText: https://radimrehurek.com/gensim/models/fasttext.html

About

Tracking semantic shifts using fastText embeddings

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published