The Birth of Bias: A case study on the evolution of gender bias in an English language model

Thu 14 July 2022

The language model (LM) is an essential building block of current AI systems dealing with natural language, and has proven to be useful in tasks as diverse as sentiment analysis, text generation, translations, and summarizing. These LMs are typically based on neural networks and trained on vast amounts of training data, which make these so effective, but this doesn't come without any problems.

LMs have also been shown to learn undesirable biases towards certain social groups, which may unfairly influence the decisions, recommendations or texts that AI systems building on those LMs generate. Research on detecting such biases is crucial, but as new LMs are continuously developed, it is equally important to study how LMs come to be biased in the first place, and what role the training data, architecture, and downstream application play at various phases in the life-cycle of an NLP model.

In this blog post, we outline our work towards a better understanding of the origins of bias in LMs by investigating the learning dynamics of an LSTM-based language model trained on an English Wikipedia corpus.

Gender bias for occupations in LMs

In our research, we studied a relatively small LM based on the LSTM architecture [1] and only consider one of the most studied types of bias in language models: gender bias. Because of this constrained focus, we were able to leverage previous work on measuring gender bias and get a detailed view of how it is learnt from the start of training, which wouldn't be feasible with a very large pre-trained LM.

In the image below you see a typical language modelling pipeline. As you can see there are multiple stages where the gender bias of a word can be measured: (I) in the (training) dataset of the model, (II) in the input embeddings (IE), which comprise the first layer of the LM, or (III) at the end of the pipeline in a downstream task that makes use of the contextual embeddings of the model.


At each of these stages, we measure the gender bias of the LM. Defining (undesirable) bias is a complicated matter, however, and there are many definitions and frameworks discussed in the literature [e.g. 2,3,4]. In our work, we used gender-neutrality as the norm, and define bias as any significant deviation from a 50-50% distribution in preferences, probabilities or similarities.

Inspired by previous work [5,6,7], we then considered the gender bias of 54 occupations terms. To quantify gender bias of occupation words, we use a set of unambiguously gendered word-pairs (e.g. “man”-“woman”, “he”-“she”, “king”-“queen”) in our measures. It is important to keep in mind, however, that ‘gender’ is a multifaceted concept, which is much more complicated than a simple male-female dichotomy suggests [8,9].

Now that we have discussed what we mean with gender bias, we'll turn to what we found in our experiments.

The evolution of gender representation in the input embeddings

Before we studied the gender bias for occupation terms, we first focused on the question: How does the LM learn a representation of gender during training, and how is that representation encoded in the input embeddings? To answer this question, we trained several linear classifiers to predict the gender of 82 gendered word pairs (e.g. "he"-"she", "son"-"daughter", etc.). We did this for several snapshots during training time, one after each epoch (one pass over all the training data) and multiple during the first epoch. The LM was fully trained after 40 epochs.


How good is this classifier if we give it all the axes of the input embedding space? And what if we only give it the single best one (which we call the gender unit)? Interestingly, we find that after around 2 epochs of training, only using the gender unit as input outperforms using all the other axes of the input embedding space! It seems that the information on the gender is encoded highly locally in the input embeddings early on during training.


Another interesting finding, is that the dominant gender unit seems to be specialized for marking female words in particular, while information on the male words is distributed over the rest of the input embedding space. In the animation below, the top row contains the correct predictions for the gender unit classifier, while the right column the correct predictions for the classifier trained on all the other axes as input. The distance from the origin indicates how far the word is from the respective decision boundaries. Note how the for the gender unit, words like "she", "her", "herself", and "feminine" are quickly moving to the top, while the animation ends with masculine words are only predicted correctly by the classifier with the gender unit removed.


The evolution of gender bias

Using a linear decision boundary for studying how gender is represented in an embedding space---like we did above---is actually very similar to one class of (gender) bias measures: those based on finding a linear gender subspace. Following Ravfogel, we define the gender subspace as the axis orthogonal to the decision boundary of a linear SVM that is trained to predict the gender for a set of male and female words [10].


Then, for finding the bias score of a word in the embedding space, you can look at the scalar projection on the gender subspace. In the example below, words like "engineer" and "scientist" are on the "male" side of the subspace (to the left of the purple decision boundary), while "nurse" and "receptionist" have a female bias (to the right). The farther away from the decision boundary, the higher the bias.


Okay, so we have a measure for the gender bias in the input embeddings of the LM. However, as you may remember from the language modelling pipeline figure at the start of the blog post, this is only one part of the model. It is also important to consider the gender bias in a downstream task, as it the bias here that is very likely to actually harm people (the input embeddings are hidden from the users). Unfortunately, it is not clear how the bias in the internal state of the model, such as the input embeddings, is related to its behaviour downstream.

For this reason, we also measure the gender bias in a downstream task, in particular the semantic textual similarity task for bias (STS-B) [11]. In this task, the LM is used for estimating the similarity of three sentences, containing either the word "man", "woman", or an occupation term. Then the gender bias for that occupation term is the difference in the similarity averaged over 276 template sentences. Below you see an example for finding the gender bias for "janitor" using one template sentence, where the final score is 0.75 - 0.54 = 0.21 (male bias).


Equiped with these tools for measuring bias, we studied how the gender bias in the input embeddings and the STS-B task change during training. In our work, we have found a strinking gender bias in both representations (which even correlates with the percentage of male versus female workers in official US Labour Statistics [5]).

In the figure below, you can find the bias scores in the LM during three points in time. We find that the bias measured in the input embeddings is fairly predictive of the bias for the occupation words in the STS-B task, although the correlation between the two measures is not higher than 0.6 at the end of the training, indicating that there are still some important differences. One interesting difference between the two measures, is that we can find a gender asymmetry in the input embedding bias (skewed towards a female bias at epoch 40), but this asymmetry seems to be masked at the level of the downstream task.


Diagnostic intervention: changing downstream bias by changing embeddings

In discussing the bias scores above, we have found that the bias in the input embeddings does in fact correlate with the STS-B bias. However, it is not clear whether there is also a causal relationship between the two representations. For this reason, we also did a diagnostic intervention: does debiasing the input embeddings result in a measurable effect on the bias measured downstream.

Again, we use a method proposed by Ravfogel to debias the input embeddings [10]. To do so, we simply project all the words on the null-space of the gender subspace (which is the decision boundary), effectively removing the gender information from this subspace. However, as Ravfogel has noted, there may be several linear representations and therefore this procedure---of finding a subspace and debiasing---should be repeated multiple times. We indicate the numer of debiasing steps with k.


To study the effect of debiasing the input embeddings on the bias in the downstream STS-B task, we debias 10 times. For each step k, we then measure (i) the STS-B bias, (ii) the effect on the perplexity of the LM, and (iii) the topological similarity of the embeddings compared to the original model. The perplexity values are normalized to one with respect to perplexity before debiasing.

In the figure below, you see the results for the LM after 1 and 40 epochs of training. A few surprising observations can be made here. First, while the bias of the original LM at epoch 40 is higher, debiasing is much more effective at the end of the training and results in a lower bias in the end. This agrees with our earlier finding that gender information is represented more locally at the end of the training, meaning that the bias is easier to remove effectively and selectively. Secondly, while debiasing very often damages the LM considerably, doing this only for three times reduces the bias without many side-effects. Both the perplexity of the LM and the topological similarity of the input embeddings are very similar to situation before debiasing.


Since we have observed some gender asymmetries in our earlier experiments, we also wanted to study the effect on the female and male biased occupations separately. In the figure below you can see the effect of debiasing on the input embedding bias (left) and STS-B bias (right). Interestingly, when consider the input embeddings, we see that debiasing once reduces especially the female bias (and even results in an increase of the measured male bias!). This could be explained by the fact that the dominant gender unit we identified earlier in this blogpost specializes in female word information. Strangely enough, for the STS-B bias we see that the first debiasing step reduces the male bias more, and more research is needed to explain this behaviour.


While we have found that there is a causal relationship between the input embedding and STS-B bias, more work is needed on the precise nature of how they are connected. It is especially important to be aware of the possible (gender) asymmetries in the internal representations of the model, as naive mitigation strategies may introduce new undesirable biases by having a disparate effect on one social category.


  • We identify different phases in the learning dynamics of gender (bias): in how locally it is represented and the different dataset features that can explain these.
  • We have seen that the concept of gender is encoded increasingly locally in the input embeddings during training, and that the dominant gender unit specializes in feminine word information.
  • The bias measured in the input embeddings are fairly predictive of the bias measured downstream in a simple task. We even found that debiasing the input embeddings is effective in reducing the downstream bias, indicating some causal relationship.
  • We observed a striking gender asymmetry in our results, for instance, in how the dominant gender unit in the input embedding space specializes in female word information and how debiasing has a different effect depending on the gender bias of the word. This latter observation underlines the importance of understanding how the (gender) bias is represented in the internal state of the LM, as naive debiasing may introduce new biases.