Author: Christof Monz.
Machine Translation has made great advances over the last few years, which is mostly due to novel architectures based on deep learning approaches. Neural Machine Translation (NMT) architectures are extremely powerful and capable to encode many aspects of the foreign (or source) sentence and at the same time produce fluent sentences in the native (or target) language. While the representational power of NMT systems is beyond doubt, it is not well-understood what the hidden layers within an NMT system actually represent.
Earlier approaches indirectly studied the information captured by the hidden states of recurrent and non-recurrent neural machine translation models by feeding them into different classifiers.
To advance our understanding of what information is captured by the hidden layers of an NMT system, we look at the encoder hidden states of both transformer and recurrent machine translation models from the nearest neighbors perspective. We investigate to what extent the nearest neighbors share information with the underlying word embeddings as well as related WordNet entries. Additionally, we study the underlying syntactic structure of the nearest neighbors to shed light on the role of syntactic similarities in bringing the neighbors together. We compare transformer and recurrent models in a more intrinsic way in terms of capturing lexical semantics and syntactic structures, in contrast to extrinsic approaches used by previous works.
A Nearest-Neighbor Perspective
Following earlier work on word embeddings, we choose to look into the nearest neighbors of the hidden state representations to learn more about the information encoded in them. We treat each hidden state as the representation of the corresponding input token. This way, each occurrence of a word has its own representation. Based on this representation, we compute the list of n nearest neighbors of each word occurrence.
Figure 1 shows an example of 5 nearest neighbors for two different occurrences of the word ‘deregulation’. Each item in this figure is a specific word occurrence, but we have removed occurrence information for the sake of simplicity. This example shows how the set of nearest neighbors changes depending on the occurrence of a word.
Nearest-Neighbors of Hidden States and Word Embeddings
Word embeddings tend to capture the dominant sense of a word, even in the presence of a significant support for other senses in the training corpus. Additionally, it is reasonable to assume that a hidden state corresponding to a word occurrence captures more of the current sense of the word. Comparing the lists could provide useful insights as to which hidden state-based neighbors are not strongly related to the corresponding word embedding. To quantify the extent to which hidden layer representations overlap with the word embedding at the corresponding position, we count how many of the words in the nearest neighbors lists of hidden states are covered by the nearest neighbors list based on the corresponding word embeddings. Just like the hidden states, the word embeddings used for computing the nearest neighbors are also from the same system and the same trained model for each experiment.
Table 1 shows the statistics of the coverage by the nearest neighbors based on embeddings in general and based on selected source POS tags for each of our models.
One can see that only between 18% and 37% of the information encoded in the hidden states is already covered by the corresponding word embeddings, showing that the information captured by the hidden layers differs substantially from the underlying word embeddings. This of course raises the question how the hidden layer representations differ from word embeddings.
Nearest-Neighbors of Hidden States and Word Senses
To shed light on the capability of hidden states in terms of learning the sense of the word in the current context, we compute the coverage of the list of the nearest neighbors of hidden states with the directly related words from WordNet.
Table 2 shows the general and the POS-based coverage for our English-German system. The transformer model clearly has the lead by a large margin. This basically means that more words from the WordNet relations of the word of interest are present in the hidden state nearest neighbors of the word. A simple result of this could be that the hidden states from transformer capture more word semantic information than the hidden states of the recurrent model. Or in other words, the hidden states from recurrent model capture some other information that brings different words than WordNet relations of the word of interest to its neighborhood. This again raises question what type of information is better captured by the hidden layers of a recurrent model.
Nearest-Neighbors of Hidden States and Syntactic Structures
Recent comparisons of recurrent and non-recurrent architectures show that they differ in the extent to which they are capable of modeling syntactic structure. To measure the degree to which the hidden layers of an NMT encoder capture syntactic information, we compute the nearest neighbors based on their structural distance. Here we use the PARSEVAL standard metric as similarity measure between the trees.
Figure 2a shows the corresponding word and subtree of a hidden state of interest and Figure 2b-c shows the corresponding words and subtrees of its three neighbors. The leafs are substituted with dummy ‘XX’ labels to show that they do not influence the computed tree similarities.
Table 3 shows the average similarity between the corresponding constituent subtree of hidden states and the corresponding trees of their nearest neighbors, computed using PARSEVAL. Interestingly, the recurrent model takes the lead in the average syntactic similarity. This confirms our hypothesis that the recurrent model dedicates more of its hidden states capacity, as compared to the transformer model, to capturing syntactic structures. It is also in agreement with the results reported on learning syntactic structures using extrinsic tasks.
In summary, our findings show that a nearest-neighbor analysis of hidden layers can result in a better understanding of which information is captured by the hidden layer representations of an NMT encoder. We show (1) that the information captured by hidden layers differs substantially from the information captured by word embeddings, (2) that transformer-based architectures are more capable in capturing lexical information, while (3) recurrent architectures are better at capturing structural information.
Want to know more?
Our approach and findings are described in more detail in our upcoming MT-Summit 2019 paper: Hamidreza Ghader and Christof Monz. An Intrinsic Nearest Neighbor Analysis of Neural Machine Translation Architectures.
Christof Monz is an associate professor within ILPS and PhD supervisor of Hamidreza.