Learning to Transform, Combine, and Reason in Open-Domain Question Answering

Author: Mostafa Dehghani.

We have all come to expect getting direct answers to complex questions from search systems on large open-domain knowledge sources like the Web. Open-domain question answering is a critical task that needs to be solved for building systems that help address our complex information needs.

To be precise open-domain question answering is the task of answering a user’s question in the form of short texts rather than a list of relevant documents, using open and available external sources.

Most open-domain question answering systems described in the literature first retrieve relevant documents or passages, select one or a few of them as the context, and then feed the question and the context to a machine reading comprehension system to extract the answer.

However, the information needed to answer complex questions is not always contained in a single, directly relevant document that is ranked high. In many cases, there is a need to take a broader context into account, e.g., by considering low-ranked documents that are not immediately relevant, combining information from multiple documents, and reasoning over multiple facts from these documents to infer the answer.

Why should we take a broader context into account?

In order to better understand why taking a broader context into account can be necessary or useful, let’s consider an example. Assume that a user asks this question: “Who is the Spanish artist, sculptor and draughtsman famous for co-founding the Cubist movement?

We can use a search engine to retrieve the top-k relevant documents. The figure below shows the question along with a couple of retrieved documents.

In an attempt to infer the correct answer to the users’ question, given the top-ranked document, a reading comprehension system most likely will extract “Georges Braque” as the answer, which is not the correct answer.

In this example, in order to infer the correct answer, one has to go down the ranked list, gather and encode facts, even those that are not immediately relevant to the question, like that fact that “Malaga is a city in Spain,” which can be inferred from a document at rank 66, and then in a multi-step reasoning process, infer some new facts, including “Picasso was a Spanish artist” given documents positioned at ranks 12 and 66, and finally “Picasso, who was a Spanish artist, co-founded the Cubist” given the previously inferred fact and the document ranked third.

In this example, and in general in many cases in open-domain question answering, a piece of information in a low-ranked document that is not immediately relevant to the question may be useful to fill in the blanks and complete information extracted from the top relevant documents in order to eventually support inferring the correct answer. However, most open-domain question answering methods focus on only one or a few candidate documents by filtering out the less relevant documents to avoid dealing with noisy information and operate over the selected set of documents to extract the answer.

TraCRNet:  Transform, Combine, and Reason

In one of our recent papers, we propose a new model, TraCRNet (pronounced “Tracker Net“), to improve open-domain question answering by explicitly operating on a larger set of candidate documents during the whole process and learning how to aggregate and reason over information from these documents in an effective way while trying not to be distracted by noisy documents.

Given the candidate documents and the question, to generate the answer, TraCRNet first transforms them into vectors by applying a stack of Transformer blocks with self-attention over words in each document. Then, it updates the learned representations from the first stage by combining and enriching them through a multihop reasoning process by applying multiple steps of the universal transformer (UT).

The figure below shows the general schema of the TraCRNet architecture:

Let’s go through the main ingredients:

  • Input encoding: This layer is in charge of encoding each of the documents and the question to single vectors given their words’ embeddings. For this layer, we used a stack of N transformer encoder blocks that is followed by a depth-wise separable convolution followed by a pooling function to get a single vector representation for the whole document or the question (see the transformer encoder in the above figure).
  • Multihop reasoning: In this layer, the universal transformer (UT) is employed to combine evidence from all documents with respect to the question within a multi-step process with the capacity of multihop reasoning. In TraCRNet, the input of the UT encoder is a set of vectors, each representing a candidate document or the question, that are computed by the input encoding layer (see the universal transformer encoder in the above figure). In each step of the UT, we add two embeddings to the vectors representing queries or documents: (i) a rank embedding that encodes the rank of documents given by the retrieval system, also used to distinguish the question from documents and (ii) the step embedding determining the depth of the UT. In the multihop reasoning layer, the representations of all the documents and the question learned from the previous layer get updated in T steps.  Self-attention in this layer allows the model to understand each of the documents based on the information in all the other documents as well as the question.
  • Output decoder: Given the output of the multihop reasoning layer, we use a stack of N transformer decoder blocks (see the transformer decoder in the above figure) to decode the answer.

Multihop reasoning with TraCRNet

Returning to our earlier example about the Spanish artist, sculptor and draughtsman who is famous for co-founding the Cubist movement, after learning representations for each top-ranked document and the question, TraCRNet updates them by applying multiple steps of the universal transformer.

Given the self-attention mechanism and the recurrence in depth in the universal transformer in the first step, TraCRNet can update the representation of document #12 by attending to document #66 and augment the information in document #12. Then, in the next step of reasoning, TraCRNet can update the representation of document #3 by attending over the vector representing document #12 estimated in the previous step, and enrich the information in document #3. After that, during answer generation, the decoder can attend to the final vector representing document #3 and give the correct answer.

We looked into the attention distribution for this particular example and were able to find a relation between attention distributions and the reasoning steps that are needed to give the correct answer to this question. The figure below presents the attention distribution of different heads of UT over all documents and the question while encoding document #12 at step 3 and step 7:

Step 3:

Step 7:

At step 3, TraCRNet has a high level of attention for the document #66 using heads #1 and 4 (blue and red) as well as for the question using head #3 (green) while transforming the document #12. This is in accordance with the fact that the model first needs to update the information encoded in the document #12 with the fact that “Malaga is a city in Spain” from the document #66.

Later, at step #7, while encoding the question, TraCRNet attends over document #12, which has information about “Picasso who is a Spanish artist” (updated in step 3) using heads #1 and #4 as well as document #3, which contains information about “Picasso as a co-founder of Cubism” using head #2 (green).

Why is TraCRNet a great model for open-domain question answering?

TraCRNet has a number of desirable features:

  • All the building blocks of TraCRNet are based on self-attentive feed-forward neural networks, hence per-symbol hidden state transformations are fully parallelizable, which leads to an enormous speedup during training and a super fast input encoding during inference time compared to RNN-based models.
  • While there is no recurrence in time in our model, the recurrence in depth in the universal transformer used in the multihop reasoning layer adds the inductive bias to the model that is needed to go beyond understanding each document separately and combine their information in multiple steps.
  • TraCRNet has the global receptive field of the transformer-based models, which helps it to better encode a long document during input encoding as well as perform better inference over a rather large set of documents during multihop reasoning.
  • The hierarchical usage of a self-attention mechanism, first over words and then over documents, helps TraCRNet control its attention both at word and document levels, making it less fragile to noisy input, which is of key importance while encoding many documents.

All these properties of TraCRNet come together and lead to an effective and efficient architecture for open-domain question answering.

We employed TraCRNet on two public open-domain question answering datasets, SearchQA and Quasar-T, and achieve results that meet or exceed the state-of-the-art. We also conducted a set of analyses to check the sensitivity of TraCRNet to the number of candidate documents as well as some ablation studies on the architecture of TraCRNet that you can find in the paper listed below.

Want to know more?

The TraCRNet has been introduced in Learning to Transform, Combine, and Reason in Open-Domain Question Answering, a paper presented at the 12th ACM International Conference on Web Search and Data Mining (WSDM 2019).

Mostafa Dehghani is a PhD student within ILPS. His doctorate research is focused on training neural networks with imperfect supervision.