Developing high-quality question-and-answer sets for chatbots

You've started work on introducing chatbots to your organisation, now find out about what's involved in development and implementation.


This guide is for anyone who has begun work on implementing chatbots and would like guidance on how to develop high-quality question-and-answer sets. It may also be of use to those who are strongly considering implementing a chatbot and want to find out more about the work involved in development and implementation.

This guide begins by explaining the importance of high-quality question-and-answer sets for chatbot performance. It then outlines the structure of question-and-answer sets, so that readers are:

  • Introduced to the key terms used when describing chatbot question-and-answer sets
  • Able to understand central concepts underpinning chatbot use
  • Able to decide whether the guidance in this document is relevant to the type of chatbot they are using

We then look at the initial steps that should be taken to develop high-quality question-and-answer sets and provides a few ‘rules of thumb’ that will help in developing them. Finally, we look at the steps that can be taken to refine question-and-answer sets and manage their ongoing development once the chatbot has gone live.

The importance of question-and-answer sets

For users to benefit from chatbot use, those developing chatbot content must anticipate the kinds of questions that users are likely to ask and the information they would need in response. As such, chatbot performance depends on more than just the effectiveness of the software being used: performance depends on the calibre of the content too. Ultimately, the quality of a question-and-answer set depends on:

  • How well it captures the information that users want to find out from using the chatbot
  • How well its structure aligns with the logic of how the chatbot works

The structure of question-and-answer sets

A typical question-and-answer set is built upon three fundamental elements:

  • Intents – The underlying meaning or intention behind a question
  • Utterances – Different ways of wording a particular question
  • Fulfilments – The answer to a user's question

For example:

  • Intent – A user wants to find out if the cafeteria caters for vegans
  • Utterances – “Do you serve vegan food in college?”, “Is the food in the café suitable for vegans?”, “Do you cater for vegans?”, “I’m vegan, will I be able to get lunch in college?”
  • Fulfilment – “The college cafeteria offers a range of dishes that are suitable for vegans. To see the menu for the next two weeks, please visit...”

Giving the chatbot multiple examples of how a question could be asked (in other words, inputting numerous utterances), enables the chatbot’s machine learning to build models of the intent so that the chatbot can respond to questions that have been worded in a way that was not anticipated. Take the example above of the intent relating to vegan food. It would be very difficult – potentially impossible – to anticipate all the ways in which a user might express their intention of finding out whether the cafeteria caters for vegans.

That said, if the chatbot is given a large enough sample of example questions (ie utterances), it will be able to use machine learning models to establish the underlying meaning that unites them. This means that if a user were to ask “is there vegan food in the college cafeteria?”, the chatbot should be able to recognise the meaning (intent) behind this question, and hence respond with the answer (fulfilment) given in the table above.

First steps

Step 1: clarify the purpose of your chatbot

Before beginning work on the question-and-answer set, it is important to pinpoint the aims and objectives of the chatbot itself. Doing so will help you to shape the topics you want the question-and-answer set to focus on, while also providing useful steers on the chatbot’s tone and the kind of information that should be conveyed through its responses to users.

Step 2: identify the questions your chatbot needs to be able to answer

The use case for the chatbot will determine, in general terms, the topics that need to be captured in the question-and-answer set. After this, the next step is to think carefully about the kinds of queries that users are likely to raise, and the kinds of responses that will satisfy these queries. Where possible, a useful place to start is to gather evidence on what kinds of queries users are already raising. For instance, if a chatbot’s purpose is to answer new students’ queries around enrolment – queries that might normally be raised to staff at a student help desk – then evidence of the kinds of questions that are normally asked could be gained by:

  • Using the experience of staff at the help desk
  • Reviewing relevant emails sent by students to the help desk

In some cases, it may not be possible to gather an evidence base around the queries students are likely to have (the chatbot may be responding to student queries about a new process, for example). Where this is the case, it is suggested that professional judgement is used to anticipate the kinds of questions that students might raise.

For instance, if a chatbot’s purpose is to answer student questions about a new set of courses on offer, then those developing the question-and-answer set (alongside those with expertise in these new courses, perhaps) should review how students are receiving information and anticipate where they may want more detail or clarification on certain points.

Whatever means are used to predict which questions will be asked, it is important to capture them systematically so that they can be referred back to by all individuals involved in the question-and-answer set development.

Step 3: establish how users are likely to ask their questions

Having established which questions the chatbot will need to answer, the next step is to identify the variety of ways in which these questions could be worded. This is because the chatbot needs to be given a number of example wordings (utterances) of a particular question (intent) so that it can build machine learning-based models of the full range of ways in which that question could be worded.

It is difficult to predict precisely how users will word their questions. That said, there are a number of methods that can be employed to make the task easier:

  • Where possible, draw upon evidence of how students have raised such queries in the past (for example, ask relevant members of staff, review appropriate email records)
  • Work collaboratively within teams when generating multiple wordings for the same question, to draw upon different colleagues’ insights
  • Where appropriate, get input from users on how they might word questions (this can be particularly useful for anticipating the jargon users might use)

When generating different wordings for a question, it is important to know how many different wordings (utterances) should be input into the question-and-answer set. From experience, it has been found that using five to ten utterances (different ways of wording the same question) per intent (group of questions that all share the same meaning), tends to be more effective than using less than five utterances. It has also been found that using more than ten utterances can yield diminishing returns or be counterproductive.

Rules of thumb

High-quality question-and-answer sets need to be structured in a way that aligns with the logic of how the chatbot works. Chatbots use utterances (different wordings for the same question) that have been inputted in order to build machine learning models, which then enables them to establish whether a user’s query has the same meaning/intention as a question in the question-and-answer set. This underlying mechanism gives rise to several rules of thumb for how to develop question-and-answer sets:

Make sure all utterances within a given intent are equivalent

The chatbot’s machine learning models will assume that, within a given intent, all the utterances are equivalent. In other words, it will assume that you are grouping questions into differently worded versions of the same question. Based on this assumption, the chatbot is then able to build models, which will establish the connections between these utterances.

One implication of this is that if those developing the question-and-answer sets input non-equivalent utterances (for instance, “where is the library?” and “are there lifts I can take to get to the library?” are not equivalent wordings of the same question), then the chatbot will build its model based on incorrect assumptions. This is likely to diminish the chatbot’s performance.

It is suggested that when listing possible ways of wording a question, consider whether you are in fact capturing multiple related questions. If you are, the best approach may be to split up the intent into multiple intents. With the example above, for instance, it may be advantageous to have two separate intents: one for finding out where the library is, and one for finding out whether the library can be accessed via lift.

Utterances should be unique to a particular intent

As well as all utterances within an intent needing to be equivalent, it is also essential that utterances are unique to one intent.

Consider the following two groups of questions:

  • The first is a grouping of questions about getting support with UCAS applications.
  • The second is a grouping of questions about getting support with personal statements.

Someone inputting utterances for these intents, may consider that the utterance “support with university application” may be relevant to either.

If this utterance were inputted into both intents and a student asked, “I need support with my uni application”, the chatbot would then identify that the intention behind the user question is similar to both the intents on UCAS and personal statements. It would likely struggle to establish which was most relevant – which increases the overall chance of inaccuracy.

One solution in the above case would be to have a separate intent for generic queries about university support, in addition to having intents for specific queries (ie around UCAS and personal statements – or even more granular).

Refining question-and-answer sets through pre-launch testing

The first draft of the question-and-answer set will have made many assumptions about the kinds of questions users will want answered, and the ways in which those questions will be asked. As such, it is unlikely to enable the chatbot to perform optimally.

Fortunately, pre-launch testing will provide useful insights that can improve the quality of the question-and-answer set, and thus the performance of the chatbot.

Step 1: selecting testing subjects

One way to diagnose areas for improvement in the question-and-answer set is to test the chatbot before it is launched, using either students, staff or a mixture of both as testers.

An advantage of testing the chatbot on students is that the questions they put to the chatbot are more likely to be representative of what students would ask the live chatbot (particularly where students are the intended users).

An advantage of testing the chatbot on staff is that this can be easier and quicker to organise, which also makes it straightforward to do multiple iterations of testing and improvement.

Step 2: collecting data from testing

Two core things to learn through testing are:

  • Is the chatbot able to respond to queries with relevant information? (in other words, is the chatbot accurate?)
  • Are the chatbot’s responses useful? (Are the answers clear and informative?)

A formalised way of collecting this information would be to ask testing subjects these questions after each query they raise with the chatbot, and then to collate the overall results.

Because chatbot accuracy/usefulness may vary between topic areas, it may also be helpful to collect data on the topic areas asked for each question, so that you can judge whether this influences the two measures of success.

Step 3: responding to pre-launch testing, and conducting further testing

The main purpose of pre-launch testing is to diagnose areas for development in the question-and-answer set. As such, it is important to make changes based on the feedback received.

Once the first round of changes has been made it is sensible to conduct further testing to evaluate whether any shortcomings have been addressed. This poses the question of when the cycle ends, and at what point it becomes appropriate to launch the chatbot.

One point to consider here is that the version of the chatbot that is launched will not be the finished product. Notably, there should be updates to the question-and-answer set as more and more data is received from the live queries raised by users, and the way in which the chatbot responds.

As such, consider the following as steers for when to deploy the chatbot:

  • The main risks of deploying the chatbot too late (ie after lengthy and extensive testing) are the opportunity costs of not having use of the chatbot during the testing period and the effort involved in ongoing testing.
  • The main risk of deploying the chatbot too early (ie after only minimal testing) is that it may perform poorly and put off users. Note, however, that users’ expectations can be managed in order to reduce the risk of this outcome.

About our work in AI

Our work in AI accelerates the adoption of effective and ethical AI solutions within colleges and universities. As part of working towards this goal, we have conducted a pilot in which chatbots have been deployed in four UK further education colleges.

Conducting this pilot has enabled us to develop expertise in educational chatbots, which has, in turn, motivated the publication of a series of guidance documents on the subject of chatbots. This guide is the third in that series. The additional guides are:

This guide is made available under Creative Commons License (CC BY-NC-ND).