Multi-domain Neural Network Language Generation for Spoken Dialogue Systems

Authors

Tsung-Hsien Wen
Milica Gasic
Nikola Mrksic
Lina M. Rojas-Barahona
Pei-Hao Su
David Vandyke
Steve Young

Abstract
Introduction
Related Work
The Neural Language Generator
Model Fine-Tuning
Data Counterfeiting
- Figure 1
- Table 1
- Figure 2
- Figure 3
Discriminative Training
- Figure 4
- Table 2
Datasets
Corpus-Based Evaluation
Experimental Setup
Human Evaluation
- Table 3
Conclusion And Future Work
References

License

Abstract

©2016 Association for Computational Linguistics. Moving from limited-domain natural language generation (NLG) to open domain is difficult because the number of semantic input combinations grows exponentially with the number of domains. Therefore, it is important to leverage existing resources and exploit similarities between domains to facilitate domain adaptation. In this paper, we propose a procedure to train multi-domain, Recurrent Neural Network-based (RNN) language generators via multiple adaptation steps. In this procedure, a model is first trained on counterfeited data synthesised from an out-of-domain dataset, and then fine tuned on a small set of in-domain utterances with a discriminative objective function. Corpus-based evaluation results show that the proposed procedure can achieve competitive performance in terms of BLEU score and slot error rate while significantly reducing the data needed to train generators in new, unseen domains. In subjective testing, human judges confirm that the procedure greatly improves generator performance when only a small amount of data is available in the domain.

Introduction

Modern Spoken Dialogue Systems (SDS) are typically developed according to a well-defined ontology, which provides a structured representation of the domain data that the dialogue system can talk about, such as searching for a restaurant or shopping for a laptop. Unlike conventional approaches employing a substantial amount of handcrafting for each individual processing component (Ward and Issar, 1994; Bohus and Rudnicky, 2009) , statistical approaches to SDS promise a domain-scalable framework which requires a minimal amount of human intervention (Young et al., 2013) . Mrkšić et al. (2015) showed improved performance in belief tracking by training a general model and adapting it to specific domains. Similar benefit can be observed in Gašić et al. (2015) , in which a Bayesian committee machine (Tresp, 2000) was used to model policy learning in a multi-domain SDS regime.

In past decades, adaptive NLG has been studied from linguistic perspectives, such as systems that learn to tailor user preferences (Walker et al., 2007) , convey a specific personality trait (Mairesse and Walker, 2008; Mairesse and Walker, 2011) , or align with their conversational partner (Isard et al., 2006) . Domain adaptation was first addressed by Hogan et al. (2008) using a generator based on the Lexical Functional Grammar (LFG) fstructures (Kaplan and Bresnan, 1982) . Although these approaches can model rich linguistic phenomenon, they are not readily adaptable to data since they still require many handcrafted rules to define the search space. Recently, RNN-based language generation has been introduced (Wen et al., 2015a; Wen et al., 2015b) . This class of statistical generators can learn generation decisions directly from dialogue act (DA)-utterance pairs without any semantic annotations (Mairesse and Young, 2014) or hand-coded grammars (Langkilde and Knight, 1998; Walker et al., 2002) . Many existing adaptation approaches (Wen et al., 2013; Shi et al., 2015; Chen et al., 2015) can be directly applied due to the flexibility of the underlying RNN language model (RNNLM) architecture (Mikolov et al., 2010) .

Discriminative training (DT) has been successfully used to train RNNs for various tasks. By optimising directly against the desired objective function such as BLEU score or Word Error Rate (Kuo et al., 2002) , the model can explore its output space and learn to discriminate between good and bad hypotheses. In this paper we show that DT can enable a generator to learn more efficiently when in-domain data is scarce.

The paper presents an incremental recipe for training multi-domain language generators based on a purely data-driven, RNN-based generation model. Following a review of related work in section 2, section 3 describes the detailed RNN generator architecture. The data counterfeiting approach for synthesising an in-domain dataset is introduced in section 4, where it is compared to the simple model fine-tuning approach. In section 5, we describe our proposed DT procedure for training natural language generators. Following a brief review of the data sets used in section 6, corpus-based evaluation results are presented in section 7. In order to assess the subjective performance of our system, a quality test and a pairwise preference test are presented in section 8. The results show that the proposed adaptation recipe improves not only the objective scores but also the user's perceived quality of the system. We conclude with a brief summary in section 9.

Related Work

Domain adaptation problems arise when we have a sufficient amount of labeled data in one domain (the source domain), but have little or no labeled data in another related domain (the target domain). Domain adaptability for real world speech and language applications is especially important because both language usage and the topics of interest are constantly evolving. Historically, domain adaptation has been less well studied in the NLG community. The most relevant work was done by Hogan et al. (2008) . They showed that an LFG f-structure based generator could yield better performance when trained on in-domain sentences paired with pseudo parse tree inputs generated from a state-of-the-art, but out-ofdomain parser. The SPoT-based generator proposed by Walker et al. (2002) has the potential to address domain adaptation problems. However, their published work has focused on tailoring user preferences (Walker et al., 2007) and mimicking personality traits (Mairesse and Walker, 2011) . Lemon (2008) proposed a Reinforcement Learning (RL) framework in which policy and NLG components can be jointly optimised and adapted based on online user feedback. In contrast, Mairesse et al. (2010) has proposed using active learning to mitigate the data sparsity problem when training datadriven NLG systems. Furthermore, Cuayhuitl et al. (2014) trained statistical surface realisers from unlabelled data by an automatic slot labelling technique.

In general, feature-based adaptation is perhaps the most widely used technique (Blitzer et al., 2007; Pan and Yang, 2010; Duan et al., 2012) . By exploiting correlations and similarities between data points, it has been successfully applied to problems like speaker adaptation (Gauvain and Lee, 1994; Leggetter and Woodland, 1995) and various tasks in natural language processing (Daumé III, 2009) . In contrast, model-based adaptation is particularly useful for language modeling (LM) (Bellegarda, 2004) . Mixture-based topic LMs (Gildea and Hofmann, 1999) are widely used in N-gram LMs for domain adaptation. Similar ideas have been applied to applications that require adapting LMs, such as machine translation (MT) (Koehn and Schroeder, 2007) and personalised speech recognition (Wen et al., 2012) . Domain adaptation for Neural Network (NN)-based LMs has also been studied in the past. A feature augmented RNNLM was first proposed by Mikolov and Zweig (2012) , but later applied to multi-genre broadcast speech recognition (Chen et al., 2015) and personalised language modeling (Wen et al., 2013) . These methods are based on finetuning existing network parameters on adaptation data. However, careful regularisation is often necessary . In a slightly different area, Shi et al. (2015) applied curriculum learning to RNNLM adaptation.

Discriminative training (DT) (Collins, 2002) is an alternative to the maximum likelihood (ML) criterion. For classification, DT can be split into two phases: (1) decoding training examples using the current model and scoring them, and (2) adjusting the model parameters to maximise the separation between the correct target annotation and the competing incorrect annotations. It has been successfully applied to many research problems, such as speech recognition (Kuo et al., 2002; Voigtlaender et al., 2015) and MT (He and Deng, 2012; Auli et al., 2014) . Recently, trained an RNNLM with a DT objective and showed improved performance on an MT task. However, their RNN probabilities only served as input features to a phrase-based MT system.

The Neural Language Generator

The neural language generation model (Wen et al., 2015a; Wen et al., 2015b ) is a RNNLM (Mikolov et al., 2010) augmented with semantic input features such as a dialogue act 1 (DA) denoting the required semantics of the generated output. At every time step t, the model consumes the 1-hot representation of both the DA d t and a token w t 2 to update its internal state h t . Based on this new state, the output distribution over the next output token is calculated. The model can thus generate a sequence of tokens by repeatedly sampling the current output distribution to obtain the next input token until an end-ofsentence sign is generated. Finally, the generated sequence is lexicalised 3 to form the target utterance.

The Semantically Conditioned Long Short-term Memory Network (SC-LSTM) (Wen et al., 2015b ) is a specialised extension of the LSTM network (Hochreiter and Schmidhuber, 1997) for language generation which has previously been shown capable of learning generation decisions from paired DA-utterances end-to-end without a modular pipeline (Walker et al., 2002; Stent et al., 2004) . Like LSTM, SC-LSTM relies on a vector of memory cells c t ∈ R n and a set of elementwise multiplication gates to control how information is stored, forgotten, and exploited inside the network. The SC-LSTM architecture used in this paper is defined by 1 A combination of an action type and a set of slot-value pairs. e.g. inform(name="Seven days",food="chinese")

2 We use token instead of word because our model operates on text for which slot values are replaced by their corresponding slot tokens. We call this procedure delexicalisation. 3 The process of replacing slot token by its value.

the following equations,

where n is the hidden layer size, i t , f t , o t , r t ∈ [0, 1] n are input, forget, output, and reading gates respectively,ĉ t and c t are proposed cell value and true cell value at time t, W 5n,2n and W dc are the model parameters to be learned. The major difference of the SC-LSTM compared to the vanilla LSTM is the introduction of the reading gates for controlling the semantic input features presented to the network. It was shown in Wen et al. (2015b) that these reading gates act like keyword and key phrase detectors that learn the alignments between individual semantic input features and their corresponding realisations without additional supervision. After the hidden layer state is obtained, the computation of the next word distribution and sampling from it is straightforward,

where W ho is another weight matrix to learn. The entire network is trained end-to-end using a cross entropy cost function, between the predicted word distribution p t and the actual word distribution y t , with regularisations on DA transition dynamics,

T is the DA vector at the last index T, and η and ξ are constants set to 10 −4 and 100, respectively.

Model Fine-Tuning

A straightforward way to adapt NN-based models to a target domain is to continue training or fine-tuning a well-trained generator on whatever new target domain data is available. This training procedure is as follows:

1. Train a source domain generator θ S on source domain data {d i , Ω i } ∈ S with all values delexicalised 4 .

2. Divide the adaptation data into training and validation sets. Refine parameters by training on adaptation data {d i , Ω i } ∈ T with early stopping and a smaller starting learning rate. This yields the target domain generator θ T .

Although this method can benefit from parameter sharing of the LM part of the network, the parameters of similar input slot-value pairs are not shared 4 . In other words, realisation of any unseen slot-value pair in the target domain can only be learned from scratch. Adaptation offers no benefit in this case.

Data Counterfeiting

In order to maximise the effect of domain adaptation, the model should be able to (1) generate acceptable realisations for unseen slot-value pairs based on similar slot-value pairs seen in the training data, and (2) continue to distinguish slot-value pairs that are similar but nevertheless distinct. Instead of exploring weight tying strategies in different training stages (which is complex to implement and typically relies on ad hoc tying rules), we propose instead a data counterfeiting approach to synthesise target domain data from source domain data. The procedure is shown in Figure 1 and described as following:

1. Categorise slots in both source and target domain into classes, according to some similarity measure. In our case, we categorise them based on their functional type to yield three classes: informable, requestable, and binary 5 .

2. Delexicalise all slots and values.

3. For each slot s in a source instance (d i , Ω i ) ∈ S, randomly select a new slot s that belongs to both the target ontology and the class of s to replace s. Repeat this process for every slot in the instance and yield a new pseudo instance

4. Train a generatorθ T on the counterfeited dataset {d i ,Ω i } ∈ T.

5. Refine parameters on real in-domain data. This yields final model parameters θ T .

This approach allows the generator to share realisations among slot-value pairs that have similar functionalities, therefore facilitates the transfer learning Table 1 : of rare slot-value pairs in the target domain. Furthermore, the approach also preserves the co-occurrence statistics of slot-value pairs and their realisations. This allows the model to learn the gating mechanism even before adaptation data is introduced.

**Table 1. Ontologies for Laptop and TV domains**

We first compared the data counterfeiting (counterfeit) approach with the model fine-tuning (tune) method and models trained from scratch (scratch). Figure 2 shows the result of adapting models between similar domains, from laptop to TV. Because of the parameter sharing in the LM part of the network, model fine-tuning (tune) achieves a better BLEU score than training from scratch (scratch) when target domain data is limited. However, if we apply the data counterfeiting (counterfeit) method, we obtain an even greater BLEU score gain. This is mainly due to the better realisation of unseen slotvalue pairs. On the other hand, data counterfeiting (counterfeit) also brings a substantial reduction in slot error rate. This is because it preserves the co-occurrence statistics between slot-value pairs and realisations, which allows the model to learn good semantic alignments even before adaptation data is introduced. Similar results can be seen in Figure 3 , in which adaptation was performed on more disjoint domains: restaurant and hotel joint domain to laptop and TV joint domain. The data counterfeiting (counterfeit) method is still superior to the other methods.

Figure 2. Results evaluated on TV domain by adapting models from laptop domain. Comparing train-from-scratch model (scratch) with model fine-tuning approach (tune) and data counterfeiting method (counterfeit). 10% ≈ 700 examples.

**Figure 3. The same set of comparison as in Figure 2, but the results were evaluated by adapting from SF restaurant and hotel joint dataset to laptop and TV joint dataset. 10% ≈ 2K examples.**

Discriminative Training

In contrast to the traditional ML criteria (Equation 1) whose goal is to maximise the log-likelihood of correct examples, DT aims at separating correct examples from competing incorrect examples. Given a training instance (d i , Ω i ), the training process starts by generating a set of candidate sentences Gen(d i ) using the current model parameter θ and DA d i . The discriminative cost function can therefore be written as

where L(Ω, Ω i ) is the scoring function evaluating candidate Ω by taking ground truth Ω i as reference. p θ (Ω|d i ) is the normalised probability of the candidate and is calculated by

is a tuned scaling factor that flattens the distribution for γ < 1 and sharpens it for γ > 1. The unnormalised candidate likelihood log p(Ω|d i , θ) is produced by summing token likelihoods from the RNN generator output,

The scoring function L(Ω, Ω i ) can be further generalised to take several scoring functions into account

where β j is the weight for j-th scoring function.

Since the cost function presented here (Equation 2) is differentiable everywhere, back propagation can be applied to calculate the gradients and update parameters directly.

The generator parameters obtained from data counterfeiting and ML adaptation were further tuned by applying DT. In each case, the models were optimised using two objective functions: BLEU-4 score and slot error rate. However, we used a soft version of BLEU called sentence BLEU as described in Auli and Gao (2014) , to mitigate the sparse n-gram match problem of BLEU at the sentence level. In our experiments, we set γ to 5.0 and β j to 1.0 and -1.0 for BLEU and ERR, respectively. For each DA, we applied our generator 50 times to generate candidate sentences. Repeated candidates were removed. We treated the remaining candidates as a single batch and updated the model parameters by the procedure described in section 5. We evaluated performance of the algorithm on the laptop to TV adaptation scenario, and compared models with and without discriminative training (ML+DT & ML). The results are shown in Figure 4 where it can be seen that DT consistently improves generator performance on both metrics. Another interesting point to note is that slot error rate is easier to optimise compared to BLEU (ERR→ 0 after DT). This is probably because the sentence BLEU optimisation criterion is only an approximation of the corpus BLEU score used for evaluation. Table 2 : Human evaluation for utterance quality in two domains. Results are shown in two metrics (rating out of 3). Statistical significance was computed using a two-tailed Student's t-test, between the model trained with full data (scrALL) and all others.

**Figure 4. Effect of applying DT training after ML adaptation. The results were evaluated on laptop to TV adaptation. 10% ≈ 700 examples.**

Table 2. Human evaluation for utterance quality in two domains. Results are shown in two metrics (rating out of 3). Statistical significance was computed using a two-tailed Student’s t-test, between the model trained with full data (scrALL) and all others.

Datasets

In order to test our proposed recipe for training multi-domain language generators, we conducted experiments using four different domains: finding a restaurant, finding a hotel, buying a laptop, and buying a television. Datasets for the restaurant and hotel domains have been previously released by Wen et al. (2015b) . These were created by workers recruited by Amazon Mechanical Turk (AMT) by asking them to propose an appropriate natural language realisation corresponding to each system dialogue act actually generated by a dialogue system. However, the number of actually occurring DA combinations in the restaurant and hotel domains were rather limited (∼200) and since multiple references were collected for each DA, the resulting datasets are not sufficiently diverse to enable the assessment of the generalisation capability of the different training methods over unseen semantic inputs. In order to create more diverse datasets for the laptop and TV domains, we enumerated all possible combinations of dialogue act types and slots based on the ontology shown in Table 1 . This yielded about 13K distinct DAs in the laptop domain and 7K distinct DAs in the TV domain. We then used AMT workers to collect just one realisation for each DA. Since the resulting datasets have a much larger input space but only one training example for each DA, the system must learn partial realisations of concepts and be able to recombine and apply them to unseen DAs. Also note that the number of act types and slots of the new ontology is larger, which makes NLG in both laptop and TV domains much harder.

Corpus-Based Evaluation

We first assess generator performance using two objective evaluation metrics, the BLEU-4 score (Papineni et al., 2002) and slot error rate ERR (Wen et al., 2015b) . Slot error rates were calculated by averaging slot errors over each of the top 5 realisations in the entire corpus. We used multiple references to compute the BLEU scores when available (i.e. for the restaurant and hotel domains). In order to better compare results across different methods, we plotted the BLEU and slot error rate curves against different amounts of adaptation data. Note that in the graphs the x-axis is presented on a log-scale.

Experimental Setup

The generators were implemented using the Theano library (Bergstra et al., 2010; Bastien et al., 2012) , and trained by partitioning each of the collected corpora into a training, validation, and testing set in the ratio 3:1:1. All the generators were trained by treating each sentence as a mini-batch. An l 2 regularisation term was added to the objective function for every 10 training examples. The hidden layer size was set to be 100 for all cases. Stochastic gradient descent and back propagation through time (Werbos, 1990) were used to optimise the parameters. In order to prevent overfitting, early stopping was implemented using the validation set.

During decoding, we over-generated 20 utterances and selected the top 5 realisations for each DA according to the following reranking criteria,

where λ is a tradeoff constant, F (θ) is the cost generated by network parameters θ, and the slot error rate ERR is computed by exact matching of the slot tokens in the candidate utterances. λ is set to a large value (10) in order to severely penalise nonsensical outputs. Since our generator works stochastically and the trained networks can differ depending on the initialisation, all the results shown below were averaged over 5 randomly initialised networks.

Human Evaluation

Since automatic metrics may not consistently agree with human perception (Stent et al., 2005) , human testing is needed to assess subjective quality. To do this, a set of judges were recruited using AMT. We tested our models on two adaptation scenarios: laptop to TV and TV to laptop. For each task, two systems among the four were compared: training from scratch using full dataset (scrALL), adapting with DT training but only 10% of target domain data (DT-10%), adapting with ML training but only 10% of target domain data (ML-10%), and training from scratch using only 10% of target domain data (scr-10%). In order to evaluate system performance in the presence of language variation, each system generated 5 different surface realisations for each input DA and the human judges were asked to score each of them in terms of informativeness and naturalness (rating out of 3), and also asked to state a preference between the two. Here informativeness is defined as whether the utterance contains all the information specified in the DA, and naturalness is defined as whether the utterance could plausibly have been produced by a human. In order to decrease the amount of information presented to the judges, utterances that appeared identically in both systems were filtered out. We tested about 2000 DAs for each scenario distributed uniformly between contrasts except that allowed 50% more comparisons between ML-10% and DT-10% because they were close. Table 3 : available, training everything from scratch (scrALL) achieves a very good performance and adaptation is not necessary. However, if only a limited amount of in-domain data is available, efficient adaptation is critical (DT-10% & ML-10% > scr-10%). Moreover, judges also preferred the DT trained generator (DT-10%) compared to the ML trained generator (ML-10%), especially for informativeness. In the laptop to TV scenario, the informativeness score of DT method (DT-10%) was considered indistinguishable when comparing to the method trained with full training set (scrALL). The preference test results are shown in Table 3 . Again, adaptation methods (DT-10% & ML-10%) are crucial to bridge the gap between domains when the target domain data is scarce (DT-10% & ML-10% > scr-10%). The results also suggest that the DT training approach (DT-10%) was preferred compared to ML training (ML-10%), even though the preference in this case was not statistically significant.

**Table 3. Pairwise preference test among four approaches in two domains. Statistical significance was computed using two-tailed binomial test.**

Conclusion And Future Work

In this paper we have proposed a procedure for training multi-domain, RNN-based language generators, by data counterfeiting and discriminative training. The procedure is general and applicable to any datadriven language generator. Both corpus-based evaluation and human assessment were performed. Objective measures on corpus data have demonstrated that by applying this procedure to adapt models between four different dialogue domains, good performance can be achieved with much less training data. Subjective assessment by human judges confirm the effectiveness of the approach.

The proposed domain adaptation method requires a small amount of annotated data to be collected offline. In our future work, we intend to focus on training the generator on the fly with real user feedback during conversation.