Contrastive Learning for Neural Topic Model

23/11/2021

NeurIPS 2021 (to appear): Contrastive Learning for Neural Topic Model
Authors: Thong Nguyen, Luu Anh Tuan


Introduction

Topic models have been successfully applied in Natural Language Processing with various applications such as information extraction, text clustering, summarization, and sentiment analysis [16]. The most popular conventional topic model, Latent Dirichlet Allocation [7], learns document-topic and topic-word distribution via Gibbs sampling and mean-field approximation. To apply deep neural network for topic model, Miao et al. [8] proposed to use neural variational inference as the training method while Srivastava and Sutton [9] employed the logistic normal prior distribution.

Motivation

Recent studies [10, 11] showed that both Gaussian and logistic normal prior employed by most neural topic models fail to capture multimodality aspects and semantic patterns of a document, which are crucial to maintaining the quality of a topic model

To cope with this issue, Adversarial Topic Model (ATM) [1013] was proposed with adversarial mechanisms using a combination of generator and discriminator. By seeking the equilibrium between the generator and discriminator, the generator is capable of learning meaningful semantic patterns of the document. 

Nonetheless, this framework has two main limitations. 

  • First, ATM relies on the key ingredient: leveraging the discrimination of the real distribution from the fake (negative) distribution to guide the training. This limits the behavior concerning the mutual information in the positive sample and the real one, which has been demonstrated as a key driver to learn useful representations in unsupervised learning [1418]. 
  • Second, ATM takes random samples from a prior distribution to feed to the generator. Previous work [19] has shown that incorporating additional variables, such as metadata or document sentiment, to estimate the topic distribution aids the learning of coherent topics. Relying on a pre-defined prior distribution, ATM hinders the integration of those variables.

How do we resolve the aforementioned problems

To address the above drawbacks, in this paper we propose a novel method to model the relations among samples without relying on the generative-discriminative architecture. In particular, we formulate the objective as an optimization problem that aims to move the representation of the input (or prototype) closer to the one that shares the semantic content, i.e., the positive sample. We also take into account the relation of the prototype and the negative sample by forming an auxiliary constraint to enforce the model to push the representation of the negative farther apart from the prototype. Our mathematical framework ends with a contrastive objective, which will be jointly optimized with the evidence lower bound of neural topic model.

Nonetheless, another challenge arises: how to effectively generate positive and negative samples under neural topic model setting? Recent efforts have addressed positive sampling strategies and methods to generate hard negative samples for images [2023]. However, relevant research to adapt the techniques to neural topic model setting has been neglected in the literature. In this work, we introduce a novel sampling method that mimics the way human being seizes the similarity of a pair of documents.

Methodology

a) Notations and Problem Setting

In this paper, we focus on improving the performance of neural topic model (NTM), measured via topic coherence. NTM inherits the architecture of Variational Autoencoder, where the latent vector is taken as topic distribution. Suppose the vocabulary has V unique words, each document is represented as a word count vector x ∈ ℝV and a latent distribution over T topics: z ∈ ℝT. NTM assumes that z is generated from a prior distribution p(z) and x is generated from the conditional distribution over the topic pφ(x | z) by a decoder φ. The aim of model is to infer the document-topic distribution given the word count. In other words, it must estimate the posterior distribution p(z | x), which is approximated by the variational distribution qθ(z | x) modeled by an encoder θ. NTM is trained by minimizing the following objective.

b) Contrastive Learning Objective

Let X = {x} denote the set of document bag-of-words. Each vector x is associated with a negative sample x and a positive sample x+. We assume a discrete set of latent classes C, so that (x; x+) have the same latent class while (x; x) does not. In this work, we choose to use the semantic dot product to measure the similarity between prototype x and the drawn samples.

Our goal is to learn a mapping function fθ: ℝV -> ℝT of the encoder θ which transforms x to the

latent distribution z (x and x+ are transformed to z and z+, respectively). A reasonable mapping function must fulfill two qualities: (1) x and x+ are mapped onto nearby positions; (2) x and x are projected distantly. Regarding goal (1) as the main objective and goal (2) as the constraint enforcing the model to learn the relations among dissimilar samples, we specify our weighted-contrastive optimization objective as follows

c) Word-based Sampling Strategy

For each document with its associated word count vector x ∈ X, we form the tf-idf representation

xtfidf. Then, we feed x to the neural topic model to obtain the latent vector z and the reconstructed document xrecon

Negative sampling We select k tokens N = {n1; n2; … ; nk} that have the highest tf-idf scores. We hypothesize that these words mainly contribute to the topic of the document. By substituting weights of chosen tokens in the original input x with the weights of the reconstructed representation xrecon: xnj = xreconnj, j ∈ {1; … ; k}, we enforce the negative samples x to have the main content deviated from the original input x.

Positive sampling Contrary to the negative case, we select k tokens possessing the lowest tf-idf scores P = {p1; p2; … ; pk}. We obtain the positive sample which bears a resembling theme to the original input by assigning weights of the chosen tokens in xrecon to their counterpart in x+ through x+pj = xreconpj, j ∈ {1; … ; k}. This forms a valid positive sampling procedure since modifying weights of insignificant tokens retains the salient topics in the source document.

Figure 1: Word-based Sampling Strategy

d) Training objective

Joint objective We jointly combine the goal of reconstructing the original input, matching the

approximate with the true posterior distribution, with the contrastive objective specified 

How we evaluate

Our experiment purposes are to:

  • Evaluate the performance of our proposed contrastive framework.
  • Analyze the effects of different sampling strategies.

We find that:

  • Our method achieves the best topic coherence on three benchmark datasets 20Newsgroups, Wikitext-103, and IMDb.
  • Our model not only generates better topics on average but also on the topic-by-topic basis.
  • Word-based sampling method consistently outperforms other strategies by a large margin, whereas topic-based sampling is vulnerable to drawing insufficient or redundant topics and might harm the performance.

Conclusion

In this paper, we propose a novel method to help neural topic model learn more meaningful representations. Approaching the problem with a mathematical perspective, we enforce our model to consider both effects of positive and negative pairs. To better capture semantic patterns, we introduce a novel sampling strategy which takes inspiration from human behavior in differentiating documents. Experimental results on three common benchmark datasets show that our method outperforms other state-of-the-art neural topic models in terms of topic coherence.


References

[1] Y. Lu, Q. Mei, and C. Zhai, “Investigating task performance of probabilistic topic models: an

empirical study of plsa and lda”, Information Retrieval, vol. 14, no. 2, pp. 178–203, 2011.

[2] S. Subramani, V. Sridhar, and K. Shetty, “A novel approach of neural topic modelling for

document clustering,” in 2018 IEEE Symposium Series on Computational Intelligence (SSCI),

pp. 2169–2173, IEEE, 2018.

[3] L. A. Tuan, D. Shah, and R. Barzilay, “Capturing greater context for question generation,” in

Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 9065–9072, 2020.

[4] R. Wang, D. Zhou, and Y. He, “Open event extraction from online text using a generative

adversarial network,” arXiv preprint arXiv:1908.09246, 2019.

[5] M. Wang and P. Mengoni, “How pandemic spread in news: Text analysis using topic model,”

arXiv preprint arXiv:2102.04205, 2021.

[6] T. Nguyen, A. T. Luu, T. Lu, and T. Quan, “Enriching and controlling global semantics for text

summarization,” arXiv preprint arXiv:2109.10616, 2021.

[7] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” the Journal of machine

Learning research, vol. 3, pp. 993–1022, 2003.

[8] Y. Miao, E. Grefenstette, and P. Blunsom, “Discovering discrete latent topics with neural

variational inference,” in International Conference on Machine Learning, pp. 2410–2419,

PMLR, 2017.

[9] A. Srivastava and C. Sutton, “Autoencoding variational inference for topic models,” arXiv

preprint arXiv:1703.01488, 2017.

[10] R. Wang, D. Zhou, and Y. He, “Atm: Adversarial-neural topic model,” Information Processing & Management, vol. 56, no. 6, p. 102098, 2019.

[11] R. Wang, X. Hu, D. Zhou, Y. He, Y. Xiong, C. Ye, and H. Xu, “Neural topic modeling with

bidirectional adversarial training,” arXiv preprint arXiv:2004.12331, 2020.

[12] X. Hu, R. Wang, D. Zhou, and Y. Xiong, “Neural topic modeling with cycle-consistent adversarial training”, arXiv preprint arXiv:2009.13971, 2020.

[13] F. Nan, R. Ding, R. Nallapati, and B. Xiang, “Topic modeling with wasserstein autoencoders”, arXiv preprint arXiv:1907.12374, 2019.

[14] A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-training”, in Proceedings of the eleventh annual conference on Computational learning theory, pp. 92–100,

1998.

[15] C. Xu, D. Tao, and C. Xu, “A survey on multi-view learning”, arXiv preprint arXiv:1304.5634,

2013.

[16] P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning representations by maximizing mutual information across views”, arXiv preprint arXiv:1906.00910, 2019.

[17] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations”, in International conference on machine learning, pp. 1597–1607, PMLR, 2020.

[18] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola, “What makes for good views for contrastive learning”, arXiv preprint arXiv:2005.10243, 2020.

[19] D. Card, C. Tan, and N. A. Smith, “Neural models for documents with metadata,”, arXiv preprint arXiv:1705.09296, 2017.