Microbiome data have been obtained relatively easily in recent years, and currently, various methods for analyzing microbiome data are being proposed. Latent Dirichlet allocation (LDA) models, which are frequently used to extract latent topics from words in documents, have also been proposed to extract information on microbial communities for microbiome data. To extract microbiome topics associated with a subject's attributes, LDA models that utilize supervisory information, including LDA with Dirichlet multinomial regression (DMR topic model) or supervised topic model (SLDA,) can be applied. Further, a Bayesian nonparametric model is often used to automatically decide the number of latent classes for a latent variable model. An LDA can also be extended to a Bayesian nonparametric model using the hierarchical Dirichlet process. Although a Bayesian nonparametric DMR topic model has been previously proposed, it uses normalized gamma process for generating topic distribution, and it is unknown whether the number of topics can be automatically decided from data. It is expected that the total number of topics (with relatively large proportions) can be restricted to a smaller value using the stick-breaking process for generating topic distribution. Therefore, we propose a Bayesian nonparametric DMR topic model using a stick-breaking process and have compared it to existing models using two sets of real microbiome data. The results showed that the proposed model could extract topics that were more associated with attributes of a subject than existing methods, and it could automatically decide the number of topics from the data.
All Science Journal Classification (ASJC) codes
- Biochemistry, Genetics and Molecular Biology (miscellaneous)
- Computer Science Applications