Latent Dirichlet Allocation (LDA) is a popular machine-learning technique that identifies latent structures in a corpus of documents. This paper addresses the ongoing concern that formal procedures for determining the optimal LDA configuration do not exist by introducing a set of parametric tests that rely on the assumed multinomial distribution specification underlying the original LDA model. Our methodology defines a set of rigorous statistical procedures that identify and evaluate the optimal topic model. The U.S. Presidential Inaugural Address Corpus is used as a case study to show the numerical results. We find that 92 topics best describe the corpus. We further validate the method through a simulation study confirming the superiority of our approach compared to other standard heuristic metrics like the perplexity index.

A statistical approach for optimal topic model identification

Grossetti, Francesco
2022

Abstract

Latent Dirichlet Allocation (LDA) is a popular machine-learning technique that identifies latent structures in a corpus of documents. This paper addresses the ongoing concern that formal procedures for determining the optimal LDA configuration do not exist by introducing a set of parametric tests that rely on the assumed multinomial distribution specification underlying the original LDA model. Our methodology defines a set of rigorous statistical procedures that identify and evaluate the optimal topic model. The U.S. Presidential Inaugural Address Corpus is used as a case study to show the numerical results. We find that 92 topics best describe the corpus. We further validate the method through a simulation study confirming the superiority of our approach compared to other standard heuristic metrics like the perplexity index.
2022
2022
Lewis, Craig M.; Grossetti, Francesco
File in questo prodotto:
File Dimensione Formato  
OpTop_JMLR.pdf

accesso aperto

Descrizione: article
Tipologia: Pdf editoriale (Publisher's layout)
Licenza: Creative commons
Dimensione 2.32 MB
Formato Adobe PDF
2.32 MB Adobe PDF Visualizza/Apri
JMLR_Acceptance_letter.pdf

non disponibili

Descrizione: Acceptance Letter
Tipologia: Allegato per valutazione Bocconi (Attachment for Bocconi evaluation)
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 51.86 kB
Formato Adobe PDF
51.86 kB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11565/4046853
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 1
social impact