Latent Dirichlet Allocation (LDA) is a popular machine-learning technique that identifies latent structures in a corpus of documents. This paper addresses the ongoing concern that formal procedures for determining the optimal LDA configuration do not exist by introducing a set of parametric tests that rely on the assumed multinomial distribution specification underlying the original LDA model. Our methodology defines a set of rigorous statistical procedures that identify and evaluate the optimal topic model. The U.S. Presidential Inaugural Address Corpus is used as a case study to show the numerical results. We find that 92 topics best describe the corpus. We further validate the method through a simulation study confirming the superiority of our approach compared to other standard heuristic metrics like the perplexity index.
A statistical approach for optimal topic model identification
Grossetti, Francesco
2022
Abstract
Latent Dirichlet Allocation (LDA) is a popular machine-learning technique that identifies latent structures in a corpus of documents. This paper addresses the ongoing concern that formal procedures for determining the optimal LDA configuration do not exist by introducing a set of parametric tests that rely on the assumed multinomial distribution specification underlying the original LDA model. Our methodology defines a set of rigorous statistical procedures that identify and evaluate the optimal topic model. The U.S. Presidential Inaugural Address Corpus is used as a case study to show the numerical results. We find that 92 topics best describe the corpus. We further validate the method through a simulation study confirming the superiority of our approach compared to other standard heuristic metrics like the perplexity index.File | Dimensione | Formato | |
---|---|---|---|
OpTop_JMLR.pdf
accesso aperto
Descrizione: article
Tipologia:
Pdf editoriale (Publisher's layout)
Licenza:
Creative commons
Dimensione
2.32 MB
Formato
Adobe PDF
|
2.32 MB | Adobe PDF | Visualizza/Apri |
JMLR_Acceptance_letter.pdf
non disponibili
Descrizione: Acceptance Letter
Tipologia:
Allegato per valutazione Bocconi (Attachment for Bocconi evaluation)
Licenza:
NON PUBBLICO - Accesso privato/ristretto
Dimensione
51.86 kB
Formato
Adobe PDF
|
51.86 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.