Home View on GitHub


Specific Correspondence Topic Model

Download this project as a .zip file Download this project as a tar.gz file

Specific Correspondence Topic Model

Code for the paper:
"Going Beyond Corr-LDA for Detecting Specific Comments on News & Blogs". In ACM international conference on Web Search and Data Mining (WSDM), 2014.

This implementation contains the following 3 models:

  1. Latent Dirichlet Allocation (LDA)
    Basic LDA model with collapsed Gibbs sampling. A bonus feature included is sparse topics which allows learning sparse topic distributions which are more diverse on their set of top words (see paper for details).

  2. Correspondence LDA (CorrLDA)
    CorrLDA model for articles and comments (or any two paired sets of documents). The latent topic space is shared between the articles and comments. As an improvement over vanilla model, this also includes an irrelevant topic for comments (see paper) and the feature of sparse topics. Inference is collapsed Gibbs sampling.

  3. Specific Correspondence Topic Model (SCTM)
    This is the model proposed in the paper for modeling specific correspondence between articles and comments. Includes the features of irrelevant topic and sparse topics. Implements multiple topic vectors and specific correspondence (see paper for details).

Using the Code

Download all the code files and go into the folder called "Release". This contains the Makefile. Compile the code, make clean; make. The main executable is called "sctm".

Usage : ./sctm <1.article-file> <2.comments-file> <3.output-dir> <4.topics> <5.model> <6.train/test>

To print the topics use the python script provided:
python print_topics.py <beta> <vocab> <?topn>

There is a sample pre-processed dataset of 501 documents and some comments provided in the folder "input". To run a demo on this dataset with 100 topics, use the command:
./sctm ../input/abagf.AT.txt ../input/cbagf.AT.txt ../output 100 sctm
Then print the topics using the provided python script:
python print_topics.py output/beta input/words.AT.txt

Input Data Format

All words should be converted to integer vocabulary ids' starting from 0.


Following are the four outputs from the model:

The dot products of the article and comment topic distributions are used for the applications described in the paper.


Direct any queries to "trapitbansal at gmail dot com".