Specific Correspondence Topic Model

Code for the paper:
"Going Beyond Corr-LDA for Detecting Specific Comments on News & Blogs". In ACM international conference on Web Search and Data Mining (WSDM), 2014.

This implementation contains the following 3 models:

Latent Dirichlet Allocation (LDA)
Basic LDA model with collapsed Gibbs sampling. A bonus feature included is sparse topics which allows learning sparse topic distributions which are more diverse on their set of top words (see paper for details).
Correspondence LDA (CorrLDA)
CorrLDA model for articles and comments (or any two paired sets of documents). The latent topic space is shared between the articles and comments. As an improvement over vanilla model, this also includes an irrelevant topic for comments (see paper) and the feature of sparse topics. Inference is collapsed Gibbs sampling.
Specific Correspondence Topic Model (SCTM)
This is the model proposed in the paper for modeling specific correspondence between articles and comments. Includes the features of irrelevant topic and sparse topics. Implements multiple topic vectors and specific correspondence (see paper for details).

Using the Code

Download all the code files and go into the folder called "Release". This contains the Makefile. Compile the code, make clean; make. The main executable is called "sctm".

Usage : ./sctm <1.article-file> <2.comments-file> <3.output-dir> <4.topics> <5.model> <6.train/test>
where

article-file is the location of the file containing the article contents
comments-file is the location of the file containing the comment contents (ignored for lda model)
output-dir is the location and name of the directory to write output
topics is the number of topics (K)
model is the model to train, one of: lda, corrlda, sctm
train/test (optional), 1 for test data (in this case output-dir should point to location of trained model)

To print the topics use the python script provided:
python print_topics.py <beta> <vocab> <?topn>
where

beta is the location of the output topic distribution file (named beta) from the sctm code
vocab is the vocabulary file consisting of word mapping where i-th line contains the word i
topn (optional) number of top words to print per topic

There is a sample pre-processed dataset of 501 documents and some comments provided in the folder "input". To run a demo on this dataset with 100 topics, use the command:
./sctm ../input/abagf.AT.txt ../input/cbagf.AT.txt ../output 100 sctm
Then print the topics using the provided python script:
python print_topics.py output/beta input/words.AT.txt

Input Data Format

All words should be converted to integer vocabulary ids' starting from 0.

Article Format
The first line contains the number of documents D. Each document begins with the number of sentences in first line S. Each of the S lines begin with the number of words in the sentence (N), followed by each word id. Example:
2
S₁
N₁ w₁ w₂ .... w_N₁
S₂
N₂ w₁ w₂ .... w_N₂
Comments Format
Similar to above. First line is D. For each document, first line is number of comments C, followed by C lines with each line beginning with number of comment words N, followed by each comment word. Note that if there is a document with no comments, it should be present in the file with 0 (for C) and no following lines.

Output

Following are the four outputs from the model:

Topic distributions: The learned topics by the model. The last topic is always the irrelevant topic. So if the model is run with K topics, then K+1 topics are output. The last topic should be ignored for LDA model. File "beta".
Comment topic distribution: Topic distribution (over K+1 topics) of each comment. File "y_dist.txt ".
Article topic distribution: Overall topic distribution (over K topics) of each article. File "z_dist.txt ".
Sentence selection probability: The probability of correspondence of a comment to each article sentence (see paper). File "xi_prob.txt".

The dot products of the article and comment topic distributions are used for the applications described in the paper.

Queries/Help

Direct any queries to "trapitbansal at gmail dot com".