Specific Correspondence Topic Model
Code for the paper:
"Going Beyond Corr-LDA for Detecting Specific Comments on News & Blogs". In ACM international conference on Web Search and Data Mining (WSDM), 2014.
This implementation contains the following 3 models:
Latent Dirichlet Allocation (LDA)
Basic LDA model with collapsed Gibbs sampling. A bonus feature included is sparse topics which allows learning sparse topic distributions which are more diverse on their set of top words (see paper for details).
Correspondence LDA (CorrLDA)
CorrLDA model for articles and comments (or any two paired sets of documents). The latent topic space is shared between the articles and comments. As an improvement over vanilla model, this also includes an irrelevant topic for comments (see paper) and the feature of sparse topics. Inference is collapsed Gibbs sampling.
Specific Correspondence Topic Model (SCTM)
This is the model proposed in the paper for modeling specific correspondence between articles and comments. Includes the features of irrelevant topic and sparse topics. Implements multiple topic vectors and specific correspondence (see paper for details).
Using the Code
Download all the code files and go into the folder called "Release". This contains the Makefile. Compile the code,
make clean; make. The main executable is called "sctm".
./sctm <1.article-file> <2.comments-file> <3.output-dir> <4.topics> <5.model> <6.train/test>
- article-file is the location of the file containing the article contents
- comments-file is the location of the file containing the comment contents (ignored for lda model)
- output-dir is the location and name of the directory to write output
- topics is the number of topics (K)
- model is the model to train, one of: lda, corrlda, sctm
- train/test (optional), 1 for test data (in this case output-dir should point to location of trained model)
To print the topics use the python script provided:
python print_topics.py <beta> <vocab> <?topn>
- beta is the location of the output topic distribution file (named beta) from the sctm code
- vocab is the vocabulary file consisting of word mapping where i-th line contains the word i
- topn (optional) number of top words to print per topic
There is a sample pre-processed dataset of 501 documents and some comments provided in the folder "input". To run a demo on this dataset with 100 topics, use the command:
./sctm ../input/abagf.AT.txt ../input/cbagf.AT.txt ../output 100 sctm
Then print the topics using the provided python script:
python print_topics.py output/beta input/words.AT.txt
Input Data Format
All words should be converted to integer vocabulary ids' starting from 0.
The first line contains the number of documents D. Each document begins with the number of sentences in first line S. Each of the S lines begin with the number of words in the sentence (N), followed by each word id. Example:
N1 w1 w2 .... wN1
N2 w1 w2 .... wN2
Similar to above. First line is D. For each document, first line is number of comments C, followed by C lines with each line beginning with number of comment words N, followed by each comment word. Note that if there is a document with no comments, it should be present in the file with 0 (for C) and no following lines.
Following are the four outputs from the model:
Topic distributions: The learned topics by the model. The last topic is always the irrelevant topic. So if the model is run with K topics, then K+1 topics are output. The last topic should be ignored for LDA model. File "beta".
Comment topic distribution: Topic distribution (over K+1 topics) of each comment. File "y_dist.txt ".
Article topic distribution: Overall topic distribution (over K topics) of each article. File "z_dist.txt ".
Sentence selection probability: The probability of correspondence of a comment to each article sentence (see paper). File "xi_prob.txt".
The dot products of the article and comment topic distributions are used for the applications described in the paper.
Direct any queries to "trapitbansal at gmail dot com".