Author-Topic Model

Scores

The perplexity score derivation is motivated by perplexity. Perplexity is widely used in language modeling to assess the predictive power of a model. It is a measure of how surprising the words are from the model's perspective, loosely equivalent to the effective branching factor. Here we use it on abstracts (for specific authors) that were already seen by the model in our CiteSeer training data set. Lower scores imply that the words on the abstract are less surprising to the model (lower bounded by zero).

The contribution score is the estimation, according to the model, of what portion of the paper is written by the given author. Clearly, it can be applied only to documents that are written by multiple authors. In each sample of the Gibbs sampler every word in each training document is assigned to a single author out of the observed set of authors, A_d. The contribution score for a document d and author A, is defined C_s(A)=|WA|/|W| A_d-1, where |WA| is the number of words in the document d assigned by the Gibbs sampler to the author A and |W| is the total number of words in the document. Thus, if all authors equally contributed on the paper C_s(A)=0 for all A, if author A did not contribute even a single word C_s(A)=-1 and if she wrote all words, C_s(A)=A_d-1. For both scores we use 5 samples of the Gibbs sampler.

BACK to ATM