Assigning Topics and Authors to New Documents

Our strategy in this application is to apply an efficient Monte Carlo algorithm that runs only on the word tokens in the new document, leading quickly to likely assignments of words to authors and topics. We start by assigning words randomly to co-authors and topics. We then sample new assignments of words to topics and words to authors.

Abstracts from two authors, B_Scholkopf and A_Darwiche were combined together into 1 ``pseudo-abstract" and the document treated as if they had both written it. These two authors work in relatively different but not entirely unrelated sub-areas of computer science: Scholkopf in machine learning and Darwiche in probabilistic reasoning. The document is then parsed by the model. i.e., words are assigned to these authors. We would hope that the author-topic model, conditioned now on these two authors, can separate the combined abstract into its component parts.

The linked Figure shows the results after the model has classified each word according to the most likely author. Note that the model only sees a bag of words and is not aware of the word order that we see in the figure. For readers viewing this in color, the more red a word is the more likely it is to have been generated (according to the model) by Scholkopf (and blue for Darwiche). For readers viewing the figure in black and white, the superscript 1 indicates words classified by the model for Scholkopf, and superscript 2 for Darwiche. The results show that all of the significant content words (such as kernel, support, vector, diagnoses, directed, graph) are classified correctly. As we might expect most of the ``errors" are words (such as ``based" or ``criterion") that are not specific to either authors' area of research. Were we to use word order in the classification, and classify (for example) whole sentences, the accuracy would increase further. As it is, the model correctly classifies 69\% of Scholkopf's words and 72\% of Darwiche's.

BACK to ATM