Author-Topic Model: 300 Topics

A 300 Topics Solution of the CiteSeer Data Set

We present here the author-topic and topic-word distributions learned from the CiteSeer data. The Figure below illustrates eight different topics (out of 300). The top four topics (#205, #209, #289, and #10) have some direct relevance to data mining - namely data mining itself, probabilistic learning, information retrieval, and database querying and indexing. There are quite a few other topics in areas such as classification and neural networks that are also related---we only showed four of those topics here (the full list is available in excel format). Each table shows the 10 words that are most likely to be produced if that topic is activated, and the 10 authors who are most likely to have produced a word if it is known to have come from that topic. The words associated with each topic are quite intuitive and, indeed, quite precise in the sense of conveying a semantic summary of a particular field of research.

The authors associated with each topic are also quite representative - note that the top 10 authors associated with a topic by the model are not necessarily the most well-known authors in that area, but rather are the authors who tend to produce the most words for that topic (in the CiteSeer abstracts).

On the bottom of the Figure, topics #163, #87 and 20 show examples of 3 other quite specific and precise topics on string matching, human-computer interaction, and astronomy respectively. There are many such topics spanning the full range of research areas encompassed by documents in CiteSeer. Not all of the topics are as research-specific as those illustrated in the tables discussed above. A fraction, perhaps 10 to 20% are devoted to ``non-research-specific" topics, the “glue" that makes up our research papers, including general terminology for describing methods and experiments, funding acknowledgments and parts of addresses(which inadvertently crept in to the abstracts), and so forth. Topic #273 provides an example of one of these non-specific topics.

BACK to ATM