. |
|
|
|
We present here
the author-topic and topic-word distributions learned from the CiteSeer data. The Figure below illustrates eight
different topics (out of 300). The top four topics (#205, #209, #289, and #10)
have some direct relevance to data mining - namely data mining itself, probabilistic
learning, information retrieval, and database querying and indexing. There
are quite a few other topics in areas such as classification and neural
networks that are also related---we only showed four of those topics here (the full list is available in excel format). Each table shows the 10 words that are most
likely to be produced if that topic is activated, and the 10 authors who are
most likely to have produced a word if it is known to have come from that
topic. The words associated with each topic are quite intuitive and, indeed,
quite precise in the sense of conveying a semantic summary of a particular
field of research. The authors
associated with each topic are also quite representative - note that the top 10
authors associated with a topic by the model are not necessarily the most
well-known authors in that area, but rather are the authors who tend to
produce the most words for that topic (in the CiteSeer
abstracts). On the bottom of the Figure, topics #163, #87 and 20 show examples of 3 other quite specific and precise topics on string matching, human-computer interaction, and astronomy respectively. There are many such topics spanning the full range of research areas encompassed by documents in CiteSeer. Not all of the topics are as research-specific as those illustrated in the tables discussed above. A fraction, perhaps 10 to 20% are devoted to ``non-research-specific" topics, the “glue" that makes up our research papers, including general terminology for describing methods and experiments, funding acknowledgments and parts of addresses(which inadvertently crept in to the abstracts), and so forth. Topic #273 provides an example of one of these non-specific topics. |