In attempting to grasp topic modeling, I found it helpful to cluster together the readings from Andrew Goldstone and Ted Underwood, Meghan R. Brett’s introduction to topic modeling, and reviews/responses from Andrew Perrin and Laura Nelson to the Poetics special issue on topic modeling. My understanding of topic modeling is that within a large corpus of texts (huge, like 1000+), TM tools mine the texts by grouping words across the corpus into topics, or patterns of co-occurring words (a relationship of similarity). These topics are then examined by the researcher (who must know something about this text corpus to be able to understand the topics found) and illustrated as a befitting visualization that makes visible these topic relationships.
As I was reading, I noticed my questions seemed to echo questions I have had since I started reading about distant reading, text mining, and visualization a year ago. While I think my technical understanding of these methods is becoming more robust through engagement with work using such methods (and showing how they used the methods), the how is still obscure to me. I pulled quotes from these pieces (that are in ways at odd with one another in how they see TM as useful) that are helping me with the how:
Goldstone and Underwood: “The strictly linguistic character of this technique is a limitation as well as a strength: it’s not designed to reveal motivation or conflict” + “This technique can reveal shifts of emphasis that are more gradual and less conscious than the ones we tend to celebrate.”
TM, as advocated by Goldstone and Underwood, is rigorous enough that it should be considered evidence (product of research) to make claims/questions within a discipline. Their account acknowledges the limitations of this method in that it loses meta information within a text (does context still work at this scope? in these methods?), but affirms the method as a different way of seeing how knowledge has emerged within a discipline/disciplinary set of texts. The how might not have otherwise been visible.
Brett: “Topic modeling is not necessarily useful as evidence but it makes an excellent tool for discovery.”
Brett views TM as a part of research—making patterns visible to then pursue. I think this view of how TM is used, in comparison to that of Goldstone and Underwood, is a divide that emerges in differing scopes of text analysis. How this is determined, I am still uncertain other than varying engagements with the methods; but, what I do notice is how discovery through these methods is differently valued.
Perrin: “But culture is not just language, language is not just text, and text is not just words. Since these methods actually analyze text (not language and not culture) we need to attend to the processes by which culture becomes language and language becomes text.”
Perrin critiques TM because it cannot account for the context of texts—texts (as a collection of words) read through these methods cannot make certain relationships visible. This doesn’t appear to be the argument that I once thought existed between the values/constraints/affordances of “close” and “distant” reading (time with one text read closely v. time with a group of texts read for patterns—we have read much that complicates this) but an assertion that while these methods may make certain patterns visible, they cannot make others. Or, they might give the illusion of patterns that are distorted. Other than making this disclaimer in a methods section, I wonder how else we might account for contexts—especially in large corpora.
Nelson: [topic modeling] “It definitely will not magically help us understand the black box of culture. It’s science, not magic, and any science takes work.”
Nelson, who takes Perrin to task in arguing that understanding texts is a way to help us understand society (the context that Perrin argues is lost in texts that TM methods read). Nelson outlines what TM can and cannot do, making the usual disclaimers about understanding available options, methods as best fit for questions/data, and assumptions behind each method. What struck me in Nelson’s account (it should be noted that this is from sociology) was describing how these methods are science.
And How: I’m left wondering about how we see relations/patterns and meaning; this sounds simple, but what relations have our attention make for great variances in meaning. And while I think this is the point—the affordances and constraints of the methods—how are they (texts to topics, topics to relationships, relationships to patterns, patterns to visualizations, visualizations to relationships, relationships to meaning) considered?