reading/noting: distant reading, data mining, data visualization

Reading: “Graphs” from Graphs, Maps, Trees: Abstract Models for Literary History by Franco Moretti (New Left Review: 28 (2003): 67-93); Data Mining: A Hybrid Methodology for Complex and Dynamic Research by Susan Lang and Craig Baehr (College Composition and Communication 64:1 (2012):172-194) ; and Grasping Rhetoric and Composition by Its Long Tail: What Graphs Can Tell Us about the Field’s Changing Shape by Derek Mueller ( College Composition and Communication 64:1 (2012): 195-223)

Moretti:

quoting Polish philosopher and historian Krzysztof Pomian: the old historical paradigm “directed the gaze of the historian towards extraordinary events…historians resembled collectors: both gathered only rare and curious objects, disregarding whatever looked banal, everyday, normal” (67)

what would happen if focus was shifted from exceptional texts to the “mass of facts”? (67) – “what a minimal fraction of the literary field we all work on…” Moretti proposes a move from close reading to what he conceptualizes as distant reading.

“a field this large cannot be understood by stitching together separate bits of knowledge about individual cases, because it isn’t a sum of individual cases: it’s a collective system, that should be grasped as such, as a whole” (68) Poses question of “knowing” the field on the basis of its texts – there’s far too many to ever read closely, and even when they are read at a close distance, how can we gain perspective to their relationships within the larger history/discipline?

the “novel is an unreliable commodity” (70) – I believe he is referring to its form as in flux based on its relations to political/social/etc. status (makes me wonder – rhetorical situation (Bitzer) or rhetorical ecology (Rice)?)

“But graphs are not models; they are not simplified versions of a theoretical structure [in the ways maps and (especially) evolutionary trees will be in the next two articles], Quantitative research provides a type of data which is ideally independent of interpretations, I said earlier, and that is of course also its limit: it provides data, not interpretation” (72) – a data handle, a way of beginning to see

multiplicity of time: patterns in event, cycle, longue duree

  • event: circumscribed domain of the event and the individual case
  • longue duree: very long span of nearly unchanging structures
  • cycle: temporary structures within historical flow
  • so, all flow and no structure (event), temporary structures for some time (cycle), all structure and no flow (longue duree) (76)

cluster (80) – appearance and disappearance of genres “punctuated by brief bursts of invention” – does he imagine the data graphing/mapping/arranging/assembling in clusters? Are there no outliers? No strange texts? No anomalies?

“What graphs make us see, in other words, are the constraints and the inertia of the literary field – the limits of the imaginable” (82). When we see, we also don’t see – this is potential.

“I began this article by saying that quantitative data are useful because they are independent of interpretation; then, that they are interesting because they demand an interpretation; and now, most radically, we see them challenge existing interpretations, and ask for a theory, not so much of ‘the’ novel, but a whole family of novelistic forms. A theory-of diversity” (91). Does Moretti view these texts as heterogeneous? Heterogeneous as a collective whole vs. homogenous as a whole? Or does he see more grouping/categorization through these graphs?

Does Moretti see the novel in flux due to rhetorical situations – “an uncertain relation to politics and social movements” (73)? – “The causal mechanism must thus be external to the genres, and common to all: like a sudden, total change of their ecosystem. Which is to say: a change of their audience. Books survive if they are read and disappear if they aren’t: and when an entire generic system vanishes at once, the likeliest explanation is that its readers vanished at once” (82). What is the difference look like between a rhetorical situation and a rhetorical ecology? Do graphs fit? Or are they not distributed enough? Are graphs too singular from the network? Do they need to be connected? Are they a step toward composing networks? And what of metanoia? kairos? chronos? in these graphs, rhetorical situations, rhetorical ecologies, and actor-networks?

Lang and Baehr

opens with quote from John Naisbitt “We are drowning in information but starved for knowledge” – taking stock/inventorying our field

as a field, [composition] “we’ve often relied on lore, anecdotal evidence, or studies relying on small sample sizes to defend our assertions. Data mining won’t provide an instant or simple answer, since we still need to determine what data we have available, examine the data to see if any trends emerge, and then, most likely, ask more questions and turn again to the data, which offer us a new set of tools and strategies for research” (174) – calling for data mining to justify what composition does, and consequently, doesn’t do

compositionists “have rejected quantification and any attempts to reach Truth about our business by scientific means, just as we long ago rejected ‘truth’ as derivable by deduction from unquestioned first principles. . . .” (174) – we won’t reach “truth” by critique, but we won’t reach “knowledge” (of) by neglecting collecting/counting/sifting materials that constitute our field

working from Richard Haswell’s article, “NCTE/CCCC’s Recent War on Scholarship,” in which he “tracks the publishing trends of RAD (replicable, aggregable, and data supported) research in flagship journals. Haswell defines such research as that which may or may not employ statistics, but is “explicitly enough systematicized in sampling, execution, and analysis to be replicated; exactly enough circumscribed to be extended; and factually enough supported to be verified” (174) – an attempt at bringing “hard” data into the “soft” humanities? (we’re not viewed as a science, not even a social science…). RAD data-driven inquiries (174):

  • Data results from a set procedure of observation, elicitation, and analysis  – illustration that it doesn’t lose its human-centric tendencies as illustrated by our common methodology of collecting data – ethnography?
  • Description of a system of text analysis or a research method or a research tool, application, and report of results
  • Establishment of a descriptive or validation system and then application to text, course, or program
  • Textual analysis with report of application, using a systematic scheme of analysis that others can apply to different texts and directly compare

quoting Chris Anson from his piece “The Intelligent Design of Writing Programs: Reliance on Belief or a Future of Evidence”, “Ultimately, changing the public discourse about writing from belief to evidence, from felt sense to investigation and inquiry, may help to move us all beyond a culture of ‘unrelenting contention’ (Tannen) and toward some common understandings based on what we can know, with some level of certainty, about what we do” (175). – the point is to be able to discuss what WPA and WP do with the public, motivation to turn to data-driven methodology/inquiry

“Data mining is loosely defined as the process of finding interesting information in large amounts of data” (176) – not “numbers” out of context

“It can also help us conduct research of a more exploratory nature, providing windows into the data that we can use to determine what questions to ask of that data” (176) – working against the bias that numerical data comes from a pre-determined hypothesis – or what is expected to be found

“Data and text mining, then, can be exploratory and, consequently, more descriptive, or they can serve a predictive function” (177) – working to explore, or working to illustrate (these aren’t mutually exclusive either)

“knowledge to be gained is implicit in the data. Data mining might be predictive, in that it seeks to forecast future actions or behaviors through examining patterns in the data, or descriptive, in that it attempts to explain those patterns and the implications thereof. It can be used to classify information or cluster it into groups according to similar characteristics and represent that information in more concise ways. Data mining can also be used to detect anomalies or outliers in a data set” (177-178) – separates it from other statistical methods

another mention of “cluster”, as in Moretti, – “Clustering involves the grouping together of similar data items; unlike classification, the labels of the clusters are not preset” (178) – what do these clusters look like? How do they surface in the data?

“Associations and patterns further assist with the understanding of data. Associations refer to relationships among data items that might predict behavior of a user group” (179) – determined from the clusters? That would be a logical progression – clusters to associations to patterns

“Data mining can also be inductive, unlike data analysis that often begins with a hypothesis that is to be proven or disproven by examining the data. Data mining allows for the fact that the relations between factors that will tell the user that the most interesting, nontrivial information may come from variables that do not initially seem to have any distinct relationship” (179) – what happens in these associations that we don’t otherwise see (?)

RAD based study types according to Chris Anson: foundational research and syntheses, replications and extensions, graduate research, connections with the general public, increased scrutiny and critique, and improved research communities (180) – the first two apply to data mining, while the other four, with some work, can as well according to Lang and Baehr (180-183)

toward foundational research and syntheses and replications and extensions, data mining makes available:

  • revisiting these foundational works can help validate their findings or uncover how these accepted or ”given” approaches have changed over time
  • the ability to enable researchers to examine the ever-increasing number of studies published and posted online and build connections and syntheses from those that can expand or contract in scope as most appropriate to answer a particular query
  • In addition to including larger sample sizes, artifacts could be examined through a variety of different lenses that might shed additional light on the core research questions of the study,
  • Follow-up studies

The remaining four, graduate research, connections with the general public, increased scrutiny and critique, and improved research communities, Lang and Baehr connect them to the public by examining “key questions or issues of popular interest or sustained issues over a period of time” (183). They quote from Kelly Ritter’s article “Extra-Institutional Agency and the Public Value of the WPA” to point out that:
First-year composition is a public enterprise historically. It’s no secret to WPAs that their necessary public defense of student writing—and the myths that require such defenses to be launched—are a result of this perceived communal ownership. Composition, unlike other academic disciplines, is perpetually at the mercy of cultural conceptions of literacy, whether through various levels of community “sponsorship” and that sponsorship’s accompanying costs (Brandt, “Sponsors of Literacy”) or complicit institutional structures, fueled by culturally skewed no- tions of “correctness” in discourse, all of which keep composition at the bottom of the academic hierarchy (Crowley) (183) – the exigence for their data mining. If we can show processes, theories, habits, ideas, practices in a more “concrete” (as in what the sciences show/do), then we can better account and demonstrate (prove?) what we do in composition studies. Or, what we have done, have done poorly, have done better, and what we have yet to do. 

They call for a “stronger culture of research”by using the Graduate Research Network (online) as an example: “the Graduate Research Network (GRN) has thrived as a part of the Computers and Writing Conference since 2000. However, its listserv (founded February 2007) and its blog (founded February 2011) have met with less success. The listserv currently has thirty-five members and has circulated forty messages since its founding, and the blog has a single entry” (184) – yikes! Part of data mining, finding/connecting, is not only looking for these spaces, but looking at how they are living on the web. It is not enough that they exist.

“The process of data mining, an often iterative one, involves identifying problems, data sources, and heuristics; establishing a formal procedure; and interpreting results” (185). The process, in detail, is outlined below:

Selecting information resources…may be based on a number of factors, such as access, relevance, availability, or possibly the nature of inquiry. Once problems and sources are selected, heuristics, or what measures or criteria should be used to systematically probe data, must be decided. This may involve developing categories (clustering), classification (or classes), variables, or other methods that will be used to sort data. When these methods have been selected, developing a formal, systematic procedure, a repeatable process, should be used to sift through the data. A process involves a number of logistical details such as data collection, reduction, formatting, and storage. In developing a documented process, the researcher can help ensure replicability, so in future studies, methods used can be transferred elsewhere… Finally, addressing results, interpreting data, and identifying trends and conclusions are the last part… (185) – what data mining “looks” like (perhaps this is just making a case for the methodology? One table that came from their study to explore the cause of “seventeen sections out of the sixty-four sections offered by our first-year writing program that had a failure rate of 30 percent or more (defined as students earning a D or F in the course or withdrawing from the course) Access to sources of data captures my attention, particularly because if this isn’t a widely adopted practice, and if its goal is to find/locate/collect data, where does one start? This is where my interest in Latour’s networks comes in, but perhaps the process of assembling networks can happen at the same time as data mining, meaning that there isn’t necessarily delineated steps in a chicken or the egg type scenario. Can these explorations/discoveries happen simultaneously?

“as a methodology, data mining typically requires longitudinal data collection in order to ensure stronger validity in findings. Without an existing source of longitudinal data, a significant time investment may be required to collect data, to front-load the project” (191) – Bold added by me. This ties into the previous long quotation about the process of data mining. It is my hunch that the data is assembled based on inquiry/exploration. A holy grail of longitudinal data doesn’t seem likely/possible/desirable. I still question what network looks like through data mining and visualizing – something like CUNY’s Writing Studies Tree? Where does this data go? How can it be found? And by who?

“Composition studies, not unlike other humanities research, must continue to re-examine and re-evaluate foundational studies and findings, as part of its evolution and body of knowledge” (191) – what is composed can decompose, but it still must be accounted for for future compositions.

Gives a short list of data mining software currently available (but how many are free? This seems to impact who is/can do this work…) in a handy appendix – my annotations added as a curious/interested graduate student:

  • Any Count: Character, Word, and Line Count software: http://www.anycount .com/ – a purchase of 49 euros. Produces “automatic word counts, character counts, line counts, and page counts for all common file formats”.
  • Clarabridge Text Analytics: http://www.clarabridge.com/ – advertises “free trial” which then leads to a purchase. Does “High-fidelity NLP with Semantic Analysis, Advanced Classification, Advanced Sentiment Analysis, Sentiment Tuning & Scoring, Standard Reports & Visualizations, Automatic Structured Data Linking”.
  • Crawdad Text Analysis Software:  http://www.crawdadtech.com/ – $95. Works as a “generator, visualizer, browser, finder, comparator, classifier”.
  • Eaagle Full Text Mapper: http://wp.eaagle.com/?page_id=16 – “Plans and Pricing” lead to “404 Error : File not found”. Does “Relevant words and topics identification through mapping, Weak or emerging signals identification, 3D Mapping, Reporting”.
  • MorphAdorner: http://morphadorner.northwestern.edu/ – could it be? Free. Works as an “annotation” and “tagging” tool.
  • NVivo Research Software : http://www.qsrinternational.com/products_nvivo .aspx – full license $670.00; student license $215.00. Works as a “coding” tool.
  • Predictive Data and Text Mining: http://www.data-miner.com/ – Must purchase book. Amazon listing $57.81. Does “(a) data preparation including tokenization, stemming, vectorization, and dictionary compilation (b) prediction by methods such as naive Bayes and advanced linear models (c) information retrieval by k-nearest neighbors and document matching (d) document clustering and (e) information extraction of named entities.”
  • SAS Data- and Text-Mining Software: http://www.sas.com/ – Call for price. Performs “predictive and descriptive modeling, data mining, text analytics, forecasting, optimization, simulation, experimental design and more”
  • SPSS Data Miner: http://www-01.ibm.com/software/analytics/spss/products/ data-collection/ – Page requested cannot be displayed.
  • Tableau Visualization Software: http://www.tableausoftware.com/ – Personal edition at $999.00. Creates “data visualizations” that work from spreadsheet like forms. “As powerful as a freight train. As user-friendly as a kitten.” “Create maps, bar and line charts, heatmaps, dashboards and more”
  • VantagePoint: http://www.thevantagepoint.com/ – “VantagePoint helps you rapidly understand and navigate through large search results, giving you a better perspective—a better vantage point—on your information. The perspective provided by VantagePoint enables you to quickly find WHO, WHAT, WHEN and WHERE, enabling you to clarify relationships and find critical patterns—turning your information into knowledge.” Differentiates between literature and scientific literature. Need an account.

Mueller

Using the concept of distant reading from Franco Moretti, looking at the work done by Donna Burns Phillips, Ruth Greenberg, and Sharon Gibson in “College Composition and Communication: Chronicling a Discipline’s Genesis” (1993), and the snapshot Janice Lauer produced in her Rhetoric Review essay “Composition Studies: Dappled Discipline”(1984), Derek works to update and further the ongoing inventorying of rhetoric and composition. From the article’s abstract, his exploration “In its focus on graphing, the article demonstrates an application of distant reading methods to present patterns not only reflective of the most commonly cited figures in CCC over the past twenty-five years, but also attendant to a steady increase in the breadth of infrequently cited figures” (195) – he is looking at citation frequencies in CCC to offer perspective of the changing disciplinary density (197). As a graduate student/newcomer to the disicpline, I take particular interest in ways of exploring the field in more focused ways than searching databases (if there is one) for keywords (only those familiar to me are available) or author (only those familiar to me are available). There is the process of referral by peers and professors, as well as reading about what others are reading on scholar blogs I follow (only those “of interest” to me – only those familiar to me). This usually amounts to many scraps of paper with scribbled names, open browser tabs that never get their due diligence, or IOUs in my Evernote mounting “things to read” notebook. Spending time with these pieces is difficult, if not unlikely. And these are only the ones I know about. Where are the others? Who are the others? How do I find (see) them?

using graphing/quantitative methods to “read” journals and the surrounding fields alters the scale to allow us to see aggregate patterns that link details and non-obvious phenomena and compiles replicable data that can represent impressions of changing conditions in the discipline of composition (196) – the field is in flux. its history not easily traced. what represents it? is it oil paintings or marble statues of “the greats”? are these greats established by adoption/circulation of ideas? does notice come to the offbeat? the counter? the on the fringe? Becoming an active participant in the field requires me to contribute, but how can I compose if I do not know what has been (de)composed?

reading tends to be local, a direct encounter with a focal text – words, sentences, paragraphs, but reading across, from a distance, allows us to zoom out, to see patterns that are unrecognizable when we’re reading at the level of the article (198) – a means of orientation, a handle. or an oar for orienteering.

as the field continues to develop/mature, there is a growing demand for theoretically sophisticated scholarship. From where (what materials) does this composing draw from? The past? Does it trouble? Refute? Reimagine? Does it stray into other disciplines to borrow? Keeping track of these sources becomes critical. While I don’t fully understand the  “complicated politics of citation”, I imagine it has to do with who is/isn’t given credit and for what. Creating something “new” only to realize it’s been composed/proposed before. Or, not realizing. And the process of composing, assembling materials, draws from the materials of composition – its texts. Graphs allow us to grasp non-obvious trends (200) – and I think non-obvious is flexible, perhaps, to the individual. Some theories will be more easily aligned/identifiable with established ideas in the field than others. And within/without those, juxtapositions, borrowings, and revisioning from the well worn and off the beaten paths.

heuretic discipliniography (Derek Mueller) “writing and rewriting the field by exploring the intersections across different scholars’ bodies of work as well as the associated pedagogical, theoretical, and methodological approaches they mobilize” (201). – This is a rich term, borrowing from Gregory Ulmer’s heuretic – the use of theory for the invention of new texts, -ography – a field of study that brings to mind geography, cartography, tomography, and so on, and discipline – as activity. It is action (re)search.

In graphing the works cited entries from the data set, the articles/works cited/name references published in CCC, Derek raises poignant questions: “What is at stake in knowing or not knowing any of the figures shown here? What presences and absences are most striking? To what degree are new scholars…overshadowed by well-established ones?” (201) – It seems odd to me, at this point, to realize that heavy hitters shape the field, but do not necessarily “make it” entirely. My own interests should make me aware of this, but I liken it to the field of “scientific knowledge” that most common people know of. There are only a handful of scientists/theories that I can explain/know of, but these are not the entirety of science. There are many in betweens, smaller composites that make a larger composition/contribution, there is borrowing, reshaping/recomposing, and the potential for something like dormancy in an idea’s circulation until a “eureka!” – “So while quantitative studies of authors cited in a well-known journal may offer a reasonable indication of the “common knowledge” of the field, this approach must not appear to produce a definitive roster of influences on the discipline.” (206)

Derek illuminates the (changing) practice of citations. He explains a conventional list creates a reduced record that affirms the presence of a source but makes no distinction/explanation of how a sources was used and from what parts it was used (location in a text). This downplays the scope of the reference/author – production, reception, and circulation (205). This is flat. Nothing to see here. Until we gain a little distance (perspective).

Looking at the top hits of composition studies through the graph, these figures do not tell us what was happening across the entire sample of the names in the citations. The names are indicative of currents in conversation in the field (206) – but who was left out of the Burkean Parlor? (214) The room is full, loud, standing room only. And some invitations got lost in the mail.

Quoting Chris Anderson in The Long Tail “We have been trained, in other words, to see the world through a hit-colored lens” (207)

Anderson’s long tail is an inquiry into niche music interests in online markets (vs. store-shelf retail giants) – the long tail is thin and demonstrates the deviation from the head of the graph, or the high ranking hits commonly available on shelves in stores, to less popular albums/artists selling online (208.) Applied to composition scholarship, the long tail provides an “abstract visual model” to potentially illuminate new insights, raise new questions, and explore the continuing maturation of the field. It allows us to engage with large scale data (209).

Derek looks at unique names of well known figures like Maya Angelou, Bill Gates, etc. invoked just once in the data set of name references. The figures at the top tell us something about citation practices and the scholarly conversation, but the long tail of unduplicated references allows us to begin to assess how broad based the conversations in the field have grown (211) and raises the question “How flat can the citation distribution become before it is no longer plausible to speak of a discipline?” (215). Being too fixed in the head of the graph, the frequently cited/evoked, has implications just as an absent-minded tail of strings of specialized references has – we have to mind where we stand, head and tail.

“Graphing provides a partial readout of the field’s pulse with respect to compactness and diffuseness, which complicates speculation about where the field stands at any given moment and where it is headed” (217) – what has changed over time in the relation between the head and the long tail? (215)

vantage point – “disciplinarity in general”: depending on one’s vantage point, the head or the tail, the field can appear highly focused or as a loose amalgamation. Recognized and shared principles vs. pocketed enclaves of unique interest that ignore disciplinarity. Both vantage points, generalist and specialist, are involved in seeing/shaping. Niche enclaves (specialized) negotiate a shared disciplinary frame, they contribute to the field’s shape. Graphing can help us better understand the ways specializations negotiate and cohabit (inter)(intra)disicplinary scenes (218-219). Bolded words added. Now what do we see?

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s