RStudio with Jockers: Hack Lab Report 2
In the past few hack labs the class has been working with the open source programming language R in an exploration of Digital Humanities resources. R, like HTML, is a programming language for statistical computing and generating graphic representations of data. While R lends itself to more intimidating, esoteric spaces of coding executed in command prompts and cryptic web spaces, a free, open source integrated developing environment (IDE), called RStudio is available for those of us without prior coding experience. RStudio provides an interface written in C++ that allows less-experienced coders to manipulate data in an organized, consolidated space.
Matthew L. Jockers, a “computational humanist,” or pioneering literary scholar in digital research in the Humanities has drafted a text with a comprehensive (and reassuring) tutorial for approaching RStudio as a tool for textual analysis called Text Analysis with R: for Students of Literature. Jockers’s instructional text is meant to be a launch pad from which the novice scholar of the nascent world of digital-literary analysis can efficiently learn to manipulate textual data in RStudio and apply those skills to any data set, or digitized novel. Using RStudio to examine word frequency and other simple mathematical occurrences is referred to as "microanalysis." Conveniently suited to our field of inquiry, Text Analysis with R offers a walk-through guide to using RStudio with none other than Melville’s Moby Dick, however the text also proves itself easily transposable to any other data set (proven by the practice application to Jane Austen’s Sense and Sensibilities). Combining that simple mathematical number-crunching with larger considerations of relativity across an entire novel is "mesoanalysis." But the final chapters of Text Analysis with R (particularily chapter 12) demonstrate the utility of RStudio for comparing data from a huge corpus, which could be a collection of hundreds of novels in a data pool. Jockers calls this analytical work “ macroanalysis:”
“Using computational analysis to retrieve key words, phrases, and linguistic patterns across thousands of texts in digital libraries, researchers can draw conclusions based on quantifiable evidence regarding how literary trends are employed over time, across periods, within regions, or within demographic groups, as well as how cultural, historical, and societal linkages may bind individual authors, texts, and genres into an aggregate literary culture.”
The use of RStudio as a tool for literary analysis is a tradition-in-the-making, a new territory to be explored. However Jockers describes R-language and macroanalysis as a kind of augmentation to the time-tested practices of the reader, which I will discuss in detail later on in this blog post.
With uneasy yet faithful trust, and spurred by Jockers’s promise of the utility of macroanalysis, the class downloaded RStudio and the Moby Dick text file which began as the raw material to be manipulated. From Chapter two, or our “First Foray into Text Analysis with R” onwards, the Jockers text read as a sort of oscillation between commands-to-be-entered and results-to-follow, interspersed with commentary on what each command means in terms of its larger application to data sets, like so:
The first thing I noted was an overwhelming anxiety that my lack of coding experience would hinder the ability to utilize RStudio, but the beauty of the tutorial text is that it knows, and it pacifies. One thing you might note from the image of the Jockers text above is how Jockers uses a casual, yet professional teaching voice to deliver a well-paced, reassuring experience. Just as that anxiety begins to rise that you are ready to quit, and all the commands are snowballing into a confusing mass of information, Jockers reminds the reader: “Deep breath” (Jockers, 14). Jockers’s ability to administer this information in an informative, but also literarily relevant manner is one of the most appealing aspects. Jockers is a self-conscious narrator when he checks in with his readers’ emotions because he is aware that we are mere “newbies” (Jockers, 19). This self-conscious awareness of the reader is not unlike Ishmael who will not hesitate to go into great detail in the event that the reader, being a mere “tyro,” is not yet aware “with what wondrous habitude of unconscious skill the whaleman will maintain an erect posture in his boat…” (Melville, 221).
The next steps were aimed at stripping the metadata from Project Gutenberg’s Moby Dick, that is the extraneous text characters such as copywriting information included in the file. Once the metadata is removed, the reader has set the scope of text which will be included in data analysis results. In the preface to his text, Jockers reminds us that RStudio will be particularly useful for quantifying and comparing data when it comes to analyzing literature. This was evident in the “/” symbol which could be placed between commands in order to display data as a relative frequency rather than a raw number. This is the point at which microanalysis turns to meso/macroanalysis. RStudio’s ability to tell you the percentage of times a significant word appears in relation to another word of interest with a nexus of topic to the original word is crucial. Jockers reminds the reader of the larger, macroanalytical goal of “relativising in this way … [which allows] you to more easily compare the patterns of usage from one text to another" (Jockers, 26). This tip hints at the fact that RStudio lends itself not only to data analysis of one text, but inter-textual analysis as a means of finding out about how the contents of canonical works are similar or different. RStudio's capacity to perform macroanalysis could possibly generate literary research questions with larger comparative scopes. One might use RStudio to compare the word-choice of narrators in several novels from a similar historical time period and/or place. Examining differences in dialect and political assertions implicit in word choice might elucidate new historical information in a juxtaposition of contemporary novelists. In fact, the new computational technology of macroanalysis has already proven an impressive level of statistical efficacy in its ability to cluster anonymous works with their reputed authors based on statistical analysis.
Notable new fields of analysis open up with new technology. I found the efficiency of RStudio for retrieving and computing data to be a welcomed edition into analytical procedure, however there were some problems encountered during the process as well. At times it was evident where technology does not surpass human capability, such as the instance where RStudio suddenly aborted while I was in the midst of analysis. Has your paperback novel ever suddenly aborted? The sense of purpose with which a human being works is not always achieved by a piece of technology, which is subject to shut down suddenly (and you can’t tell it to just hold on!) This is an issue of communication between human and technology. Struggling with R-code’s grammar was another obstacle, where you have to work with the program according to its inflexible grammatical prescriptions. So, like old Ahab’s bone prosthetic sometimes our technological devices can be inflexible and break down on us, and it can be very disappointing.
For the student of literature, those time-tested practices of close reading (word-count frequencies, range of narrator’s vocabulary, etc.) generally are marked with somewhat unscrupulous attention to statistics, aided by little technology other than the physical text, a pen, and a notebook. Even now, writing this blog post I chose to reference Melville’s use of the term “tyro,” and thus needed page numbers. When presented with the choice between flipping through the 572 pages of my physical copy of Moby Dick scanning for the word “tyro,” or entering a command and instantaneously obtaining a list, the "immediately rewarding" nature of RStudio promised by Jockers in his preface found fruition. This list, in fact will also provide me with the frequency with which the word “tyro” appears, and through relativizing the word with another I could find out more about “tyro” as it appears in Moby Dick a lot faster with R than I could with a pen, a notebook, and the novel in front of me. Further, in physically searching my book for the chapter entitled “The Albatross,” I could not find the chapter until I realized that the table of contents lists the chapter’s full name, “The Pequod Meets the Albatross.” This is another testament to what digital analysis has to offer; While the human emotional capacity for interpretation could never be paralleled in a computer program, it is largely the leg-work of analysis that is aided by RStudio (where humans fail often). With that being said, the data sources from which we draw the texts to analyze in RStudio are not always perfect. Remember that data can be misprinted and consequently misread by the computer where a human might otherwise catch the mistake. So we see the failings of the computer made up for my human sensibilities, and vice versa. It is important then to think of RStudio as an axiliary device with which we co-analyze, and not a sentient being designed to write our theses. But this expediency with which R-code can provide data that would take a human many hours is crucial to the research environment; the faster data is mined, the sooner advances come about. Jockers describes symbiotic relationship shared between RStudio and human analysis powerfully in his preface, which serves as a kind of defense of macroanalysis:
"If a new type of evidence happens to confirm what we have come to believe using far more speculative methods, shouldn’t that new evidence be viewed as a good thing? If the latest Mars rover returns more evidence that the planet could have once supported life, that new evidence would be important. Albeit, it would not be as shocking or exciting as the first discovery of microbes on Mars, or the first discovery of ice on Mars, but it would be important evidence nevertheless, and it would add one more piece to a larger puzzle. So, computational approaches to literary study can provide complementary evidence, and I think that is a good thing"(Jockers, viii).
If I may reframe the analogy in terms of whaling, I would consider RStudio to be akin to the technological advent of a sonar-radar given to Ahab. Time-tested practice has proven to the Pequod that a group of whalemen is capable of locating a pod of whales and hunting those whales down. Would it not be nice however, if they could review the size of every pod of whales beneath the sea and then compare those pods to determine which is most lucrative? This is what RStudio allows for the literary analyst. Instead of finding the word whale a number of times and comparing those instances which you can recall or document, you push a button and a screen shows you every “whale” (as a word) in the novel.
As far as the utility of RStudio as an asset for future literary analysis, I believe there is always a place for a technology that makes something faster and more efficient. I believe the multifarious skills required to master RStudio will require analysis to become a multi-disciplinary affair even more so than it has been in the past. The invocation of professionals from many fields is something that has been consistent in all the technology of the Digital Humanities we have studied. RStudio is highly statistical, analytical, literary, and code-oriented among other skill sets. For this reason it encourages a collaborative environment in the Humanities. I believe it is possible that the mathematical level that RStudio is capable of operating under might require specialization of the processes in the future; it is plausible that a multi-disciplinary workforce could all work on an RStudio analysis project under the umbrella goal of literary analysis, each member simultaneously contributing to the project itself and their separate fields of study. Not everyone can be a Jockers, capable of both advanced programming and sophisticated, literary whale jokes in one lifetime. The fundamental importance of RStudio is that it is forging new fields of literary analysis and presenting a new way to read, and that is exciting.