R is a free software program used primarily for statistics and graphing. The variety of statistical and graphing techniques are as follows:
• Linear and non linear modeling • Classical statistical tests • Time-series analysis • Classification • Clustering • And more
It is widely used among statistical and data-miners for developing statistical software and data analysis.
Students have used this program to help solve problems in statistics. The program is good to find the mean, standard deviations, and constructing bell curves of a data set(1). It is a good way to find patterns of a large set of numbers quickly, find percentages, and see if the data supports a hypothesis.
However, proving hypotheses are not just limited to statistics classrooms.
Matthew Jockers: Mr. Jockers decided that R could be useful for not just statistics. He wrote “Test Analysis with R for Students of Literature.” This manual is meant to compute statistic analysis of linguistic data. He wrote it for all the students and scholars of literature. Each chapter works through a new technique or process. Using the novel Moby Dick from Project Gutenberg , we worked through the first four chapters. These chapters, which we went through in class, are as follows:
• Chapter 1: Basics
o Goes through how to set up R on your computed
• Chapter 2: First Forey into Text Analysis with R
o Sets up the text and begins analysis of word choice
This image is the top ten words of Moby Dick
• Chapter 3: Accessing and Comparing Word Frequency Data
o Take the data we achieved from Chapter 2, and compare it to data from a different text. Data retrieved through the same techniques
This image is the top ten words of Jane Austin's Sense and Sensibility
• Chapter 4: Token Distribution analysis
o Creating Distribution plots for different words in the text
With R, the goal is to be able to access information that would be otherwise unobtainable through traditional methods of close reading and human synthesis. Matthew says that "the real learning will begin when you put this book aside and build a project of your own" (2).
The first two graphs
The first graph of the top ten words of Moby Dick. The words from highest to lowest are "the", "of", "and", "a", "to", "in", "that", "it", "his", and "I". s The second graph is a comparison with the top ten words of Sense and Sensibility. The words from highest to lowest are "to", "the", "of", "and", "a", "I", "in", "was", "it."
Results of Using R
In Chapter 2 with the help of Jockers’ instructions we were able to construct a graph of the top ten most used words in Moby Dick. While nine of the ten words are not interesting, the ninth most frequently used word gives us an unusual insight to the novel. Jockers points out the lack of females in the novel, and brings our attention to the ninth most used word “his.” While we knew that there are many men in the novel, it comes to a surprise that the word “his” was used so often. Jockers takes this information to look into a new pattern that he was curious about. Interested in the gap between males and females in the novel, he decided to look for female pronouns versus male pronouns. He takes us through more techniques so that we would be able to do this together, and for other novels with different information. We computed how many times “he”, “she”, “him”, and “her”, were used in Moby Dick. The word “he” was used 1876 times, and “him” was used 1058” times. Comparing this to “she” used 114 times, and “her” used 350 times, you can see the drastic difference. We do not stop there. Then Jockers takes us through looking at frequency. We were able to compute that “he” was used 16.36 for every time “she” appeared in the novel, and “him” was used 3.2 for every time “her” was used.
Why is this Information Relevant?
You may be asking why we care how many times a male pronoun was used over a female pronoun. This information could be used for a variety of different discussions about Herman Melville and his book Moby Dick. When you compare the top ten words to Sense and Sensibilty it is evident that “him” is not a common word to be included in the list. The fact that it is in Moby Dick tells us that it is very important. Also, the lack of female pronouns in the book displays a significant separation between males and females in the novel. Why Melville decided to not include females to such a great degree may have to be discussed with many close readings of the book, but the significance of the male pronouns tells us a lot. This tells us that the book has a major focus on male characters in the novel. The male characters therefore are a great focus in the novel. Even though the novel is called Moby Dick, Moby Dick and Whale are not in the top ten. This may present us with the idea that the whale Moby Dick is not the main character talked about and dealt with in the novel. From this information, discussions could be made about what is the real focus of the novel.
Troubles with Using R
If you have managed to count correctly, you will have noticed one of my computational errors with programming R. My graph for the top ten words of Sense and Sensibility only has nine words. Also, my graph for the top ten words of Moby Dick looks like this: To correctly work through R, you have to make sure that each command is inputed correctly. This was frustrating at times especially when you could not figure out what was incorrect. R is not a forgiving program, if one thing is off, it will affect everything that follows.
While some people were praising their genius for producing the correct results according to Jockers, others were cursing themselves for not taking that introductory computer science class. We as a class fought valiantly, relying on each other along the way. Even though we struggled, the results and potential applications of this program could not go unnoticed.
Other Uses of R
We have only scratched the surface of how useful R. While we used it to look at the top ten words in Moby Dick and Sense and Sensibility, you can use R to look up almost any word depending on what you are searching for. You can also create graphs like seen in chapter 4. This could be another visual way to look at data and find patterns. The other chapters not explored by my class go through clustering, classification, topic modeling, correlation, lexical variety, and much more. Each chapter presenting the opportunity to find different and unknown patters not only in Moby Dick, but also in any given text, and thanks to Matthew Jockers, we have the instructions to learn how to find these patterns on our own.
My experience with R
If you have not noticed by my complications with the graphs, I had some difficulty getting the programming correct. I was not alone in my struggles; the entire class had difficulties at some point with some chapter. Despite our frustrations, nothing could beat the elation when everything went as planned and our results matched up with the Jockers text. I was able to get through the text with little casualties, but my success in programming was not the only benefit of working with R.
Working with R and experiencing the frustrations also helped me learn the programming better than if everything had gone as planned. If I had not had to go back and re-read certain steps so many times to not just figure out why I was putting the phrase into R, but to see what letter or number I had missed, I would not had paid so much attention to what I was doing as is did. Since you have to be so careful while learning the programming, you can really understand what Jockers is talking about and what each command does. This makes it more possible for me to use this program in the future on other texts for other classes and projects. Being able to figure out how to find word frequency, it gives me the opportunity to find similar patterns in different novels.
As I moved along with each chapter and got more information on Moby Dick, I began to see patterns in the text that would have otherwise been impossible to find. While we had discussed the lack of female characters in the book, to see the ratio of male pronouns to female pronouns was eye opening. It really presented us with the gap between the two, and gave us the possibility to explore why.