Big Data, Tiny Computers: Making Data-Driven Methods Accessible with a Raspberry Pi

By Aaron Beveridge* and Nicholas M. Van Horn*

The phrase “Big Data” often implies the use of research practices requiring massive supercomputers to process datasets too large for traditional forms of analysis. Advertisements from companies like Microsoft and Amazon reinforce Big Data’s intimidation factor, as these companies no doubt prefer that researchers develop data-driven projects using their Azure or AWS platforms. While these platforms are feasible for researchers with extensive institutional support—either from their home institutions or from grant funding agencies—for many others the long-term financial costs and steep learning curves associated with such platforms forces researchers to look elsewhere for more accessible options. Of course, making data-driven methods more accessible extends far beyond consumer choices—the work extends all the way down to the computer hardware and to the data itself. In academic research, we frequently ask questions that already-available datasets and ready-made software tools cannot answer. Therefore, the production of data-driven research rarely involves plugging a ready-made dataset into an analytics software that returns a publishable result. Rather, data-driven research often requires that researchers employ an inventive maker approach when investigating novel questions with previously unexplored data.

Like many other fields, digital rhetoricians have been exploring the value of Big Data (data science) methods for our own research practices. While this is due, in part, to the broad influence of the digital humanities (Ridolfo, Jockers), it also stems from the long-standing interdisciplinary nature of the research questions that emerge in digital rhetoric (Eyman). Rhetoric and writing studies have a long-standing history of employing a maker approach when opening new possibilities for the work we pursue (LeBlanc). More recently, projects like the Writing Studies Tree, Faciloscope, Hedge-O-Matic, and MassMine exemplify the vast potential for the type of work emerging from the intersection of data-driven methods and a maker research orientation.

This chapter extends this maker orientation into the realm of physical computing, and demonstrates how an inexpensive Raspberry Pi computer (~$35) can addresses key hardware and workflow issues for long-term data collection projects. As the co-creators of MassMine, the authors of this chapter have been developing tools that reduce the learning curve for scholars who need to collect novel datasets from digital sources. To date, our work has focused primarily on the accessibility of programming practices (software) associated with digital data collection and web scraping. Yet, from a hardware perspective, long-term data collection activities—those which are needed to create large (“Big”) research-quality datasets—usually require a computer or cloud server to run for days, weeks, months, and potentially, years. As computing technology continues to get smaller and more affordable, tiny computers and microprocessors—like the Arduino and Raspberry Pi—open many possibilities for maker projects that utilize the modularity, portability, and efficiency of these devices.

Efficiency, in particular, can have many meanings for Big Data: computational efficiency (how many steps does it take for a given program to complete its task?), processing efficiency (how many processors are necessary or available for completing the steps of a program?), temporal efficiency (how long will it take for a program to complete its task?), power efficiency (how much electricity is needed to power the computer processors?), and work efficiency (what are the pragmatic restrictions for completing everything involved in a project?). While these are among the more important efficiency considerations for data analytics work, the problems they pose often result in competing and contradictory answers. For example, programs that use multiple processors, simultaneously, usually produce the most efficient results when processing a large dataset, but these types of programs can be very difficult to design. A programmer could take 2 weeks to design an advanced program that uses multiple processors and runs 200% faster than a more simplified program that already exists. However, if the program that already exists finishes its task in 1 week’s time, then the 2 weeks required to develop the advanced program would result in 1 week of wasted effort. Still, other considerations can change how we judge efficiency for this example. If the newly designed program will be used on a regular basis in the future—saving an accumulated amount of time as the new program is reused indefinitely—then the future time savings justifies the additional effort. In other words, there are many competing issues involved in trying to efficiently collect and process data for research.

For many researchers, the issues that effect the efficiency and sustainability of a research project are often more pragmatic in nature. For example, when collecting Twitter data for our Digital Humanities Quarterly article, “Attention Ecology: Trend Circulation and the Virality Threshold” (co-authored with Sean Morey), we tracked 17,343 unique trends over seventy-four days—accessing Twitter’s Application Programming Interface (API) every 5 minutes (the maximum allowed by Twitter) to collect and archive data. The software used to access Twitter and manage the data collection was an early version of MassMine. While MassMine ran without any problems throughout the entire project, our team ran into multiple pragmatic issues with our hardware setup. When we first started to collect data, MassMine was running in the background on one of our personal laptops. Because MassMine uses limited system resources, MassMine was able to collect data without negatively effecting the performance of other software on the laptop. However, three glaring issues were revealed a couple days into the initial data collection: (1) the laptop could not be reset or turned off without having to restart the data collection, (2) MassMine needed to access the internet every 5 minutes to collect the maximum data allowed by Twitter, requiring a constant internet connection, and (3) other people were no longer able to borrow or use the computer to ensure that the data collection was not accidentally shut down. While these issues may seem trivial, on a long enough timeline even the most trivial of issues disrupt the feasibility of a research project. Given our intention to track Twitter trends for multiple months, we decided that a dedicated computer was a more feasible solution.

There are other approaches that we could have used to collect data with MassMine, like using a university research computer cluster (supercomputer) or running a cloud server—both of which are possible using MassMine. However, these options are usually available to only the most advanced researchers who have trained to use these systems. Therefore, we developed a research workflow to use a Raspberry Pi as a data collection server, and we have released a new version of MassMine to specifically support this workflow. This is similar, in many ways, to the pragmatic workflow developed for the “Attention Ecology” article mentioned in the previous paragraph, except that it additionally takes advantage of the of the resource and space saving efficiencies of the Raspberry Pi. While there are many scholarly benefits to a maker research orientation, we hope this work opens new possibilities for applying data-driven methods to well-established rhetoric and writing theory, as well as encouraging new research by enabling the collection of exigent and novel datasets.

This webtext is split in two sections. The “Overview,” along with the “Comparison” page of this webtext, address the efficiency and sustainability of the Raspberry Pi for long term data collection projects. In the “Comparison” section that follows, we analyze how a Raspberry Pi compares to a typical middle-tier personal computer, showing how the Raspberry Pi provides significant resource savings. The second half of this webtext includes the “Raspberry Pi Server” and “Using MassMine” pages. From a maker perspective, the last two sections provide readers with everything they need to setup their own Raspberry Pi and collect data using MassMine.

*Beveridge and Van Horn have contributed equally to the research and writing for this chapter