Dealing with big data: The case of Twitter

E. Tjong Kim Sang, A. van den Bosch

Research output: Contribution to journal/periodicalArticleScientificpeer-review

50 Citations (Scopus)


As data sets keep growing, computational linguists are experiencing more big data problems: challenging demands on storage and processing caused by very large data sets. An example of this is dealing with social media data: including metadata, the messages of the social media site Twitter in 2012 comprise more than 250 terabytes of structured text. Handling data volumes like this requires parallel computing architectures with appropriate software tools. In this paper we present our experiences in working with such a big data set, a collection of two billion Dutch tweets. We show how we collected and stored the data. Next we deal with searching in the data using the Hadoop framework and visualizing search results. In order to determine the usefulness of this tweet analysis resource, we have performed three case studies based on the data: relating word frequency to real-life events, finding words related to a topic, and gathering information about conversations. The three case studies are presented in this paper. Access to this current and expanding tweet data set is offered via the website
Original languageEnglish
Pages (from-to)121-134
Number of pages14
JournalComputational Linguistics in the Netherlands Journal
Issue number12/2013
Publication statusPublished - 2013


Dive into the research topics of 'Dealing with big data: The case of Twitter'. Together they form a unique fingerprint.

Cite this