Dealing with big data: The case of Twitter

E. Tjong Kim Sang, A. van den Bosch

Onderzoeksoutput: Bijdrage aan wetenschappelijk tijdschrift/periodieke uitgaveArtikelWetenschappelijkpeer review

51 Citaten (Scopus)


As data sets keep growing, computational linguists are experiencing more big data problems: challenging demands on storage and processing caused by very large data sets. An example of this is dealing with social media data: including metadata, the messages of the social media site Twitter in 2012 comprise more than 250 terabytes of structured text. Handling data volumes like this requires parallel computing architectures with appropriate software tools. In this paper we present our experiences in working with such a big data set, a collection of two billion Dutch tweets. We show how we collected and stored the data. Next we deal with searching in the data using the Hadoop framework and visualizing search results. In order to determine the usefulness of this tweet analysis resource, we have performed three case studies based on the data: relating word frequency to real-life events, finding words related to a topic, and gathering information about conversations. The three case studies are presented in this paper. Access to this current and expanding tweet data set is offered via the website
Originele taal-2Engels
Pagina's (van-tot)121-134
Aantal pagina's14
TijdschriftComputational Linguistics in the Netherlands Journal
Nummer van het tijdschrift12/2013
StatusGepubliceerd - 2013


Duik in de onderzoeksthema's van 'Dealing with big data: The case of Twitter'. Samen vormen ze een unieke vingerafdruk.

Citeer dit