Methods for Partitioning the Dialect Continuum

Activity: Talk or presentationAcademic

Description

Dialectology has long employed the practice of dividing dialect continua into distinct areas, often visualized through maps that differentiate these regions using lines or color coding. Notable examples include maps by Te Winkel (1901) and Daan & Blok (1969), which highlight the various dialect areas within the Dutch continuum. These straightforward visual representations make the maps easily accessible and interpretable by a broad audience.
In this paper we consider several alternatives of creating similar maps dialectometrically. As a case study we use data from 50 locations and 166 items from the Series of Dutch Dialect Atlases. We calculated aggregated binary item distances. Then we applied five methods for partitioning the 50 variëties.
Method 1. Using this method a dendrogram is created by hierarchical cluster analysis on the basis of the aggregated distances among the local dialects. Areas can be derived from the dendrogram by drawing a vertical line somewhere in the dendrogram and counting the horizontal lines matching it. Local dialects connecting to the matching horizontal lines will then belong to the same cluster (or group or area). The vertical line should be drawn between the two successive nodes that are most distant to each other.
Method 2. This method uses bootstrap clustering to find dialect groups. It involves resampling data, performing hierarchical clustering to identify natural groups with the elbow method, and counting how often dialects co-occur in the same group. Local dialects are marked as connected if they appear in the same group in over 95% of iterations, resulting in networks that represent dialect groupings (Heeringa 2017).
Method 3. Similar as method 1, but instead of resampling data, noise is added to the distances.
Method 4. Affinity Propagation (AP) is a clustering algorithm that selects representative data points as exemplars and groups other data points into clusters based on their similarity to these exemplars. Exemplars are actual points from the dataset. In contrast to k-means there is no need to specify the number of clusters in advance (Frey & Dueck 2007).
Method 5. DBSCAN is an algorithm that groups together points that are closely packed while marking points in low-density regions as noise (Ester 1996). HDBSCAN is generally considered more robust than DBSCAN, particularly in handling datasets with varying densities and noisy data (McInnes et al. 2017). The latter method only requires the user to specify the minimum number of points necessary to form a cluster.
We evaluate the results by comparing the partitions with the original distance measurements using the Silhouette score (Rousseeuw 1987). Figure 2 shows that partitions with a higher Silhouette score reflect the beam map more faithfully.
Period24 Apr 2025
Event titleDiaClas. Dialect classification - past, present and future
Event typeConference
LocationLjubljana, SloveniaShow on map
Degree of RecognitionInternational

Keywords

  • dialectometry
  • dialect classification
  • partitioning
  • cluster analysis
  • bootstrap clustering
  • clustering with noise
  • affinity propagation
  • DBSCAN
  • Silhouette score