K-means 그래프 - K-means geulaepeu

데이터 분포를 그래프로 표현하면 데이터가 몇개의 그룹으로 분류될 수 있는지 눈으로 확인할 수 있습니다. 그런데, 컴퓨터의 입장에서는 몇 개의 그룹으로 데이터를 분류해야 최적의 결과가 되는지 알기가 어렵습니다. 데이터가 산만하게 분포되어 있는 경우, 사람의 눈으로도 2개로 나누어야 할지, 3개로 나누어야 할지 모호할 때가 있습니다. 

이번 포스팅에서는 k-means 클러스터링으로 데이터를 분류하고자 할 때, 최적의 클러스터 개수를 결정하는 방법에 대해 살펴봅니다.

최적의 클러스터 개수를 결정하는데 사용되는 대표적인 방법은 다음과 같은 2가지가 있습니다.

  • 엘보우(elbow) 기법
  • 실루엣(silhouette) 기법

엘보우 기법

[30편]에서 k-means 클러스터링은 클러스터내 오차제곱합(SSE)의 값이 최소가 되도록 클러스터의 중심을 결정해나가는 방법이라고 했습니다. 만약 클러스터의 개수를 1로 두고 계산한 SSE 값과, 클러스터의 개수를 2로 두고 계산한 SSE 값을 비교했을 때, 클러스터의 개수를 2로 두고 계산한 SSE 값이 더 작다면, 1개의 클러스터보다 2개의 클러스터가 더 적합하다는 것을 짐작할 수 있습니다.

이런 식으로 클러스터의 개수를 늘려나가면서 계산한 SSE를 그래프로 그려봅니다. SSE의 값이 점점 줄어들다가 어느 순간 줄어드는 비율이 급격하게 작아지는 부분이 생기는데, 그래프 모양을 보면 팔꿈치에 해당하는 바로 그 부분이 우리가 구하려는 최적의 클러스터 개수가 됩니다. elbow가 우리말로 팔꿈치라는거 다 아실거고요~

[30편] skl_clusterdata.py 맨 아래에 다음 코드를 추가합니다.

elbow(X)는 클러스터 개수에 따른 데이터 X의 SSE 값을 그래프로 그려주는 함수입니다. 코드에서 km.inertia_가 k-means 클러스터링으로 계산된 SSE 값입니다.

위 코드를 추가한 코드를 실행하면 다음과 같은 그래프가 화면에 출력됩니다.

위 그래프를 보면 클러스터의 개수가 3일 때 팔꿈치 부분이라는 것을 알 수 있습니다. 따라서 주어진 데이터를 구분하기 위한 최적의 클러스터의 개수는 3개로 결정하면 됩니다.

실루엣 기법

실루엣 기법은 클러스터링의 품질을 정량적으로 계산해주는 방법입니다. i번째 데이터 x(i)에 대한 실루엣 계수 s(i) 값은 아래의 식으로 정의됩니다.

여기서 a(i)는 클러스터내 데이터 응집도(cohesion)를 나타내는 값으로, 데이터 x(i)와 동일한 클러스터내의 나머지 데이터들과의 평균 거리입니다. 이 거리가 작으면 응집도가 높겠지요.

b(i)는 클러스터간 분리도(separation)를 나타내는 값으로, 데이터 x(i)와 가장 가까운 클러스터내의 모든 데이터들과의 평균거리입니다.

만약 클러스터 개수가 최적화 되어 있다면 b(i)의 값은 크고, a(i)의 값은 작아집니다. 따라서 s(i)의 값은 1에 가까운 숫자가 됩니다. 반대로 클러스터내 데이터 응집도와 클러스터간 분리도의 값이 같으면 실루엣 계수 s(i)는 0이 됩니다. 즉 실루엣 계수가 0이라는 것은 데이터들을 클러스터로 분리하는 것이 무의미하다는 것이겠지요.

아무튼 클러스터의 개수가 최적화되어 있으면 실루엣 계수의 값은 1에 가까운 값이 됩니다.

아래의 코드를 봅니다.

skl_silhouette.py

이 코드에서 plotSilhouette(X, y_km)은 데이터 X와 X를 임의의 클러스터 개수로 계산한 k-means 결과인 y_km을 인자로 받아 각 클러스터에 속하는 개별 데이터의 실루엣 계수값을 수평 막대그래프로 그려주는 함수입니다.

>>> cluster_labels = np.unique(y_km)

y_km의 고유값을 멤버로 하는 numpy 배열을 cluster_labels로 둡니다. y_km의 고유값 개수는 클러스터의 개수와 동일합니다.

>>> n_clusters = cluster_labels.shape[0] 

클러스터 개수를 n_clusters로 둡니다.

>>> silhouette_vals = silhouette_samples(X, y_km, metric='euclidean')

실루엣 계수를 계산하고 그 결과를 silhouette_vals로 둡니다.

plotSilhouette()의 for 구문은 각 클러스터에 속하는 데이터들에 대한 실루엣 값을 수평 막대그래프로 그려주는 로직입니다. 클러스터를 구분하기 위해 matplotlib.cm에서 제공하는 컬러맵 중 JET를 이용해서 칠해줍니다.

참고로 JET 컬러맵은 아래와 같습니다.

왼쪽 파란색 부분은 0.0, 오른쪽 빨간색 부분은 1.0 값을 가집니다. 중간 부분의 색상을 선택하려면 0과 1사이의 실수를 정해주면 됩니다.

>>> silhoutte_avg = np.mean(silhouette_vals)

>>> plt.axvline(silhoutte_avg, color='red', linestyle='--')

모든 데이터들의 실루엣 계수의 평균값을 빨간 점선으로 표시합니다.

skl_silhouette.py는 [30편]에서 다루었던 150개의 샘플 데이터를 3개의 클러스터로 나누어 k-means 클러스터링을 수행하고, 실루엣 계수를 그래프로 그려주는 코드입니다.

코드를 실행하면 아래와 같은 결과가 화면에 나옵니다.

클러스터 1~3에 속하는 데이터들의 실루엣 계수가 0으로 된 값이 아무것도 없으며 실루엣 계수의 평균이 0.7보다 크므로 잘 분류된 결과라 해도 무방합니다.

skl_silhouette.py에서 아래의 코드를 찾아 인자 n_clusters의 값을 2로 수정해봅니다.

>>> km = KMeans(n_clusters=2, random_state=0)

수정한 코드를 실행하면 다음과 같은 결과가 화면에 나타납니다.

클러스터1은 실루엣 계수의 값이 비교적 좋은 편이지만, 클러스터2는 실루엣 계수가 0인 것들도 있고 전체적으로 값이 0.6이하로 품질이 좋지 않음을 알 수 있습니다.

실제 데이터를 분류한 k-means 클러스터링 결과는 아래와 같게 나옵니다.

당연히 클러스터 3개로 구분했을 때보다 품질이 좋지 않다는 것을 알 수 있습니다.

엘보우 기법은 k-means 클러스터링에만 유효하지만, 실루엣 기법은 k-means 클러스터링 이외의 다른 클러스터링에도 적용할 수 있는 기법입니다.

이제 클러스터링에 있어 최적의 결과를 보이는 클러스터의 개수를 구하는 방법에 대해 배웠으니 실무에 잘 응용하시면 되겠습니다. 

Introduction

Graph clustering is an important problem in several fields, including physics [1, 2], engineering [3], image processing [4], and the medical [5] and social sciences [6]. A cluster in a graph is a set of nodes that has more connections within the set than outside the set [7].

Clustering can be useful for understanding large networks where the numbers of nodes and edges are too large for a human analyst to examine individually. Dividing the network into separate clusters and examining the content of those clusters and their relations can be more useful for the practitioners than examining the whole network.

The term “graph clustering” is somewhat ambiguous and can refer to very different types of clustering problems, including network community detection [7,8,9,10,11,12,13], graph clustering [14], graph partitioning [4, 15, 16], graph set clustering [17] or graph based clustering. These terms are somewhat ambiguous and often used interchangeably, which can cause confusion. They can refer to any one of the following cases:

  • clustering any kind of data by first converting the whole dataset into a graph [14];

  • grouping the nodes of a single graph into distinct clusters when the number of clusters is controlled by the user [4];

  • finding clusters in the graph without any control on the number of clusters [7,8,9,10,11,12,13];

  • clustering a set of graphs where each data object is a graph of its own [18, 19]; and

  • separating the graph into clusters while constraining the size of the clusters [15].

In this work, we focus on producing a disjoint clustering of the nodes of a weighted undirected graph. The number of clusters is controlled by the user.

Algorithms use different strategies for the clustering. Divisive or cut-based methods [4, 20, 21] split a graph recursively into sub-networks until some stopping condition is met. Agglomerative methods [9, 22, 23] start by placing each node in its own cluster and then merging the clusters. Iterative algorithms start from some initial clustering, which is then improved via small changes, such as switching an individual node from one cluster to another. Graph growing or seed expansion [11, 24] selects a seed node and then gradually grows a cluster using (for example) best-first-search.

Many algorithms also use a cost function to guide the clustering process. For example, a cost function can be used to select the optimal split operation [4] or best partition for a given node [3]. An algorithm can include several different strategies. The Louvain algorithm [9], for instance, applies both an agglomerative approach and iterative optimization.

Two of the most popular cost functions are modularity [1, 2, 23, 25] and conductance [4, 16]. Modularity has the deficiency that it cannot be directly optimized for a specific number of clusters. It also has a tendency to produce very large clusters, which is known as resolution limit [26]. While conductance can be optimized for a specific number of clusters, it is sensitive to outliers and can produce unbalanced clustering with tiny clusters.

To overcome these problems, we propose two new cost functions called inverse internal weight (IIW) and mean internal weight (MIW). IIW provides more balanced clustering than the alternatives, while MIW can detect dense clusters and is good at disregarding noise and outliers.

In this paper, we also introduce two algorithms to optimize these cost functions. The first, called the K-algorithm, is a direct derivation of the classical k-means algorithm. It applies similar iterative local optimization but without the need to calculate the means. It inherits the properties of k-means clustering in terms of both good local optimization ability and the tendency to get stuck at a local optimum. The second algorithm, called the M-algorithm, gradually improves on the results of the K-algorithm. It finds new local optima, which often provide a better solution. It works by repeatedly merging and splitting random clusters and tuning the results using the K-algorithm. Both algorithms work on all of the discussed cost functions. They can also be applied for any other cost function that satisfies certain criteria (for further discussion of this point, see Sect. 4.1).

We also introduce new graph benchmark datasets that make it easy to visualize our results. We compare the proposed methods against existing state-of-the-art algorithms and cost functions on these datasets.

As a case study, we analyze connections between diseases in a disease co-occurrence network [27], that is, a graph where correlating diseases are connected. We constructed this type of graph based on electronic health care records in the North Karelia region of Finland. We built two separate graphs, one for ICD10 blocks (with 188 nodes) and one for individual ICD10 disease codes (with 644 nodes). These are too large for manual investigation, but the clustering results are suitable for expert analysis. Here, we used clustering to find the most relevant connections from the network. This provides a good overview of the contents and structure of the disease network.

This type of clustering-based analysis tool is currently lacking in both the scientific literature and healthcare practices. Most community detection methods lack a proper way to control the size of the clusters, which often tend to become too large for manual investigation. This is a serious deficiency of the current algorithms and cost functions. Many algorithms are also missing a mechanism to control the number of clusters. In addition, the cost function used in the optimization may lead to very unbalanced clustering containing overly large clusters that are impossible to investigate manually.

Many papers consider it a drawback if the number of clusters needs to be specified by the user [8, 13, 28]. This is undoubtedly true in many scenarios. However, the hidden assumption of this view is that the data have some specific number of clusters and the task of the algorithm is to detect it. This scenario is often not true in real-world datasets. It may be possible to cluster one dataset in several meaningful ways, and the proper number of clusters may depend on the specific needs of the user.

In the disease co-occurrence network analysis, it is desirable to have clusters of roughly the same size, and one way of controlling this is to specify the number of clusters. If the clustering algorithm and cost function produces somewhat balanced clusters, and the goal is to split N nodes into clusters of roughly size n, then the number of clusters can be chosen simply as k = N/n. For example, if a network of 188 diseases is split into 10 parts, then each part will contain roughly 19 diseases—small enough that the contents of the cluster can be investigated manually (e.g., in the form of a similarity or correlation matrix).

The proposed clustering algorithms are able to work in this scenario. The clustering results reveal many interesting properties of the network. For example, we can see that mental health diagnoses are so connected to diagnoses related to alcohol and drug use that they form one cluster.

In summary, our work has the following contributions:

  • We propose two new graph clustering algorithms. The algorithms work with several different cost functions and allow running time to be adjusted in a flexible manner to ensure an acceptable compromise between speed and quality.

  • We propose two new cost functions: IIW, which provides more balanced clustering, and MIW, which detects dense parts as clusters and disregards noise and outliers.

  • We demonstrate the usefulness of the proposed algorithms via a case study of disease co-occurrence analysis.

Clustering cost functions

Many methods have been proposed for estimating clustering quality in the absence of ground truth knowledge (e.g., correct partition labels). Clustering quality functions typically favor good separation of clusters (low Ei) and high internal weight (high Wi), but there are differences in how these factors are emphasized and scaled.

In this section, we review the three cost functions implemented in the proposed algorithm. We introduce two new cost functions: MIW and IIW. One of the cost functions (conductance) has previously been studied. All of these cost functions are additive, which means that the quality of clustering is given by the sum of qualities for individual clusters [7]. The studied cost functions are summarized in Table 1 and an example is shown in Fig. 1.

Table 1 Clustering cost functions

Full size table

Fig. 1

K-means 그래프 - K-means geulaepeu

Single cluster's contribution to the cost function. Most cost functions are based on internal (Wi), external (Ei), and total (Ti) weights of the clusters. The studied cost functions are additive, which means that the quality of clustering is given by summing up the contributions of individual clusters (and possibly scaling the result with some constant)

Full size image

Notations

N :

Number of nodes

k :

Number of clusters

n i :

Size of cluster i

w ij :

Weight between node j and i

W i :

Sum of internal weights in cluster i

W ij :

Sum of weights from node j to nodes within cluster i

M i :

Total weight (mass) of node i

M :

Total weight (mass) of whole graph

E i :

Sum of external weights from cluster i

E ij :

= Mi − Wij, external weights from node j to clusters other than i

T i :

= Ei + Wi,, total weight of edges connecting to nodes in cluster i

Conductance

The term conductance has been used in slightly different ways in several studies [7, 12, 29, 30]. In this work, we use a formulation of conductance based on a definition by Leskovec et al. [30]. We define the conductance (in Table 1) of a cluster as the weight of all external edges divided by the total weight of the nodes in the cluster. The sum of the values of individual clusters is normalized by dividing it by the number of clusters.

Minimizing conductance leads to clusters with good separation from the rest of the network (low Ei) and high internal weight (high Wi). Conductance also avoids creating overly small clusters. This can be understood by considering a case wherein a cluster consists of just a single node. Then, Ei = Ti and conductance is 1.0 (worst) for that cluster. In the case of an empty cluster, it would be undefined (0/0), which we interpret in this case as 1.0.

Mean internal weight

The MIW cost function (Table 1) scales the internal weights Wi by dividing by the cluster size ni. Larger values are considered better. While the unweighted version of the cost function has previously been considered by Yang and Leskovec [31], to the best of our knowledge, it has not yet been used as a target function for optimization.

The cost function favors small dense clusters because cluster size is the denominator. However, if the graph contains large and strongly connected subgraphs (i.e., almost complete graphs), it may favor large clusters instead. As an example of the second case, consider a complete graph of four nodes where all six edge weights have value 1. Splitting this graph into two clusters of sizes 2 and 2 would yield a cost function value of 2(1)/2 + 2(1)/2 = 2, whereas keeping it as one cluster would yield the value of 2(1 + 1 + 1 + 1 + 1 + 1)/4 = 3. In other words, having one large cluster and one empty cluster has almost the same value as equally splitting into two clusters. As a result, the cost function may sometimes produce empty clusters.

Inverse internal weight

The IIW cost function (Table 1) calculates the sum of inverse weights inside each cluster. Smaller values are considered better. The inverse weights are scaled to the range [1,∞] by multiplying them by the mean weight of a perfectly balanced cluster (M/k). In the case of optimal clustering of k completely separated and balanced clusters, all Wi would equal M/k and clustering would take the value 1.0.

There are two reasons for using the inverse weight 1/Wi rather than the mean weight Wi/ni. First, doing so ensures that all nodes will be assigned to a cluster to which they are connected. As an example, consider a case wherein a node is assigned to a small cluster A to which it has no connection, but there exists another large cluster B to which it does have a connection. If the node changes from cluster A to cluster B, WA will remain unchanged, but WB will increase, which will provide a smaller (better) cost value as expected. Mean weight, on the other hand, may do the opposite by moving the node from B to A, even if it has no connection there. This happens when the node has only a weak link to cluster B, but the penalty of the increased cluster size outweighs the effect of the node weight.

Second, inverse weighting favors more balanced clustering. If a node has an equally strong connection c to two clusters A and B, which have weights WB and WA so that WB > WA, then assigning the node to cluster A would provide a more optimized result. This is because the derivative of f = 1/x is f′ = −1/x2 and thus f′(WA) < f′(WB).

The cost function is also guaranteed to provide exactly k clusters for the optimal cost. This can be understood by considering a case wherein one cluster is empty. Since the weight of an empty cluster is Wi = 0, its inverse weight would be infinite. Thus, as long as there exist sufficient data for all k clusters and there exists a clustering with finite cost, an optimal algorithm would lead to a clustering containing k non-empty clusters.

Examples of CND, MIW and IIW are provided in Fig. 2. While CND is based on the external links (Ei), the two proposed cost functions rely merely on the within-cluster statistics. This more closely follows the standard k-means cost function (sum of squared errors), which completely ignores between-cluster relations.

Fig. 2

K-means 그래프 - K-means geulaepeu

Example of cost calculation of three different cost functions: mean internal weight (MIW), inverse internal weight (IIW), and conductance (CND). Every edge is counted twice, once for each node it connects

Full size image

Existing algorithms

In this section, we briefly review the most relevant of the existing algorithms. For more extensive reviews, see [7, 8, 32,33,34].

Hierarchical algorithms

Hierarchical algorithms work in either a bottom up or top-down manner. Top-down algorithms are often referred to as cut-based methods [4, 20, 21]. They recursively split the graph into sub-networks until some stopping condition is met. Bottom-up methods, also known as agglomerative clustering, [9, 22, 23] start by placing each node in its own cluster and then merging the clusters until a similar condition is met.

The Walktrap [22] is an agglomerative method based on the observation that random walks in graphs often become trapped within a single cluster. The probability that two nodes will appear in the same random walk is higher if they are in the same cluster. Walktrap is based on Ward's method, which minimizes squared distances inside the cluster.

The Louvain algorithm [9] is also an agglomerative method but its goal is to minimize modularity. It starts with each node in its own cluster and then sequentially assigns nodes to the cluster that minimizes the total modularity. This process is iterated until a local optimum is reached (i.e., there is no single move of a node to another cluster that would improve the cost value). After local optimization, it reduces clusters to single nodes and starts iterating the optimization again.

The Vieclus method [35] optimizes the modularity cost function using a combination of genetic algorithm and local search. It uses clusterings from the Louvain algorithm to initialize the population. It is the current state-of-the-art for optimizing modularity. However, the Louvain algorithm has the benefit of being able to control the number of clusters.

The NCut method [4] minimizes conductance by formulating graph clustering as an eigenvalue problem on the similarity matrix of the graph nodes. Tabatabaei et al. [16] proposed a faster O(N log2N) algorithm called GANC to optimize the same cost function.

The Sbm_dl method [36] fits the graph into a stochastic block model that aims to find the most likely model parameters that generates the observed network. It uses greedy agglomerative heuristic which also tries to detect the correct number of clusters. The implementation also enables the minimum and maximum bounds to be given for the number of clusters it returns.

Iterative algorithms

Iterative algorithms start with an initial clustering that is improved by small changes. The changes are typically made on the cluster partitions by moving nodes (individual or small groups of nodes) from one partition to another, aiming to optimize some criterion for clustering fitness. The process stops when no change improves the solution.

An early example of this approach is the Kernighan–Lin algorithm [3], which aims to find an optimal way to cut a graph into two sub-graphs. The two arbitrary initial cluster partitions are improved by finding the two nodes to swap between partitions that lead to the largest improvement in the minimum cut criterion. This is continued until no further improvement is possible.

The Gemsec method embeds graphs into vector space and performs centroid based clustering in the vector space [37]. It combines node embedding cost and sum of squared error clustering cost into the same cost function and then uses gradient based iterative optimization for this cost function.

Balanced clustering

Several methods [38,39,40] aim to achieve a balanced clustering by minimizing the total cut on the graph (Ei) while constraining the maximum cluster size to (1 + ε) times the average cluster size. The constraint parameter is typically a small number (≤ 0.05). Complete balance (ε = 0) has also been considered [40]. KaffpaE is an evolutionary algorithm that optimizes this cost function. It uses a novel combine operator that guarantees that the fitness of the new offspring will be at least as good as that of the best of the parents.

The FluidC algorithm [41] is based on the idea of fluids (communities) interacting by expanding and pushing each other in an environment (graph) until a stable state is reached. It utilizes an update rule that maximizes the number of connections within a cluster scaled by the inverse of the number of vertices composing a cluster. This guarantees k non-empty clusters.

Graph growing and seed expansion

Clusters can also be formed by gradually growing them from a seed node, typically by using breadth-first search. Karypis and Kumar [24] aimed to achieve an equally sized split of the graph by starting from a random node and then expanding the cluster until half of the vertices are included. This is repeated 10 times. The result with the smallest edge cut is selected as an intermediate result which is improved using the Kernighan–Lin algorithm.

Whang et al. [11] aimed to find overlapping clusters using a process which they called seed expansion. They used several heuristics to select good seeds. One of these heuristics selects nodes in the order of node degree, while disregarding the neighbors of the node in subsequent selections. Clusters are then grown from the seeds by greedily optimizing conductance in each step.

K-algorithm and M-algorithm

Although many iterative and hierarchical algorithms for graph clustering exist, a k-means based solution is still missing. The main reason for this is that k-means requires calculating the mean of a cluster, which is seemingly not possible for graphs. However, we implement a sequential variant of k-means which assigns the nodes to the closest cluster without calculating distance to the mean. This is done by using a delta approach, which calculates the change in the cost function value before and after the assignment. The proposed algorithm is called the K-algorithm; as it resembles the k-means algorithm but without using means. In Sect. 4.4, we also introduce a merge-and-split based M-algorithm which improves on the results of the K-algorithm.

K-algorithm: greedy local optimization

The K-algorithm (Algorithm 1) starts from an initial solution and then iteratively tunes it toward a better solution. It can be used to improve any initial clustering. Nonetheless, the quality of the results depends heavily on the initial clustering. We, therefore, introduce a density-based initialization method (Sect. 4.3) that is more effective than a simple random partitioning. The effect of the initialization is studied in Sect. 6.3.

The K-algorithm iteratively improves the initial solution by processing the nodes sequentially, in random order (line 5). For each node, the method considers all clusters (line 14). It changes the cluster of the node if this improves the cost function (line 18). The delta calculation (line 15) for the selected cost functions requires recalculating Wxi, the sum of weights from node i to all possible clusters x. This is done by looping through all edges adjacent to node i (line 10) and adding to the weight sum of their cluster.

After all nodes have been processed, the algorithm starts another iteration. The iterations continue until no changes occur and the cost function can no longer be improved by changing the partition of a single node (line 22).

K-means 그래프 - K-means geulaepeu

In theory, the K-algorithm can work with any cost function. In practice, there are a few constraints. First, the delta of a given node should be calculated quickly without looping through the entire dataset. Second, the optimal value for a cost function should produce k non-empty clusters when k is a parameter of the algorithm.

Cost function delta calculation

A key part of the K-algorithm is finding the partition with the best cost function for a given node (lines 14–18). Finding this partition requires calculating the delta value of moving the node in question from its current partition x to another partition y for all possible partitions.

The delta value can be calculated by creating a new clustering (C⟶C′) where the node j is moved from its old cluster X to a new cluster Y. The delta is then calculated as the difference between these two clusterings:

$$ \Delta f\left( {j,X,Y} \right) = f\left( {C^{\prime}} \right) - f\left( C \right) $$

(1)

However, in practice, this is very slow, since calculating the cost function for an entire dataset requires looping through all edges of the graph. It is therefore better to directly calculate the delta value. Since the cost functions are additive (see Figs. 1, 2), only changes in cluster X, from which the node is removed, and cluster Y, to which it is added, affect the delta:

$$ \Delta f\left( {j,X,Y} \right) = \Delta f\left( {C_{X} } \right) + \Delta f\left( {C_{Y} } \right) $$

(2)

These can be further split into cost before change (CX, CY) and cost after change (CX′, CY′):

$$ \Delta f\left( {j,X,Y} \right) = f\left( {C_{X} ^{\prime}} \right) + f\left( {C_{Y} ^{\prime}} \right) - f\left( {C_{X} } \right) - f\left( {C_{Y} } \right) $$

(3)

These components of the delta are calculated differently for each cost function. In the case of mean weight, they are calculated as follows:

$$ f\left( {C_{x} ^{\prime}} \right) = \frac{{W_{x} - W_{{{\text{xj}}}} }}{{n_{x} - 1}}\;\;f\left( {C_{Y} ^{\prime}} \right) = \frac{{W_{y} + W_{{{\text{yj}}}} }}{{n_{y} + 1}}\;\;f\left( {C_{x} } \right) = \frac{{W_{x} }}{{n_{x} }}\;\;f\left( {C_{Y} } \right) = \frac{{W_{y} }}{{n_{y} }} $$

(4)

The full delta function for mean weight (MIW) is therefore:

$$ \Delta {\text{MIW}}\left( {j,X,Y} \right) = \frac{{\left( {W_{x} - W_{{{\text{xj}}}} } \right)}}{{n_{x} - 1}} + \frac{{\left( {W_{y} + W_{{{\text{yj}}}} } \right)}}{{n_{y} + 1}} - \frac{{W_{x} }}{{n_{x} }} - \frac{{W_{y} }}{{n_{y} }} $$

(5)

Applying a similar derivation in the case of IIW yields the following:

$$ \Delta {\text{IIW}}\left( {\text{j,X,Y}} \right) = \frac{1}{{\left( {W_{x} - W_{{{\text{xj}}}} } \right)}} + \frac{1}{{\left( {W_{y} + W_{{{\text{yj}}}} } \right)}} - \frac{1}{{W_{x} }} - \frac{1}{{W_{y} }} $$

(6)

For conductance (CND), where Tx − Wx = Ex, we get:

$$ \Delta {\text{CND}}\left( {j,X,Y} \right) = \frac{{\left( {T_{x} - M_{j} } \right) - \left( {W_{x} - W_{{{\text{xj}}}} } \right)}}{{T_{x} - M_{j} }} + \frac{{\left( {T_{y} + M_{j} } \right) - \left( {W_{y} + W_{{{\text{yj}}}} } \right)}}{{T_{y} + M_{j} }} - \frac{{T_{x} - W_{x} }}{{T_{x} }} - \frac{{T_{y} - W_{y} }}{{T_{y} }} $$

(7)

Density based initialization

The initialization method (Algorithm 2) grows new clusters into dense parts of the graph. The nodes are sorted based on density, and the densest (central) node is chosen as the seed node for the first cluster. A new cluster is then grown to this position by expanding from the seed node (line 10). The next cluster is grown from the second densest node that has not yet been assigned to another cluster.

K-means 그래프 - K-means geulaepeu

Cluster growing (Algorithm 3) is used both in the density-based initialization and later in the merge-and-split algorithm (Sect. 4.4). It starts by creating a new cluster Ci that consists of a single seed node. New nodes are then merged to it in a best-first search manner. That is, the nodes are sequentially added so that the algorithm always picks the node j that has the highest sum of edge weights (Wij) to the cluster i (variable nodeToCluWeight) and is not yet assigned to another cluster. This continues until a specific cluster size is reached (line 3), or there are no more nodes with Wij > 0 (line 6).

In density-based initialization, the size of the grown cluster is selected as 80% of the average cluster size (0.8(N/k)). This means that 20% of the nodes are left without a cluster label. For these, the labels are chosen randomly (Algorithm 2, lines 12–14). The Kalgorithm later fine-tunes these to more suitable partitions.

It is sufficient to cover 80% of the dataset with the cluster growing and leave the rest for the K-algorithm to optimize. If the cluster growing was to cover 100% of the dataset, the last clusters would sometimes be too fragmented for the K-algorithm to optimize.

K-means 그래프 - K-means geulaepeu

The initialization method depends on the estimation of node density. This is calculated using Eq. (8) by looping all connected nodes and multiplying the neighbor node’s total weight (Mj) by the edge weight (wij) connecting to that neighbor. The idea is that a node can be considered central if it has strong ties to other nodes that are themselves central.

$$ dens\left( i \right) = \mathop \sum \limits_{j \in G\left( i \right)}^{{}} w_{{{\text{ij}}}} M_{j} $$

(8)

Here, G(i) represents the set of nodes to which node i is connected (wij > 0). In our data, the size of G(i) is only a small fraction of the size of the data, N. If G(i) were in the order of O(N) (e.g., in case of complete graph), it would increase the time complexity of the initialization to O(N2). In this case, it would make sense to limit G(i) to only the most significant edges.

M-algorithm

In most cases, the K-algorithm clusters the dataset fast and produces reasonable results. However, it always converges to a local maximum wherein no single change in a node partition can improve the cost function. This can sometimes be the correct clustering, but often there is room for further improvement. For this reason, we will next introduce the M-algorithm which is based on a merge-and-split strategy.

The M-algorithm (Algorithm 4) repeats the K-algorithm multiple times. It disrupts the stable state produced by the K-algorithm by first merging two random clusters (line 5) and then splitting one random cluster (lines 8–12). This effectively reallocates one cluster into a different part of the data space. The solution is then fine-tuned to a new local optimum using the K-algorithm. If this new solution improves on the previous local optimum (line 15), it is kept as the current solution; otherwise, it is discarded. The merge-and-split process (Fig. 3) is repeated multiple times (line 2). The number of repeats is a user-given parameter that enables a flexible tradeoff between clustering quality and processing time.

Fig. 3

K-means 그래프 - K-means geulaepeu

The K-algorithm needs an initial clustering as a starting point. We use density-based initialization (part 1) which grows new clusters from dense parts of the graph (nodes A, B and C). The local optimization of the K-algorithm (part 2) can then fine-tune this initial clustering. The mergeandsplit algorithm can further improve the solution by first merging a random pair of clusters (part 3) and then splitting one cluster (part 4) by growing a new cluster from random node D. The results are fine-tuned with the K-algorithm (part 5). The algorithm usually needs to be repeated several times for successful clustering

Full size image

It is usually good to merge clusters only if there are some links between them. Therefore, we choose the two clusters according to a probability defined by the strength of the connections between them. If the clusters have a strong connection, the merge probability is higher. If there are no connections between them, the merge probability is zero. For a pair of clusters (A, B), the merge probability is defined as:

$$ p(A,B) = \frac{{C_{AB} }}{E} $$

(9)

Here, the variable E is the total sum of weights between all clusters (E = ∑Ei), and CAB is the total sum of weights between clusters A and B. Therefore, summing p(A,B) for all possible pairs yields the value 1.0.

We perform the split by selecting a random cluster and then growing a new cluster from a random node inside that cluster. The size of the new cluster is selected randomly between 5 and 95% of the size of the original cluster. This sometimes produces unequal splits and therefore offers a better chance to also solve datasets that have clusters of unequal sizes.

K-means 그래프 - K-means geulaepeu

The merge-and-split process (Fig. 3) is repeated multiple times (line 2). The number of repeats R is a user-given parameter that enables a flexible tradeoff between clustering quality and processing time.

Time complexity

The time complexity of the K-algorithm (Algorithm 1) can be calculated by analyzing the number of times a line is executed. The first for-loop (line 5) processes all N points. Inside this loop, the recalculation of Wxi loops all connections in O(|E|/N) steps and the calculation of the delta value for all clusters takes k steps. This makes the first for loop have a total of O(N(k +|E|/N)) steps. Other variables needed by the delta calculation (Wi, Ei), also need to be updated after changing the cluster of a node but these can be updated in O(1) time.

The entire process (lines 3–22) is repeated for I iterations, until the algorithm converges. This makes the total time complexity of the K-algorithm O(IN(k +|E|/N)). I is a small number that ranged from 4 to 78 in our experiments. It increases slightly with dataset size and other properties (see Table 5).

The initialization of the K-algorithm (Algorithm 2) is not a bottleneck in the algorithm. The time complexity of the initialization is k times (line 5) that of growing a new cluster. The time complexity of cluster growing (Algorithm 3) depends on the size of the cluster (line 3), which is O(N/k), and on the average number of edges (line 11) which is O(|E|/N). This makes the time complexity of growing one cluster O(|E|/k) and the initialization as a whole O(|E|).

The time complexity of the M-algorithm (Algorithm 4) is determined by the number of times the K-algorithm is repeated (line 14) and is O(RIN(k +|E|/N)) in total where R is the number of repeats. Other parts of the algorithm, the merge and split operations are minor compared to the O(IN(k +|E|/N)) complexity of the K-algorithm. The merge operation (line 5) is trivial and takes O(N/k) time. The splitting is done by cluster growing (line 12) which is O(|E|/k).

The time complexity of both the K- and M-algorithms is therefore almost linear O(N) with respect to the number of nodes. This can be verified in Fig. 5 in the experimental results.

Experimental setup

Datasets

We performed our experiments using three types of datasets (see Table 2):

  • a k-nearest neighbors (kNN) graph of numerical datasets;

  • artificial graphs; and

  • disease comorbidity networks.

Table 2 Datasets

Full size table

To enable visual inspection of clustering results, we generated kNN graphs with parameter k = 30 from selected numerical 2D datasets taken from the clustering basic benchmark [42]. We used the datasets S1-S4 and Unbalance, which we converted to kNN graphs using the method in [43]. The resulting graphs were used as input for the clustering algorithm.

The distances in the graph are converted to weights (similarities) as follows:

$$ w_{{{\text{xy}}}} = \frac{{{\text{max}}\left( d \right) - d\left( {x,y} \right)}}{{{\text{max}}\left( d \right)}} $$

(10)

where max(d) is the maximum distance in the entire graph.

We also generated three additional 2D sets (G3A, G3B, UNI) to illustrate the properties of the cost functions. The sets G3A and G3B contain three separate Gaussian clusters, each of which contains both a very dense and very sparse (low- and high variance) area. The clusters in G3B are more overlapping (less separated) than in G3A. The UNI dataset contains random noise uniformly distributed in the 2D plane.

The icdA and icdB datasets are disease comorbidity networks [27] where selected diseases represent nodes in the graph. The diseases have been recorded using the International Classification of Diseases (ICD10) [44]. In the ICD10, diagnoses are organized in a hierarchy where the first letter of a code refers to the category of the diagnosis. For example, in code F31, F indicates that the diagnosis relates to mental and behavioral disorders. F31 specifies a diagnosis of bipolar affective disorder. Codes are further grouped to form blocks such as F30-F39 for mood [affective] disorders, and F20-F29 for schizophrenia, schizotypal and delusional disorders. We constructed two variants of the network, each corresponding to different abstraction levels. The first network, named icdA, consists of individual diseases (e.g., F31) as the nodes. The second, named icdB, consists of the blocks (e.g., F20-F29).

Two diseases are linked if they co-occur in the same patients with sufficient frequency that they are correlated. We use relative risk [27] to measure the strength of the connection of the diseases. These networks were constructed using data for 51,909 patients. Data for this study were obtained from a regional electronic patient database. We use the first letter of the ICD10 coding scheme as the ground truth for the cluster labels. In the ICD data, most relative risk values are between 0.5 and 5.0 but they can also be over 100. These outliers would dominate the cost function optimization, and for this reason, we normalize them to the range of [0,1] as follows:

$$ w_{{{\text{xy}}}}^{^{\prime}} = \frac{{w_{{{\text{xy}}}} }}{{1 + w_{{{\text{xy}}}} }} $$

(11)

We also generated three series of artificial graph datasets using the weighted network benchmark generation tool proposed by Lancichinetti and Fortunato [10]. One key parameter in this tool is the mixing parameter μt, which controls the proportion of links inside versus between clusters. An μt value of 0 indicates that the clusters are completely separated; 0.5 means that 50% of the links are within the clusters; and 1 indicates that the clusters are completely overlapping. We generated the varMu series by varying μt between 0.60 and 0.80. In the varDeg series, the average degree was varied, and in the varN series, the size of the dataset was varied. For details, see Table 2.

Measuring clustering quality

To evaluate clustering quality with respect to the ground truth, we used two methods: normalized mutual information (NMI) [45] and centroid index (CI) [46]. The NMI values are within the range [0,1], where 1.0 indicates correct clustering. The CI value represents the number of incorrectly detected clusters in the clustering results so that a value of 0 indicates a correct clustering. The CI gives an intuitive cluster-level overview of the correctness of the clustering, whereas NMI provides a more fine-grained, data-level measure.

Compared methods

Table 3 summarizes the algorithms used in our experiments (see Sect. 3). We compare the proposed methods (K-algorithm and M-algorithm) against eight other algorithms. We selected algorithms that (1) took the number of clusters k as a parameter, (2) could cluster weighted networks, (3) were fast enough to run on a dataset of size 6500, and (4) had a publicly available implementation. We found five algorithms (in Table 3) that matched these criteria. We also included the Vieclus and Louvain algorithms, which do not take the number of clusters as a parameter but fit the other criteria. FluidC was targeted for unweighted graphs. We also ran the standard k-means algorithm for datasets with numerical representations. We used the implementations for the methods sbm_dl, gemsec, and fluidc from the CDlib package [47] and Louvain and Walktrap from the iGraph library.Footnote 1

Table 3 Compared graph clustering methods

Full size table

We mainly used the default parameters for the algorithms. For KaffpaE, we set the balance constraint ε = 20% to match that of the artificial graphs. The Vieclus algorithm required a time limit (seconds) as a parameter which we set according to the formula N/100 (10 s for dataset of size 1000).

Running time was measured as run on a single core. As the NCut implementation uses parallel processing, its running times are not comparable to others. The experiments were run on a Ubuntu 20.04 server with 20 processing cores (Intel® Xeon® W-2255 CPU @ 3.70 GHz).

Results

In this section, we examine the experimental results in five different parts. We first visually compare the three cost functions using kNN graph datasets. We then show the numerical evaluation results of the benchmark data with respect to the ground truth. We also study the effect of the initialization on the K-algorithm’s performance. Next, we study how the dataset parameters affect the number of iterations. Finally, we demonstrate the usefulness of the method using disease comorbidity analysis as a case study.

Visual cost function comparison

We visually compared the cost functions to identify characteristics that simple clustering validity functions like NMI may hide. The usefulness of the cost function also depends on the clustering application. Therefore, the choice of a cost function should be left up to the application specialist. Our goal is to provide information that will be useful in making such decisions.

We performed the visual analysis based on kNN graphs of numerical 2D datasets. The graph is first clustered using the proposed method. To ensure that the results were as close to optimal as possible, we ran the M-algorithm for 100,000 splits and merges. The resulting partitions are visualized using the original 2D dataset.

The results in Fig. 4 show that in the two extreme cases—that is, uniformly distributed random data and three clearly separate clusters (with the correct k parameter)—there are no notable differences between cost functions. The differences become visible in cases between the extremes, when there exists (a) overlap between clusters, (b) random noise in the clusters, or (c) a different parameter k compared with the ground truth.

Fig. 4

K-means 그래프 - K-means geulaepeu

Comparison of the three cost functions: conductance (CND), inverse internal weight (IIW), and mean internal weight (MIW). Three different datasets are used: three separate clusters (G3A), three sparse clusters with a dense middle area (G3B), and uniformly distributed random noise (UNI). We vary the algorithm parameter k (the number of clusters) to show how cost functions behave when the parameter is different from ground truth. Colors are determined by the size of the cluster. Black indicates the largest cluster, followed by red, blue, yellow, and green (in descending order)

Full size image

The MIW cost function tends to form small compact clusters composed of nodes in the dense area. These clusters contribute most to the cost function value. Optimizing the cost function also forms one garbage cluster containing all the nodes in sparse areas. This can be perceived either as an error or as a type of outlier detection that filters out less relevant information. When k = 5, the other cost functions lead to splitting the denser areas instead of the sparse areas.

Optimizing IIW leads to more balanced clustering. This can be seen especially in the case of Gaussian clusters with k = 2, when conductance and MIW both split the three clusters into one part consisting of two clusters and one part containing only one cluster. IIW, on the other hand, splits the dataset into two equally sized parts containing 1.5 clusters. Otherwise, the results are similar to those for conductance.

In the case of completely uniform data, we would expect clusters of equal size, since the dataset does not have any visible clustering structure. However, MIW results in unbalanced cluster sizes, where the largest partition is double the size of the smallest partition. Conductance also provides noticeably unbalanced results. IIW results in the most balanced clustering.

In summary, we recommend that practitioners select their cost function depending on the application:

  • IIW is preferable when more balanced clustering is desired.

  • MIW is preferable for detecting dense clusters and disregarding noise and outliers. It tends to form one large garbage cluster from the noise points. Therefore, the number of clusters should be set to k + 1 if k true clusters are required.

  • Conductance generally worked well in many of the tested cases, but it may lead to very unbalanced clustering if the data have isolated tiny clusters with external weight Ei close to 0.

Algorithm benchmark

In this section, we present a clustering quality comparison of the proposed method (K-algorithm and M-algorithm) with eight other algorithms (Table 3). We ran the M-algorithm for R = 100 merge-and-split iterations. We ran each of the tests 10 times and present the mean values of the results.

The results are reported in Fig. 5 and Table 4. The prefixes K- and M- refer to the K- and M-algorithms. Conductance, IIW, and MIW are referred to by the suffixes -CND, -INV, and -MIW, respectively.

Fig. 5

K-means 그래프 - K-means geulaepeu

Results for varying number of nodes N (varN) are shown on top, average degree |E|/N (varDeg) in the middle, and mixing parameter μt (varMu) on the bottom. Measurement of quality (NMI) is on the left, and time (seconds) is on the right. In the case of the K- and M-algorithms, we used the IIW cost function. The NCut runtimes are not comparable, as the implementation used parallel processing

Full size image

Table 4 Summary of results: kNN datasets on the left, comorbidity network in the middle, and artificial graphs on the right

Full size table

In the varN tests (Fig. 5), where data size is varied, the K-algorithm, M-algorithm, and Gemsec clearly stood out from the rest. However, Gemsec’s running time was an order of magnitude greater than the other methods, and it could not be run for the largest datasets. The M-algorithm was the only algorithm that could solve the largest dataset (1,024,000 nodes). The running times for most of the algorithms seem close to linear.

When varying the number of edges (varDeg; Fig. 5), the M-algorithm, NCut, and Vieclus clearly perform better than the others. A smaller number of edges also decreases the number of K-algorithm iterations (Table 5), which likely contributes to the K-algorithm’s poor performance compared with the varN tests. In most algorithms, running times increase slightly with larger |E|/N, but in the M- and K-algorithms, this increase is greater.

Table 5 Effect of dataset properties on number of K-algorithm iterations by varying the parameters N, µt, and |E|/N (Table 1)

Full size table

In the case of the disease co-occurrence network datasets (icdA and icdB), the M-algorithm with IIW clearly performed better than the other methods. However, the ground truth here is the existing human-made disease-level categorization. The algorithms might find different (possibly better) clustering than the human categorization. A CI = 0 result is therefore not expected with the imA and imB datasets.

No algorithm performed best on all datasets. However, the M-algorithm had the best overall performance in terms of quality, as shown in Table 4 (CI = 1.7, NMI = 0.78). NCut (CI = 2.7, NMI = 0.75) was second and the K-algorithm (CI = 2.7, NMI = 0.67) third. KaffpaE (CI = 4.0, NMI = 0.66) and Walktrap (CI = 3.5, NMI = 0.63) performed well on the kNN-based datasets (S1–G3) but had worse results for the Lancichinetti benchmark (Fig. 5). Vieclus (CI = 78.7, NMI = 0.61), which also determined the number of clusters by optimizing modularity, performed well on the Lancichinetti benchmark but not as well for the kNN graphs. It obtained the right number of clusters for many datasets but failed in other cases. It output 796 clusters (30 in the ground truth) for the dataset of size 1,024,000. This result highlights the need to be able to control the number of clusters.

The main weakness of both the M-algorithm and the IIW cost function is data with large cluster-size unbalance, as in unb, which has 20:1 unbalanced and well-separated clusters. For this type of data, the hierarchical approach performs better, and NCut and Walktrap achieved the best results. However, in additional tests, we were also able to solve this dataset with the M-algorithm and the conductance cost function if the number of repeats was increased from 100 to 1000.

The running times in Fig. 5 show that the K-algorithm was the second fastest (after the Louvain algorithm) in most cases. The quality of its results was also good. Table 4 shows that the K-algorithm was second best (tied with NCut) according to CI or third best according to NMI. The M-algorithm also improved on the K-algorithm’s results for almost all datasets except the easiest sets in the varN, varDeg, and varMu series and s1–s3, where the K-algorithm’s results are also good (NMI within 0.01, CI close to 0.0).

Effect of initialization

In this section, we study the impact of the initialization method on the K-algorithm and M‑algorithm. We study two different initialization methods: density based initialization described in Algorithm 2; and random partition initialization which assigns a random cluster label to each node. For both these cases, we record the NMI value for (1) initialization, (2) after K-algo convergence, and (3) after repeating M-algo for 100 times. The results are summarized in Fig. 6.

Fig. 6

K-means 그래프 - K-means geulaepeu

Impact of initialization on algorithm result. The area of the M-algorithm signifies the improvement in NMI over the K-algorithm (same with the K-algorithm and initialization)

Full size image

The results show that the initialization method does not affect the results of the M-algorithm nearly at all, as the corresponding results with both initializations are virtually the same. However, in the case of the K-algorithm, the results are much worse for several datasets (S1, S4, unb, 20) in the case of random partition initialization. The average degree on the graph seems to be one affecting factor. The K-algorithm works better even with bad initialization if the number of neighbors in the graph is large. This was the case with dataset icdA (|E|/N = 53) and artificial graphs with |E|/N ≥ 26.

The above observations are in line with the previous results for k-means initialization for numerical data, which indicates that k-means is highly sensitive to initialization [48]. There are two ways to address this issue. The first is to use better initialization [48], and the second is to change to a better algorithm—such as random swap [49]—that does not depend on the initialization. In the case of graph data, we used density-based initialization to implement the first approach and a split-and-merge strategy in the M-algorithm to implement the second.

K-algorithm iterations

The number of iterations I until convergence was the only part of the time complexity analysis (in Sect. 4.5) that could not be directly calculated by analyzing the algorithm pseudocode. In Table 5, we present experimental results for the number of iterations.

Both the size and difficulty of the dataset caused the K-algorithm to need more iterations until convergence. As observed in Fig. 5, increasing N and µt makes the dataset more difficult; the same occurs when decreasing |E|/N. In all of these cases, the difficulty also increases the number of iterations. However, when the dataset is too difficult to be solved by any of the algorithms (|E|/N ≤ 15 and µt ≥ 74), the number of iterations starts to decrease. The dataset size also has an effect. The highest number of iterations occurred for the dataset of size 1,024,000, where it was 47% higher than the highest value for the varDeg series (where N = 5000).

Disease co-occurrence network

In the previous chapter, the clustering of the disease network was evaluated by how closely it matched the predefined categorization. However, this does not necessarily correspond to the reality, as disease co-occurrences appear beyond category boundaries. In fact, the real goal of clustering is to uncover new information, not to study how well the clustering models existing predefined structures.

We next analyze what new information we found in the clustering results that might have relevance in medical practice. We used the icdB dataset, which has 188 nodes, but removed all symptom codes (starting with R), as these confused the analysis. This left 170 diagnoses. We selected the M-algorithm with IIW cost function as this combination produced accurate clustering for most datasets. Although the size of the dataset is seemingly small, the number of possible connections between nodes is still very large (14,365), which would make manual examination exhausting. Instead, using clustering (with k = 15) we can obtain groups that are small enough for realistic human analysis. A proof of concept of such analysis is shown in Fig. 7. A more extensive analysis will be found in a follow-up paper.

Fig. 7

K-means 그래프 - K-means geulaepeu

Clustering result of the icdB dataset. For each cluster, we show a description based on medical expert analysis. Visualization is performed using Gephi. Edges with RR > 1.75 are drawn with a thicker line. The first diagnose code of a range represents the whole range (e.g., T36 represents range T36-T50)

Full size image

Most of the clustering results are as expected. For example, cluster 4 shows the connection between mental health diagnoses and diagnoses related to alcohol and drug use. Furthermore, alcohol use and minor injuries like limb fractures ended up in cluster 9. However, some results are more unexpected. For example, cluster 1 consists of diseases of eye, ear and viral infections wherein the clinical association of different conditions can be recognized but it would likely not constitute a cluster of diseases if defined based only on clinical considerations.

Conclusions

We introduced two new cost functions for graph clustering: MIW and IIW. We also studied a third one called conductance. The IIW resulted in the best overall results in the experimental tests. It works especially well in the cases where somewhat balanced clustering is required.

Nevertheless, the choice of a cost function depends on the dataset and the purpose of clustering. All of the studied cost functions are useful in some situations, and the choice of a cost function should ultimately be left up to the analyst. In Sect. 6.1, we provided information to aid in making this decision.

We also introduced two new algorithms K-algorithm and M-algorithm that optimize these cost functions. They are adaptations of the k-means algorithm for graph clustering context. We compared the algorithms against eight previous methods. In the experimental tests, the M-algorithm clearly outperformed existing state-of-the-art. Average clustering error was CI = 1.7, whereas best existing (NCut) had CI = 2.7.

References

  1. Newman ME (2006) Modularity and community structure in networks. Proc Natl Acad Sci 103(23):8577–8582

    Article  Google Scholar 

  2. Newman ME (2004) Analysis of weighted networks. Phys Rev E 70(5):056131

    Article  Google Scholar 

  3. Kernighan BW, Lin S (1970) An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J 49(2):291–307

    Article  Google Scholar 

  4. Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905

    Article  Google Scholar 

  5. Divo MJ, Casanova C, Marin JM et al (2015) Chronic obstructive pulmonary disease comorbidities network. Eur Respir J 22:113–118

    Google Scholar 

  6. Hromic H, Prangnawarat N, Hulpuş I, Karnstedt M, Hayes C (2015) Graph-based methods for clustering topics of interest in twitter. In: International conference on web engineering, pp 701–704

  7. Fortunato S (2010) Community detection in graphs. Phys Rep 486(3–5):75–174

    Article  MathSciNet  Google Scholar 

  8. Fortunato S, Hric D (2016) Community detection in networks: a user guide. Phys Rep 659:1–44

    Article  MathSciNet  Google Scholar 

  9. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008(10):10008

    Article  Google Scholar 

  10. Lancichinetti A, Fortunato S (2009) Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. Phys Rev E 80(1):016118

    Article  Google Scholar 

  11. Whang JJ, Gleich DF, Dhillon IS (2016) Overlapping community detection using neighborhood-inflated seed expansion. IEEE Trans Knowl Data Eng 28(5):1272–1284

    Article  Google Scholar 

  12. Lu Z, Wen Y, Cao G (2013) Community detection in weighted networks: Algorithms and applications. In: 2013 IEEE international conference on pervasive computing and communications (PerCom), pp 179–184

  13. Lancichinetti A, Fortunato S (2009) Community detection algorithms: a comparative analysis. Phys Rev E 80(5):056117

    Article  Google Scholar 

  14. Zhang W, Wang X, Zhao D, Tang X (2012) Graph degree linkage: agglomerative clustering on a directed graph. In: European conference on computer vision, pp 428–441

  15. LaSalle D, Karypis G (2016) A parallel hill-climbing refinement algorithm for graph partitioning. In: 45th International conference on parallel processing (ICPP), pp 236–241

  16. Tabatabaei SS, Coates M, Rabbat M (2012) Ganc: greedy agglomerative normalized cut for graph clustering. Pattern Recogn 45(2):831–843

    Article  Google Scholar 

  17. Schäfer T, Mutzel P (2017) Struclus: scalable structural graph set clustering with representative sampling. In: International conference on advanced data mining and applications, pp 343–359

  18. Riesen K, Bunke H (2008) Kernel k-means clustering applied to vector space embeddings of graphs. In: IAPR workshop on artificial neural networks in pattern recognition, pp 24–35

  19. Ferrer M, Valveny E, Serratosa F, Bardají I, Bunke H (2009) Graph-based k-means clustering: a comparison of the set median versus the generalized median graph. In: International conference on computer analysis of images and patterns, pp 342–350

  20. Girvan M, Newman ME (2002) Community structure in social and biological networks. Proc Natl Acad Sci 99(12):7821–7826

    Article  MathSciNet  Google Scholar 

  21. Kannan R, Vempala S, Vetta A (2004) On clusterings: good, bad and spectral. J ACM 51(3):497–515

    Article  MathSciNet  Google Scholar 

  22. Pons P, Latapy M (2005) Computing communities in large networks using random walks. In: International symposium on computer and information sciences, pp 284–293

  23. Clauset A, Newman ME, Moore C (2004) Finding community structure in very large networks. Phys Rev E 70(6):066111

    Article  Google Scholar 

  24. Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392

    Article  MathSciNet  Google Scholar 

  25. Newman ME, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(2):026113

    Article  Google Scholar 

  26. Fortunato S, Barthelemy M (2007) Resolution limit in community detection. Proc Natl Acad Sci 104(1):36–41

    Article  Google Scholar 

  27. Hidalgo CA, Blumm N, Barabási A, Christakis NA (2009) A dynamic network approach for the study of human phenotypes. PLoS Comput Biol 5(4):e1000353

    Article  Google Scholar 

  28. Okuda M, Satoh S, Sato Y, Kidawara Y (2019) Community detection using restrained random-walk similarity. IEEE Trans Pattern Anal Mach Intell

  29. Sinclair A, Jerrum M (1989) Approximate counting, uniform generation and rapidly mixing markov chains. Inf Comput 82(1):93–133

    Article  MathSciNet  Google Scholar 

  30. Leskovec J, Lang KJ, Mahoney M (2010) Empirical comparison of algorithms for network community detection. In: Proceedings of the 19th international conference on world wide web, pp 631–640

  31. Yang J, Leskovec J (2015) Defining and evaluating network communities based on ground-truth. Knowl Inf Syst 42(1):181–213

    Article  Google Scholar 

  32. Schaeffer SE (2007) Graph clustering. Comput Sci Rev 1(1):27–64

    Article  Google Scholar 

  33. Chakraborty T, Dalmia A, Mukherjee A, Ganguly N (2017) Metrics for community analysis: a survey. ACM Comput Surv 50(4):1–37

    Article  Google Scholar 

  34. Javed MA, Younis MS, Latif S, Qadir J, Baig A (2018) Community detection in networks: a multidisciplinary review. J Netw Comput Appl 108:87–111

    Article  Google Scholar 

  35. Biedermann S, Henzinger M, Schulz C, Schuster B (2018) Memetic graph clustering. In: Proceedings of the 17th international symposium on experimental algorithms (SEA'18), LIPIcs. Dagstuhl. Technical report arXiv: 1802.07034

  36. Peixoto TP (2014) Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models. Phys Rev E 89(1):012804

    Article  Google Scholar 

  37. Rozemberczki B, Davies R, Sarkar R, Sutton C (2019) Gemsec: graph embedding with self clustering. In: Proceedings of the 2019 IEEE/ACM international conference on advances in social networks analysis and mining, pp 65–72

  38. Buluç A, Meyerhenke H, Safro I, Sanders P, Schulz C (2016) Recent advances in graph partitioning. In: Algorithm engineering, pp 117–158

  39. Sanders P, Schulz C (2012) Distributed evolutionary graph partitioning. In: 2012 Proceedings of the fourteenth workshop on algorithm engineering and experiments (ALENEX), pp 16–29

  40. Benlic U, Hao JK (2010) An effective multilevel memetic algorithm for balanced graph partitioning. In: 2010 22nd IEEE international conference on tools with artificial intelligence, vol 1, pp 121–128

  41. Parés F, Gasulla DG, Vilalta A, Moreno J, Ayguadé E, Labarta J, Cortés U, Suzumura T (2017) Fluid communities: a competitive, scalable and diverse community detection algorithm. In: International conference on complex networks and their applications, pp 229–240

  42. Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48(12):4743–4759

    Article  Google Scholar 

  43. Sieranoja S, Fränti P (2018) Fast random pair divisive construction of knn graph using generic distance measures. In: Proceedings of the 2018 international conference on big data and computing, pp 95–98

  44. WHO (2016) International statistical classification of diseases and related health problems, 10th revision. World Health Organization, Geneva

    Google Scholar 

  45. Danon L, Diaz-Guilera A, Duch J, Arenas A (2005) Comparing community structure identification. J Stat Mech Theory Exp 2005(09):P09008

    Article  Google Scholar 

  46. Fränti P, Rezaei M, Zhao Q (2014) Centroid index: cluster level similarity measure. Pattern Recogn 47(9):3034–3045

    Article  Google Scholar 

  47. Rossetti G, Milli L, Cazabet R (2019) Cdlib: a python library to extract, compare and evaluate communities from complex networks. Appl Netw Sci 4(1):1–26

    Article  Google Scholar 

  48. Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recogn 93:95–112

    Article  Google Scholar 

  49. Fränti P (2018) Efficiency of random swap clustering. J Big Data 5(1):13

    Article  Google Scholar 

Download references