How are cluster analysis diagrams generated?


This topic explains how the data underlying a cluster analysis diagram is generated.

Measuring similarity

To measure the similarity between each pair of items that will appear in a cluster diagram, NVivo first builds a table where:

Table rows Clustered by Table columns Table cells
Sources Word similarity Each different word that appears in the text of the sources The number of times the column’s word appears in the row’s source
Coding similarity Each node that codes the sources’ content 1 if the column’s node codes the row’s source, 0 otherwise
Attribute value similarity Each different attribute value of the sources (e.g. Book:Year = 2010) 1 if the row’s source has the column’s attribute value, 0 otherwise
Nodes Word similarity Each different word that appears in the text of the nodes The number of times the column’s word appears in the row’s node
Coding similarity Each source coded by the row’s node 1 if the column’s source is coded by the row’s node, 0 otherwise
Attribute value similarity Each different attribute value of the nodes (e.g. Person:Sex = Female) 1 if the row’s node has the column’s attribute value, 0 otherwise
Words (top 100 words in Word Frequency query results)

N/A

Each source or node that the query searches in The number of times the row’s word appears in the column’s source or node

NVivo then calculates a similarity index between each pair of items (each pair of rows in the table) using the similarity metric you’ve selected:

Forming clusters

Using the calculated similarity index between each pair of items, NVivo groups the items into a number of clusters (10 by default), using the complete linkage (farthest neighbor) hierarchical clustering algorithm. For more information, refer to the Wikipedia article Complete-linkage clustering.

Generating a dendrogram

By default the results of the cluster analysis are displayed as a dendrogram, which is generated using the same complete linkage (farthest neighbor) hierarchical clustering technique that is used to form the clusters.

Generating a cluster map

The cluster analysis results can also be displayed as a 2D or 3D cluster map, where the items in the cluster analysis are represented as points in space.

The cluster map is generated using an iterative multidimensional scaling algorithm. Initially, the items are placed randomly as data points in a square or cube, and then a series of iterations are performed to optimize the positions of the items. The optimal distance between each pair of items is defined as 1.1 minus the similarity index between the items. At each iteration, the actual distance between each pair of items is compared to the optimal distance between them, and the data points are moved closer together or further apart accordingly. The algorithm ends when an optimal configuration is reached that cannot be improved by further movement of the data points.