Would you like an efficient method to find clusters of DNA matches relevant to your research subject? In this series, I’m sharing the steps to create a network graph using the free, open source Gephi application, available for Windows or Mac. I use Gephi to create network graphs of my AncestryDNA matches, but you can use matches from other companies as well.
Below are the previous steps in this tutorial:
This article will cover how to filter out clusters, compare two clusters to each other, find connected clusters, random connections, analyze connections, and adjust the modularity to find more or less clusters. I will be using a new example network graph for this tutorial.
In the previous tutorials, I used a network graph of my own matches from AncestryDNA with the range of 50-400 cM. This created a simple, manageable network graph with 6 clusters. For this tutorial, I will be using my mom’s network graph that includes matches from 15-886 cM from AncestryDNA. This graph includes 21 clusters and is a bit more complicated, due to including so many more matches. The number of matches increases drastically when you include matches under 40-50 cM. When you have a complicated network graph, you may want to employ some of the strategies in this tutorial to focus on clusters relevant to your research question.
In Gephi, a quick way to understand the clusters is to change the settings when you run modularity. Playing with the modularity can help you group matches into larger buckets or detect smaller communities.
After going to the statistics module on the right sidebar, find modularity, then click run. You’ll see the modularity settings popup box.
The default resolution is 1.0, but as the popup box says, you can lower the resolution to get more communities (smaller ones), and make the resolution higher than 1.0 to get less communities (bigger ones).
If you make the resolution high, you can often see just two clusters, one maternal and one paternal. I tried a resolution of 20, and the report said the algorithm created 11 communities. However when I applied color to these, I saw that there were just two large clusters, green and orange, and the other 9 were tiny (see the colors and percentages in the appearance module). When I compared close matches to their place in the graph, I could clearly see that the maternal matches were in the top left of the network graph (green). The paternal matches are orange.
Setting the modularity resolution to 5 gave me five large clusters. The maternal clusters were still all grouped into one, but the paternal clusters were separated into four. When you are first starting to analyze the network graph, it can be worthwhile to adjust the modularity to a higher number, like 20, and see the breakdown of maternal/paternal. Then try a resolution of 5 to see clusters divided into 4-5 clusters, hopefully giving you an idea of the four grandparent lines.
Some of the 21 clusters look like they could be subdivided into additional clusters. Let’s try a lower resolution to see if we can get more clusters. I changed the resolution to 0.5, but the maternal clusters didn’t split apart much. I lowered the resolution a bit more, to 0.3, and now I got four maternal clusters, and LOTS of paternal clusters.
Resolution 1 is on the left, resolution 0.5 is in the middle, and resolution 0.3 is on the right.
Filtering Out Clusters
Let’s go back to the original graph with 21 clusters and learn how to filter to just see the clusters you’re interested in. The top left clusters, purple (1) and green (2), are maternal. I know this because the larger nodes are close maternal cousins that I recognize. I want to focus on paternal clusters that relate to the Royston side of the family, to help my mom with her Cynthia (Dillard) Royston research. Cynthia is Diana’s 3rd-great-grandmother. So, I looked at the other large matches for close paternal matches on the Royston/Dillard side.
The orange cluster (14) in the center has a medium sized node, a close Royston cousin named Victor. Victor is two generations closer than my mother to Cynthia (Dillard) Royston. That cluster 14 is crowded by the other clusters. The center of the graph looks pretty messy, making the analysis of which clusters are connected to which other clusters challenging.
In order to focus on cluster 14, and see if we can sub-divide cluster 14 into smaller clusters, we will need to filter out the other clusters. These sub-clusters may reveal different common ancestral couples in the Royston line.
To filter out the other clusters, we will use the Filters module on the right sidebar of Gephi. Double click on attributes in the Filters library. Then, double click on partition.
Next, double click on the Modularity Class (Node) filter to add it to your queries. (If you have any other filters you were using before, right click on them to remove them first). Once you have added the modularity class filter, you’ll see the cluster numbers in the bottom right module. You can then check the boxes next to whichever clusters you want to see. They are listed in order by the largest cluster to the smallest cluster. Cluster 14 is the second largest cluster.
After clicking on the box for the cluster you want to focus on, click filter. Now I only see cluster 14!
Cluster 14 appears to have several sub-clusters. To analyze the sub-clusters of group 14, you can select all the nodes for group 14 and export them to a new workspace. Workspaces are tabs within the visualization window that allow you to analyze more than one graph simultaneously. To open a cluster in a new workspace, click the select button in the bottom right partition (modularity class) settings. Next, find the export filtered graph button. It’s a small icon at the top of the filters module. When you hover over the second icon, you’ll see “export filtered graph to a new workspace.” Click that icon.
After clicking the export filtered graph to a new workspace button, a new tab will appear at above the appearance module labelled “Workspace 2.” Click on Workspace 2 to go to that workspace and see the cluster you exported there. My cluster 14 is very zoomed in, so I will use the magnifying class icon to center my view on the whole graph.
The nodes look light colored and faint because they were selected. To unselect them, choose the direct selection (arrow) tool.
The next step is to redo the layout. In the layout module, choose Force Atlas 2, change the scaling to 1000, and check the box for stronger gravity. Click run.
Now we’ll run modularity again. In the filters module, click the statistics tab. Click the “run” button next to modularity. Gephi will detect communities and show the modularity report. Cluster 14 now has nine sub-clusters. These nine clusters have now been given their own numbers from 0-8.
Finally, we will add color to the sub-clusters. In the appearance module, click the nodes tab, and then click the color palette. Select the partition tab. In the “choose an attribute” dropdown box, select modularity class.
From there, you can either click apply with the default colors, or click palette to generate a new palette of colors.
Now we can analyze each smaller cluster within cluster 14, looking for common ancestors. Sometimes, in my analysis of sub-clusters, I find common ancestral couples further on the studied ancestral line. Sometimes, I can’t find common ancestors, and I wonder if they all share an “old” segment that was passed down from a distant common ancestor. The first step to finding matches relevant to your case is to evaluate as many matches within a cluster with a known cousin who descends from the research subject. Clusters often have subgroups within them. If that analysis doesn’t yield relevant matches, you may want to look for connected clusters to analyze.
Compare Two Clusters & Find Connected Clusters
Comparing the connection between clusters can help you find related clusters.
Using the “partition (modularity class)” filter, you can compare two clusters to each other. This can help you see the connection between the two clusters more clearly. Going back to Workspace 1, I selected the orange cluster (14) and the green cluster (12).
The two clusters appear to have many connections between them. After exporting them to a new workspace, laying out out the graph, and changing the force atlas settings to a scaling of 1000 and gravity of 0.2, I was able to see sub-clusters and connections between sub-clusters clearly. This smaller, focused graph is much easier to analyze than the original network graph for Diana that included all her matches.
How do you know which two clusters to focus on? Comparing the known cluster (which contains known descendants of the research subject) to each other cluster can help. Filter to see just two clusters at a time. Take screenshots to compare the strength of the connections. Below I’m comparing the connection of cluster 14 with clusters 2 and 15. Between 14 and 2 there are very few connecting lines. They are also far away from each other in the graph. The comparison between 14 and 15 is the opposite — the connection between them is strong, with many connecting lines, and the clusters are near to each other in the graph.
This visual analysis holds true in the documentary research as well. Reviewing the family trees for matches in cluster 2 revealed that it’s maternal. Cluster 14 is paternal. I wouldn’t expect a maternal and a paternal cluster to have any connections at all! But there can be shared DNA between two of your matches on another line, unrelated to the test-taker, which explains the random connections.
Cluster 15 (blue) is a Harris cluster. My mother’s grandparents were Charles Leslie Shults and Ettie Belle Harris. Her Royston line begins with Charles Leslie Shults’ mother, Dora Algie Royston. The Shults, Royston, and Harris lines all lived in Texas and Indian Territory/Oklahoma. With ancestors who share common geographical origins and migration paths, it’s normal to see more connections between clusters. The two large matches in the blue cluster (15) are a 1C and 1C1R from Diana’s paternal grandparents, Charles Leslie Shults and Ettie Belle Harris. Because these first cousins are large nodes in between clusters 14 and 15, you can see that they are important matches pointing to two clusters who are further back on Charles Leslie Shults line (orange/14) and Ettie Belle Harris’ line (blue/15).
Connections between clusters can mean several things, but the stronger the connections between clusters, the more likely it is that they represent DNA matches who share common ancestors with the test-taker and each other on the same ancestral line.
Calculate the Number of Connecting Edges
Another way to determine the strength of a connection between two clusters, beyond a visual comparison, is to calculate the exact number of edges/connections between them. In the context module at the top right of Gephi, you can see how many nodes and edges the selected cluster(s) have. (The percent visible refers to the proportion of nodes and edges visible out of the entire graph.) In the example below, I filtered to see just clusters 14 and 15, and together, they have 17,082 edges.
By itself, cluster 14 has 5,946 edges. Cluster 15 has 10,717 edges. To determine how many edges are connecting the two clusters, well subtract the edges in clusters 14 and 15 from the total edges when they are together.
17,082 – 5946 – 10,717= 419 edges
This means there are 419 edges connecting the two clusters.
By comparison, clusters 2 and 14 have a total of 11,118 edges. Cluster 2 has 5160 edges. Cluster 14 has 5946 edges.
11,118 – 5,160 – 5,946 = 12 edges
This means there are only 12 edges connecting clusters 2 and 14. That such a weak connection, I would consider it no connection at all (just random connections).
Another factor to take into account is the number of nodes in the cluster. I used a spreadsheet to help me find the proportion of connections to the number of nodes in the cluster.
What I learned from this spreadsheet analysis comparing cluster 14 to every other significant paternal cluster is that both clusters 15 and 17 have over 300 connections with cluster 14. Cluster 15 is twice as large as 17 though, so the proportion of connections per node is higher for 17, at an average of 0.67 connecting edges per node. Cluster 15 has an average of 0.42 connecting edges per node.
I also calculated the average connecting edges per node for cluster 14 in each comparison, then the last column is the average of columns J and K. Finally, I sorted the spreadsheet by column L to find which clusters had the highest average connecting edges per node for both clusters. This analysis led me to focus on the clusters with the highest connections to cluster 14: clusters 17, 15, 19, 12, 20 and 16. I compared the people in the clusters to their match pages with my notes and common ancestor hints. Then I noted the numbers for the ancestral couples in Diana’s paternal pedigree.
My analysis of MRCA couples for these clusters shows:
Cluster 14: Thomas B. Royston/Cynthia Dillard – this is the cluster with known descendants of the research subject, Cynthia (Dillard) Royston. I also found a descendant of Thomas B. Royston’s father, John Carey Royston.
Cluster 17: William H. Shults/Eliza Isenhour and ancestors
Cluster 15: Dock Harris/Alice Frazier and ancestors – this line has an instance of Harris siblings marrying Frazier siblings. Although not pedigree collapse, it causes multiple relationships among descendant test-takers, and making the genetic networks overlap and appear as one big cluster.
Cluster 19: H Weatherford/ Clemsy Cline
Cluster 12: Robert Cisney Royston/ Isabella Weatherford and H Weatherford/ Clemsy Cline
Cluster 20: Unknown
cluster 16: Unknown
The analysis shows that to find relevant matches on the Royston/Dillard line, I should focus on the matches in cluster 14 and then possibly expand to clusters 20 and 16. If I want to find more about the Weatherford ancestors, I can focus on clusters 19 and 12. I might want to take a closer look at cluster 12, in case there are matches whose common ancestors are on the Royston/Dillard side. For now, it looks like this cluster is both Royston/Weatherford (which might include matches with MRCA couples further back on the Royston side), and Weatherford/Cline (which includes DNA matches with MRCA couples further back on the Weatherford side).
It’s interesting to note that closer cousins, like 1st and 2nd cousins, cause a lot of connections between further back clusters. That’s why cluster 14 and 15 have many connections, though on opposite sides of the paternal pedigree chart. To reduce this noise, you can remove close cousins from the network graph. Just right click on them and click delete. You might want to copy your whole graph to a new workspace first.
We have been looking for shared match connections between clusters that show they are on the same ancestral line. But there are also random connections in the network graph. What can cause “random” connections between clusters?
- Individuals who share DNA on a different line than the line they share with the test-taker. After all, we all share DNA with thousands of other people. There’s a high chance that some of our matches will match some of our other matches with different MRCAs, common only to them, and not to us. This situation accounts for almost all the random connections.
- False matches. These are matches that don’t really share DNA.
- Individuals who share multiple relationships (and therefore multiple common ancestral couples) with the test-taker. This usually shows as a node with many connections to two separate clusters. These are not really random connections, but extra connections.
- Individuals who have pedigree collapse or endogamy in their family tree. Their network graphs will show connections to many clusters. Network graph analysis is not helpful for those with moderate to severe endogamy. These are not random connections, but so many connections that it makes the network graph unreadable.
Looking for connected clusters can be tricky. The more matches under about 30-50 cM are included, the more random connections you’ll see between clusters. Prioritize clusters that have significant connections between them when compared with the other clusters in the graph.
In the example below, looking at the connection between clusters 1 and 14, the clusters are only connected through one node. This can indicate that the DNA match shares two relationships with the test-taker. In this instance, the node is a DNA match who shares a common ancestral couple with Diana on her maternal side, and another common ancestral couple on her paternal side. Diana and this match are double cousins. The node could really be colored half purple and half orange.
Network graphs can give you valuable clues about your DNA matches and how they are related to one another and to the test-taker. They can be complicated to analyze, but hopefully some of these strategies gave you ideas for how to better understand your network graph. To take full advantage of your network graph, you must look at matches’ trees, find hypothesized common ancestors, and perform documentary research. The next post in this series will focus on how to do that. Alice Childs, one of the researchers at Family Locket Genealogists, uses GWorks, a tool of DNAGedcom, to help analyze network graphs. See her post here:
Childs, Alice. “How Pairing Gworks with a Network Graph Can Help Solve Your Research Objective.” 16 February 2023. Family Locket. https://familylocket.com/how-pairing-gworks-with-a-network-graph-can-help-solve-your-research-objective/.