Creating Network Graphs with Gephi Part 5: Analyze the Network Graph

October 28, 2023 Posted by Nicole Elder Dyer Research Tips

Would you like an efficient method to find clusters of DNA matches relevant to your research subject? In this series, I’m sharing the steps to create a network graph using the free, open source Gephi application, available for Windows or Mac. I use Gephi to create network graphs of my AncestryDNA matches, but you can use matches from other companies as well.

Below are the previous steps in this tutorial:

Creating Gephi Network Graphs Part 1: Gather Matches and Prepare Spreadsheets

Creating Network Graphs with Gephi Part 2: Import Spreadsheets and Run Layouts

Creating Gephi Network Graphs Part 3: Adjusting the Network Graph

Creating Gephi Network Graphs Part 4: Exporting and Saving the Graph

This article will cover how to filter out clusters, compare two clusters to each other, find connected clusters, random connections, analyze connections, and adjust the modularity to find more or less clusters. I will be using a new example network graph for this tutorial.

Example Graph

In the previous tutorials, I used a network graph of my own matches from AncestryDNA with the range of 50-400 cM. This created a simple, manageable network graph with 6 clusters. For this tutorial, I will be using my mom’s network graph that includes matches from 15-886 cM from AncestryDNA. This graph includes 21 clusters and is a bit more complicated, due to including so many more matches. The number of matches increases drastically when you include matches under 40-50 cM. When you have a complicated network graph, you may want to employ some of the strategies in this tutorial to focus on clusters relevant to your research question.

Diana’s AncestryDNA network graph from 15cM and up, with 21 clusters

Adjusting Modularity

In Gephi, a quick way to understand the clusters is to change the settings when you run modularity. Playing with the modularity can help you group matches into larger buckets or detect smaller communities.

After going to the statistics module on the right sidebar, find modularity, then click run. You’ll see the modularity settings popup box.

The default resolution is 1.0, but as the popup box says, you can lower the resolution to get more communities (smaller ones), and make the resolution higher than 1.0 to get less communities (bigger ones).

High Resolution

If you make the resolution high, you can often see just two clusters, one maternal and one paternal. I tried a resolution of 20, and the report said the algorithm created 11 communities. However when I applied color to these, I saw that there were just two large clusters, green and orange, and the other 9 were tiny (see the colors and percentages in the appearance module). When I compared close matches to their place in the graph, I could clearly see that the maternal matches were in the top left of the network graph (green). The paternal matches are orange.

Setting the modularity resolution to 5 gave me five large clusters. The maternal clusters were still all grouped into one, but the paternal clusters were separated into four. When you are first starting to analyze the network graph, it can be worthwhile to adjust the modularity to a higher number, like 20, and see the breakdown of maternal/paternal. Then try a resolution of 5 to see clusters divided into 4-5 clusters, hopefully giving you an idea of the four grandparent lines.

Lower Resolution

Some of the 21 clusters look like they could be subdivided into additional clusters. Let’s try a lower resolution to see if we can get more clusters. I changed the resolution to 0.5, but the maternal clusters didn’t split apart much. I lowered the resolution a bit more, to 0.3, and now I got four maternal clusters, and LOTS of paternal clusters.

Res 1.0 – 21 clusters

Res 0.5 – 36 clusters

Res 0.3 – 49 clusters

Resolution 1 is on the left, resolution 0.5 is in the middle, and resolution 0.3 is on the right.

Filtering Out Clusters

Let’s go back to the original graph with 21 clusters and learn how to filter to just see the clusters you’re interested in. The top left clusters, purple (1) and green (2), are maternal. I know this because the larger nodes are close maternal cousins that I recognize. I want to focus on paternal clusters that relate to the Royston side of the family, to help my mom with her Cynthia (Dillard) Royston research. Cynthia is Diana’s 3rd-great-grandmother. So, I looked at the other large matches for close paternal matches on the Royston/Dillard side.

Diana’s AncestryDNA network graph with matches from 15-886 cM

The orange cluster (14) in the center has a medium sized node, a close Royston cousin named Victor. Victor is two generations closer than my mother to Cynthia (Dillard) Royston. That cluster 14 is crowded by the other clusters. The center of the graph looks pretty messy, making the analysis of which clusters are connected to which other clusters challenging.

In order to focus on cluster 14, and see if we can sub-divide cluster 14 into smaller clusters, we will need to filter out the other clusters. These sub-clusters may reveal different common ancestral couples in the Royston line.

To filter out the other clusters, we will use the Filters module on the right sidebar of Gephi. Double click on attributes in the Filters library. Then, double click on partition.

Next, double click on the Modularity Class (Node) filter to add it to your queries. (If you have any other filters you were using before, right click on them to remove them first). Once you have added the modularity class filter, you’ll see the cluster numbers in the bottom right module. You can then check the boxes next to whichever clusters you want to see. They are listed in order by the largest cluster to the smallest cluster. Cluster 14 is the second largest cluster.

After clicking on the box for the cluster you want to focus on, click filter. Now I only see cluster 14!

Cluster 14 appears to have several sub-clusters. To analyze the sub-clusters of group 14, you can select all the nodes for group 14 and export them to a new workspace. Workspaces are tabs within the visualization window that allow you to analyze more than one graph simultaneously. To open a cluster in a new workspace, click the select button in the bottom right partition (modularity class) settings. Next, find the export filtered graph button. It’s a small icon at the top of the filters module. When you hover over the second icon, you’ll see “export filtered graph to a new workspace.” Click that icon.

After clicking the export filtered graph to a new workspace button, a new tab will appear at above the appearance module labelled “Workspace 2.” Click on Workspace 2 to go to that workspace and see the cluster you exported there. My cluster 14 is very zoomed in, so I will use the magnifying class icon to center my view on the whole graph.

The nodes look light colored and faint because they were selected. To unselect them, choose the direct selection (arrow) tool.

The next step is to redo the layout. In the layout module, choose Force Atlas 2, change the scaling to 1000, and check the box for stronger gravity. Click run.

Now we’ll run modularity again. In the filters module, click the statistics tab. Click the “run” button next to modularity. Gephi will detect communities and show the modularity report. Cluster 14 now has nine sub-clusters. These nine clusters have now been given their own numbers from 0-8.

Finally, we will add color to the sub-clusters. In the appearance module, click the nodes tab, and then click the color palette. Select the partition tab. In the “choose an attribute” dropdown box, select modularity class.

From there, you can either click apply with the default colors, or click palette to generate a new palette of colors.

Now we can analyze each smaller cluster within cluster 14, looking for common ancestors. Sometimes, in my analysis of sub-clusters, I find common ancestral couples further on the studied ancestral line. Sometimes, I can’t find common ancestors, and I wonder if they all share an “old” segment that was passed down from a distant common ancestor. The first step to finding matches relevant to your case is to evaluate as many matches within a cluster with a known cousin who descends from the research subject. Clusters often have subgroups within them. If that analysis doesn’t yield relevant matches, you may want to look for connected clusters to analyze.

Compare Two Clusters & Find Connected Clusters

Comparing the connection between clusters can help you find related clusters.

Using the “partition (modularity class)” filter, you can compare two clusters to each other. This can help you see the connection between the two clusters more clearly. Going back to Workspace 1, I selected the orange cluster (14) and the green cluster (12).

The two clusters appear to have many connections between them. After exporting them to a new workspace, laying out out the graph, and changing the force atlas settings to a scaling of 1000 and gravity of 0.2, I was able to see sub-clusters and connections between sub-clusters clearly. This smaller, focused graph is much easier to analyze than the original network graph for Diana that included all her matches.

How do you know which two clusters to focus on? Comparing the known cluster (which contains known descendants of the research subject) to each other cluster can help. Filter to see just two clusters at a time. Take screenshots to compare the strength of the connections. Below I’m comparing the connection of cluster 14 with clusters 2 and 15. Between 14 and 2 there are very few connecting lines. They are also far away from each other in the graph. The comparison between 14 and 15 is the opposite — the connection between them is strong, with many connecting lines, and the clusters are near to each other in the graph.

This visual analysis holds true in the documentary research as well. Reviewing the family trees for matches in cluster 2 revealed that it’s maternal. Cluster 14 is paternal. I wouldn’t expect a maternal and a paternal cluster to have any connections at all! But there can be shared DNA between two of your matches on another line, unrelated to the test-taker, which explains the random connections.

Cluster 15 (blue) is a Harris cluster. My mother’s grandparents were Charles Leslie Shults and Ettie Belle Harris. Her Royston line begins with Charles Leslie Shults’ mother, Dora Algie Royston. The Shults, Royston, and Harris lines all lived in Texas and Indian Territory/Oklahoma. With ancestors who share common geographical origins and migration paths, it’s normal to see more connections between clusters. The two large matches in the blue cluster (15) are a 1C and 1C1R from Diana’s paternal grandparents, Charles Leslie Shults and Ettie Belle Harris. Because these first cousins are large nodes in between clusters 14 and 15, you can see that they are important matches pointing to two clusters who are further back on Charles Leslie Shults line (orange/14) and Ettie Belle Harris’ line (blue/15).

Diana’s pedigree

Connections between clusters can mean several things, but the stronger the connections between clusters, the more likely it is that they represent DNA matches who share common ancestors with the test-taker and each other on the same ancestral line.

Calculate the Number of Connecting Edges

Another way to determine the strength of a connection between two clusters, beyond a visual comparison, is to calculate the exact number of edges/connections between them. In the context module at the top right of Gephi, you can see how many nodes and edges the selected cluster(s) have. (The percent visible refers to the proportion of nodes and edges visible out of the entire graph.) In the example below, I filtered to see just clusters 14 and 15, and together, they have 17,082 edges.

By itself, cluster 14 has 5,946 edges. Cluster 15 has 10,717 edges. To determine how many edges are connecting the two clusters, well subtract the edges in clusters 14 and 15 from the total edges when they are together.

17,082 – 5946 – 10,717= 419 edges

This means there are 419 edges connecting the two clusters.

By comparison, clusters 2 and 14 have a total of 11,118 edges. Cluster 2 has 5160 edges. Cluster 14 has 5946 edges.

11,118 – 5,160 – 5,946 = 12 edges

This means there are only 12 edges connecting clusters 2 and 14. That such a weak connection, I would consider it no connection at all (just random connections).

Another factor to take into account is the number of nodes in the cluster. I used a spreadsheet to help me find the proportion of connections to the number of nodes in the cluster.

What I learned from this spreadsheet analysis comparing cluster 14 to every other significant paternal cluster is that both clusters 15 and 17 have over 300 connections with cluster 14. Cluster 15 is twice as large as 17 though, so the proportion of connections per node is higher for 17, at an average of 0.67 connecting edges per node. Cluster 15 has an average of 0.42 connecting edges per node.

I also calculated the average connecting edges per node for cluster 14 in each comparison, then the last column is the average of columns J and K. Finally, I sorted the spreadsheet by column L to find which clusters had the highest average connecting edges per node for both clusters. This analysis led me to focus on the clusters with the highest connections to cluster 14: clusters 17, 15, 19, 12, 20 and 16. I compared the people in the clusters to their match pages with my notes and common ancestor hints. Then I noted the numbers for the ancestral couples in Diana’s paternal pedigree.

Blue boxes indicate which clusters matches with these MRCA couples were in

My analysis of MRCA couples for these clusters shows:

Cluster 14: Thomas B. Royston/Cynthia Dillard – this is the cluster with known descendants of the research subject, Cynthia (Dillard) Royston. I also found a descendant of Thomas B. Royston’s father, John Carey Royston.

Cluster 17: William H. Shults/Eliza Isenhour and ancestors

Cluster 15: Dock Harris/Alice Frazier and ancestors – this line has an instance of Harris siblings marrying Frazier siblings. Although not pedigree collapse, it causes multiple relationships among descendant test-takers, and making the genetic networks overlap and appear as one big cluster.

Cluster 19: H Weatherford/ Clemsy Cline

Cluster 12: Robert Cisney Royston/ Isabella Weatherford and H Weatherford/ Clemsy Cline

Cluster 20: Unknown

cluster 16: Unknown

The analysis shows that to find relevant matches on the Royston/Dillard line, I should focus on the matches in cluster 14 and then possibly expand to clusters 20 and 16. If I want to find more about the Weatherford ancestors, I can focus on clusters 19 and 12. I might want to take a closer look at cluster 12, in case there are matches whose common ancestors are on the Royston/Dillard side. For now, it looks like this cluster is both Royston/Weatherford (which might include matches with MRCA couples further back on the Royston side), and Weatherford/Cline (which includes DNA matches with MRCA couples further back on the Weatherford side).

It’s interesting to note that closer cousins, like 1st and 2nd cousins, cause a lot of connections between further back clusters. That’s why cluster 14 and 15 have many connections, though on opposite sides of the paternal pedigree chart. To reduce this noise, you can remove close cousins from the network graph. Just right click on them and click delete. You might want to copy your whole graph to a new workspace first.

Random Connections

We have been looking for shared match connections between clusters that show they are on the same ancestral line. But there are also random connections in the network graph. What can cause “random” connections between clusters?

Individuals who share DNA on a different line than the line they share with the test-taker. After all, we all share DNA with thousands of other people. There’s a high chance that some of our matches will match some of our other matches with different MRCAs, common only to them, and not to us. This situation accounts for almost all the random connections.
False matches. These are matches that don’t really share DNA.
Individuals who share multiple relationships (and therefore multiple common ancestral couples) with the test-taker. This usually shows as a node with many connections to two separate clusters. These are not really random connections, but extra connections.
Individuals who have pedigree collapse or endogamy in their family tree. Their network graphs will show connections to many clusters. Network graph analysis is not helpful for those with moderate to severe endogamy. These are not random connections, but so many connections that it makes the network graph unreadable.

Looking for connected clusters can be tricky. The more matches under about 30-50 cM are included, the more random connections you’ll see between clusters. Prioritize clusters that have significant connections between them when compared with the other clusters in the graph.

In the example below, looking at the connection between clusters 1 and 14, the clusters are only connected through one node. This can indicate that the DNA match shares two relationships with the test-taker. In this instance, the node is a DNA match who shares a common ancestral couple with Diana on her maternal side, and another common ancestral couple on her paternal side. Diana and this match are double cousins. The node could really be colored half purple and half orange.

Conclusion

Network graphs can give you valuable clues about your DNA matches and how they are related to one another and to the test-taker. They can be complicated to analyze, but hopefully some of these strategies gave you ideas for how to better understand your network graph. To take full advantage of your network graph, you must look at matches’ trees, find hypothesized common ancestors, and perform documentary research. The next post in this series will focus on how to do that. Alice Childs, one of the researchers at Family Locket Genealogists, uses GWorks, a tool of DNAGedcom, to help analyze network graphs. See her post here:

Childs, Alice. “How Pairing Gworks with a Network Graph Can Help Solve Your Research Objective.” 16 February 2023. Family Locket. https://familylocket.com/how-pairing-gworks-with-a-network-graph-can-help-solve-your-research-objective/.

About Nicole Elder Dyer

Nicole Dyer is a professional genealogist specializing in Southern United States research and genetic genealogy. She is the creator of FamilyLocket.com and the Research Like a Pro Genealogy Podcast. She co-authored Research Like a Pro: A Genealogist's Guide and Research Like a Pro with DNA and is an instructor for the study groups of the same name. She lectures at conferences and institutes and previously served as the secretary and publicity chair of the Pima County Genealogy Society. Nicole holds a bachelor's degree from Brigham Young University in History Teaching. At Family Locket Genealogists, Nicole is a project manager, editor, and researcher.

4 Comments

Leave your reply.

Ashley Young
· Reply

April 26, 2025 at 8:33 AM

Hi Nicole,
Thanks so much for putting together such a comprehensive tutorial on using Gephi for network graphs! I’ve wanted to try it out since first seeing you post about it a few years ago, but just recently got around to actually doing it. I’m new to both DNAGedcom & Gephi, but was able to follow your instructions and create my first network graph for my grandaunt’s matches from 25-320 cM.

When importing my ICW file I did get the warning, “13,413 mutual edges removed to fulfill undirected type.” In the data laboratory, my columns for “people” and “tree url” are empty, so maybe I did something wrong when gathering? Everything else went smoothly until I started to compare my 15 clusters. I used your spreadsheet example to create my own comparison spreadsheet. After doing so, I noticed that comparisons between clusters 7 to 9 as well as 3 to 9 resulted in a negative number of connecting edges. This doesn’t seem right. I do plan to redo my gather and try again, but wondered if you might have any insight as to why this may have happened or what I may have done to cause this issue? Thanks!

Loading...
- Nicole Elder Dyer
  · Reply
  
  Author
  
  June 7, 2025 at 7:31 AM
  
  Great job making a network graph! In answer to your question –
  1. “13,413 mutual edges removed to fulfill undirected type” warning
  This message is Gephi’s way of saying: “You had duplicate connections between nodes, and since your graph is undirected, I removed the extras.” This is totally normal and expected behavior when using DNAGedcom’s ICW file as input. ICW (In-Common-With) relationships are symmetrical—if A is ICW with B, B is also ICW with A—so it’s common for the raw data to list both directions. Gephi keeps just one connection per pair when you’re using an undirected graph (which you want for genetic genealogy).
  
  2. Empty “people” and “tree url” columns
  It sounds like the People file wasn’t successfully merged into your Nodes table. DNAGedcom creates multiple files: the ICW file (connections), and the People file (which has tree info, notes, surnames, etc.). If you used only the ICW file to create the network, your nodes won’t include the “people” or “tree url” data, which live in the People file. Try this fix:
  
  Open the Nodes table in the Gephi Data Laboratory.
  
  Use the “Import Spreadsheet” option to import the Match file (usually yourkit_m.csv) as a table to merge with existing nodes, using the test taker’s match ID as the key (Gephi calls this the “Id” column).
  
  Make sure you’re merging into existing nodes and not creating new ones.
  
  This should populate the columns like “people” and “tree url” so you can use them in your visualizations and analysis.
  
  If you used DNAGedcom 4 to gather matches, then perhaps it’s not working right because it’s still in beta. I’m not sure.
  
  3. Negative number of connecting edges between clusters
  This usually points to a mistake in how the comparison spreadsheet was set up, rather than a problem in Gephi itself.
  
  My guess is that when you calculated the number of edges connecting two clusters, you may have accidentally included internal edges (within a cluster) as negatives, or possibly subtracted the wrong value.
  
  Here’s what to double-check:
  
  Make sure your spreadsheet formula is:
  
  Counting the number of edges where the source node is in one cluster and the target is in the other.
  
  Not double-counting or subtracting total edges improperly.
  
  A negative result usually happens when you’re doing something like:
  
  edges_between = total_edges – edges_in_cluster_A – edges_in_cluster_B
  …which can go wrong quickly if the totals are off.
  
  Try filtering your edges in Gephi to show only those where the source and target belong to different clusters and count those directly. You can also export the edge table and do your counts manually or with a script.
  
  Loading...
Grace
· Reply

May 30, 2025 at 2:54 AM

Hi Nicole, Thank you so much for the tutorials on Gephi; they were good. I am new to Gephi, and I was wondering if you have any written piece on how to understand Data Laboratory in Gephi. I am interested in how to identify the colour palettes , the percentages ascribed to each colour and its subsequent explanations.

Loading...
- Nicole Elder Dyer
  · Reply
  
  Author
  
  June 7, 2025 at 7:37 AM
  
  1. What do the colors mean in Gephi?
  In most DNA network graph workflows (like the one in my tutorial), colors are assigned after running the Modularity algorithm. This algorithm detects clusters or communities—groups of matches that are more connected to each other than to others. When you run Modularity:
  
  Gephi creates a new column in the Nodes table (Data Laboratory) called Modularity Class.
  
  Each value (e.g., 0, 1, 2, …) represents a cluster, and Gephi automatically assigns a color to each one.
  
  These colors are not labeled by default, so they’re just visual cues—until you assign meaning based on your genealogy research.
  
  2. Where are the color palettes stored or displayed?
  Colors aren’t shown directly in the Data Laboratory. They’re applied visually in the Overview tab.
  
  However:
  
  The Modularity Class numbers in the Data Laboratory correspond to the colors shown in the Overview tab.
  
  You can see which color belongs to which class by hovering over the nodes in Overview, or by selecting a node and checking the “Appearance” > Nodes > Color settings.
  
  To manually see or change the colors:
  
  Go to Appearance (upper left panel in Overview).
  
  Choose Nodes > Color.
  
  Select “Partition” and pick Modularity Class.
  
  You’ll see the list of clusters (0, 1, 2…) with assigned colors. You can manually edit colors here too.
  
  Loading...