Would you like an efficient method to find clusters of DNA matches relevant to your research subject? In this series, I’m sharing the steps to create a network graph using the free, open source Gephi application, available for Windows or Mac. I use Gephi to create network graphs of my AncestryDNA matches, but you can use matches from other companies as well. Throughout this series, I will be using my own matches from AncestryDNA, but I have changed their names for privacy. To see the first step in this tutorial, go here: Creating Gephi Network Graphs Part 1: Gather Matches and Prepare Spreadsheets.
This article goes over downloading Gephi, a tour of the Gephi application, naming your file, importing the spreadsheets, and running graph layouts.
Download and Update Gephi
Gephi is an open source software program for visualizing graphs and networks. It runs on Windows, Mac OS X and Linux.
Gephi free to download and use – but it may have bugs and errors and not behave perfectly. Whenever I have an issue with Gephi, I search online for others who have had the same problem and can usually find ideas for what to do in online forums. Here’s the Gephi Support page: https://gephi.org/users/support/.
To download Gephi, go to this URL: https://gephi.org/.
Open the program and check for updates by going to Help > Check for Updates. If there are any updates, be sure to install them and update the program so you have the latest version. When you open the program, it should notify you if there are updates available.
Tour of Gephi
As I show you what Gephi looks like, keep in mind that I’m using a Windows computer. If you’re using a Mac, it will look slightly different.
Within Gephi, there are three tabs along the top banner under the top menu: overview, data laboratory, and preview.
Overview tab: lay out and view the network graph
Data laboratory tab: table with data about the nodes
Preview tab: options for exporting your graph as a PDF or an image file
In this post, we’ll be working in the overview tab. Within the overview tab, there is a left panel and a right panel with various modules. Within the left panel is the appearance module and the layout module.
Appearance module: change the size and color of the nodes and edges
Layout module: change the layout of the network graph; including how close the nodes are to each other and which type of layout your graph will have
The right panel includes the context module, Filters module, and statistics module.
Context Module: see how many nodes (DNA matches) and edges (shared matches) you have in the graph
Filters Module: filter to a subset of DNA matches – i.e. a smaller range of matches, matches with no shared matches, etc.
Statistics Module: run the modularity report that separates the DNA matches into communities – AKA clusters
There is also a hidden panel at the bottom of the graph panel. You can view it by clicking the arrow at the bottom right.
The bottom panel includes a global, edges, and labels tab.
Global tab: choose the graph’s background color and other settings
Edges tab: choose whether you want to show the edges, edge thickness, and other settings
Labels tab: turn on and off labels for the nodes and edges; select whether you want the label to be the amount of shared cM, the name of the match, your notes about the DNA match, or some combination of the data about the nodes/DNA matches
Now that you know some of the basics of the Gephi program, we are ready to get started on your first network graph project.
Name Your File
Before importing spreadsheets, give your new project a name. Open Gephi and click “New Project.” Click File > Save As, and then give your network graph project a descriptive file name. I usually include the following details:
- Name of the person whose DNA matches I’m using
- Name of the company where the DNA matches are found (usually I’m using AncestryDNA)
- The range of shared cM that I’ll be using for the initial graph (you may narrow it down further as you work within the graph)
Here’s what my file names usually look like: Nicole AncestryDNA Network Graph 50-400cM.Gephi
I have a folder in my file system called network graphs and sub-folders for each graph that I create with the associated spreadsheets and graph files in them.
Import ICW Spreadsheet
In the last post in this series, I showed how to download the match and in common with (ICW) spreadsheets using DNAGedcom Client. See “Creating Gephi Network Graphs Part 1: Gather Matches and Prepare Spreadsheets.” At the end of that tutorial, you should have the match and ICW files with the headers adjusted to be ready to import into Gephi.
We are now going to import the ICW spreadsheet. To do this, go to File > Import Spreadsheet.
Choose the ICW csv file that you prepared and click open. In the import settings dialog box that opens, you can see that Gephi has detected this spreadsheet as an edges file. Click next, then click finish.
Note: I am using my own matches from AncestryDNA, but I have changed their names for privacy. All names you see in the spreadsheets and screenshots in this series were generated using Random Name Generator.
Next, you will see a dialog box that reports the completion of importing the ICW file. In this box, you will change two things. In the graph type dropdown box, change it to “undirected.” Secondly, click the radio button for “append to existing workspace” instead of “new workspace.” Then click ok.
You should see black nodes and edges in a clump in the center of the graph.
Run Layouts
In the Layout module in the left side panel, click on the dropdown box that says “choose a layout.” Select OpenOrd and click “Run.”
Next, go to the layouts dropdown box again and select “Force Atlas 2.” Under tuning, change the scaling to 1000 and check the box for stronger gravity. Click “Run.” When the clusters have formed, click “stop.” After this, one additional tweak you can do here to help your labels be more visible later, is to check the box for “prevent overlap,” and click run again. This will spread out the nodes a bit.
Now you should have a graph with visible clusters, like this:
Import Matches Spreadsheet
In order to know which nodes belong to which DNA matches, you need to associate the ID numbers in the ICW spreadsheet with the names of matches along with the details from the matches spreadsheet (shared cM, notes, etc.). These details will come in handy as you use your graph.
Click File > Import Spreadsheet. This time, select the “m” file or matches spreadsheet – be sure that this is the file you prepared by changing the column header. The file should be a csv file.
In the import spreadsheet dialog box that appears, you’ll see that Gephi detects that it’s a nodes table.
Click Next. When you get to Import Settings, you have the option to deselect some of the columns. I typically uncheck the box for some of the columns that aren’t useful to me, like the following that are included in a match spreadsheet with matches from AncestryDNA:
- starred
- private
- hint
- archived
- image URL
- scanned
- member since
- ethnic regions
- ethnic trace regions
You can bring in all the columns, but your data laboratory won’t show them all at once. The data laboratory allows you to select which columns show so you can do it that way as well.
Click “Finish.” In the import report that comes next, change the graph type to “undirected” and click the radio box for “append to existing workspace.” Then click “OK.”
Results
Now that you have associated the nodes in you graph with the IDs for DNA matches in the match spreadsheet, you can find out who is in your graph.
First, click on the data laboratory tab at the top (in between the overview and preview tabs). You will see a list of all your nodes. Each row is one of the nodes/DNA matches. If you don’t look at the data laboratory, you won’t be able to do the next step.
Zoom in by using the wheel on your mouse so you can get close enough to accurately click on a node. If you don’t get close, enough, when you right click, the options will be grayed out. So, zoom in a bit closer using the wheel on your mouse. Hover over one of the nodes with your pointer. You will see that the rest of the graph is now grayed out except for the selected node and connected nodes. Right click on the node (command + click for MAC), then choose “select in data laboratory.” This doesn’t automatically take you to the data lab, you will have to go there manually.
Now when you go to the data laboratory, you will see that one of the rows is highlighted.
Next – Go to Part 3
Part three will focus on adjusting the network graph and using the tools in Gephi to find communities (clusters), add color to the communities, change the node size, and more. Go here: Creating Gephi Network Graphs Part 3: Adjusting the Network Graph
12 Comments
Leave your reply.