Would you like an efficient method to find clusters of DNA matches relevant to your research subject? In this series, I’m sharing the steps to create a network graph using the free, open source Gephi application, available for Windows or Mac. I use Gephi to create network graphs of my AncestryDNA matches, but you can use matches from other companies as well. Throughout this series, I will be using my own matches from AncestryDNA, but I have changed their names for privacy. All names you see in the spreadsheets and screenshots in this post were generated using Random Name Generator.
I like working with AncestryDNA matches first because they have the largest database of matches. Also, many of the matches have trees attached to their results, or unlinked trees. This helps to find common ancestral couples. When I perform professional research for clients, I often create a network graph of their AncestryDNA matches. This allows me to efficiently find the cluster of matches that descends from the ancestor they want to know more about. From there, I look for connected clusters that might help me answer their research question – usually to find the parents of that ancestor.
Would you like to try making a network graph? In this article, I will detail the steps you need to take before you use Gephi – downloading your matches with DNAGedcom and preparing the spreadsheets. In subsequent articles, I will share how to make the graph, adjust the settings to focus on relevant clusters, adjust the modularity, analyze the clusters, and more.
What is a network graph?
First, let’s go over the basics of network graphs for genetic genealogy. The nodes in a network graph represent your DNA matches. The lines connecting them indicate the two matches are shared matches to each other. When a group of matches has many connecting lines to each other, they form a cluster. Matches in a cluster are likely related along a shared ancestral line. The MRCA (most recent common ancestral) couple with the test taker may not always be in the same generation, but are typically on the same side of a family. A node (DNA match) in the graph might connect to several clusters. This could mean they are a closer relative and are connected to two or more distant clusters, representing more distant ancestral couples.
Set Up the DNAGedcom Client Application
In order to create a network graph, you need two spreadsheets: the match file and the icw (in common with) file. The match file is a list of your DNA matches and how much DNA they share with you. It also includes columns for the estimated relationship, number of shared segments, link to their tree, link to the match page, etc.
The icw file is a list of shared matches. Each row in the icw spreadsheet is one of your DNA matches and one of their shared matches. Since most of your matches have more than one shared match, the spreadsheet includes multiple rows for each match. In the icw spreadsheet below, you can see two rows of Sheila’s shared matches and nine rows of Janine’s shared matches.
To download these lists, you’ll need an application called the DNAGedcom Client, available for Windows or Mac. The DNAGedcom Client can help you “gather” or download lists of matches, lists of shared matches, and lists of ancestors in matches’ trees. The DNAGedcom Client can also create a cluster report with the Collins Leeds Method, create a chromosome matrix app clustering of matches, and more.
The client is separate from the tools available at the DNAGedcom website, which include the Autosomal DNA Segment Analyzer, JWorks, KWorks and GWorks. Learn more about these tools at DNAGedcom.com > Autosomal Tools or GWorks (Gedcom).
This help article at DNAGedcom is all about setting up the DNAGedcom Client on your computer: https://doc.dnagedcom.com/help/overview/.
In order to download and use the DNAGedcom Client application, you must first register for an account and subscribe. The silver subscription is $5 per month and the gold subscription is $10 per month.
Once you subscribe, you can download the DNAGedcom Client to your computer, install it, and open it. You’ll be prompted to log in to your DNAGedcom account. You’ll then see the home screen that shows what folder your “m” file and “icw’ file (and all other files generated by the client) will be stored. To change the location, click the gray gear in the top right and select a different folder.
When you want to view your files, click the “open folder” button on this page to quickly get to them. My DNAGedcom folder looks like this:
Gathering the Match and ICW Files
To begin gathering (downloading) your matches and shared matches, click on the “Gather” tab at the top of the DNAGedcom Client. Choose the site you want to download your matches from. In my example, I chose Ancestry, and logged in using the web login option.
Next you will choose the profile. If you manage or have access to the DNA results of more than one person, you’ll have a dropdown list. I selected myself for this example.
After that, enter the cM range to limit the number of matches you download. The first time you download matches and make a network graph, I suggest using a range of 50-400 cM. If you try to download all your matches from Ancestry at once, it could take the DNAGedcom Client several days to gather all the matches and shared matches. Once you have entered a range, check the box for “Gather ICW.” Uncheck the boxes for “gather trees” and “gather ethnicity.” Finally, click “Gather DNA Data.”
DNAGedcom will give you an estimate for how long it will take to gather the matches, but the estimate is usually much longer than it actually takes when you’re downloading only matches from 50-400 cM. I have 219 matches in this range. If you have far fewer matches, consider expanding your range down to 40 or 35 cM. If you download all your matches, then the time estimate is more accurate.
While DNAGedcom Client is gathering your matches, it’s important that your computer not go to sleep. This is different from the screen turning off – which is fine. Make sure your computer settings are set to not go to sleep, because gathering the matches could take several hours. If your computer does go to sleep or is shut down, you can open the DNAGedcom Client again and enter the same parameters to restart the gathering process. The DNAGedcom database will pick up the gather where it left off. For tips on gathering your DNA matches from Ancestry and other companies, go to this DNAGedcom help article: Tips and Tactics for Gathering DNA Data.
When the gathering process is complete, you’ll see the message, “Creating Ancestry Reports Completed.”
This message means that the .csv files have been generated and when you go to the file location you specified on the home screen, you will be able to find your “m” file and “icw” file. The naming convention is like this:
match file: m_test taker name.csv
icw file: icw_test taker name.csv
Change Column Headers in the Spreadsheets
The next step in preparing your spreadsheets is to open the “m” file and “icw” file and adjust a few of the column headers so Gephi will understand them.
My computer uses Windows and the default application for opening a csv file is Excel. I use Excel to edit the column headers, then save the files. When I do this, the file remains in the csv format. Some people I have helped, who are using a Mac computer, open the csv file with the Numbers application, which then saves the file in the numbers file format instead of csv. It’s important that you save the file in the csv format after you adjust the headers. In Numbers, you can export to a csv file if you need to.
Open the icw file.
- Find the column that says “matchid” and change it to “source”.
- Find the column that is labelled “icwid” and change it to “target”.
- Go to the last column, labelled “source” and change it to “company” or delete the entire column.
Each DNA match has their own unique ID from Ancestry which is a string of numbers and letters. The “source” and “target” columns tell Gephi which DNA matches (nodes) to connect with a line. In Gephi, lines connecting nodes are referred to as edges.
Next, click file > save as and put the file in a different folder. I set up a new folder outside of my DNAGedcom folder for network graphs. Within this network graphs folder, I have a folder for each test taker I’ve made network graphs for.
Next, open the m file.
Find the column labelled “matchid.” Change this to be “ID”.
Click file > save as and save the file to your network graphs file folder.
Remember that after adjusting the column headers, both your “m” file and “csv” file should still be csv files (not numbers or excel files).
Next – Go to Part 2
Great job! You’re ready to move on to the next part of this series, where we begin using Gephi. In the next post, I will walk you through downloading Gephi and importing the m and icw spreadsheets.