Exciting Innovations at the BYU Record Linking Lab

Joseph P. Price, a professor of economics at BYU, presented at Education Week about his innovative Record Linking Lab. Summarizing from their website, the purpose of the Record Linking Lab is to improve the quality and coverage of the FamilySearch Family Tree by developing tools that link families and individuals across records. Those assisting in this effort include BYU students and academic researchers; yet the Record Linking Lab also works with FamilySearch.org. Their big ideas include automated indexing, extraction of genealogy data from digitized books, and linking people with “sparse data,” like those in yearbooks or books listing those who died in WWI.

The Facebook page for the Record Linking Lab is updated regularly with news and information about their projects. It also contains links to each of the four presentations Dr. Price gave at education week. I attended class 3, “Exciting New Sources of Data for Family History.” He asked us to think about all the papers that we have created during our lives with our information – records, forms, bank slips, school records, DMV records, sports forms, medical records, insurance records, etc.

He then talked about auto-indexing of records. This is a big idea that genealogists have been following for the past several years as machine learning has grown. Dr. Price mentioned the idea that the volunteers who do FamilySearch indexing can provide a basis for the machine learning needed to teach computers to auto-index similar records. One obstacle to indexing in foreign languages is that there aren’t enough volunteer indexers who read the language to be able to complete the projects in a reasonable amount of time. The idea here is to use the foreign language speakers for a particular project do enough indexing to create training data, then let the computer auto-index the rest. Machine-printed forms provide a good basis for auto-indexing, since the printed fields are easy for the machine to identify. Many fields on the 1930 and 1940 census records have not been indexed yet, and this is a place for auto-indexing to start. For example, the veteran status on the 1930 census has not been indexed at FamilySearch.

I have been following Transkribus for the past two years since I learned about them at RootsTech. They are on the cutting edge of handwritten text recognition – using machine learning to teach computers how to recognize common fields on handwritten documents and transcribe them.

These innovations are eventually going to take over indexing, in Dr. Price’s opinion, so that volunteers can focus their efforts on other important tasks, like record linking, and tasks that require a human. One idea that he shared for opportunities of auto-indexing is to re-index the 1850-1940 United States Federal Census records and fix the errors for fields like gender, where over 100,000 people were indexed with the wrong gender.

Dr. Price continued to discuss extracting genealogical data from digitized books at Internet Archive. Using OCR to recognize text from the books could be used to then link information to people in the FamilySearch Family Tree. He also talked about linking records with sparse data, including soldiers who died in WWI, to Family Tree. His hope is that many of these soldiers, who died without descendants, will not be forgotten.

Another exciting innovation that Dr. Price mentioned is the possibility to use geocode matching to link families through census records. As boundaries and jurisdictions change, it might be difficult to know where individuals lived from census year to census year. Establishing an exact geo-coordinates for a family in each census can provide a more accurate way of determining where they lived. Linking the geo-coordinates of towns and enumeration districts could be a huge boon to linking census records and families.

To volunteer with the record linking lab, email them at rll@byu.edu. They have many tasks for volunteers to help with.