IPED2: Inheritance Path based Pedigree Reconstruction Algorithm for Complicated Pedigrees

Our group, in an effort led by former UCLA PhD student Dan He, developed an algorithm for reconstructing pedigrees with genotype data. This novel approach is presented in a paper recently published in IEEE/ACM Transactions on Computational Biology and Bioinformatics.

Pedigree inference plays an important role in population genetics. Pedigrees, commonly known as family trees, represent genetic relationships between individuals of a family. A pedigree diagram provides a model to compute the inheritance probability for the observed genotype and encodes all possible inheritance options for an allele in an individual. Pedigree reconstruction methods face several challenges. First, there can be an exponential number of possible pedigree graphs, and, second, the number of unknown ancestors can become very large as the height of the pedigree increases.

Examples of sequentially labeling the half-sibling graph. For more information, see our paper.

Examples of sequentially labeling the half-sibling graph. For more information, see our paper.

Our project uses genotype data to reconstruct pedigrees with computational efficiency despite these challenges. Our previous method, IPED, is the only known algorithm scalable to large pedigrees with reasonable accuracy for cases involving both outbreeding and inbreeding. IPED starts from extant individuals and reconstructs the pedigree generation by generation backwards in time. For each generation, IPED predicts the pairwise relationships between the individuals at the current generation and create parents for them according to their relationships.

Existing methods, including IPED, only consider pedigrees with simple structure; they cannot handle populations where, for example, two children share only one parent. To improve pedigree reconstruction when populations have complex structure, we proposed the novel method IPED2. Our approach uses a new statistical test to detect half-sibling relationships and a new graph-based algorithm to reconstruct the pedigree when half-siblings are allowed.

In order to test the performance of our method on complicated pedigrees, we use simulated pedigrees with different parameter settings and, instead of genotype data, we simulate haplotypes
directly. Our experiments show that IPED2 outperforms IPED and two other existing approaches for cases where there are half-siblings.

To our knowledge, this is the first method that can, using just genotype data, reconstruct pedigrees with half-siblings and inbreeding. IPED2 is also scalable to large pedigrees. In future work, we would like to consider additional genetic actions, such as insertion, deletion, and replacement, to resolve the conflicts. We also plan to refine IPED2 to consider cases where genotypes of ancestral individuals are known and where genotypes of extant individuals that are not on the lowest generations are known.

For more information, see our paper, which is available for download through Bioinformaticshttp://ieeexplore.ieee.org/abstract/document/7888513/.

In addition, the open source implementation of IPED2, which was developed by Dan He, is freely available for download at http://genetics.cs.ucla.edu/Dan/Software/IPED2.html.

The full citation to our paper is:
He, D., Wang, Z., Parida, L. and Eskin, E., 2017. IPED2: Inheritance path based pedigree reconstruction algorithm for complicated pedigrees. IEEE/ACM Transactions on Computational Biology and Bioinformatics.

Identifying genetic relatives without compromising privacy

Our DNA can tell us a lot about who our relatives are. Recently, several companies including 23andMe and AncestryDNA now provide services where they collect DNA from individuals and then match the DNA to a database of the DNA of other people to identify relatives. Relatives are then informed by the company that their DNAs match. Our lab was interested if we can perform this same type of service but without involving a company and more generally without involving any third party. One way to do this would be to have individuals obtain their own DNA sequences and then share their DNA sequences directly with each other. Unfortunately, DNA sequences are considered medical information and it is inappropriate to share them in this way.

Through a collaboration between our lab and the UCLA cryptography group, we recently published a paper that combines cryptography and genetics which describes an approach for identifying relatives without compromising privacy. Our paper was published in the April 2014 issue of Genome Research. The key ideas is that individuals release an encrypted version of their DNA information. Another individual can download this encrypted version and then use their own DNA information to try to decrypt it. If the are related to each other, their DNA sequences will be close enough that the decryption will work telling the individual that they are related. While if they are unrelated, the decryption will fail. What is important in this approach is that individuals who are not related do not obtain any information about each other’s DNA sequences.

The intuitive idea behind the approach is the following. Individuals each release a copy of their own genomes encrypted with a key that is based on the genome itself. Other users then download this encrypted information and try to decrypt it using their own genomes as the key. The encryption scheme is designed to allow for decryption if the encrypting key and decrypting key are “close enough”. Since related individuals share a portion of their genomes, we set the threshold for “close enough” to be exactly the threshold of relatedness that we want to detect.

Our approach uses a relatively new type of cryptographic technique called Fuzzy Extractors which were pioneered by our co-authors on this study, Amit Sahai and Rafail Ostrovsky. This type of technique allows for encryption and decryption with keys that match inexactly. Students in our group who were involved are Dan He, Nick Furlotte, Farhad Hormozdiari, and Jong Wha (Joanne) Joo. This research was supported by National Science Foundation grant 1065276.

The full citation of our paper is here:

Sorry, no publications matched your criteria.

Haplotype Phasing from Sequence Data

The haplotype phasing problem.

The classical haplotype phasing problem.

Over the past few years, our group has written several papers on inferring haplotypes from sequence data.

The problem of Haplotype Inference referred to as Haplotype Phasing has had a long history in computational genetics and the problem itself has had several incarnations.  Genotyping technologies obtain “genotype” information on SNPs which mixes the genetic information from both chromosomes.  However, many genetic analyses require “haplotype” information which is the genetic information on each chromosome (see Figure).

In the early days before reference datasets were available, methods would be applied to large numbers of genotyped individuals which would attempt to identify a small number of haplotypes which explained the majority of the individual genotypes.  Methods from this period include PHASE (11254454) and HAP (14988101) (from our group with Eran Halperin).  The figure is actually one of Eran’s slides from around 2002.

Once reference datasets such as the HapMap became available, imputation based methods such as IMPUTE(10.1038/ng2088) and BEAGLE(10.1016/j.ajhg.2009.01.005) dominated previous phasing approaches because they leveraged information from the carefully curated reference datasets.

In principal, haplotype phasing or imputation methods can be applied directly to sequencing data by first calling genotypes in the sequencing data and then applying a phasing or imputation approach.  However, since each read originates from only one chromosome, if a read spans two genotypes it provides some information on haplotype phase.  Combining these reads to construct haplotypes is referred to as the “haplotypes assembly” problem which was pioneered by   Vikas Bansal and Vineet Bafna(10.1093/bioinformatics/btn298),(10.1101/gr.077065.108).  Dan He in our group developed an optimal method for haplotype assembly which guarantees finding the optimal solution for short reads and reduces the problem of haplotype assembly for longer reads to MaxSAT which finds the optimal solution for the vast majority of problem instances(10.1093/bioinformatics/btq215). More recently, others have developed methods that can discover optimal solutions for all problem instances(10.1093/bioinformatics/btt349). In his paper, Dan also showed that haplotype assembly will always underperform traditional phasing methods for short read sequencing data because too few of the reads span multiple genotypes.

To overcome this issue, Dan extended his methods to jointly perform imputation and haplotype assembly(10.1089/cmb.2012.0091),(10.1016/j.gene.2012.11.093).  These methods outperformed both imputation methods and haplotype assembly methods but unfortunately are too slow and memory intensive to apply in practice.  More recently, in our group, Wen-Yun Yang, Zhanyong Wang, Farhad Hormozdiari with Bogdan Pasaniuc developed a sampling method which is both fast and accurate for combining haplotype assembly and imputation(10.1093/bioinformatics/btt386).

Full citations of our papers are here:

Sorry, no publications matched your criteria.

Bibliography