B.I.G. Summer in ZarLab

This summer, six young adults engaged in a unique eight-week learning experience with ZarLab, learning practical skills in genomics and bioinformatics while conducting research on large-scale human genetic datasets. These four undergraduate students participated in the Bruins-In-Genomics (B.I.G.) Summer Program, an intensive laboratory and seminar program aimed at providing real-world experience for students who are interested in pursuing interdisciplinary graduate education in the quantitative and biological sciences. In addition, two Los Angeles-area high school students participated in laboratory activities as volunteer researchers.

Eleazar Eskin, co-organizer of the summer program, and Serghei Mangul, post-doctoral scholar, hosted the young scholars in ZarLab, a UCLA computational genetics group affiliated with both the Computer Science Department and the Human Genetics Department. Mangul supervised a group of students who collaborated on a project aimed at developing computational methods for the study of the human immune system and microbiome. Working with data from one of the largest sequencing projects in the world, the Genotype-Tissue Expression (GTEx) study, the students analyzed more than 8,000 samples obtained from 544 individuals and representing 53 different tissue types. In doing so, they gained familiarization with current approaches to studying how changes in our genes contribute to common human diseases.

During a poster session on August 12, 2016, the B.I.G. participants presented the results of their work on GTEx:

  • Jeremy Rotman: “Studying the microbiome by analyzing the coverage of sequencing reads mapped to viruses, eukaryotes, and bacteria”
  • Benjamin Statz: “An improved method for analysis of variable domain of B and T cell receptors”
  • William Van Der Wey: “Functional profiling of microbial communities across multiple human tissues”
  • Kevin Wesel: “Profiling repeat elements across multiple human tissues”

In addition to mentoring B.I.G. Program students in ZarLab, Mangul developed and presented a three-part series of workshops introducing students to UNIX earlier during the program.

Eskin and Mangul also hosted a B.I.G. Program student, Samantha Jenson, who collaborated with Jonathan Flint, a world-renowned authority on the genetics of depression and co-director of UCLA’s Depression Grand Challenge. This year, Eskin facilitated a Neurogenetics working group and weekly neurogenetics seminar series for the B.I.G. Program. Participants in this group gained first-hand experience in the process of developing methods for mapping the underlying genetic causes of Major Depression Disorder. Jenson presented her work on “Structural variant discovery in Major Depression Disorder” during the August 12th poster session.

The annual B.I.G. Program is a collaboration between multiple labs and includes next generation sequencing analysis workshops, weekly science talks by researchers, a weekly student journal club, professional development seminars, social activities, concluding poster sessions, and an optional GRE test prep course. Participants also benefited from relevant workshops and research talks presented during the UCLA Computational Genomics Summer Institute (CGSI).

Congratulations to Benjamin, Jeremy, Kevin, Samantha, and William on their acceptance to and success in the B.I.G. Summer Program!

This slideshow requires JavaScript.

We thank the following generous institutions that made this year’s B.I.G. Summer Program a big success:

  • National Institutes of Health grant MH109172
  • UCOP for a UC-HBCU partnership Program in Genomics and Systems
  • NIH NIBIB for NGS Data Analysis Skills for the Biosciences Pipeline  R25EB022364
  • NIH NIMH for Undergraduate Research Experience in Neuropsychiatric Genomics R25MH109172-01

Learn more about the B.I.G. Program:
UCLA Newsroom: UCLA hosts summer program for future biosciences leaders

Serghei Mangul’s Introduction to UNIX Workshops

We present three video recordings of workshops that ZarLab postdoctoral scholar Serghei Mangul developed under the UCLA Institute for Quantitative and Computational Biosciences Collaboratory and delivered to Bruins-In-Genomics (B.I.G.) SUMMER participants. B.I.G. SUMMER is an intensive, practical experience in genomics and bioinformatics for undergraduate students who are interested in integrating quantitative and biological knowledge and considering pursuing graduate degrees in the biological, biomedical, or health sciences.

An important question for undergraduates considering careers in the biosciences is whether or not biologists need to develop robust programming skills. Biology students without backgrounds in computer science are often intimidated by applications that require inputting code or negotiating systems that lack a graphical interface, such as Unix, R, SASS, and Python.


“Becoming a programmer” may seem daunting to many students in biology, but an ability to analyze sequencing data represents a competitive advantage in today’s age of big data and next generation sequencing. By gaining familiarity with Unix, these students may find it easier to engage with other applications and programming languages commonly used in computational biology. In order to use Unix effectively, students must learn how to directly enter functional commands line-by-line into a workbench that manages multiple platforms and a unified filesystem—without the familiar aid of a graphical interface.

In this three-part series of workshops, Dr. Mangul provides just enough information for students with no computational background to get started using Unix for analytical tasks. These workshops aim to help participants learn key commands and develop fundamental skills, such as connecting, writing, and submitting basic shell scripts to a cluster.

Slides and more information about the workshop are available at the following webpage:

Introduction to UNIX 1/3

Introduction to UNIX 2/3

Introduction to UNIX 3/3


Imputing Phenotypes for Genome-wide Association Studies


Pairwise correlation between each phenotype pair in the NFBC dataset.

In genome-wide association studies (GWAS), investigators identify variants that are significantly associated with the phenotype by collecting and performing statistical tests on genotypes and phenotypes from a set of individuals. Recently, GWAS samples have increased in size to include tens or hundreds of thousands of variants. Studies working with such large datasets have recently discovered hundreds of variants involved in multiple common diseases (Schunkert et al. 2011; Voight et al. 2010). For the most part, identified variants have very small effect sizes, suggesting that larger association studies are capable of implicating more variants.

Increasing the size of GWAS samples is a shared goal among bioinformatics researchers. Unfortunately, some phenotypes are either logistically difficult or very expensive to collect. For these phenotypes, it is impractical to perform GWAS with tens or hundreds of thousands of individuals. Examples of these difficult-to-collect phenotypes include those that require obtaining an inaccessible tissue (such as brain expression), using a complex intervention (such as a response to diet), and re-contacting individuals simply because they were unmeasured in the original cohort. For these phenotypes, an investigator finds it difficult to collect samples large enough to discover variants with small effect sizes. As a result, it is unlikely that GWAS will perform effectively on these phenotypes.

To address this issue, we developed a novel approach we call phenotype imputation. In our method, we estimate and leverage the correlation structure between multiple phenotypes to impute the uncollected phenotype. A paper presenting our approach was accepted by and is in press with the American Journal of Human Genetics.

In order to leverage the correlation structure between multiple phenotypes, we first estimate the correlation structure from a complete dataset that includes all phenotypes. We then use the conditional distribution based on the multivariate normal (MVN) statistical framework to impute the uncollected phenotypes in an incomplete dataset. Our approach uses only phenotypic—not genetic—information, enabling subsequent use of these imputed phenotypes for association testing without incurring data re-use. For GWAS including both complete and incomplete datasets, we provide an optimal meta-analysis strategy that accounts for imputation uncertainties by combining association results from both collected and imputed phenotypes. Further, our paper demonstrates that phenotype imputation can be performed using summary statistics. This result makes our method applicable to datasets where we only have access to the summary statistics and not the raw genotypes and phenotypes.

In our forthcoming AJHG paper, we use the Northern Finland Birth Cohort (NFBC) data to assess the performance of our novel method. The NFBC dataset consists of 10 phenotypes collected from 5,327 individuals. The 10 phenotypes are triglycerides (TG), highdensity lipoproteins (HDL), low-density lipoproteins (LDL), glucose (GLU), insulin (INS), body mass index (BMI), C-reactive protein (CRP) as a measure of inflammation, systolic blood pressure (SBP), diastolic blood pressure (DBP), and height. The genotype data consists of 331,476 SNPs.

Imputing the TG, BMI, and SBP phenotypes enable us to recover most of the significantly associated loci in the original data at the nominal significance level, as shown in the above figure. This result demonstrates that the imputed phenotype can effectively be used for replication purposes, even though it might not provide sufficient power for discovery purposes due to imputation uncertainties.

Our approach allows us to know the exact distribution of the imputed phenotype due to our parametric assumptions. We can directly use the mean value of this distribution as the imputed value. Furthermore, we utilize the variance of the missing phenotype in our analysis of the statistical power. The primary advantage of our framework is that it increases the power of GWASs on phenotypes that are difficult to collect. Analytical power computation is provided that allows investigators to determine the benefit of the imputation for a given dataset prospectively. Another advantage of this method is that it allows the use of summary statistics when the raw genotypes are not available.

This project was led by Farhad Hormozdiari and involved Michael Bilow. The article is available at: http://dx.doi.org/10.1016/j.ajhg.2016.04.013.

The full citation to our paper is:

Hormozdiari, Farhad; Kang, Eun Yong; Bilow, Michael; Ben-David, Eyal; Vulpe, Chris; McLachlan, Stela; Lusis, Aldons; Han, Buhm; Eskin, Eleazar (2016): Imputing Phenotypes for Genome-wide Association Studies.. In: American Journal of Human Genetics, in press , 2016. (Type: Journal Article | Abstract | Links | BibTeX)