Our group publishes papers presenting new methodologies, describing the results of studies that use our software, and reviewing current topics in the field of Bioinformatics. Scroll down or click here for a complete list of papers produced by our lab. Since 2013, we write blog posts summarizing new research papers and review articles:
GWAS
- Fine Mapping Causal Variants and Allelic Heterogeneity
- Widespread Allelic Heterogeneity in Complex Traits
- Selection in Europeans on Fatty Acid Desaturases Associated with Dietary Changes
- Incorporating prior information into association studies
- Characterization of Expression Quantitative Trait Loci in Pedigrees from Colombia and Costa Rica Ascertained for Bipolar Disorder
- Simultaneous modeling of disease status and clinical phenotypes to increase power in GWAS
- Efficient and accurate multiple-phenotype regression method for high dimensional data considering population structure
- Review Article: Population Structure in Genetic Studies: Confounding Factors and Mixed Models
- Colocalization of GWAS and eQTL Signals Detects Target Genes
- Chromosome conformation elucidates regulatory relationships in developing human brain
Mouse Genetics
- Review Article: The Hybrid Mouse Diversity Panel
- Genes, Environments and Meta-Analysis
- Review Article: Mixed Models and Population Structure
- Identifying Genes Involved in Blood Cell Traits
- Genes, Diet, and Body Weight (in Mice)
- Review Article: Mouse Genetics
Population Structure
- Efficient and accurate multiple-phenotype regression method for high dimensional data considering population structure
- Review Article: Population Structure in Genetic Studies: Confounding Factors and Mixed Models
- Accounting for Population Structure in Gene-by-Environment Interactions in Genome-Wide Association Studies Using Mixed Models
- Multiple testing correction in linear mixed models
- Identification of causal genes for complex traits (CAVIAR-gene)
- Accurate viral population assembly from ultra-deep sequencing data
- GRAT: Speeding up Expression Quantitative Trail Loci (eQTL) Studies
- Correcting Population Structure using Mixed Models Webcast
- Mixed models can correct for population structure for genomic regions under selection
Review Articles
- Review Article: Population Structure in Genetic Studies: Confounding Factors and Mixed Models
- Review Article: The Hybrid Mouse Diversity Panel
- Review Article: GWAS and Missing Heritability
- Review Article: Mixed Models and Population Structure
- Review Article: Mouse Genetics
Publications
2004 |
Eskin, ; Eleazar, Eleazar RECOMB '04: Proceedings of the eighth annual international conference on Resaerch in computational molecular biology, ACM, New York, NY, USA, 2004, ISBN: 1-58113-755-9. Abstract | Links | BibTeX | Tags: Motif Finding @conference{974630, title = {From profiles to patterns and back again: a branch and bound algorithm for finding near optimal motif profiles}, author = { Eskin and Eleazar Eleazar}, url = {http://dx.doi.org/10.1145/974614.974630}, isbn = {1-58113-755-9}, year = {2004}, date = {2004-01-01}, booktitle = {RECOMB '04: Proceedings of the eighth annual international conference on Resaerch in computational molecular biology}, pages = {115-124}, publisher = {ACM}, address = {New York, NY, USA}, abstract = {An important part of deciphering gene regulatory mechanisms is discovering transcription factor binding sites. In many cases, these sites can be detected because they are often overrepresented in genomic sequences. The detection of the overrepresented signals in sequences, or motif-finding has become a central problem in computational biology. There are two major computational frameworks for attacking the motif finding problem which differ in their representation of the signals. The most popular is the profile or PSSM (Position Specific Scoring Matrix) representation. The goal of these algorithms is to obtain probabilistic representations of the overrepresented signals. Another is the consensus pattern or pattern with mismatches representation which represents a signal as discrete consensus pattern and allows some mismatches to occur in each instance of the pattern. The advantage of profiles is the expressiveness of their representation while the advantage of the consensus pattern approach is the existence of efficient algorithms that guarantee discovery of the best patterns. In this paper we present a unified framework for motif finding which encompasses both the profile representation and the consensus pattern representation. We prove that the problem of discovering the best profiles can be solved by considering a degenerate version of the problem of finding the best consensus patterns. The main advantage of our framework is that it motivates a novel algorithm, MITRA-PSSM, which discovers profiles, yet provides some of the guarantees of discovering the best signals. The algorithm searches for best profiles with respect to information content which is the same criterion of popular algorithms such as MEME and CONSENSUS. MITRA-PSSM is specifically designed for searching for profiles in this framework and introduces a novel notion of scoring consensus patterns, discrete information content. MITRA-PSSM is available for public use via webserver at http://www.calit2.net/compbio/mitra/.}, keywords = {Motif Finding}, pubstate = {published}, tppubtype = {conference} } An important part of deciphering gene regulatory mechanisms is discovering transcription factor binding sites. In many cases, these sites can be detected because they are often overrepresented in genomic sequences. The detection of the overrepresented signals in sequences, or motif-finding has become a central problem in computational biology. There are two major computational frameworks for attacking the motif finding problem which differ in their representation of the signals. The most popular is the profile or PSSM (Position Specific Scoring Matrix) representation. The goal of these algorithms is to obtain probabilistic representations of the overrepresented signals. Another is the consensus pattern or pattern with mismatches representation which represents a signal as discrete consensus pattern and allows some mismatches to occur in each instance of the pattern. The advantage of profiles is the expressiveness of their representation while the advantage of the consensus pattern approach is the existence of efficient algorithms that guarantee discovery of the best patterns. In this paper we present a unified framework for motif finding which encompasses both the profile representation and the consensus pattern representation. We prove that the problem of discovering the best profiles can be solved by considering a degenerate version of the problem of finding the best consensus patterns. The main advantage of our framework is that it motivates a novel algorithm, MITRA-PSSM, which discovers profiles, yet provides some of the guarantees of discovering the best signals. The algorithm searches for best profiles with respect to information content which is the same criterion of popular algorithms such as MEME and CONSENSUS. MITRA-PSSM is specifically designed for searching for profiles in this framework and introduces a novel notion of scoring consensus patterns, discrete information content. MITRA-PSSM is available for public use via webserver at http://www.calit2.net/compbio/mitra/. |
Halperin, Eran; Eskin, Eleazar Haplotype reconstruction from genotype data using Imperfect Phylogeny. Journal Article Bioinformatics, 20 (12), pp. 1842-9, 2004, ISSN: 1367-4803. Abstract | Links | BibTeX | Tags: Haplotype Phasing @article{Halperin:Bioinformatics:2004, title = {Haplotype reconstruction from genotype data using Imperfect Phylogeny.}, author = { Eran Halperin and Eleazar Eskin}, url = {http://dx.doi.org/10.1093/bioinformatics/bth149}, issn = {1367-4803}, year = {2004}, date = {2004-01-01}, journal = {Bioinformatics}, volume = {20}, number = {12}, pages = {1842-9}, address = {England}, organization = {CS Division, University of California Berkeley, Berkeley, CA 92093-0114, USA. eran@eecs.berkeley.edu}, abstract = {Critical to the understanding of the genetic basis for complex diseases is the modeling of human variation. Most of this variation can be characterized by single nucleotide polymorphisms (SNPs) which are mutations at a single nucleotide position. To characterize the genetic variation between different people, we must determine an individual's haplotype or which nucleotide base occurs at each position of these common SNPs for each chromosome. In this paper, we present results for a highly accurate method for haplotype resolution from genotype data. Our method leverages a new insight into the underlying structure of haplotypes that shows that SNPs are organized in highly correlated 'blocks'. In a few recent studies, considerable parts of the human genome were partitioned into blocks, such that the majority of the sequenced genotypes have one of about four common haplotypes in each block. Our method partitions the SNPs into blocks, and for each block, we predict the common haplotypes and each individual's haplotype. We evaluate our method over biological data. Our method predicts the common haplotypes perfectly and has a very low error rate (<2% over the data) when taking into account the predictions for the uncommon haplotypes. Our method is extremely efficient compared with previous methods such as PHASE and HAPLOTYPER. Its efficiency allows us to find the block partition of the haplotypes, to cope with missing data and to work with large datasets. AVAILABILITY: The algorithm is available via a Web server at http://www.calit2.net/compbio/hap/}, keywords = {Haplotype Phasing}, pubstate = {published}, tppubtype = {article} } Critical to the understanding of the genetic basis for complex diseases is the modeling of human variation. Most of this variation can be characterized by single nucleotide polymorphisms (SNPs) which are mutations at a single nucleotide position. To characterize the genetic variation between different people, we must determine an individual's haplotype or which nucleotide base occurs at each position of these common SNPs for each chromosome. In this paper, we present results for a highly accurate method for haplotype resolution from genotype data. Our method leverages a new insight into the underlying structure of haplotypes that shows that SNPs are organized in highly correlated 'blocks'. In a few recent studies, considerable parts of the human genome were partitioned into blocks, such that the majority of the sequenced genotypes have one of about four common haplotypes in each block. Our method partitions the SNPs into blocks, and for each block, we predict the common haplotypes and each individual's haplotype. We evaluate our method over biological data. Our method predicts the common haplotypes perfectly and has a very low error rate (<2% over the data) when taking into account the predictions for the uncommon haplotypes. Our method is extremely efficient compared with previous methods such as PHASE and HAPLOTYPER. Its efficiency allows us to find the block partition of the haplotypes, to cope with missing data and to work with large datasets. AVAILABILITY: The algorithm is available via a Web server at http://www.calit2.net/compbio/hap/ |
Leslie, Christina S; Eskin, Eleazar ; Cohen, Adiel ; Weston, Jason ; Noble, William Stafford Mismatch string kernels for discriminative protein classification. Journal Article Bioinformatics, 20 (4), pp. 467-76, 2004, ISSN: 1367-4803. Abstract | Links | BibTeX | Tags: @article{Leslie:Bioinformatics:2004, title = {Mismatch string kernels for discriminative protein classification.}, author = { Christina S. Leslie and Eleazar Eskin and Adiel Cohen and Jason Weston and William Stafford Noble}, url = {http://dx.doi.org/10.1093/bioinformatics/btg431}, issn = {1367-4803}, year = {2004}, date = {2004-01-01}, journal = {Bioinformatics}, volume = {20}, number = {4}, pages = {467-76}, address = {England}, organization = {Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, Mail Code 0401, New York, NY 10027, USA. cleslie@cs.columbia.edu}, abstract = {MOTIVATION: Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training and prediction are also important concerns. RESULTS: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection. These kernels measure sequence similarity based on shared occurrences of fixed-length patterns in the data, allowing for mutations between patterns. Thus, the kernels provide a biologically well-motivated way to compare protein sequences without relying on family-based generative models such as hidden Markov models. We compute the kernels efficiently using a mismatch tree data structure, allowing us to calculate the contributions of all patterns occurring in the data in one pass while traversing the tree. When used with an SVM, the kernels enable fast prediction on test sequences. We report experiments on two benchmark SCOP datasets, where we show that the mismatch kernel used with an SVM classifier performs competitively with state-of-the-art methods for homology detection, particularly when very few training examples are available. Examination of the highest-weighted patterns learned by the SVM classifier recovers biologically important motifs in protein families and superfamilies.}, keywords = {}, pubstate = {published}, tppubtype = {article} } MOTIVATION: Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training and prediction are also important concerns. RESULTS: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection. These kernels measure sequence similarity based on shared occurrences of fixed-length patterns in the data, allowing for mutations between patterns. Thus, the kernels provide a biologically well-motivated way to compare protein sequences without relying on family-based generative models such as hidden Markov models. We compute the kernels efficiently using a mismatch tree data structure, allowing us to calculate the contributions of all patterns occurring in the data in one pass while traversing the tree. When used with an SVM, the kernels enable fast prediction on test sequences. We report experiments on two benchmark SCOP datasets, where we show that the mismatch kernel used with an SVM classifier performs competitively with state-of-the-art methods for homology detection, particularly when very few training examples are available. Examination of the highest-weighted patterns learned by the SVM classifier recovers biologically important motifs in protein families and superfamilies. |
Price, Alkes L; Eskin, Eleazar ; Pevzner, Pavel A Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. Journal Article Genome Res, 14 (11), pp. 2245-52, 2004, ISSN: 1088-9051. Abstract | Links | BibTeX | Tags: @article{Price:GenomeRes:2004, title = {Whole-genome analysis of Alu repeat elements reveals complex evolutionary history.}, author = { Alkes L. Price and Eleazar Eskin and Pavel A. Pevzner}, url = {http://dx.doi.org/10.1101/gr.2693004}, issn = {1088-9051}, year = {2004}, date = {2004-01-01}, journal = {Genome Res}, volume = {14}, number = {11}, pages = {2245-52}, address = {United States}, organization = {Department of Computer Science and Engineering, University of California-San Diego, La Jolla, California 92093-0114, USA. aprice@cs.ucsd.edu}, abstract = {Alu repeats are the most abundant family of repeats in the human genome, with over 1 million copies comprising 10% of the genome. They have been implicated in human genetic disease and in the enrichment of gene-rich segmental duplications in the human genome, and they form a rich fossil record of primate and human history. Alu repeat elements are believed to have arisen from the replication of a small number of source elements, whose evolution over time gives rise to the 31 Alu subfamilies currently reported in Repbase Update. We apply a novel method to identify and statistically validate 213 Alu subfamilies. We build an evolutionary tree of these subfamilies and conclude that the history of Alu evolution is more complex than previous studies had indicated.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Alu repeats are the most abundant family of repeats in the human genome, with over 1 million copies comprising 10% of the genome. They have been implicated in human genetic disease and in the enrichment of gene-rich segmental duplications in the human genome, and they form a rich fossil record of primate and human history. Alu repeat elements are believed to have arisen from the replication of a small number of source elements, whose evolution over time gives rise to the 31 Alu subfamilies currently reported in Repbase Update. We apply a novel method to identify and statistically validate 213 Alu subfamilies. We build an evolutionary tree of these subfamilies and conclude that the history of Alu evolution is more complex than previous studies had indicated. |
2003 |
Noble, William Stafford; Weston, Jason ; Schölkopf, Bernhard ; Eskin, Eleazar ; Leslie, Christina Dealing with large diagonals in kernel matrices Journal Article Annals of the Institute of Statistical Mathematics, pp. 391-408, 2003, ISSN: 0020-3157 (Print) 1572-9052 (Online). Abstract | Links | BibTeX | Tags: @article{Noble:AnnalsOfTheInstituteOfStatisticalMathematics, title = {Dealing with large diagonals in kernel matrices}, author = { William Stafford Noble and Jason Weston and Bernhard Schölkopf and Eleazar Eskin and Christina Leslie}, url = {http://dx.doi.org/10.1007/BF02530507}, issn = {0020-3157 (Print) 1572-9052 (Online)}, year = {2003}, date = {2003-01-01}, journal = {Annals of the Institute of Statistical Mathematics}, pages = {391-408}, publisher = {Springer Netherlands}, abstract = {In kernel methods, all the information about the training data is contained in the Gram matrix. If this matrix has large diagonal values, which arises for many types of kernels, then kernel methods do not perform well: We propose and test several methods for dealing with this problem by reducing the dynamic range of the matrix while preserving the positive definiteness of the Hessian of the quadratic programming problem that one has to solve when training a Support Vector Machine, which is a common kernel approach for pattern recognition.}, keywords = {}, pubstate = {published}, tppubtype = {article} } In kernel methods, all the information about the training data is contained in the Gram matrix. If this matrix has large diagonal values, which arises for many types of kernels, then kernel methods do not perform well: We propose and test several methods for dealing with this problem by reducing the dynamic range of the matrix while preserving the positive definiteness of the Hessian of the quadratic programming problem that one has to solve when training a Support Vector Machine, which is a common kernel approach for pattern recognition. |
Eskin, Eleazar; Halperin, Eran; Karp, Richard M Efficient reconstruction of haplotype structure via perfect phylogeny. Journal Article J Bioinform Comput Biol, 1 (1), pp. 1-20, 2003, ISSN: 0219-7200. Abstract | Links | BibTeX | Tags: Haplotype Phasing @article{Eskin:JBioinformComputBiol:2003, title = {Efficient reconstruction of haplotype structure via perfect phylogeny.}, author = { Eleazar Eskin and Eran Halperin and Richard M. Karp}, url = {https://www.ncbi.nlm.nih.gov/pubmed/15290779}, issn = {0219-7200}, year = {2003}, date = {2003-01-01}, journal = {J Bioinform Comput Biol}, volume = {1}, number = {1}, pages = {1-20}, address = {England}, organization = {Computer Science Department, Columbia University, USA. eeskin@cs.columbia.edu}, abstract = {Each person's genome contains two copies of each chromosome, one inherited from the father and the other from the mother. A person's genotype specifies the pair of bases at each site, but does not specify which base occurs on which chromosome. The sequence of each chromosome separately is called a haplotype. The determination of the haplotypes within a population is essential for understanding genetic variation and the inheritance of complex diseases. The haplotype mapping project, a successor to the human genome project, seeks to determine the common haplotypes in the human population. Since experimental determination of a person's genotype is less expensive than determining its component haplotypes, algorithms are required for computing haplotypes from genotypes. Two observations aid in this process: first, the human genome contains short blocks within which only a few different haplotypes occur; second, as suggested by Gusfield, it is reasonable to assume that the haplotypes observed within a block have evolved according to a perfect phylogeny, in which at most one mutation event has occurred at any site, and no recombination occurred at the given region. We present a simple and efficient polynomial-time algorithm for inferring haplotypes from the genotypes of a set of individuals assuming a perfect phylogeny. Using a reduction to 2-SAT we extend this algorithm to handle constraints that apply when we have genotypes from both parents and child. We also present a hardness result for the problem of removing the minimum number of individuals from a population to ensure that the genotypes of the remaining individuals are consistent with a perfect phylogeny. Our algorithms have been tested on real data and give biologically meaningful results. Our webserver (http://www.cs.columbia.edu/compbio/hap/) is publicly available for predicting haplotypes from genotype data and partitioning genotype data into blocks.}, keywords = {Haplotype Phasing}, pubstate = {published}, tppubtype = {article} } Each person's genome contains two copies of each chromosome, one inherited from the father and the other from the mother. A person's genotype specifies the pair of bases at each site, but does not specify which base occurs on which chromosome. The sequence of each chromosome separately is called a haplotype. The determination of the haplotypes within a population is essential for understanding genetic variation and the inheritance of complex diseases. The haplotype mapping project, a successor to the human genome project, seeks to determine the common haplotypes in the human population. Since experimental determination of a person's genotype is less expensive than determining its component haplotypes, algorithms are required for computing haplotypes from genotypes. Two observations aid in this process: first, the human genome contains short blocks within which only a few different haplotypes occur; second, as suggested by Gusfield, it is reasonable to assume that the haplotypes observed within a block have evolved according to a perfect phylogeny, in which at most one mutation event has occurred at any site, and no recombination occurred at the given region. We present a simple and efficient polynomial-time algorithm for inferring haplotypes from the genotypes of a set of individuals assuming a perfect phylogeny. Using a reduction to 2-SAT we extend this algorithm to handle constraints that apply when we have genotypes from both parents and child. We also present a hardness result for the problem of removing the minimum number of individuals from a population to ensure that the genotypes of the remaining individuals are consistent with a perfect phylogeny. Our algorithms have been tested on real data and give biologically meaningful results. Our webserver (http://www.cs.columbia.edu/compbio/hap/) is publicly available for predicting haplotypes from genotype data and partitioning genotype data into blocks. |
Eskin, Eleazar; Keich, Uri; Gelfand, Mikhail S; Pevzner, Pavel A Genome-wide analysis of bacterial promoter regions. Journal Article Pac Symp Biocomput, pp. 29-40, 2003, ISSN: 1793-5091. Abstract | Links | BibTeX | Tags: Motif Finding @article{Eskin:PacSympBiocomput:2003, title = {Genome-wide analysis of bacterial promoter regions.}, author = { Eleazar Eskin and Uri Keich and Mikhail S. Gelfand and Pavel A. Pevzner}, url = {https://www.ncbi.nlm.nih.gov/pubmed/12603015}, issn = {1793-5091}, year = {2003}, date = {2003-01-01}, journal = {Pac Symp Biocomput}, pages = {29-40}, address = {Singapore}, organization = {Department of Computer Science, Columbia University, New York, NY 10027, USA. eeskin@cs.columbia.edu}, abstract = {Identifying prokaryotic promoter sequences is notoriously difficult and for most sequenced bacterial genomes the promoter sequences are still unknown. Since experimental analysis trails behind sequencing, genome-wide computational promoter discovery is often the only realistic way to discover these sequences in newly sequenced bacterial genomes. However, genome-wide samples for promoter discovery may be very large and corrupted complicating promoter discovery. We discuss three aspects of genome-wide promoter discovery: sample generation, signal finding algorithms, and scoring signals. We applied our new MITRA algorithm to analyze samples of divergent and convergent genes in 20 bacterial genomes and found strong putative dyad signals in 17 out of the 20 genomes. Moreover, in 12 out of 20 genomes the found signals are identical or similar to the known regulatory patterns (Pribnow-Gilbert boxes and CRP binding sites). Since many of putative signals correspond to previously known elements of bacterial transcriptional regulation, the remaining discovered signals are good candidates for unknown regulatory elements.}, keywords = {Motif Finding}, pubstate = {published}, tppubtype = {article} } Identifying prokaryotic promoter sequences is notoriously difficult and for most sequenced bacterial genomes the promoter sequences are still unknown. Since experimental analysis trails behind sequencing, genome-wide computational promoter discovery is often the only realistic way to discover these sequences in newly sequenced bacterial genomes. However, genome-wide samples for promoter discovery may be very large and corrupted complicating promoter discovery. We discuss three aspects of genome-wide promoter discovery: sample generation, signal finding algorithms, and scoring signals. We applied our new MITRA algorithm to analyze samples of divergent and convergent genes in 20 bacterial genomes and found strong putative dyad signals in 17 out of the 20 genomes. Moreover, in 12 out of 20 genomes the found signals are identical or similar to the known regulatory patterns (Pribnow-Gilbert boxes and CRP binding sites). Since many of putative signals correspond to previously known elements of bacterial transcriptional regulation, the remaining discovered signals are good candidates for unknown regulatory elements. |
Eskin, Eleazar; Halperin, Eran; Karp, Richard Large scale reconstruction of haplotypes from genotype data Conference RECOMB '03: Proceedings of the seventh annual international conference on Research in computational molecular biology, ACM, New York, NY, USA, 2003, ISBN: 1-58113-635-8. Abstract | Links | BibTeX | Tags: Haplotype Phasing @conference{640088, title = {Large scale reconstruction of haplotypes from genotype data}, author = {Eleazar Eskin and Eran Halperin and Richard Karp}, url = {http://dx.doi.org/10.1145/640075.640088}, isbn = {1-58113-635-8}, year = {2003}, date = {2003-01-01}, booktitle = {RECOMB '03: Proceedings of the seventh annual international conference on Research in computational molecular biology}, pages = {104-113}, publisher = {ACM}, address = {New York, NY, USA}, abstract = {Critical to the understanding of the genetic basis for complex diseases is the modeling of human variation. Most of this variation can be characterized by single nucleotide polymorphisms (SNPs) which are mutations at a single nucleotide position. To characterize an individual's variation, we must determine an individual's haplotype or which nucleotide base occurs at each position of these common SNPs for each chromosome. In this paper, we present results for a highly accurate method for haplotype resolution from genotype data. Our method leverages a new insight into the underlying structure of haplotypes which shows that SNPs are organized in highly correlated "blocks". The majority of individuals have one of about four common haplotypes in each block. Our method partitions the SNPs into blocks and for each block, we predict the common haplotypes and each individual's haplotype. We evaluate our method over biological data. Our method predicts the common haplotypes perfectly and has a very low error rate (0.47%) when taking into account the predictions for the uncommon haplotypes. Our method is extremely efficient compared to previous methods, (a matter of seconds where previous methods needed hours). Its efficiency allows us to find the block partition of the haplotypes, to cope with missing data and to work with large data sets such as genotypes for thousands of SNPs for hundreds of individuals. The algorithm is available via webserver at http://www.cs.columbia.edu/compbio/hap.}, keywords = {Haplotype Phasing}, pubstate = {published}, tppubtype = {conference} } Critical to the understanding of the genetic basis for complex diseases is the modeling of human variation. Most of this variation can be characterized by single nucleotide polymorphisms (SNPs) which are mutations at a single nucleotide position. To characterize an individual's variation, we must determine an individual's haplotype or which nucleotide base occurs at each position of these common SNPs for each chromosome. In this paper, we present results for a highly accurate method for haplotype resolution from genotype data. Our method leverages a new insight into the underlying structure of haplotypes which shows that SNPs are organized in highly correlated "blocks". The majority of individuals have one of about four common haplotypes in each block. Our method partitions the SNPs into blocks and for each block, we predict the common haplotypes and each individual's haplotype. We evaluate our method over biological data. Our method predicts the common haplotypes perfectly and has a very low error rate (0.47%) when taking into account the predictions for the uncommon haplotypes. Our method is extremely efficient compared to previous methods, (a matter of seconds where previous methods needed hours). Its efficiency allows us to find the block partition of the haplotypes, to cope with missing data and to work with large data sets such as genotypes for thousands of SNPs for hundreds of individuals. The algorithm is available via webserver at http://www.cs.columbia.edu/compbio/hap. |
Eskin, Eleazar; Noble, William Stafford ; Singer, Yoram Protein family classification using sparse markov transducers. Journal Article J Comput Biol, 10 (2), pp. 187-213, 2003, ISSN: 1066-5277. Abstract | Links | BibTeX | Tags: @article{Eskin:JComputBiol:2003, title = {Protein family classification using sparse markov transducers.}, author = { Eleazar Eskin and William Stafford Noble and Yoram Singer}, url = {http://dx.doi.org/10.1089/106652703321825964}, issn = {1066-5277}, year = {2003}, date = {2003-01-01}, journal = {J Comput Biol}, volume = {10}, number = {2}, pages = {187-213}, address = {United States}, organization = {Department of Computer Science, Columbia University, New York, NY 10027, USA. noble@gs.washington.edu}, abstract = {We present a method for classifying proteins into families based on short subsequences of amino acids using a new probabilistic model called sparse Markov transducers (SMT). We classify a protein by estimating probability distributions over subsequences of amino acids from the protein. Sparse Markov transducers, similar to probabilistic suffix trees, estimate a probability distribution conditioned on an input sequence. SMTs generalize probabilistic suffix trees by allowing for wild-cards in the conditioning sequences. Since substitutions of amino acids are common in protein families, incorporating wild-cards into the model significantly improves classification performance. We present two models for building protein family classifiers using SMTs. As protein databases become larger, data driven learning algorithms for probabilistic models such as SMTs will require vast amounts of memory. We therefore describe and use efficient data structures to improve the memory usage of SMTs. We evaluate SMTs by building protein family classifiers using the Pfam and SCOP databases and compare our results to previously published results and state-of-the-art protein homology detection methods. SMTs outperform previous probabilistic suffix tree methods and under certain conditions perform comparably to state-of-the-art protein homology methods.}, keywords = {}, pubstate = {published}, tppubtype = {article} } We present a method for classifying proteins into families based on short subsequences of amino acids using a new probabilistic model called sparse Markov transducers (SMT). We classify a protein by estimating probability distributions over subsequences of amino acids from the protein. Sparse Markov transducers, similar to probabilistic suffix trees, estimate a probability distribution conditioned on an input sequence. SMTs generalize probabilistic suffix trees by allowing for wild-cards in the conditioning sequences. Since substitutions of amino acids are common in protein families, incorporating wild-cards into the model significantly improves classification performance. We present two models for building protein family classifiers using SMTs. As protein databases become larger, data driven learning algorithms for probabilistic models such as SMTs will require vast amounts of memory. We therefore describe and use efficient data structures to improve the memory usage of SMTs. We evaluate SMTs by building protein family classifiers using the Pfam and SCOP databases and compare our results to previously published results and state-of-the-art protein homology detection methods. SMTs outperform previous probabilistic suffix tree methods and under certain conditions perform comparably to state-of-the-art protein homology methods. |
2002 |
Noble, William Stafford; Schölkopf, Bernhard ; Weston, Jason ; Eskin, Eleazar ; Leslie, Christina A Kernel Approach for Learning from almost Orthogonal Patterns Conference Lecture Notes in Computer Science, 2430/2002 , Lecture Notes in Computer Science Springer Berlin / Heidelberg, 2002, ISSN: 0302-9743 (Print) 1611-3349 (Online). Abstract | Links | BibTeX | Tags: @conference{Noble:LectureNotesInComputerScience:2002, title = {A Kernel Approach for Learning from almost Orthogonal Patterns}, author = { William Stafford Noble and Bernhard Schölkopf and Jason Weston and Eleazar Eskin and Christina Leslie}, url = {http://dx.doi.org/10.1007/3-540-36755-1}, issn = {0302-9743 (Print) 1611-3349 (Online)}, year = {2002}, date = {2002-01-01}, booktitle = {Lecture Notes in Computer Science}, volume = {2430/2002}, pages = {151-167}, publisher = {Springer Berlin / Heidelberg}, series = {Lecture Notes in Computer Science}, abstract = {In kernel methods, all the information about the training data is contained in the Gram matrix. If this matrix has large diagonal values, which arises for many types of kernels, then kernel methods do not perform well. We propose and test several methods for dealing with this problem by reducing the dynamic range of the matrix while preserving the positive definiteness of the Hessian of the quadratic programming problem that one has to solve when training a Support Vector Machine.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } In kernel methods, all the information about the training data is contained in the Gram matrix. If this matrix has large diagonal values, which arises for many types of kernels, then kernel methods do not perform well. We propose and test several methods for dealing with this problem by reducing the dynamic range of the matrix while preserving the positive definiteness of the Hessian of the quadratic programming problem that one has to solve when training a Support Vector Machine. |
Noble, William Stafford; Schölkopf, Bernhard ; Weston, Jason ; Eskin, Eleazar ; Leslie, Christina A Kernel Approach for Learning from Almost Orthogonal Patterns Conference Lecture Notes in Computer Science, 2431/2002 , Lecture Notes in Computer Science Springer Berlin / Heidelberg, 2002, ISSN: 0302-9743 (Print) 1611-3349 (Online). Abstract | Links | BibTeX | Tags: @conference{Noble:LectureNotesInComputerScience:2002a, title = {A Kernel Approach for Learning from Almost Orthogonal Patterns}, author = { William Stafford Noble and Bernhard Schölkopf and Jason Weston and Eleazar Eskin and Christina Leslie}, url = {http://dx.doi.org/10.1007/3-540-45681-3}, issn = {0302-9743 (Print) 1611-3349 (Online)}, year = {2002}, date = {2002-01-01}, booktitle = {Lecture Notes in Computer Science}, volume = {2431/2002}, pages = {225-240}, publisher = {Springer Berlin / Heidelberg}, series = {Lecture Notes in Computer Science}, abstract = {In kernel methods, all the information about the training data is contained in the Gram matrix. If this matrix has large diagonal values, which arises for many types of kernels, then kernel methods do not perform well. We propose and test several methods for dealing with this problem by reducing the dynamic range of the matrix while preserving the positive definiteness of the Hessian of the quadratic programming problem that one has to solve when training a Support Vector Machine.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } In kernel methods, all the information about the training data is contained in the Gram matrix. If this matrix has large diagonal values, which arises for many types of kernels, then kernel methods do not perform well. We propose and test several methods for dealing with this problem by reducing the dynamic range of the matrix while preserving the positive definiteness of the Hessian of the quadratic programming problem that one has to solve when training a Support Vector Machine. |
Apap, Frank; Honig, Andrew ; Hershkop, Shlomo ; Eskin, Eleazar ; Stolfo, Sal Detecting Malicious Software by Monitoring Anomalous Windows Registry Accesses Conference Lecture Notes in Computer Science, 2516/2002 , Lecture Notes in Computer Science Springer Berlin / Heidelberg, 2002, ISSN: 0302-9743 (Print) 1611-3349 (Online). Abstract | Links | BibTeX | Tags: @conference{Apap:LectureNotesInComputerScience:2002, title = {Detecting Malicious Software by Monitoring Anomalous Windows Registry Accesses}, author = { Frank Apap and Andrew Honig and Shlomo Hershkop and Eleazar Eskin and Sal Stolfo}, url = {http://dx.doi.org/10.1007/3-540-36084-0}, issn = {0302-9743 (Print) 1611-3349 (Online)}, year = {2002}, date = {2002-01-01}, booktitle = {Lecture Notes in Computer Science}, volume = {2516/2002}, pages = {36-53}, publisher = {Springer Berlin / Heidelberg}, series = {Lecture Notes in Computer Science}, abstract = {We present a host-based intrusion detection system (IDS) for Microsoft Windows. The core of the system is an algorithm that detects attacks on a host machine by looking for anomalous accesses to the Windows Registry. The key idea is to first train a model of normal registry behavior on a windows host, and use this model to detect abnormal registry accesses at run-time. The normal model is trained using clean (attack-free) data. At run-time the model is used to check each access to the registry in real time to determine whether or not the behavior is abnormal and (possibly) corresponds to an attack. The system is effective in detecting the actions of malicious software while maintaining a low rate of false alarms}, keywords = {}, pubstate = {published}, tppubtype = {conference} } We present a host-based intrusion detection system (IDS) for Microsoft Windows. The core of the system is an algorithm that detects attacks on a host machine by looking for anomalous accesses to the Windows Registry. The key idea is to first train a model of normal registry behavior on a windows host, and use this model to detect abnormal registry accesses at run-time. The normal model is trained using clean (attack-free) data. At run-time the model is used to check each access to the registry in real time to determine whether or not the behavior is abnormal and (possibly) corresponds to an attack. The system is effective in detecting the actions of malicious software while maintaining a low rate of false alarms |
Eskin, Eleazar; Pevzner, Pavel A Finding composite regulatory patterns in DNA sequences. Journal Article Bioinformatics, 18 Suppl 1 , pp. S354-63, 2002, ISSN: 1367-4803. Abstract | Links | BibTeX | Tags: Motif Finding @article{Eskin:Bioinformatics:2002, title = {Finding composite regulatory patterns in DNA sequences.}, author = { Eleazar Eskin and Pavel A. Pevzner}, url = {https://www.ncbi.nlm.nih.gov/pubmed/12169566}, issn = {1367-4803}, year = {2002}, date = {2002-01-01}, journal = {Bioinformatics}, volume = {18 Suppl 1}, pages = {S354-63}, address = {England}, organization = {Department of Computer Science, Columbia University, New York, 10027 NY, USA. eeskin@cs.columbia.edu}, abstract = {Pattern discovery in unaligned DNA sequences is a fundamental problem in computational biology with important applications in finding regulatory signals. Current approaches to pattern discovery focus on monad patterns that correspond to relatively short contiguous strings. However, many of the actual regulatory signals are composite patterns that are groups of monad patterns that occur near each other. A difficulty in discovering composite patterns is that one or both of the component monad patterns in the group may be 'too weak'. Since the traditional monad-based motif finding algorithms usually output one (or a few) high scoring patterns, they often fail to find composite regulatory signals consisting of weak monad parts. In this paper, we present a MITRA (MIsmatch TRee Algorithm) approach for discovering composite signals. We demonstrate that MITRA performs well for both monad and composite patterns by presenting experiments over biological and synthetic data.}, keywords = {Motif Finding}, pubstate = {published}, tppubtype = {article} } Pattern discovery in unaligned DNA sequences is a fundamental problem in computational biology with important applications in finding regulatory signals. Current approaches to pattern discovery focus on monad patterns that correspond to relatively short contiguous strings. However, many of the actual regulatory signals are composite patterns that are groups of monad patterns that occur near each other. A difficulty in discovering composite patterns is that one or both of the component monad patterns in the group may be 'too weak'. Since the traditional monad-based motif finding algorithms usually output one (or a few) high scoring patterns, they often fail to find composite regulatory signals consisting of weak monad parts. In this paper, we present a MITRA (MIsmatch TRee Algorithm) approach for discovering composite signals. We demonstrate that MITRA performs well for both monad and composite patterns by presenting experiments over biological and synthetic data. |
Bhattacharyya, Manasi; Hershkop, Shlomo; Eskin, Eleazar MET: an experimental system for Malicious Email Tracking Conference NSPW '02: Proceedings of the 2002 workshop on New security paradigms, ACM, New York, NY, USA, 2002, ISBN: 1-58113-598-X. Abstract | Links | BibTeX | Tags: @conference{844104, title = {MET: an experimental system for Malicious Email Tracking}, author = {Manasi Bhattacharyya and Shlomo Hershkop and Eleazar Eskin}, url = {http://dx.doi.org/10.1145/844102.844104}, isbn = {1-58113-598-X}, year = {2002}, date = {2002-01-01}, booktitle = {NSPW '02: Proceedings of the 2002 workshop on New security paradigms}, pages = {3-10}, publisher = {ACM}, address = {New York, NY, USA}, abstract = {Despite the use of state of the art methods to protect against malicious programs, they continue to threaten and damage computer systems around the world. In this paper we present MET, the Malicious Email Tracking system, designed to automatically report statistics on the flow behavior of malicious software delivered via email attachments both at a local and global level. MET can help reduce the spread of malicious software worldwide, especially self-replicating viruses, as well as provide further insight toward minimizing damage caused by malicious programs in the future. In addition, the system can help system administrators detect all of the points of entry of a malicious email into a network. The core of MET's operation is a database of statistics about the trajectory of email attachments in and out of a network system, and the culling together of these statistics across networks to present a global view of the spread of the malicious software. From a statistical perspective sampling only a small amount of traffic (for example, .1 %) of a very large email stream is sufficient to detect suspicious or otherwise new email viruses that may be undetected by standard signature-based scanners. Therefore, relatively few MET installations would be necessary to gather sufficient data in order to provide broad protection services. Small scale simulations are presented to demonstrate MET in operation and suggests how detection of new virus propagations via flow statistics can be automated.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } Despite the use of state of the art methods to protect against malicious programs, they continue to threaten and damage computer systems around the world. In this paper we present MET, the Malicious Email Tracking system, designed to automatically report statistics on the flow behavior of malicious software delivered via email attachments both at a local and global level. MET can help reduce the spread of malicious software worldwide, especially self-replicating viruses, as well as provide further insight toward minimizing damage caused by malicious programs in the future. In addition, the system can help system administrators detect all of the points of entry of a malicious email into a network. The core of MET's operation is a database of statistics about the trajectory of email attachments in and out of a network system, and the culling together of these statistics across networks to present a global view of the spread of the malicious software. From a statistical perspective sampling only a small amount of traffic (for example, .1 %) of a very large email stream is sufficient to detect suspicious or otherwise new email viruses that may be undetected by standard signature-based scanners. Therefore, relatively few MET installations would be necessary to gather sufficient data in order to provide broad protection services. Small scale simulations are presented to demonstrate MET in operation and suggests how detection of new virus propagations via flow statistics can be automated. |
Eskin, Eleazar Sparse sequence modeling with applications to computational biology and intrusion detection PhD Thesis 2002. BibTeX | Tags: @phdthesis{Eskin:SparseSequenceModelingWithApplicationsToComp, title = {Sparse sequence modeling with applications to computational biology and intrusion detection}, author = { Eleazar Eskin}, year = {2002}, date = {2002-01-01}, organization = {Columbia University}, keywords = {}, pubstate = {published}, tppubtype = {phdthesis} } |
Leslie, Christina; Eskin, Eleazar; Noble, William Stafford The spectrum kernel: a string kernel for SVM protein classification. Journal Article Pac Symp Biocomput, pp. 564-75, 2002, ISSN: 1793-5091. Abstract | Links | BibTeX | Tags: @article{Leslie:PacSympBiocomput:2002, title = {The spectrum kernel: a string kernel for SVM protein classification.}, author = { Christina Leslie and Eleazar Eskin and William Stafford Noble}, url = {https://www.ncbi.nlm.nih.gov/pubmed/11928508}, issn = {1793-5091}, year = {2002}, date = {2002-01-01}, journal = {Pac Symp Biocomput}, pages = {564-75}, address = {Singapore}, organization = {Department of Computer Science, Columbia University, New York, NY 10027, USA. cleslie.noble@cs.columbia.edu}, abstract = {We introduce a new sequence-similarity kernel, the spectrum kernel, for use with support vector machines (SVMs) in a discriminative approach to the protein classification problem. Our kernel is conceptually simple and efficient to compute and, in experiments on the SCOP database, performs well in comparison with state-of-the-art methods for homology detection. Moreover, our method produces an SVM classifier that allows linear time classification of test sequences. Our experiments provide evidence that string-based kernels, in conjunction with SVMs, could offer a viable and computationally efficient alternative to other methods of protein classification and homology detection.}, keywords = {}, pubstate = {published}, tppubtype = {article} } We introduce a new sequence-similarity kernel, the spectrum kernel, for use with support vector machines (SVMs) in a discriminative approach to the protein classification problem. Our kernel is conceptually simple and efficient to compute and, in experiments on the SCOP database, performs well in comparison with state-of-the-art methods for homology detection. Moreover, our method produces an SVM classifier that allows linear time classification of test sequences. Our experiments provide evidence that string-based kernels, in conjunction with SVMs, could offer a viable and computationally efficient alternative to other methods of protein classification and homology detection. |
Eskin, Eleazar; Noble, William Stafford ; Singer, Yoram Using substitution matrices to estimate probability distributions for biological sequences. Journal Article J Comput Biol, 9 (6), pp. 775-91, 2002, ISSN: 1066-5277. Abstract | Links | BibTeX | Tags: @article{Eskin:JComputBiol:2002, title = {Using substitution matrices to estimate probability distributions for biological sequences.}, author = { Eleazar Eskin and William Stafford Noble and Yoram Singer}, url = {http://dx.doi.org/10.1089/10665270260518263}, issn = {1066-5277}, year = {2002}, date = {2002-01-01}, journal = {J Comput Biol}, volume = {9}, number = {6}, pages = {775-91}, address = {United States}, organization = {Department of Computer Science, Columbia University, NY 10027, USA. eeskin@cs.columbia.edu}, abstract = {Accurately estimating probabilities from observations is important for probabilistic-based approaches to problems in computational biology. In this paper we present a biologically-motivated method for estimating probability distributions over discrete alphabets from observations using a mixture model of common ancestors. The method is an extension of substitution matrix-based probability estimation methods. In contrast to previous such methods, our method has a simple Bayesian interpretation and has the advantage over Dirichlet mixtures that it is both effective and simple to compute for large alphabets. The method is applied to estimate amino acid probabilities based on observed counts in an alignment and is shown to perform comparably to previous methods. The method is also applied to estimate probability distributions over protein families and improves protein classification accuracy.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Accurately estimating probabilities from observations is important for probabilistic-based approaches to problems in computational biology. In this paper we present a biologically-motivated method for estimating probability distributions over discrete alphabets from observations using a mixture model of common ancestors. The method is an extension of substitution matrix-based probability estimation methods. In contrast to previous such methods, our method has a simple Bayesian interpretation and has the advantage over Dirichlet mixtures that it is both effective and simple to compute for large alphabets. The method is applied to estimate amino acid probabilities based on observed counts in an alignment and is shown to perform comparably to previous methods. The method is also applied to estimate probability distributions over protein families and improves protein classification accuracy. |
2001 |
Schultz, M G; Eskin, E; Zadok, F; Stolfo, S J Data mining methods for detection of new malicious executables Conference Security and Privacy, 2001 S&P 2001 Proceedings 2001 IEEE Symposium on, 2001. Abstract | Links | BibTeX | Tags: @conference{Schultz:SecurityAndPrivacy2001SP2001Proceedings:20, title = {Data mining methods for detection of new malicious executables}, author = { M. G. Schultz and E. Eskin and F. Zadok and S. J. Stolfo}, url = {http://dx.doi.org/10.1109/SECPRI.2001.924286}, year = {2001}, date = {2001-01-01}, booktitle = {Security and Privacy, 2001 S&P 2001 Proceedings 2001 IEEE Symposium on}, pages = {38-49}, abstract = {A serious security threat today is malicious executables, especially new, unseen malicious executables often arriving as email attachments. These new malicious executables are created at the rate of thousands every year and pose a serious security threat. Current anti-virus systems attempt to detect these new malicious programs with heuristics generated by hand. This approach is costly and oftentimes ineffective. We present a data mining framework that detects new, previously unseen malicious executables accurately and automatically. The data mining framework automatically found patterns in our data set and used these patterns to detect a set of new malicious binaries. Comparing our detection methods with a traditional signature-based method, our method more than doubles the current detection rates for new malicious executables}, keywords = {}, pubstate = {published}, tppubtype = {conference} } A serious security threat today is malicious executables, especially new, unseen malicious executables often arriving as email attachments. These new malicious executables are created at the rate of thousands every year and pose a serious security threat. Current anti-virus systems attempt to detect these new malicious programs with heuristics generated by hand. This approach is costly and oftentimes ineffective. We present a data mining framework that detects new, previously unseen malicious executables accurately and automatically. The data mining framework automatically found patterns in our data set and used these patterns to detect a set of new malicious binaries. Comparing our detection methods with a traditional signature-based method, our method more than doubles the current detection rates for new malicious executables |
Stolfo, Salvatore; Lee, Wenke; Chan, Philip; Fan, Wei; Eskin, Eleazar Data mining-based intrusion detectors: an overview of the columbia IDS project Journal Article SIGMOD Rec., 30 (4), pp. 5-14, 2001, ISSN: 0163-5808. Abstract | Links | BibTeX | Tags: @article{604267, title = {Data mining-based intrusion detectors: an overview of the columbia IDS project}, author = {Salvatore Stolfo and Wenke Lee and Philip Chan and Wei Fan and Eleazar Eskin}, url = {http://dx.doi.org/10.1145/604264.604267}, issn = {0163-5808}, year = {2001}, date = {2001-01-01}, journal = {SIGMOD Rec.}, volume = {30}, number = {4}, pages = {5-14}, publisher = {ACM}, address = {New York, NY, USA}, abstract = {Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references. |
Eskin, E; Lee, Wenke ; Stolfo, S J Modeling system calls for intrusion detection with dynamic window sizes Conference DARPA Information Survivability Conference & Exposition II, 2001 DISCEX '01 Proceedings, 1 , 2001. Abstract | Links | BibTeX | Tags: @conference{Eskin:DarpaInformationSurvivabilityConferenceExpos, title = {Modeling system calls for intrusion detection with dynamic window sizes}, author = { E. Eskin and Wenke Lee and S. J. Stolfo}, url = {http://dx.doi.org/10.1109/DISCEX.2001.932213}, year = {2001}, date = {2001-01-01}, booktitle = {DARPA Information Survivability Conference & Exposition II, 2001 DISCEX '01 Proceedings}, volume = {1}, pages = {165-175 vol.1}, abstract = {We extend prior research on system call anomaly detection modeling methods for intrusion detection by incorporating dynamic window sizes. The window size is the length of the subsequence of a system call trace which is used as the basic unit for modeling program or process behavior. In this work we incorporate dynamic window sizes and show marked improvements in anomaly detection. We present two methods for estimating the optimal window size based on the available training data. The first method is an entropy modeling method which determines the optimal single window size for the data. The second method is a probability modeling method that takes into account context dependent window sizes. A context dependent window size model is motivated by the way that system calls are generated by processes. Sparse Markov transducers (SMTs) are used to compute the context dependent window size model. We show over actual system call traces that the entropy modeling methods lead to the optimal single window size. We also show that context dependent window sizes outperform traditional system call modeling methods}, keywords = {}, pubstate = {published}, tppubtype = {conference} } We extend prior research on system call anomaly detection modeling methods for intrusion detection by incorporating dynamic window sizes. The window size is the length of the subsequence of a system call trace which is used as the basic unit for modeling program or process behavior. In this work we incorporate dynamic window sizes and show marked improvements in anomaly detection. We present two methods for estimating the optimal window size based on the available training data. The first method is an entropy modeling method which determines the optimal single window size for the data. The second method is a probability modeling method that takes into account context dependent window sizes. A context dependent window size model is motivated by the way that system calls are generated by processes. Sparse Markov transducers (SMTs) are used to compute the context dependent window size model. We show over actual system call traces that the entropy modeling methods lead to the optimal single window size. We also show that context dependent window sizes outperform traditional system call modeling methods |
Lee, Wenke; Stolfo, S J; Chan, P K; Eskin, E; Fan, Wei ; Miller, M; Hershkop, S; Zhang, Junxin Real time data mining-based intrusion detection Conference DARPA Information Survivability Conference & Exposition II, 2001 DISCEX '01 Proceedings, 1 , 2001. Abstract | Links | BibTeX | Tags: @conference{Lee:DarpaInformationSurvivabilityConferenceExposit, title = {Real time data mining-based intrusion detection}, author = { Wenke Lee and S. J. Stolfo and P. K. Chan and E. Eskin and Wei Fan and M. Miller and S. Hershkop and Junxin Zhang}, url = {http://dx.doi.org/10.1109/DISCEX.2001.932195}, year = {2001}, date = {2001-01-01}, booktitle = {DARPA Information Survivability Conference & Exposition II, 2001 DISCEX '01 Proceedings}, volume = {1}, pages = {89-100 vol.1}, abstract = {We present an overview of our research in real time data mining-based intrusion detection systems (IDSs). We focus on issues related to deploying a data mining-based IDS in a real time environment. We describe our approaches to address three types of issues: accuracy, efficiency, and usability. To improve accuracy, data mining programs are used to analyze audit data and extract features that can distinguish normal activities from intrusions; we use artificial anomalies along with normal and/or intrusion data to produce more effective misuse and anomaly detection models. To improve efficiency, the computational costs of features are analyzed and a multiple-model cost-based approach is used to produce detection models with low cost and high accuracy. We also present a distributed architecture for evaluating cost-sensitive models in real-time. To improve usability, adaptive learning algorithms are used to facilitate model construction and incremental updates; unsupervised anomaly detection algorithms are used to reduce the reliance on labeled data. We also present an architecture consisting of sensors, detectors, a data warehouse, and model generation components. This architecture facilitates the sharing and storage of audit data and the distribution of new or updated models. This architecture also improves the efficiency and scalability of the IDS}, keywords = {}, pubstate = {published}, tppubtype = {conference} } We present an overview of our research in real time data mining-based intrusion detection systems (IDSs). We focus on issues related to deploying a data mining-based IDS in a real time environment. We describe our approaches to address three types of issues: accuracy, efficiency, and usability. To improve accuracy, data mining programs are used to analyze audit data and extract features that can distinguish normal activities from intrusions; we use artificial anomalies along with normal and/or intrusion data to produce more effective misuse and anomaly detection models. To improve efficiency, the computational costs of features are analyzed and a multiple-model cost-based approach is used to produce detection models with low cost and high accuracy. We also present a distributed architecture for evaluating cost-sensitive models in real-time. To improve usability, adaptive learning algorithms are used to facilitate model construction and incremental updates; unsupervised anomaly detection algorithms are used to reduce the reliance on labeled data. We also present an architecture consisting of sensors, detectors, a data warehouse, and model generation components. This architecture facilitates the sharing and storage of audit data and the distribution of new or updated models. This architecture also improves the efficiency and scalability of the IDS |
Eskin, Eleazar; Grundy, William N; Singer, Yoram Using mixtures of common ancestors for estimating the probabilities of discrete events in biological sequences. Journal Article Bioinformatics, 17 Suppl 1 , pp. S65-73, 2001, ISSN: 1367-4803. Abstract | Links | BibTeX | Tags: @article{Eskin:Bioinformatics:2001, title = {Using mixtures of common ancestors for estimating the probabilities of discrete events in biological sequences.}, author = { Eleazar Eskin and William N. Grundy and Yoram Singer}, url = {https://www.ncbi.nlm.nih.gov/pubmed/11472994}, issn = {1367-4803}, year = {2001}, date = {2001-01-01}, journal = {Bioinformatics}, volume = {17 Suppl 1}, pages = {S65-73}, address = {England}, organization = {Department of Computer Science, Columbia University, New York, NY, 10027, USA. eeskin@cs.columbia.edu}, abstract = {Accurately estimating probabilities from observations is important for probabilistic-based approaches to problems in computational biology. In this paper we present a biologically-motivated method for estimating probability distributions over discrete alphabets from observations using a mixture model of common ancestors. The method is an extension of substitution matrix-based probability estimation methods. In contrast to previous such methods, our method has a simple Bayesian interpretation and has the advantage over Dirichlet mixtures that it is both effective and simple to compute for large alphabets. The method is applied to estimate amino acid probabilities based on observed counts in an alignment and is shown to perform comparably to previous methods. The method is also applied to estimate probability distributions over protein families and improves protein classification accuracy.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Accurately estimating probabilities from observations is important for probabilistic-based approaches to problems in computational biology. In this paper we present a biologically-motivated method for estimating probability distributions over discrete alphabets from observations using a mixture model of common ancestors. The method is an extension of substitution matrix-based probability estimation methods. In contrast to previous such methods, our method has a simple Bayesian interpretation and has the advantage over Dirichlet mixtures that it is both effective and simple to compute for large alphabets. The method is applied to estimate amino acid probabilities based on observed counts in an alignment and is shown to perform comparably to previous methods. The method is also applied to estimate probability distributions over protein families and improves protein classification accuracy. |
2000 |
Eskin, Eleazar Detecting errors within a corpus using anomaly detection Conference Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2000. Abstract | BibTeX | Tags: anomaly detection @conference{974325, title = {Detecting errors within a corpus using anomaly detection}, author = {Eleazar Eskin}, year = {2000}, date = {2000-01-01}, booktitle = {Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference}, pages = {148-153}, publisher = {Morgan Kaufmann Publishers Inc.}, address = {San Francisco, CA, USA}, abstract = {We present a method for automatically detecting errors in a manually marked corpus using anomaly detection. Anomaly detection is a method for determining which elements of a large data set do not conform to the whole. This method fits a probability distribution over the data and applies a statistical test to detect anomalous elements. In the corpus error detection problem, anomalous elements are typically marking errors. We present the results of applying this method to the tagged portion of the Penn Treebank corpus.}, keywords = {anomaly detection}, pubstate = {published}, tppubtype = {conference} } We present a method for automatically detecting errors in a manually marked corpus using anomaly detection. Anomaly detection is a method for determining which elements of a large data set do not conform to the whole. This method fits a probability distribution over the data and applies a statistical test to detect anomalous elements. In the corpus error detection problem, anomalous elements are typically marking errors. We present the results of applying this method to the tagged portion of the Penn Treebank corpus. |
Eskin, Eleazar; Grundy, William N; Singer, Yoram Protein family classification using sparse Markov transducers. Journal Article Proc Int Conf Intell Syst Mol Biol, 8 , pp. 134-45, 2000, ISSN: 1553-0833. Abstract | Links | BibTeX | Tags: classification, proteins, sparse Markov transducers @article{Eskin:ProcIntConfIntellSystMolBiol:2000, title = {Protein family classification using sparse Markov transducers.}, author = { Eleazar Eskin and William N. Grundy and Yoram Singer}, url = {https://www.ncbi.nlm.nih.gov/pubmed/12804091}, issn = {1553-0833}, year = {2000}, date = {2000-01-01}, journal = {Proc Int Conf Intell Syst Mol Biol}, volume = {8}, pages = {134-45}, address = {UNITED STATES}, organization = {Department of Computer Science, Columbia University, USA. eeskin@cs.columbia.edu}, abstract = {In this paper we present a method for classifying proteins into families using sparse Markov transducers (SMTs). Sparse Markov transducers, similar to probabilistic suffix trees, estimate a probability distribution conditioned on an input sequence. SMTs generalize probabilistic suffix trees by allowing for wild-cards in the conditioning sequences. Because substitutions of amino acids are common in protein families, incorporating wildcards into the model significantly improves classification performance. We present two models for building protein family classifiers using SMTs. We also present efficient data structures to improve the memory usage of the models. We evaluate SMTs by building protein family classifiers using the Pfam database and compare our results to previously published results.}, keywords = {classification, proteins, sparse Markov transducers}, pubstate = {published}, tppubtype = {article} } In this paper we present a method for classifying proteins into families using sparse Markov transducers (SMTs). Sparse Markov transducers, similar to probabilistic suffix trees, estimate a probability distribution conditioned on an input sequence. SMTs generalize probabilistic suffix trees by allowing for wild-cards in the conditioning sequences. Because substitutions of amino acids are common in protein families, incorporating wildcards into the model significantly improves classification performance. We present two models for building protein family classifiers using SMTs. We also present efficient data structures to improve the memory usage of the models. We evaluate SMTs by building protein family classifiers using the Pfam database and compare our results to previously published results. |
1999 |
McKeown, Kathleen R; Klavans, Judith L; Hatzivassiloglou, Vasileios; Barzilay, Regina; Eskin, Eleazar Towards multidocument summarization by reformulation: progress and prospects Conference AAAI '99/IAAI '99: Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications , American Association for Artificial Intelligence, Menlo Park, CA, USA, 1999, ISBN: 0-262-51106-1. BibTeX | Tags: artificial intelligence @conference{315355, title = {Towards multidocument summarization by reformulation: progress and prospects}, author = {Kathleen R. McKeown and Judith L. Klavans and Vasileios Hatzivassiloglou and Regina Barzilay and Eleazar Eskin}, isbn = {0-262-51106-1}, year = {1999}, date = {1999-01-01}, booktitle = {AAAI '99/IAAI '99: Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications }, pages = {453-460}, publisher = {American Association for Artificial Intelligence}, address = {Menlo Park, CA, USA}, keywords = {artificial intelligence}, pubstate = {published}, tppubtype = {conference} } |
Eskin, Eleazar; Siegel, Eric Genetic programming applied to Othello: introducing students to machine learning research Conference SIGCSE '99: The proceedings of the thirtieth SIGCSE technical symposium on Computer science education, ACM, New York, NY, USA, 1999, ISBN: 1-58113-085-6. Links | BibTeX | Tags: instruction @conference{299771, title = {Genetic programming applied to Othello: introducing students to machine learning research}, author = {Eleazar Eskin and Eric Siegel}, url = {http://dx.doi.org/10.1145/299649.299771}, isbn = {1-58113-085-6}, year = {1999}, date = {1999-01-01}, booktitle = {SIGCSE '99: The proceedings of the thirtieth SIGCSE technical symposium on Computer science education}, pages = {242-246}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {instruction}, pubstate = {published}, tppubtype = {conference} } |