Reduction in cost of DNA analysis

Discussion in 'Biology & Genetics' started by Billy T, Jun 9, 2007.

  1. Billy T Use Sugar Cane Alcohol car Fuel Valued Senior Member

    Messages:
    23,198
    I did not intend to start this thread, but SamCDkey moved my post below (originally in her "Biology and Genetic Links" thread) into this thread.

    Not exactly a link, and may appear to be promotional, but independent confirmation of a reduction in cost of massive DNA analysis by factor of 30 is big news. (In interest of full disclosure, I again acknowledge I hold shares in ILMN. They again are leaving the competition in their dust.) I have made a few sections bold, as some may not want to read the science etc.

    8June07: Illumina announced that researchers from the California Institute of Technology and Stanford University School of Medicine, and the National Heart, Lung, and Blood Institute of the National Institutes of Health are the first in the industry to use the Solexa Sequencing technology with the Genome Analyzer to generate significant chromatin immunoprecipitation sequencing, or ChIP-Seq results within months of system installation, and have study findings accepted for publication by high-profile peer-reviewed journals. These studies appear in the June 8, 2007 edition of Science and the May 18, 2007 issue of Cell, respectively.
    In both studies, investigators used Illumina’s Solexa Sequencing technology to quickly and easily identify how proteins interact with deoxyribonucleic acid (DNA) across the entire human genome. Findings demonstrate that this next-generation technology offers the scientific community the ability to scan millions of binding sites within one channel of a flow cell for 10 to 30 times lower cost than ChIP-chip approaches. Using the ChIP-Seq method, researchers also obtain unprecedented resolution, specificity, and signal-to-noise as compared to microarray-based alternatives.

    “Our ChIP-Seq study took advantage of the Genome Analyzer’s ability to generate tens of millions of sequence reads per run,” said Richard M. Myers, Ph.D., Professor and Chair of Genetics at Stanford University School of Medicine and key author on the Science publication. “This allowed us to turn ChIP into a simple counting assay, in which one sequence read per molecule of DNA was mapped from the sample back onto the reference genome across millions of molecules per sample. By analyzing the distribution of these positive “hits” across the entire genome we mapped and learned new things about thousands of sites that selectively bind to our factor of interest. The amount and nature of these data meant we got exceptional positional resolution and statistical confidence.”

    “We are excited about applying other similar “sequence census” methods to measure and map RNA expression, DNA modification assays, and the like,” said Barbara Wold, Ph.D., Professor of Molecular Biology and Director of the Beckman Institute at Caltech and key author on the Science paper. “It seems inevitable that sequence reads will soon be the common digital currency for genome-scale DNA and RNA measurements.”

    According to the paper published in Cell on May 18, 2007, researchers at the NHLBI “demonstrated that direct sequencing of ChIP DNA using the Illumina Genome Analyzer is an efficient method for mapping genome-wide distributions of histone modifications and chromatin protein targets.” Investigators used the Solexa sequencing technology to generate genome-wide data for more than 20 epigenetic marks in human T cells.

    “As indicated by the Science and Cell studies, scientists are now able to explore entire genomes, regardless of organism type, and unearth rich, new information -- easily, rapidly, and affordably,” said John West, Illumina's Senior Vice President and General Manager, DNA Sequencing. “The ChIP-SEQ approach is transforming the way scientists are looking at the DNA-protein interactions that alter gene function and regulation, critical to the understanding of many complex diseases.”

    Parties interested in hearing more about study findings published in Science are invited to join an Illumina-sponsored webinar, hosted by principal investigators Richard M. Myers, Ph.D., and David Johnson, Ph.D., both from Stanford University Medical Center, and Ali Mortazavi, Ph.D., from The California Institute of Technology on Thursday, June 21, 2007 at 1:00 pm Eastern Time. To register for the event, please visit www.Illumina.com/webinars.
     
    Last edited by a moderator: Jun 9, 2007
  2. Google AdSense Guest Advertisement



    to hide all adverts.
  3. CharonZ Registered Senior Member

    Messages:
    786
    Well, while that combination is nice (ChIP-Seq), it will still rely on available already sequenced organisms. Thus

    is not completely true. At the very least one needs something to align it to.


    And just to avoid possible confusion ChIP-Seq is about identification of protein binding sites on DNA, and not about sequencing of organisms.
    As of yet solexa has very short reads (25 bp or so), thus new genomes cannot be efficiently sequenced (even 454 with ~100 bp has probs with that).
     
  4. Google AdSense Guest Advertisement



    to hide all adverts.
  5. Billy T Use Sugar Cane Alcohol car Fuel Valued Senior Member

    Messages:
    23,198
    I thank you for your comments. I have zero formal training in any part of modern biology, so I do not always fully understand what I read. Also nearly all of my interest (financial and otherwise) is in early stage drug developers, about 40 different ones, not equipment makers (only own two others: www.abaxis.com & www.qiagen.com).

    From what you state the advance is only in identifying sites on the DNA (or RNA presumably) that are active binding to selected agents/molecules. Is this correct? If true, and they know the locations of these sites on the DNA, could they not also run four "binding tests" (one with each of the four molecules that bridge between the two strands of DNA) and thus learn the sequence?

    Also in third paragraph of article, Barbara Wold speaks of "sequence reads" is this not a reference to determination of the DNA sequence? (Perhaps by the "four letters separately bound" I just mentioned?)

    What is your background? Have you read either of the journal papers mentioned in Post 1?
     
    Last edited by a moderator: Jun 9, 2007
  6. Google AdSense Guest Advertisement



    to hide all adverts.
  7. CharonZ Registered Senior Member

    Messages:
    786
    It works different. (BTW, with four letters you refer to the bases of the DNA, right?)

    The main point of this technique (which btw is not completely new, only used in a higher throughput manner), is to identify proteins, which bind DNA. What one usually wants to find are transcription factors (TF), proteins that are involved in regulation of gene expression. These TFs bind to specific DNA sequences and either enhance or inhibit the expression (transcription into mRNA) of the respective genes.
    Essentially it is a two-step process.
    This assay usually involves having TFs binding to DNA in vivo or in vitro, the (sheared)DNA-Protein complexes are then extracted by antibodies directed against the TF. This of course relies on the fact that the TFs are known in the first place, otherwise you wouldn't be able to extract them.

    Then the DNA is removed from the proteins and sequenced in this case. So you end up with the DNA-sequence to which the proteins were binding. In a high troughput manner you can thus try to identify all binding sites (DNA-stretches) to which the TF in question can bind.

    In other words you do get sequences, but only very short ones. Usual TF binding sites hardly excel 30 base pairs. With this technique you cannot assess the whole genome of a novel organism, you can only map the TF binding sites on a genome with known sequence.

    From the background: I am scientist involved in "omics" research. I actually got a flyer for the Solexa sequencing system.

    The main problem of the Solexa system (which is actually independand from the ChIP assay), is the short read length of 25 bps. It is atm impossible to assemble such short sequences into a complete DNA sequence, even for smaller genomes. Imagine chopping 3 billion bases (roughly human genome) in random fragments then sequencing only 25 bp of each fragment and then try to assemble all the fragments (which are also redundant) back into the original 3 billion sequence.
     
  8. Billy T Use Sugar Cane Alcohol car Fuel Valued Senior Member

    Messages:
    23,198
    To Charonz:
    Thanks again.

    I did not understand all,so here are some questions:

    What sort of work is "omics"? (I doubt you study electrical resisters, but that has a silent "h" in it anyway)

    Is "try to identify all binding sites (DNA-stretches) to which the TF in question can bind." the same "DNA- stretches" sometimes called "Snips"? or are Snips more general - i.e. any short DNA sequence?

    Are the TFs the same sort of things as the short chains that siRNA (and Amylyn) people used to block protein expression? (This year's Noble prise stuff.) I owned SRNAi more than a year before Merck or Abbott bought it. (Memory fails as two which bought it because I also owned KOSS, which the other bought out. Fortunately for me, one transaction was in very late Dec06 and other in first week of Jan07, so the tax hit I took was not too bad, but did push me off the tax tables last year.)

    I thought the sequencing of large genomes was first done by turning loose, perhaps separately, a bunch of "enzimatic hatchets" to cut it up and them by clever computer programs figure out how it was originally hooked together from data on the cut up pieces - obviously a very crude understanding.

    If at least partially correct idea, are you telling me that the length of the pieces enzimatically cut up was much longer than the 25 or so bps when you said: "It is atm impossible to assemble such short sequences into a complete DNA sequence, even for smaller genomes. Imagine chopping 3 billion bases (roughly human genome) in random fragments then sequencing only 25 bp of each fragment..."

    Have you used illumina's existing "BeadArray" or know about it? Are you also getting information from them about the two new (should be released about now) systems? They say: "The Human 1M* and Human 450S BeadChips** has coverage in greater than 99 percent of known genes. BeadChip system will profile over one million diverse genetic variants. The new Human 1M BeadChip combines an unprecedented level of content for both whole-genome (WG) and copy number variation (CNV) analysis, along with additional unique, high-value genomic regions of interest - all on a single micro array chip. The Human 1M and Human 450S BeadChips uses “Infinium Assay“ (a machine?) to provide industry-leading data quality, genomic coverage, and intelligent probe selection."

    Is this mainly "hot air," self promotion, advanced PR, etc. or does it sound like a significant advance?
     
    Last edited by a moderator: Jun 10, 2007
  9. CharonZ Registered Senior Member

    Messages:
    786
    Am a bit short on time so, I probably cannot answer all questions. I'll just try to get as far as posssible. Also I won't proof-read.

    Omics is lab-speak for a bunch of areas that are losely referred to as postgenomics research. This includes transkriptomics, proteomics, metabolomics and so on. In general these are high-throughput research of a whole analyte class (mRNA, proteins, metbolites) in an organism.

    Snips are something totally different. Single nucleotide polymorphisms refer to exchanges of single base changes at the same position in the DNA in related organisms. They can e.g. be used to determine the relatedness of individuals. Snips are therefore no stretches at all. Only exchanges on a defined position.

    TFs are not siRNAs. The latter are, as the name implies, RNAs, whereas TFs are proteins.

    There are basically two ways of genome sequencing strategies. Top-down and bottom-up. You refer to the latter one, which, depending on the precise procedure, is often also referred to as shotgun sequencing. Quite often one actually uses mechanical shearing as compared to enzymatic digestions, as the former method is more unspecific and thus the chance is higher that you obtain shredded DNA with of more homogenous size distribution.

    The obtained pieces are then cloned into suitable vectors which, depending on the application, takes in DNA sequence of a couple thousand base pairs. You must keep in mind that after each run you will end with the respecitive maximum length of the method (25 bp in solexa, around 1 kb in traditional sanger type sequencing systems). Then you need to put tour puzzle consisting of 25 bp or 1 kb fragments together. However, as you only got the possibility of four different bases at any given position a 25 bp fragment is likely not to be as unique as a 1kb stretch. So for many (if not most) 25 bp fragment you cannot ascertain the correct position in the genome. Especially in novel genomes you do not have a scaffold onto which you can assemble the sequences.

    So in general as the manufacturer states, the solexa is not for sequencing of novel organisms, but for resequencing of known ones, or for special applications like ChIP-Seq.

    These arrays are btw. something completely different again. They fall into the realm of microarrays and in this case are mostly used for SNPs or expression analyses. It has nothing to do with sequencing as you first need to have the sequence and the print small stretches on them onto array slides.
     
  10. S.A.M. uniquely dreadful Valued Senior Member

    Messages:
    72,822
    I hope you don't mind Billy I thought it was interesting enough to warrant/generate some discussion.
     
  11. Billy T Use Sugar Cane Alcohol car Fuel Valued Senior Member

    Messages:
    23,198
    12June07 news release from ILMN prompted by sale of 75 new Analyzer described briefly below. (Stock jumped about 8% on news.)

    Illumina’s Genome Analyzer employs proprietary sequencing-by-synthesis chemistry, enabling researchers to sequence a human genome for less than one percent of the cost of today’s state-of-the-art capillary sequencing platforms. By producing greater than 100 times the output of current genetic analyzers at a fraction of their costs, the Genome Analyzer is enabling researchers to successfully carry out novel genome wide analysis, such as global mapping of DNA-protein interaction, digital expression profiling, small RNA discovery and profiling, and targeted and genome-wide sequencing for discovery of novel polymorphisms.
     
    Last edited by a moderator: Jun 13, 2007
  12. Billy T Use Sugar Cane Alcohol car Fuel Valued Senior Member

    Messages:
    23,198
    Here are couple more links on the reduced cost and application of DNA (and RNA in this case) analysis that is becoming cheaper. The following application on the mysterious death of bees (CCD) is very cheap and broad baseded and has many potential applications as I understand it.

    "... A new way of using gene-sequencing machines could be the best scientific tool for understanding germs since the invention of the microscope. It might help scientists deal with sudden outbreaks like SARS faster and with more certainty. But first, researchers are going to try to save the humble honeybee. ...

    Columbia University, and the United States Department of Agriculture have found that collapsing hives are much more likely to be infected by the Israeli acute paralysis virus (IAPV), using a new approach that aims to sequence all the genes in an environment almost as if they were single organisms. The new technique, called metagenomics, for studying bacteria, viruses and other bugs, shows promise for studying human diseases. ...

    Metagenomics could also open a window on the invisible microbes, still undiscovered, that live in the earth, air and sea.

    For 15 years, scientists have been sequencing the genes of individual organisms, from lowly bacteria to malaria parasites to dogs and human beings. Metagenomics takes another approach, basically putting an entire environment--the water in an ocean, the soil in a forest or the insides of an animal--into a DNA sequencer. What comes out is a mess of data, strings of genetic material where it is not always easy to tell what DNA comes from what microbe. But discoveries can be made with this method that are otherwise impossible. ...

    there are some limitations to the bee research, says Jonathan Eisen, a rising star in metagenomics and a professor at the University of California-Davis. First, he says, the experimenters basically ground up whole bees, when dissecting them first might have yielded more information. Second, the researchers sequenced only RNA, not DNA, meaning they might have missed some types of bacteria. And the group used a problematic method for searching for particular organisms in the mess of genetic material they sequenced. ...

    That complaint in itself highlights the promise of this new field, which a committee convened by the National Academies of Science in March said could be the greatest opportunity in three centuries for understanding microbes. ..."
    From:
    http://www.forbes.com/2007/09/06/ge...-cx_mh_0907bees.html?partner=daily_newsletter

    -------------------------------

    "Illumina, Inc. (NASDAQ:ILMN) today {6Sept07} announced the introduction of the Infinium HumanLinkage-12 Genotyping BeadChip, Illumina’s fifth multi-sample Infinium BeadChip and the Company’s first standard panel to take advantage of a twelve-sample format for linkage analysis. The HumanLinkage-12 BeadChip offers the lowest cost per sample for linkage analysis plus industry-leading call rates, uniform marker distribution, and superior SNP content. Powered by the Infinium Assay, this linkage panel is available for $90/sample, a competitive price with a PCR-free protocol and easy workflow. ...

    Linkage analysis maps the location of disease-causing loci by identifying genetic markers that are co-inherited with the phenotype of interest. According to a paper published in the journal Human Molecular Genetics by the International Multiple Sclerosis Consortium significant additional power can be obtained using high-throughput SNP genotyping for linkage analysis. This study, which revisited previously typed complex disease family cohorts, also reported that higher success rates and accuracy were found with Illumina’s technology. ..."
    From:
    http://news.morningstar.com/news/ViewNews.asp?article=/BW/20070906005387_univ.xml


    Better and much cheaper - No wonder ILMN stock is doing well for me and other shareholders.
     
    Last edited by a moderator: Sep 9, 2007
  13. CharonZ Registered Senior Member

    Messages:
    786
    Uhm I really do not want to dampen your enthusiasm, but:

    metagenomics is not really that new. What is even worse is that if you check literature (meaning peer-reviewed articles) you'll find that comparatively little information has been gained by this approach. I expect that with newer techniques it might improve in the near future, though.
    Just as a comment, though, the sequencers presented in this thread are all short-read sequencers and are totally useful for metagenomic approaches (where you don't have a scaffold for assembly).

    However, within the next five years or so a new sequencer is supposed to arrive (forgot which company, though) which will combine the long reads of old Sanger sequencing method with the speed of the newer sequencers. Essentially it uses chips with immobilized polymerases. This tool will be far more interesting for gaining new sequences.

    On the other hand 454 (by Roche) is supposed to increase the read-length to 500 bps soonish (Atm it is at ~200, though still longer than the solexa system).

    Everything aside I do acknowledge that for shareholders (or stock values) the marketing is probably more important than the precise use in actual science.
     
  14. Billy T Use Sugar Cane Alcohol car Fuel Valued Senior Member

    Messages:
    23,198
    I am sure that they do all they can to make it appear as better than invention of "sliced bread" but you must admit the Nat. Accad. of Scieces calling it the "most significant advance since the microscope" does indicate it it has some novality and importance! Fact that Forbes prints an article on "megagenomics" now also makes me think there is something new here.

    I know less than you about all this, but cheap beads with internal holigrams to self indicate type when randomly placed in the array and optical readout by laser etc. seems (at least a physicist like me) - dam clever manufacturing. I own shares in about 40 and follow about 60 more biotechs so can not go too deeply into eachbut try to listen to all their major presentations - For example two days ago was listening live to:
    Annual BioCentury Newsmakers in the Biotech Industry Conference 6Sept07 at:
    http://phx.corporate-ir.net/phoenix.zhtml?c=136783&p=conferenceAgenda&id=1621473&day=1
    and the talks are still available if you are interested. (sometimes I am posting here while listening to talks I have basically heard before.)
     
    Last edited by a moderator: Sep 9, 2007
  15. CharonZ Registered Senior Member

    Messages:
    786
    Two things: Metagenomics is kinda new hope for microbiologists because the whole sequencing projects that where believed to be the "sliced bread" during its time did not deliver what it was supposed to do (you will notice that most academic bodies claimed the same about genomics as it does now for metagenomics), despite the millions poured into it. Metagenomics is kind of an extension to this approach and as I said, it is not really completely new. Metagenomics papers have been already around for some time (the turnover in that field is pretty high) and technically the only real limiting factor is funding.

    That argument aside I find that you posting an article about metagenomics together with the parallel sequencing or bead techniques pretty much misleading, as none of these are actually able to perform metagenomics. As such metagenomics will not benefit by the cost cut by Solexa or similar machines.
    All metagenomic studies to date are carried out by conventional systems, which have cut costs significantly over the years (and which are not actually better funded).

    I am not saying that metagenomics is useless (far from it), but it is clearly an overhyped field (like systems biology). The real use of the sequences is far harder to convey to laymen and in truth, far less sensational(e.g. for the generation of databases for metaproteome analyses). Also, the new sequencing systems described in this thread have atm zero impact on this field
     
  16. Billy T Use Sugar Cane Alcohol car Fuel Valued Senior Member

    Messages:
    23,198
    Sorry, I appoligize. I am just too ignorant of all the fine distinctions of the field and its equipment. I appreciate your comments and corrections.

    Perhaps if you have time and inclination, you could make a quick Tutorial outline to educate me (and others who may be interested) in understanding the rapidly evolving instrument capabilities of the field.

    I have tried to teach mymself, mainly at Illuminia's web site, but there is so much Hype - they can do anything better than anyone else - it almost seems as if my car will not start, I only need to get their model xyz to have cheap good solution to that problem.

    I would really appreciate an outline of the types of machines, what they can (and can not do), how they are used (Including how many instititutions do that type of work - just order of magnitude estimates as I am really mainly interested in trying to know if some application is only of interest to two guys in lower Slovackia or to every university biology department and research lab.) etc.
     
  17. CharonZ Registered Senior Member

    Messages:
    786
    Hmm this would take some time to compile, especially if the very basics of genome sequencing (and as important, sequencing strategies) needs to be incorporated.
    Due to lack of time I usually only make comparatively short and typo ridden posts during my coffee breaks.
    When I got time I actually might be doing that. I think I even have some old powerpoint slides that I used for talks, but I cannot promise anything at this point.
    Some catchwords that might be of interest for you are:
    -sequencing strategies (top down, bottom up, shotgun)
    -sanger sequencing
    -pyro sequencing
    -sequencing by synthesis (I guess you know that one already, it is used by the solexa system)
    -genome assembly

    Generally I just want to point out at here that in order to sequence a genome or even metagenome (which is just an environmental mixture of a number of genomes) one does not only need to get the DNA and sequence it. The real challenge is to take the so-called reads that the machine gives you and assemble the (meta-)genome from it. In principle you can imagine that the short sequences that the sequencer gives you are puzzle pieces and you have to arrange them in a specific order. The problem here is the read length. You determine the arrangement of sequences by finding overlapping sequences and arrange them after each other.
    As you can imagine a lot of short reads are harder to assemble than long reads. Now if you compare the read lengths of different sequencer you will find the following:
    old "Sanger" type sequencers: ~750-1000 bp
    454 pyrosequencer: 100-200 bp
    Solexa "sequence by synthesis": 30 bp

    As you can see,the modern parallel sequencers have inferior read length making it hard if not impossible to assemble complete genomes. They can be used for re-sequencing, though. For instance, the human genome has been sequenced once already and if you want to sequence another human, you can use these machines as you can take the sequenced genome as a scaffold and align the short reads to it. Also one can try to create extensive libraries and try to use them as scaffold, but then the price advantage of the parallel machines is all but gone.
    Metagenomes on the other hand are pretty much unique (as you usually do not know what species you will find in detail) and thus are almost per definitionem scaffold-free.

    In the near future there will be developments that will actually allow fast sequencing of novel organisms, in fact a number are in development right now, but one has to keep in mind that the machines now on the market fulfill rather specialized purposes and will not yet displace the older ones. Regardless, it is a very important development on the sequencing market.

    Regarding how many are using these systems:
    Sanger type (capillary) sequencer are quite common. Many labs that need to do sequencing in a specialized way (e.g. DNAase footprinting) or with a certain throughput have one. However due to the drop in costs many more are just sending the samples to a sequencing service.

    Roche 454 is less common, mostly due to the price tag. They cost around half a million and need a specific lab set-up. I know of an increasing handful of institutes buying them (including my old working place), though. Just by estimation I would say that in most well-funded countries one can expect to have around half a dozen of them, more in the USA due to the larger volume of sequencing projects (but that is just an out-of-thin air estimation).
    In addition to the price tag a disadvantage is that it is atm still not very scalable, meaning that you have to do all the reads at once. So it is totally useless if you only want to sequence a short fragment (e.g. to confirm correct cloning of PCR products) as you will have to pay the same as sequencing a whole genome.
    I guess that for the solexa similar limitations exist, maybe even more due to the extremely short read length. I only know of few papers with the solexa and all of them were proof-of-principle paper. But then they are not so long on the market, yet. Personally, I have not met anyone owning one of these, yet.

    Coffee is empty back to work

    Please Register or Log in to view the hidden image!

     
    Last edited: Sep 11, 2007
  18. Ophiolite Valued Senior Member

    Messages:
    9,232
    Best thread on sciforums this year. (Perhaps ever). This is what science forums should be about. Congratulations and thanks to BillyT, CharonZ and Sam for extracting the exchange as a distinct thread.
     
  19. Billy T Use Sugar Cane Alcohol car Fuel Valued Senior Member

    Messages:
    23,198
    Thanks CharonZ. Very informative. Hope you drink more coffee soon, find you old PP slides etc. (and can up load them here - I never can)

    One thing especially is still a mystery to me:
    The different machines and approaches to sequencing have limited ability to process bp strings (only do up to a max length) Why? (Also do you chose the enzimes used to cut roughly to this size pieces?) I do not understand why in my illustration below bp string "c" could not be very long and help the computer solve for the original long bp. Is it something to due with the ability of the machine to convert the organic bp string information into computer compatible code equivalent informations?

    I understand now much of how sequencing is done and why fast computers are part of it. For others, who do not, I will draw an illustration:

    You take some genetic material (need not separate it from protiens etc., I think, as the DNA & RNA parts of interest, will consist of sequences of only four letters in the code, with always the same pair-wise binding, but I will illustrate with binary coded information.)

    Before you cut it up in short pieces it is thus very long stings like:
    ...1000101110010100001010o1010101001010101010111111000901000010101010000001010000010011110001010100000101010101...
    (The "9" got there by my sloppy typing, but I suspect that there are sometimes errors in the DNA & RNA strings also, so I did not correct.)

    I know that the "cut it up" can be (and think usually is) done by enzimes which tend to "snip" at particular sequences, but bet very high frequency ultra-sound might work also.

    After cutting with many different enzimes (I think) you have a mess of short stings whose length CharonZ gave in terms of "bp" which 99% sure is "basis pairs" referring to the possible / permitted pairs of the coding letters (molecules cross linking the stand in DNA helix or after helix is split the part still attached to the "back bone" of each helix strand. (RNA is usually only single stranded, I think)

    I.e. you have a soup like:

    00101001110001 (only 14 bp as I am lazy and do not want to count many just to illustrate)

    11001111001101010101010101000001

    00011100010100001010101001010100001

    10101010001010101010 (etc. for a "zillion" more relative short strings of bp, not all different from each other, but some at least unique)

    Now by means not entirely clear to me these shorter bp chains are tranformed into computer compatible data strings (and I assume the cat now gets to eat the biological material, if the enzimes were not toxics, as it has no further use)

    Now clever computer programs look for overlaps such as:

    00010000111101000101010010 (bp string "a")
    .................11110100010010101010001000110111101010100010101001010 (bp string "b")
    ....................................................1000100010100110001101000101011100001010101000010101010101010000101011100010101000100111 (bp string "c")

    and so on until you have rebuilt the original "uncut" very long bp chain (in the computer only of course)

    Note that only for string "a" did I show the bp sub sequences prior to the "matching part," but they often would exist (however, as the enzimes do cut at specific sub stings of bp, in many cases I think the "match point" may be the end of a bp string. (I am not exactly clear how the computer "knows" which way the string hooks on to the matching bp string in those rare cases when the "matching" sub string is symetric about its mid point such as 010001100010 - perhaps it must consider both posibilites before reaching the final solution of a reconstruced sequence of the orignal long bp string.)

    I did not check to see if different section of my strings a,b, & c could also be alined, but am sure this is part of the "headaches" the computer has in concluding what is the actual original long bp string. Must be some "voting proceedure" to decide. I.e. if in the data set it is feed of many bp stings there are 746 times can make the expected length (known roughly from the "before cut" long bp passing thru mass spectragraph or someting like that) with one particular sequence and only 21 times with "second place" solution, then it puts it money on the 746 case string. - something like that, I think.

    Perhaps Charon will correct my errors, but I am pretty confident the basic idea illustrated is nearly correct.

    I will return by edit soon and try to make the stings a,b,& c overlap better by adjusting number of my "spacing dots," which should be ingnored, as that is all they are.
     
    Last edited by a moderator: Sep 11, 2007
  20. CharonZ Registered Senior Member

    Messages:
    786
    Maybe I gotta step back a lil' more. DNA is a fairly uniform molecule with essentially only one site of variation. These are the bases of DNA and there can be only one of four possible bases, which in short are termed A, C, G, T. I refer to these when I say bp or base pairs.
    The reason why it is refered to base pairs as opposed to merely bases is that the DNA double helix form pairs, with A pairing with T and C with G. This property of specific pairing is exploited by the sequencing reactions.

    The limitation in length is totally unrelated to "chopping" up the genome with enzymes or other means. This part is only necessary for setting up a DNA library and for this explanatory purposes one can assume that it does have no impact on the actual sequencing reaction.

    All sequencers depend on a PCR reaction. What happens here is that the DNA fragment to be sequenced is denatured (split into its two halves) and a PCR reaction is started. What happens is that an enzyme (the DNA dependent DNA polymerase) starts at a defined point (depending on the use of so-called primers) and then starts to synthesize a second DNA strand, thus recreating a whole double helix again. The specificity is given by the fact that bases can, as mentioned above, only pair in a certain way, so that one half of the DNA is automatically the template for the second (for details please refer to one of the numerous web sites dealing with PCR).
    The detection of which nucleotide (base + sugar backbone) was built in depends on the system and is not important here. What is important however, is that the polymerase can only move along the DNA for a limited time before it falls off. In other words in a single reaction the sequencing is stopped after that length.

    In the Sanger sequencing reaction the PCR can run almost optimal yielding the long reading length. Both 454 and Solexa systems require additional enzymes and washing steps which arguably decrease the PCR efficiency and thus reading length.

    Now regarding assembly, what was stated is basically correct (although one would usually use A, C, G and T instead of binary strings). However, none of the displayed overlaps would be a correct assembly. It should rather look like this:


    acgaaatgctgctagatcgtg-end
    _________________cgtgaatagacgatgat-end
    ______________________________tgatggcggtaata-end

    With "_" just being spaces to align the sequence. The sequencing started with acg and cgt and tga, respectively.

    Now imagine you got an overlap of only a very short sequence (as one would expect e.g. from the Solexa system) e.g.: GGACT

    Then you got a genome of, say, 3 billion bases (roughly human genome). Just by chance you will find many many regions with that particular sequence, so it won't be possible to really say where the sequence belongs to. Also there are a lot of repetitive regions that have a low string diversity. If the reads start and end within such a region they cannot be assembled.
     
  21. Billy T Use Sugar Cane Alcohol car Fuel Valued Senior Member

    Messages:
    23,198
    Not trying to promote them, but I get notices that do not fully understand such as:

    Illumina announced today that Cancer Research UK will fund two studies designed to uncover genetic factors linked to the development of lung and ovarian cancers. These studies are part of a dual agreement with UK-based research centers that will total more than 15,500 samples. Both studies will initially use Illumina’s Infinium HumanHap550 Genotyping BeadChip, followed by customized analysis using Illumina’s iSelect Genotyping BeadChip. These studies mark the third and fourth service projects to be conducted for Cancer Research UK by Illumina’s FastTrack Genotyping Services team.

    Collaborating with Illumina for one study will be Richard Houlston, Ph.D., with his research team at the Institute of Cancer Research, and Tim Eisen at the University of Cambridge. More than 2,000 samples will be run on the HumanHap550 BeadChip, followed by 5,600 samples on an iSelect Custom BeadChip. ...A second study led by principal investigators Paul Pharoah, Ph.D., at the University of Cambridge and Simon Gayther, Ph.D., at the University of London will attempt to uncover genetic links to ovarian cancer. This study will process more than 2,000 samples using the HumanHap550 BeadChip and 6,000 samples using an iSelect Custom BeadChip. ...

    “We are very pleased to work with Illumina’s FastTrack Services team again. We initiated our first study with them in 2005, which resulted in discovery and a speedy publication in Nature Genetics of a gene linked to colorectal cancer. "

    From:
    http://news.morningstar.com/news/ViewNews.asp?article=/BW/20070912005433_univ.xml
    (Morning star is a free service that you can set up to report news related to a list of stock you submit.)

    Why do they need to switch machines? I.e. what does the second machine do that the first can not and conversely?
     
  22. CharonZ Registered Senior Member

    Messages:
    786
    These bead chips are something completely different, and not related to sequencers or sequencing machines at all. In fact there are not machines at all.

    They are a variation of microarray technologies which are used to screen RNA/cDNA or DNA for a specific sequence or to quantify gene expression.
    Basically you immobilize short DNA strands of known sequence onto beads or on a chip and then you add your labelled samples. If it binds and thus give a signal you know that a certain DNA sequence is present in your sample (and with a variation of this you can quantify RNA changes).
    So you can, for instance put samples of patients with a genetic disease on such a chip and then you do the same with a healthy sample. Those regions that give a signal with the diseased sample and not in the healthy one might then be associated with the given diseases. These are then targets for further analyses.

    Different chips have different DNA stretches (also called olignucleotides if they are short) immobilized onto them. The custom chips therefore have sequences specified by the buyer, depending on need. E.g. they may want to have specific regions with a higher coverage, if they want to narrow down the locus associated with a disease and so on.
    The chips are essentially one-use slides, after hybridisation you throw them away, sometimes you can wash and reuse them, but quality generally goes down.
    Other than that one has to add that those chips are not that exclusive as the newer sequencing systems (454, Solexa), but there are a lot of company nowadays around who manufacture them (with slightly different techniques, though). To date there is no consensus which system/manufacturer is superior, btw.. Also, quite a number of labs (including one of my former labs) print their chips on their own.
     
  23. Billy T Use Sugar Cane Alcohol car Fuel Valued Senior Member

    Messages:
    23,198
    To CharonZ:

    Thanks again. With you continued help I may some day understand this mess. How do they know what sequence of BPs they bound to the chip? Do they make a sequence that they know exists and throw in some beads with a particular 3D holigram inside that they always use for that sequence? If so I wouuld think they can only discover sequences that they look for, not perhaps the mutated one causing the disease.
     

Share This Page