Storing movies in DNA

Meinolf Blawat,Jean Bolot,Christophe Diot, Technicolor, Firstname

semanticscholar(2016)

引用 0|浏览1
暂无评分
摘要
We describe in this paper, at a high level, a new approach to archive or store movies (or other media content) in the base sequence of DNA and to retrieve them without error. The approach provides compact storage for thousands of years without fear of obsolescence since DNA is a universal information storage mechanism in biological organisms. 1. THE BASIC IDEA You might remember DNA from your biology class in high school or college. DNA stands for Deoxyribonucleic Acid and it is a molecule that encodes the genetic instructions used in living organisms. It is organized in cells in long structures which are the chromosomes. DNA is most famous for its double helix structure, which was first revealed in a series of papers in 1953 [13]. The double helix has since become an iconic image and instantly recognizable visual representation of genetics and biology, or even science in general – and not surprisingly it often shows up in movies: on computer screens and holographic displays, and in science labs with genetic testing and experiments. Think “radioactive spiders” or “lost dinosaurs”, for example, the stuff of monsters or genetically altered creatures that run out of control ... . In fiction movies at least, Mr. DNA is a bad guy. We will not be talking about this DNA character here. Instead, we will talk about actually storing the movie content in DNA form. You know that movies can be and have been stored on reels of celluloid film, on DVDs and other optical media of various kinds, or on magnetic media such as hard disk or solid state drives. The idea here is to use DNA as the physical medium to store movies. Two questions might be popping in your mind. Number one: Why would you want to do that? And number two: How would you do that, it just seems – well, hard to even imagine or fathom. We will go over both questions in more details later in the paper, but for now here are the short and intuitive answers. Why would we want to store movies in DNA? There are two main reasons why we are interested in storing movies on DNA. First, unlike current storage methods such as film, optical media or magnetic media, DNA is extremely stable and robust to damage. Assuming a movie encoded using DNA molecules, or strands, those molecules would essentially remain in place for tens thousands of years, especially if they are stored in cold, dry, and dark conditions. This is what allows researchers to analyze DNA from ancient frozen humans or mammoths (or what allows “movie researchers” to bring dinosaurs back to life in “Jurassic Park”). Also refer for example to Reference [8] for a recent study to simulate the degradation of encoded DNA over a few millennia. Second, assuming a movie would be encoded using DNA molecules, the technology to read those molecules is well-known and simple. It is safe to assume that, unless some catastrophic event occurs and wipes out basic technical knowledge from the surface of the Earth, the technology will exist in hundreds or thousands of years in the future to read information that was encoded in DNA today. This is much unlike the situation with optical or magnetic media, where new formats and read/write techniques are introduced every decade or so. Taken together, the two points above (DNA is robust over tens of thousands of years and technology will exist then to read it) make DNA a very compelling solution for long-term archival and storage of valuable data. There is another “bonus” reason to use DNA to store movies, which is that DNA is extremely compact and that biological encoding of data on DNA would lead to information density several orders of magnitude higher than possible on magnetic media. Church and others at Harvard Medical School achieved in 2012 experimental densities of several petabits per mm [3], [4], enough to (at least in theory) store a million-picture catalog in a small bottle of water. How would we store movies in DNA? The process to store and retrieve movies on DNA is, conceptually at least, quite simple and it proceeds through the following steps: 1. Digitize: Starting from the original content (movie, show, or other), digitize the content to obtain a sequence of 0’s and 1’s. 2. Encode: Recall that DNA molecules consist of two strands coiled around each other to form the double helix mentioned earlier. The two strands are in turn composed of 4 types of nucleotides referred to as A, C, T, and G. This second step then is to take the sequence of 0’s and 1’s obtained at the end of step 1 and convert it into a sequences of nucleotides. One simple way to do it (although not recommended in practice) would be to map bits one-to-one with nucleotides, for example 0 randomly to A or C, and 1 randomly to T or G. The output of step 2 then is a long sequence of A, C, T and G’s. 3. Synthesize: This step takes the string of A, C, T, G obtained at the end of step 2 and creates or synthesizes artificial DNA (meaning non-biological DNA) with the same sequence of nucleotides, using commercially available synthesis machines. 4. Archive the DNA sequence obtained at the end of step 3, for however long is needed. 5. Sequence: Read the stored DNA strands using commercially available DNA sequencing machines and thus getting back a sequence of A, C, T and G’s. 6. Decode: Using the inverse coding technique from that used in step 2, convert the sequence obtained at the end of step 5 into a sequence of 0’s and 1’s. 7. Read: the sequence of 0’s and 1’s and play the movie or content. Steps 1 to 3 are the “writing” or synthesis steps, with the goal of writing the DNA corresponding to the movie content. Steps 4 to 7 are the “reading” or sequencing steps, with the goal of reading the DNA back to the original movie content. The overall process seems quite simple but there are at least two key issues to overcome before DNA storage of movies can become feasible in practice. A first issue is that the process of writing DNA (synthesis, Step 3) and the process of reading DNA (sequencing, Step 5) are error prone, with error rates up to several percentage points. The corresponding challenge then is to design coding schemes such that the original movie content can be synthesized then sequenced then decoded and read completely error-free, even with potentially high error rates during the synthesis and sequencing steps. A second issue is that it is only possible to write or synthesize relatively small amounts of DNA at this point. The most significant recent accomplishment was the synthesis of a 650-kbyte book (encoded in htlm format) by the team of George Church at Harvard in 2012, but movie content would require the synthesis of orders of magnitude larger amounts of DNA. In collaboration with George Church at Harvard, we launched a project in 2013 aimed at developing i) new coding techniques for error-free DNA storage and retrieval of movie and media content and ii) new techniques to synthesize large amounts of DNA data. We consider in this paper in particular some of the work related to coding schemes (Steps 2 and 6 in the sequence of steps described above), but more generally provide relevant background information on bioand other technologies to understand our approach and the goal of the project. In Section 2, we describe the recent trends in biotechnology that led us to even consider the possibility of storing movies on DNA. In Section 3, we consider the specifics of storing movie content on DNA. In Section 4, we describe one of the coding schemes we developed to compensate for and correct the errors that naturally occur when synthesizing and sequencing DNA for movie or media content. Section 5 concludes the paper with an update on current status and future directions. 2. TRENDS IN BIO-TECHNOLOGY Biotechnology is a recent story, which emerged around the time of the arrival of the transistor. The double-helix model of DNA was revealed, along with experimental supporting evidence, in a series of five articles in Nature [3]. A few years later in 1977 Sanger et al. described a sequencing method to map the DNA of a complete bacterium genome. The Human Genome Project was launched in 1984 and completed in 2003, two years earlier than planned at a cost estimated to be around $3 billion [4]. Human genome sequencing today takes a few days and costs less than $10,000, with the price steadily dropping down. The exponential increases in transistor and integrated circuit capabilities have been summarized using the celebrated Moore’s law, which observes that the number of transistors in integrated circuits doubles approximately every two years. The speed of genome sequencing has far better than doubled every two years since 2003, in other words “faster-than-Moore’s law has gone biotech”. This is illustrated in Figure 1 (adapted from Reference [7]) which shows on the log-scaled y-axis the relative growth in capabilities over time for both DNA synthesizing (writing) and sequencing (reading), specifically the synthesizable amount of oligos (“Oligos”) and the amount of base-pairs that can be sequenced for one dollar (“seq bp/$”). A linear curve in Figure 1 indicates an exponential increase. We observe, starting in 2003, an exponential increase at a much higher rate than in the years 19802002. In parallel with increased capabilities, technological advances have led to a massive decrease in the cost to synthesize or sequence DNA. This is visible in Figure 1 in the curve labelled “seq bp/$” and also illustrated in Figure 2, which shows the really amazing decrease in sequencing cost, even compared to a Moorelaw-type decrease. Sequencing a human genome cost roughly $100 million in 2001 and less than $10,000 in 2014. Arkinvest, a market analysis company specialized in the field of bio-technologies, expects the costs of genome sequencing to reach $100 in the relatively near future [9], with a correlated increase in the number of sequenced genomes, as shown in Figure 3 below (from [9]). Figure 2: Cost of sequencing a human genome, with co
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要