# Bioinformatics

**Paper, Order, or Assignment Requirements**

Objective: Design and write an OLC (Overlap Layout Consensus) program for DNA sequencing.

1. Data

Download the first two chromosome sequences of Yeast (S. cerevisiae) (NC_001133.fna and NC_001134.fna) from ftp://ftp.ncbi.nlm.nih.gov/genomes/Fungi/Saccharomyces_cerevisiae_uid128.

Simulate reads by randomly segmenting the genome in chunks of average 400 bp. This can be done by generating two random numbers: a random number between 1 and the length of the chromosome and the other for the length of a read with the average of 400 bp. Ideally, you will use a mathematical distribution for read lengths (for example, a Gaussian distribution), but you may use uniform lengths of 400 bp in reads.

You have to generate enough read so that an arbitrary base position in 10 fragments, on the average, namely the coverage depth is 10. As in the slides, in order to have the coverage depth of 10, the total number of bases in fragments has to be 10 times the genome size.

Replace all ‘N’s in reads to any neucleotide.

Real reads include both 5’ to 3’ and 3’ to 5’ reads. For simplicity, you assume that the simulated reads are only for 5’ to 3’ reads.

2. Write a program(s) to create the overlap graph from reads. An overlap(si , sj) is defined as the length of the longest matches between the suffix of si and the prefix of sj .In order to compute overlaps, you need to perform n*(n-1) comparisons, where n denotes the number of fragments.

The following is an example with 7 reads, and the overlap value is in the parentheses.

Fragment overlap

1. TACCTTG 2(3) 4(1) 7(1)

2. TTGAT 3(3)

3. GATATGG 4(2) 7(1)

4. GGAG 3(1) 7(1)

5. CTCTA 1(2) 6(3)

6. CTAGT

7. GCTCT 2(1) 5(4) 6(2)

3. Each read in part 2 above becomes a node in a graph. Each link (edge) between nodes represents the overlap value. The sequencing problem becomes the traveling salesman problem, visiting every graph node with the largest sum of overlap values. You can try a greedy algorithm. Start with the largest overlap value. Follow the path with largest overlap values.

4. Compare the resulting assembly with the original sequence from NCBI.

**What to submit:**

**
**Submit a report including the following:

• The data structure that you used to store the simulated read fragments with a discussion on an extension to sequence sizes of mammals.

• A discussion on the comparison of your assembled sequence with the original sequence.

• Any references, if used.

• Your program in an appendix.

Latest completed orders:

# | Title | Academic Level | Subject Area | # of Pages | Paper Urgency |
---|---|---|---|---|---|