Programming Exercise 6
Using Arrays - DNA Matching

This program was adapted from a version Rich Pattis did at CMU.

Provided classes and files:    DNAMain.java, Sequencer.java, FileHandler.java, Strand.java, StrandCollection.java, Timer.java, ConsoleInput.java, simple.txt, complex.txt, all files zipped
What to turn in: Strand.java, Sequencer.java, StrandCollection.java

This program is a simplification of the process involved in the Human Genome Project and other similar projects for mapping DNA. For example, see:

Vocabulary:

For this assignment, you'll be reading a file that identifies several strands in a DNA molecule and trying to reconstruct the DNA molecule from the strands. The data you'll be processing is artificially generated (by a program) but the process is similar to what's done by computer programs in the human genome project. In a laboratory, a fragment of DNA is duplicated, then all the fragments are broken into smaller, overlapping strands by methods like thermal dissociation and agitation. In a lab, the bases of the smaller strands are determined and entered into a file. At this point you're stepping in to process the file and reconstruct the original DNA sequence of bases. The data for this assignment is perfect, not noisy or with any mistakes. In this respect the data is unlike that processed by the Human Genome Project.

Basic Operations of the Sequencer:

The basic operation that's essential in doing this assignment is to process all the strands in a strand collection by matching two strands and merging them into a new strand. This process decreases the number of strands by one since two strands are merged into one strand. The match/merge operation is repeated until there is only one strand left in the strand collection---this strand will be the original DNA; or until no more matches are possible.

Matching is the process whereby 2 strands are checked for overlap. This is similar in nature to the String matching you did on assignment 1. Consider 2 strands labeled A1 and B2.

A1
a t c g g t a t a c t g

 

B2
g t a t a c t g g a c t

To see if A1 and B2 have any overlap lineup A1 and B2. One question is where to initially line up the strands. The matching threshold is the minimum number of characters that must match between the two strands to form a valid match or overlap. Assume the matching threshold is 4. This means there must be at least a 4 character overlap between strands to have a valid match. Start by lining up A1 and B2 with the minimum possible overlap:

A1
a t c g g t a t a c t g
g t a t a c t g g a c t
B2

This is not a valid match, (it fails on the first character, a vs. g) so slide B1 to the left by one character:

A1
a t c g g t a t a c t g
g t a t a c t g g a c t
B2

Again this is not a valid match so we slide B1 another character to the left. This continue until we consider all possible alignments or find an overlap. In this case an overlap exists if B1 is moved three more characters to left.

A1
a t c g g t a t a c t g
g t a t a c t g g a c t
B2

At this point there is a valid overlap, a match character for character between the strands at the overlapping characters and it is greater than the threshold value of 4. This indicates there is a valid match. between A1 and B2.

Now the strands are to be merged to form a new, single strand. Based on the overlap the new strand would be:

A1B2
a t c g g t a t a c t g g a c t

The label of the two new strands is the combination of the original two labels. The new, merged strand is a candidate for other possible matches, but A1 and B2 are no longer in the collection of strands as single entities. 

Notice this method finds the smallest possible match. It is possible that this method would produce a false positive, meaning it appears to be a match but really isn't. The bases in the overlap between the two strands matched by chance, not because they represent the same DNA chain. How would you process the strands to ensure the largest possible match is found? What is the chance of a false positive as a function of the threshold value?

Another possibility for matching between strands is a containing match. In this case one strand is completely present in another. There is not just an overlap, but a complete match, very much like the String matching algorithm in assignment 1. For example

A1B2
a t c g g t a t a c t g g a c t

 

C2
g g t a t

Notice strand C2 is contained completely within strand A1B2

A1B2
a t c g g t a t a c t g g a c t
g g t t t
C1

When a containing match occurs the strands are merged into a single strand, but the length is the same as the longest original strand in the pair.

A1B2C1
a t c g g t a t a c t g g a c t

Important note: If a containing match occurs it does not need to meet or exceed the threshold value. If the threshold value had been 6 the containing match between strand A1B2 and C1 would still have been valid.

What to do:

You will be completing three classes, Sequencer, StrandCollection and Strand.

The Sequencer class has been partially completed. I have created a method that reads in sequences of bases and their labels from a file. The file is sent as a parameter to the program, i.e. it is the first and only String in sent to the String[] args method in main. The three methods in Sequencer are

The job of the sequencer is to create strands from data in the file and load them into a StrandCollection object. The Sequencer then processes the StrandCollection by trying to match and merge strands. The Sequencer will do this until no more overlapping matches meeting the threshold value or containing matches exist  and then will print out the results.

The StrandCollection class is a container for Strands. The StrandCollection should be capable of adding, deleting, and accessing Strand objects. You must use a native array of Strand objects as your storage container for the StrandCollection

The Strand class represents a single strand of bases. Each strand has a label, which is a String, and a set of bases, which for this assignment shall be represented as a native array of characters. Note, it would be nice to use the String class for this, but again, CS307 is not a class on the Java API, it is a class of programming and so I want you to have practice working with arrays and non trivial algorithms involving arrays.

I am leaving a number of design decisions to you. Where should the match method be performed? The Strand class or the StrandCollection or the Sequencer? How should merging be handled? I strongly suggest you spend some time designing your classes and deciding on the methods of each class before you start coding things on the computer.

What to complete

Classes Sequencer.java, StrandCollection.java, and Strand.java. Also answer the following questions:

1. What is the probability that 2 strands of equal length will match exactly (character for character, a complete match) as a function of the length of the strands? For 2 strands of lengths 5, 10 and 15 what are the probabilities that they will match exactly purely by chance?

2. The method of matching strands described in this handout finds the minimum possible match. What is the difference in time between this method and ensuring the greatest possible match is found? Include two methods for matching, one that finds the first, minimum possible match, and one that ensures the greatest possible match is found. To find the difference in Time between these two methods use the provided Timer class. What are the results for simple.txt and complex.txt?

Provided Files:

DNAMain.java: This creates a Sequencer object and calls the go method. The main method expects 2 parameters, the first is the location of the file with the raw data and the second is the threshold value to use in processing the strands.

FileHanlder: reads data from a file. You should not have to deal with this.

simple.txt: A small collection of strands

complex.txt: A large collection of strands