The goal of this assignment is to use hardware level parallelism as well as platform level concurrency to solve a classic genomics problem: sequenence alignment. You will gain experience programming FPGAs in Verilog and thinking about the kind of parallelism exposed by hardware, as well as experience using heterogeneous, or accelerator-based programming using hardware specialized to particular programming tasks.
Roughly speaking, sequence alignment refers to a class of algorithms that compare nucleotide sequences, for example to determine measures of genetic similarity. There are multiple approaches to the problem, but the one we are interested in for this lab is often referred to as an optimal matching problem, or a global alignment problem. Specifically, given two sequences of DNA sequences consisting of strings over the alphabet {A,C,G,T}, align those two strings such that edit distance is minimized. Consider the two DNA sequences below:
ACGTTGCAGG
GTTGCAGGAT
|
|
|
|
Note that the problem is enhanced (complicated) by the possibilities of insertions or deletions mid-sequence. For example, given the sequence on the left, the optimal alignment on the right shows that 'T' was inserted at (0-indexed) position 3 in the first string (or deleted from the second) and 'G' was inserted at position 7 in the second string (or deleted from the first). In keeping with the CS community's love of jargon, these insertion/deletions are called "INDELs."
|
|
In the most general form of the problem, it is possible to assign different weights to different pairwise combinations, for example, a mismatch of G+C contributed -5 while G+T is -10. Additionally, there are numerous extensions in which additional letters may be added to represent ambiguity when more than one kind of nucleotide could occur at a position (e.g. R, purine, can represent an ambiguous choice between G and A). Since, for this class, we are interested less in the algorithmic nuances and more in the parallelization and concurrency aspects, your implementation will use the GACT alphabet for DNA, and will use the basic scoring scheme in which a match is worth 1, while mismatches and INDELs are worth -1.
The classic algorithm for global alignment relies on
A | C | G | T | T | G | C | A | G | G | ||
---|---|---|---|---|---|---|---|---|---|---|---|
G | |||||||||||
T | |||||||||||
T | |||||||||||
G | |||||||||||
C | |||||||||||
A | |||||||||||
G | |||||||||||
G | |||||||||||
A | |||||||||||
T |
The cells at row and column index 1 are initialized with the negative
value of the corresponding index in the string. For examples [1,1]
corresponds to index 0 in both S1 and S1, and is initialized to 0.
[2,1] becomes -1, [3,1] becomes -2, and so on as shown below in
step 1.
The goal of the algorithm is to fill in the table with scores that
represent all possible alignments of the strings. At each cell of the table
a "local score" is computed, which corresponds to whether the {A, C, T, G}
value at the row and column headers for the cell match or not. A match gets
positive value of 1, while a mismatch gets -1. For example,
in the table above, at [2, 2], A is compared against G, which is a mismatch,
so the local score contribution would be -1, while at [2, 7] A matches with a for a
local score contribution of +1. The total value at any given cell is the
|
|
|
|
... ... ... ... |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Initialized Row and Column 0 |
G[2,2] = max(G[1,1] + (local score(-1)),
G[1,2] + INDEL,
G[2,1] + INDEL))
G[2,2] = max((0+-1), (-1+-1), (-1+-1))
G[2,2] = -1
|
G[3,2] = max(G[2,1]+(local score(-1)),
G[2,2]+INDEL,
G[3,1]+INDEL)
= max(-1+-1, -1+-1, -2+-1) = -2
G[3,3] = max(G[2,2]+(local(-1)),
G[2,3]+INDEL,
G[3,2]+INDEL)
= max(-1+-1, -2+-1, -2+1) = -2
G[2,3] = max(G[1,2]+(local score(-1)),
G[2,2]+INDEL,
G[1,3]+INDEL)
= max(-1+-1, -1+-1, -2+-1) = -2
|
G[4,2] = max(G[3,1]+(local score(1)),
G[3,2]+INDEL,
G[4,1]+INDEL) = -1
G[4,3] = max(G[4,2]+(local score(-1),
G[3,2]+INDEL,
G[3,3]+INDEL) = -2
G[2,4] = max(G[1,3]+(local score(-1),
G[2,3]+INDEL,
G[1,4]+INDEL) = -3
...etc...
|
... ... ... ... |
G[11,2] = max(G[10,1]+(local score(1)),
G[11,1]+INDEL,
G[10,2]+INDEL) = -8
G[11,3] = max(G[10,2]+(local score(-1)),
G[11,2]+INDEL,
G[10,3]+INDEL) = -6
...etc...
G[11,11] = max(G[10,10]+(local score(-1)),
G[11,10]+INDEL,
G[10,11]+INDEL) = 4
|
Once the score table has been filled in, the optimal alignment(s) correspond
to paths traced from the lower right [maxcols-1,maxrows-1] to the zero at the
upper left at [1,1]. In this case, the optimal alignment
yields a score of 4. The path corresponding to the optimal alignment is highlighted in gold in the following table.
Recovering optimal alignments is a matter of tracing back from the lower right
to the upper left. Each cell corresponds to an alignment entry pair (two letters
or one letter and an INDEL), and the entry before it can be recovered by
deducing which score above it, to the left, or to the upper left is optimal,
and therefore contributed to the total score at that cell (there may be more than
one option in the general case). For example, at [10,8] the score 5 corresponds to a match (G+G);
the optimal align at the slot before it can be found by observing that the
score of 4 at [9,7] must have preceded it since the slot does not correspond to
an INDEL, and the 4 at [9,7] plus the value of the match at [10,8] yields the observed
score. You may find it expedient in your own implementation to simply keep track of
which preceding cell contributed to the score at each cell.
A | C | G | T | T | G | C | A | G | G | ||
---|---|---|---|---|---|---|---|---|---|---|---|
0 | -1 | -2 | -3 | -4 | -5 | -6 | -7 | -8 | -9 | -10 | |
G | -1 | -1 | -2 | -1 | -2 | -3 | -4 | -5 | -6 | -7 | -8 |
T | -2 | -2 | -2 | -2 | 0 | -1 | -2 | -3 | -4 | -5 | -6 |
T | -3 | -3 | -3 | -3 | -1 | 1 | 0 | -1 | -2 | -3 | -4 |
G | -4 | -4 | -4 | -2 | -2 | 0 | 2 | 1 | 0 | -1 | -2 |
C | -5 | -5 | -3 | -3 | -3 | -1 | 1 | 3 | 2 | 1 | 0 |
A | -6 | -4 | -4 | -4 | -4 | -2 | 0 | 2 | 4 | 3 | 2 |
G | -7 | -5 | -5 | -3 | -4 | -3 | -1 | 1 | 3 | 5 | 4 |
G | -8 | -6 | -6 | -4 | -4 | -4 | -2 | 0 | 2 | 4 | 6 |
A | -9 | -7 | -7 | -5 | -5 | -5 | -3 | -1 | 1 | 3 | 5 |
T | -10 | -8 | -8 | -6 | -4 | -4 | -4 | -2 | 0 | 2 | 4 |
The path shown in the table above corresponds to the optimal alignment:
A C G T T G C A G G - -
- - G T T G C A G G A T
-1 -1 1 1 1 1 1 1 1 1 -1 -1
You will develop your FPGA implementation using a runtime and JIT compiler called Cascade. You will use that toolchain to develop your Verilog code, and run it either in simulation, or on an Intel/Terasic DE10-nano FPGA board. Unlike other labs where we give you latitude to select an implementation platform and language, you must code in Verilog for this lab.
Due to COVID-19 and social distancing policies, the DE10-nano boards the department would typically loan you for the duration of this lab cannot be practically distributed. For those with access to a DE10-nano, we encourage you to do the complete lab. If you do not have access to DE10 (most of you), it is expected that you will code, debug, and measure entirely in Cascade's simulation environment.
Deliverables will be detailed below, but the focus is on a writeup that provides performance measurements as graphs, and answers (perhaps speculatively) a number of questions. Spending some time setting yourself up to quickly and easily collect and visualize performance data is a worthwhile time investment as with other labs in this course.
In step 1 of the lab, you will write a program that accepts command-line parameters to specify the following:
--S1:
string of A,C,G,T--S2:
string of A,C,G,TThe output of your program should include:
ACGTTGCAGG--
--GTTGCAGGAT
For example, the following command-line invoked on moonstone.csres.utexas.edu
yield corresponding CSV output in our sample solution:
./lab3 --S1 ACGTTGCAGG --S2 GTTGCAGGAT ,,A,C,G,T,T,G,C,A,G,G, , 0 , -1 , -2 , -3 , -4 , -5 , -6 , -7 , -8 , -9 , -10 , G , -1 , -1 , -2 , -1 , -2 , -3 , -4 , -5 , -6 , -7 , -8 , T , -2 , -2 , -2 , -2 , 0 , -1 , -2 , -3 , -4 , -5 , -6 , T , -3 , -3 , -3 , -3 , -1 , 1 , 0 , -1 , -2 , -3 , -4 , G , -4 , -4 , -4 , -2 , -2 , 0 , 2 , 1 , 0 , -1 , -2 , C , -5 , -5 , -3 , -3 , -3 , -1 , 1 , 3 , 2 , 1 , 0 , A , -6 , -4 , -4 , -4 , -4 , -2 , 0 , 2 , 4 , 3 , 2 , G , -7 , -5 , -5 , -3 , -4 , -3 , -1 , 1 , 3 , 5 , 4 , G , -8 , -6 , -6 , -4 , -4 , -4 , -2 , 0 , 2 , 4 , 6 , A , -9 , -7 , -7 , -5 , -5 , -5 , -3 , -1 , 1 , 3 , 5 , T , -10 , -8 , -8 , -6 , -4 , -4 , -4 , -2 , 0 , 2 , 4 , ACGTTGCAGG-- --GTTGCAGGAT
The algorithm described above yields some very natural parallelizations for FPGAs. You will use Cascade to implement the algorithm in Verilog. Cascade's README.md provides a good overview of how to use cascade. The README.md should be considered mandatory reading whether you wish to set the system up on your own system, or use our virtual machine image Cascade.ova. In particular, if you plan to use cascade on anything other than a linux host, you will need to use the virtual machine layer, as Cascade currently runs only on a linux stack.
The instructor is able to coordinate the loaning out of a small number of DE10-nano boards, but this is discouraged due to COVID-19. If you feel strongly that you want to use one, contact the instructor.
Cascade has some very nice properties that you should find helpful for this lab. In particular it allows you to do "printf" style debugging using a "$display" keyword that is otherwise impossible with FPGA hardware. More importantly, cascade is a JIT compiler that encapsulates the programming of the actual FPGA hardware behind software emulation, allowing you to runt/test/debug your changes immediately, rather than waiting for a lengthy hardware compilation to complete. It also has features for managing inputs and outputs using file I/O, which we will rely on for this lab.
Your implementation will accept inputs in a *.mem file, and will produce as output the complete grid, also in a *.mem file. We will provide tools and skeleton code for getting inputs and outputs to/from the FPGA.
Your implementation will have the following inputs and outputs:
The skeleton code we provide, main.v, constants.v, debug.v provide code that will help you create and manage inputs as well as demonstrate how to use cascade's "$display" debugging tools. The cascade-files/nw.v is where you will write your code. We strongly recommend you decompose your solution by implementing a module that describes a single cell that does comparisons for a point in the grid described above, and a top-level module that composes those cells into a grid. The Verilog excerpt below, from main.v, uses cascade's I/O support to instantiate your top-level module and populate it's inputs.
// Instantiate your top-level Needleman-Wunsch module: wire [LENGTH*CWIDTH-1:0] s1 = rdata[2*LENGTH*CWIDTH-1:1*LENGTH*CWIDTH]; wire [LENGTH*CWIDTH-1:0] s2 = rdata[1*LENGTH*CWIDTH-1:0*LENGTH*CWIDTH]; wire signed[SWIDTH-1:0] score; YOUR_TOP_LEVEL_MODULE#( .LENGTH(LENGTH), .CWIDTH(CWIDTH), .SWIDTH(SWIDTH), .MATCH(MATCH), .INDEL(INDEL), .MISMATCH(MISMATCH) ) grid ( .s1(s1), .s2(s2), .score(score) );
The subsequent Verilog code in main.v manages the clock signal and inputs/outputs, and waits until your code has computed the score:
// While there are still inputs coming out of the fifo, print the results: reg once = 0; always @(posedge clock.val) begin // Base case: Skip first input when fifo hasn't yet reported values if (!once) begin once <= 1; end // Edge case: Stop running when the fifo reports empty else if (empty) begin $finish(1); end // Common case: Print results as they become available else begin $display("align(%h,%h) = %d", s1, s2, score); end end
You will implement your code in nw.v, which is included by main.v and debug.v. The debug.v file is similar to main.v, with the exception that it uses "$display" to show intermediate state of your internal FGPA logic. In your writeup, provide the following graphs and answer the following questions:
If you are not working on a Linux desktop or laptop where you have sudo
privilege (which is recommended but may not be practical), you will need to bring up the
lab in a Virtual Machine. Instructions for using VirtualBox to create and use a VM
with Windows, MacOS, and Linux can be found at:
For those who want to bring up their own VM and manage cascade installation on the DE10
themselves:
Windows 10-Specific VM and DE10 Bringup Instructions
MacOS-Specific VM and DE10 Bringup Instructions
If you are using the Cascade.ova VM with VirtualBox, there are a few things you need to know:
password
. This might be worth knowing if you let your VM go to sleep and hit the lock screen.cd ~/Desktop/cascade
to find it.git pull; make clean; make;
--march de10
should enable hardware measurements
and does not invoke JIT compilation.
march
flag you use.
You can then regenerate your FPGA scaling graphs using cycle count as the runtime.
For comparing against C++, assume your code is running on the DE10-Nano at a
frequency of 50MHz and graph the expected hardware runtime. This may be a bit
idealized, but should give you a good idea of how fast FPGAs can be.
Using the canvas
turn in utility, you should turn in, along with your code, Makefiles, and measurement scripts,
a brief writeup with the scalability graphs requested above. Be sure that your
writeup includes sufficient text to enable us to understand which graphs
are which. Note that as will other labs in this course we will check
solutions for plagiarism using Moss.
One of the goals of using cascade is to aid research efforts in improving the programmability of FPGAs. To this end, cascade is instrumented to collect information about compile times and compiler errors/successes that can be used in a subsequent (anonymized!) study. Cascade will produce a file called "cascade-log" in your home directory. We hope you will include this file in your submission out of support for the "good fight" that is computer science research. However, we will also provide an additional 5 points of extra credit for anyone who turns this file in with their submission. Thanks in advance for helping the research effort!
A LaTeX template that includes placeholders for graphs and re-iterates any questions we expect answers for can be found here, (a build of that template is here).
Please report how much time you spent on the lab.