CS378H: Concurrency: Honors

Lab #5: FPGAs!

The goal of this assignment is to use hardware level parallelism as well as platform level concurrency to solve a classic genomics problem: sequenence alignment. You will gain experience programming FPGAs in Verilog and thinking about the kind of parallelism exposed by hardware, as well as experience using heterogeneous, or accelerator-based programming using hardware specialized to particular programming tasks.

Sequence Alignment

Roughly speaking, sequence alignment refers to a class of algorithms that compare nucleotide sequences, for example to determine measures of genetic similarity. There are multiple approaches to the problem, but the one we are interested in for this lab is often referred to as an optimal matching problem, or a global alignment problem. Specifically, given two sequences of DNA sequences consisting of strings over the alphabet {A,C,G,T}, align those two strings such that edit distance is minimized. Consider the two DNA sequences below:

      
	ACGTTGCAGG
	GTTGCAGGAT 
  
The sequences can be "aligned" in several ways, with each way yielding a similarity metric that corresponds to the number of positions at which the letters did or did not match. Concretely, at each position, we assign a score value of 1 when letters match, and -1 when they do not. A number of possible alignments along with their scores are shown below. The optimal alighnment amongst those shown is the one with highest score, or minimum edit distance, with background in green. Note that in general, there may be more than one optimal alignment for a given pair of sequences.
      
 -  A  C  G  T  T  G  C  A  G  G
 G  T  T  G  C  A  G  G  A  T  -
-1 -1 -1  1 -1 -1  1 -1  1 -1 -1

  
      
 -  -  A  C  G  T  T  G  C  A  G  G
 G  T  T  G  C  A  G  G  A  T  -  -
-1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1

  
      
 A  C  G  T  T  G  C  A  G  G  -  -  -  -
 -  -  -  -  G  T  T  G  C  A  G  G  A  T
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

  
      
 A  C  G  T  T  G  C  A  G  G  -  - 
 -  -  G  T  T  G  C  A  G  G  A  T
-1 -1  1  1  1  1  1  1  1  1 -1 -1 

  
SCORE: (8*-1)+(3*1) = -5
SCORE: (11*-1)+(1*1) = -10
SCORE: (14*-1)+(0*1) = -14
SCORE: (4*-1)+(8*1) = 4

Note that the problem is enhanced (complicated) by the possibilities of insertions or deletions mid-sequence. For example, given the sequence on the left, the optimal alignment on the right shows that 'T' was inserted at (0-indexed) position 3 in the first string (or deleted from the second) and 'G' was inserted at position 7 in the second string (or deleted from the first). In keeping with the CS community's love of jargon, these insertion/deletions are called "INDELs."
     
	ACGTTGCAGT
	ACGTGCGAGT
     
  
      
 0  1  2  3  4  5  6  7  8  9 10
 -------------------------------	
 A  C  G  T  T  G  C  -  A  G  T
 A  C  G  -  T  G  C  G  A  G  T 
 1  1  1 -1  1  1  1 -1  1  1  1 

  
SCORE: (2*-1)+(9*1) = 7

In the most general form of the problem, it is possible to assign different weights to different pairwise combinations, for example, a mismatch of G+C contributed -5 while G+T is -10. Additionally, there are numerous extensions in which additional letters may be added to represent ambiguity when more than one kind of nucleotide could occur at a position (e.g. R, purine, can represent an ambiguous choice between G and A). Since, for this class, we are interested less in the algorithmic nuances and more in the parallelization and concurrency aspects, your implementation will use the GACT alphabet for DNA, and will use the basic scoring scheme in which a match is worth 1, while mismatches and INDELs are worth -1.

The Algorithm

The classic algorithm for global alignment relies on dynamic programming. Given two sequences over the alphabet {A, C, G, T}, the first step is to construct a table whose columns are labeled with the letters of the first input string (call it S1) and whose rows are labeled with the second. For the sample strings we used above, the initial table would look like:
A C G T T G C A G G
 
G
T
T
G
C
A
G
G
A
T

The cells at row and column index 1 are initialized with the negative value of the corresponding index in the string. For examples [1,1] corresponds to index 0 in both S1 and S1, and is initialized to 0. [2,1] becomes -1, [3,1] becomes -2, and so on as shown below in step 1. The goal of the algorithm is to fill in the table with scores that represent all possible alignments of the strings. At each cell of the table a "local score" is computed, which corresponds to whether the {A, C, T, G} value at the row and column headers for the cell match or not. A match gets positive value of 1, while a mismatch gets -1. For example, in the table above, at [2, 2], A is compared against G, which is a mismatch, so the local score contribution would be -1, while at [2, 7] A matches with a for a local score contribution of +1. The total value at any given cell is the minimum edit distance of: a) the local score plus the score to the upper left (corresponding to a match/mismatch) b) the score to the left plus the value of an INDEL (-1), c) the score above plus the value of an INDEL (-1). Note that a minimum edit distance is actually the maximum score taken from above or from the left corresponds to an INDEL or gap in the alignment. The table is filled in by moving down and to the right and filling in scores as the cells upon which they depend become available. For the example alignment we've been considering, below we see the first four and final steps, with the set of cells filled by each step shown in blue.
A C G T T G C A G G
 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
G -1
T -2
T -3
G -4
C -5
A -6
G -7
G -8
A -9
T -10
A C G T T G C A G G
0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
G -1 -1
T -2
T -3
G -4
C -5
A -6
G -7
G -8
A -9
T -10
A C G T T G C A G G
0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
G -1 -1 -2
T -2 -2 -2
T -3
G -4
C -5
A -6
G -7
G -8
A -9
T -10
A C G T T G C A G G
0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
G -1 -1 -2 -1
T -2 -2 -2 -2
T -3 -3 -3 -3
G -4
C -5
A -6
G -7
G -8
A -9
T -10
... ... ... ...
A C G T T G C A G G
0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
G -1 -1 -2 -1 -2 -3 -4 -5 -6 -7 -8
T -2 -2 -2 -2 0 -1 -2 -3 -4 -5 -6
T -3 -3 -3 -3 -1 1 0 -1 -2 -3 -4
G -4 -4 -4 -2 -2 0 2 1 0 -1 -2
C -5 -5 -3 -3 -3 -1 1 3 2 1 0
A -6 -4 -4 -4 -4 -2 0 2 4 3 2
G -7 -5 -5 -3 -4 -3 -1 1 3 5 4
G -8 -6 -6 -4 -4 -4 -2 0 2 4 6
A -9 -7 -7 -5 -5 -5 -3 -1 1 3 5
T -10 -8 -8 -6 -4 -4 -4 -2 0 2 4
STEP 1
Initialized Row and Column 0
STEP 2
G[2,2] = max(G[1,1] + (local score(-1)), G[1,2] + INDEL, G[2,1] + INDEL)) G[2,2] = max((0+-1), (-1+-1), (-1+-1)) G[2,2] = -1
STEP 3
G[3,2] = max(G[2,1]+(local score(-1)), G[2,2]+INDEL, G[3,1]+INDEL) = max(-1+-1, -1+-1, -2+-1) = -2 G[3,3] = max(G[2,2]+(local(-1)), G[2,3]+INDEL, G[3,2]+INDEL) = max(-1+-1, -2+-1, -2+1) = -2 G[2,3] = max(G[1,2]+(local score(-1)), G[2,2]+INDEL, G[1,3]+INDEL) = max(-1+-1, -1+-1, -2+-1) = -2
STEP 4

G[4,2] = max(G[3,1]+(local score(1)), G[3,2]+INDEL, G[4,1]+INDEL) = -1 G[4,3] = max(G[4,2]+(local score(-1), G[3,2]+INDEL, G[3,3]+INDEL) = -2 G[2,4] = max(G[1,3]+(local score(-1), G[2,3]+INDEL, G[1,4]+INDEL) = -3 ...etc...
... ... ... ...
FINAL STEP

G[11,2] = max(G[10,1]+(local score(1)), G[11,1]+INDEL, G[10,2]+INDEL) = -8 G[11,3] = max(G[10,2]+(local score(-1)), G[11,2]+INDEL, G[10,3]+INDEL) = -6 ...etc... G[11,11] = max(G[10,10]+(local score(-1)), G[11,10]+INDEL, G[10,11]+INDEL) = 4

Once the score table has been filled in, the optimal alignment(s) correspond to paths traced from the lower right [maxcols-1,maxrows-1] to the zero at the upper left at [1,1]. In this case, the optimal alignment yields a score of 4. The path corresponding to the optimal alignment is highlighted in gold in the following table. Recovering optimal alignments is a matter of tracing back from the lower right to the upper left. Each cell corresponds to an alignment entry pair (two letters or one letter and an INDEL), and the entry before it can be recovered by deducing which score above it, to the left, or to the upper left is optimal, and therefore contributed to the total score at that cell (there may be more than one option in the general case). For example, at [10,8] the score 5 corresponds to a match (G+G); the optimal align at the slot before it can be found by observing that the score of 4 at [9,7] must have preceded it since the slot does not correspond to an INDEL, and the 4 at [9,7] plus the value of the match at [10,8] yields the observed score. You may find it expedient in your own implementation to simply keep track of which preceding cell contributed to the score at each cell.

A C G T T G C A G G
0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
G -1 -1 -2 -1 -2 -3 -4 -5 -6 -7 -8
T -2 -2 -2 -2 0 -1 -2 -3 -4 -5 -6
T -3 -3 -3 -3 -1 1 0 -1 -2 -3 -4
G -4 -4 -4 -2 -2 0 2 1 0 -1 -2
C -5 -5 -3 -3 -3 -1 1 3 2 1 0
A -6 -4 -4 -4 -4 -2 0 2 4 3 2
G -7 -5 -5 -3 -4 -3 -1 1 3 5 4
G -8 -6 -6 -4 -4 -4 -2 0 2 4 6
A -9 -7 -7 -5 -5 -5 -3 -1 1 3 5
T -10 -8 -8 -6 -4 -4 -4 -2 0 2 4

The path shown in the table above corresponds to the optimal alignment:


 A  C  G  T  T  G  C  A  G  G  -  - 
 -  -  G  T  T  G  C  A  G  G  A  T
-1 -1  1  1  1  1  1  1  1  1 -1 -1 
  

The Implementation

You will develop your FPGA implementation using a runtime and JIT compiler called Cascade. You will use that toolchain to develop your Verilog code to run on an Intel/Terasic DE10-nano FPGA board, which the department will loan you for the duration of this lab. Unlike other labs where we give you latitude to select an implementation platform and language, you must code in Verilog for this lab. If you want to use a different FPGA toolchain and execution environment, we are willing to hear you out, but strongly advise against it. Talk to the instructor or TA if you are considering alternative toolchains.

Deliverables will be detailed below, but the focus is on a writeup that provides performance measurements as graphs, and answers (perhaps speculatively) a number of questions. Spending some time setting yourself up to quickly and easily collect and visualize performance data is a worthwhile time investment as with other labs in this course.

Step 1: Create a sequential host-based solution

In step 1 of the lab, you will write a program that accepts command-line parameters to specify the following:

The output of your program should include:

For example, the following command-line invoked on moonstone.csres.utexas.edu yield corresponding CSV output in our sample solution:

./lab3 --S1 ACGTTGCAGG --S2 GTTGCAGGAT
,,A,C,G,T,T,G,C,A,G,G,
, 0 , -1 , -2 , -3 , -4 , -5 , -6 , -7 , -8 , -9 , -10 ,
G , -1 , -1 , -2 , -1 , -2 , -3 , -4 , -5 , -6 , -7 , -8 ,
T , -2 , -2 , -2 , -2 , 0 , -1 , -2 , -3 , -4 , -5 , -6 ,
T , -3 , -3 , -3 , -3 , -1 , 1 , 0 , -1 , -2 , -3 , -4 ,
G , -4 , -4 , -4 , -2 , -2 , 0 , 2 , 1 , 0 , -1 , -2 ,
C , -5 , -5 , -3 , -3 , -3 , -1 , 1 , 3 , 2 , 1 , 0 ,
A , -6 , -4 , -4 , -4 , -4 , -2 , 0 , 2 , 4 , 3 , 2 ,
G , -7 , -5 , -5 , -3 , -4 , -3 , -1 , 1 , 3 , 5 , 4 ,
G , -8 , -6 , -6 , -4 , -4 , -4 , -2 , 0 , 2 , 4 , 6 ,
A , -9 , -7 , -7 , -5 , -5 , -5 , -3 , -1 , 1 , 3 , 5 ,
T , -10 , -8 , -8 , -6 , -4 , -4 , -4 , -2 , 0 , 2 , 4 ,
ACGTTGCAGG--
--GTTGCAGGAT
    

Step 2: Cascade Implementation

The algorithm described above yields some very natural parallelizations for FPGAs. You will use Cascade with the DE10-nano to implement the algorithm in Verilog. Cascade's README.md provides a good overview of how to use cascade. The README.md should be considered mandatory reading whether you wish to set the system up on your own system, or use our virtual machine image Cascade.ova. In particular, if you plan to use cascade on anything other than a linux host, you will need to use the virtual machine layer, as Cascade currently runs only on a linux stack. See below for OS-specific (Windows/MacOS) instructions for VirtualBox and DE10s.

We will be loaning out DE10-nano boards, and I plan to distribute them in class the day of the first FPGA lectures. If you happen to miss this class you can make arrangements to pick one up from my office or from the TA by sending us email. We are thankful to the CS department and to VMware Research Group, each of which paid for half of the DE10-nano boards we are using. The boards are expensive. To ensure that the boards are returned, we will not grade your lab until you have returned your board. Failure to return the board will result in a zero for the lab.

Cascade has some very nice properties that you should find helpful for this lab. In particular it allows you to do "printf" style debugging using a "$display" keyword that is otherwise impossible with FPGA hardware. More importantly, cascade is a JIT compiler that encapsulates the programming of the actual FPGA hardware behind software emulation, allowing you to runt/test/debug your changes immediately, rather than waiting for a lengthy hardware compilation to complete. It also has features for managing inputs and outputs using file I/O, which we will rely on for this lab.

Your implementation will accept inputs in a *.mem file, and will produce as output the complete grid, also in a *.mem file. We will provide tools and skeleton code for getting inputs and outputs to/from the FPGA.

Your implementation will have the following inputs and outputs:

The skeleton code we provide, main.v, constants.v, debug.v provide code that will help you create and manage inputs as well as demonstrate how to use cascade's "$display" debugging tools. The cascade-files/nw.v is where you will write your code. We strongly recommend you decompose your solution by implementing a module that describes a single cell that does comparisons for a point in the grid described above, and a top-level module that composes those cells into a grid. The Verilog excerpt below, from main.v, uses cascade's I/O support to instantiate your top-level module and populate it's inputs.

// Instantiate your top-level Needleman-Wunsch module:
wire [LENGTH*CWIDTH-1:0] s1 = rdata[2*LENGTH*CWIDTH-1:1*LENGTH*CWIDTH];
wire [LENGTH*CWIDTH-1:0] s2 = rdata[1*LENGTH*CWIDTH-1:0*LENGTH*CWIDTH];
wire signed[SWIDTH-1:0] score;
YOUR_TOP_LEVEL_MODULE#(
  .LENGTH(LENGTH),
  .CWIDTH(CWIDTH),
  .SWIDTH(SWIDTH),
  .MATCH(MATCH),
  .INDEL(INDEL),
  .MISMATCH(MISMATCH) 
) grid (
  .s1(s1),
  .s2(s2),
  .score(score)
);

The subsequent Verilog code in main.v manages the clock signal and inputs/outputs, and waits until your code has computed the score:

// While there are still inputs coming out of the fifo, print the results:
reg once = 0;
always @(posedge clock.val) begin
  // Base case: Skip first input when fifo hasn't yet reported values
  if (!once) begin 
    once <= 1;
  end 
  // Edge case: Stop running when the fifo reports empty
  else if (empty) begin
    $finish(1);
  end 
  // Common case: Print results as they become available
  else begin
    $display("align(%h,%h) = %d", s1, s2, score);
  end
end

You will implement your code in nw.v, which is included by main.v and debug.v. The debug.v file is similar to main.v, with the exception that it uses "$display" to show intermediate state of your internal FGPA logic. In your writeup, provide the following graphs and answer the following questions:

If you are not working on a Linux desktop or laptop where you have sudo privilege (which is recommended but may not be practical), you will need to bring up the lab in a Virtual Machine. Instructions for using VirtualBox to create and use a VM with Windows, MacOS, and Linux can be found at:

Cascade Environment Instructions

For those who want to bring up their own VM and manage cascade installation on the DE10 themselves:
Windows 10-Specific VM and DE10 Bringup Instructions
MacOS-Specific VM and DE10 Bringup Instructions

If you are using the Cascade.ova VM with VirtualBox, there are a few things you need to know:

Note that some of instructions in these files are specific to getting Cascade's JIT to work, and you don't strictly need it to complete the lab or the measurements. If you have a correct Verilog implementation, and you're able to connect to the DE-10, running the lab with --march de10 should enable hardware measurements and does not invoke JIT compilation.

Step 3: Extra Credit Options

There are three options for extra credit:

Deliverables

Using the canvas turn in utility, you should turn in, along with your code, Makefiles, and measurement scripts, a brief writeup with the scalability graphs requested above. Be sure that your writeup includes sufficient text to enable us to understand which graphs are which. Note that as will other labs in this course we will check solutions for plagiarism using Moss.

One of the goals of using cascade is to aid research efforts in improving the programmability of FPGAs. To this end, cascade is instrumented to collect information about compile times and compiler errors/successes that can be used in a subsequent (anonymized!) study. Cascade will produce a file called "cascade-log" in your home directory. We hope you will include this file in your submission out of support for the "good fight" that is computer science research. However, we will also provide an additional 5 points of extra credit for anyone who turns this file in with their submission. Thanks in advance for helping the research effort!

A LaTeX template that includes placeholders for graphs and re-iterates any questions we expect answers for can be found here, (a build of that template is here).

Please report how much time you spent on the lab.

Acknowledgements

Thanks to Eric Schkufza and Michael Wei of VMware Research Group for supporting this lab. Thanks to our department head Don Fussell for supporting the project by helping find funds to enable us to loan DE10 hardware to every student.