229 Lab 5: Control-Flow Graph

CMPUT 229 - Computer Organization and Architecture I

Lab 5: Control-Flow Graph

Creator: Ayokunle Amodu

Introduction

This is the first of two labs focused on the fundamentals of flow analysis used in compiler design. The solution for this lab constructs and analyzes a Control-Flow Graph (CFG). A CFG is a fundamental data structure used in compilers to represent the possible execution paths of a program, establish dominance relationships, and help the compiler perform optimizations such as the Dead Code Elimination (DCE) implemented in Lab 6.

What is a Control-Flow Graph?

A Control-Flow Graph is a directed multi-graph that represents the control flow of a program. We will build our definition of a CFG by first defining its constituent components:

Basic Blocks
Edges
Predecessors and Successors

Basic Blocks

 Description:
 A basic block is a unit of control-flow that represents a sequential block of code where instructions execute 
 strictly from start to finish. A segment of code is considered a basic block if it follows these rules:

 • Once execution enters a basic block, all instructions within that block must be executed.
 • Branches or jumps can only occur as the final instruction of a basic block.
 • Only the first instruction of a basic block can be the target for a branch or jump.

Edges

 Description:
 The edges in a Control-Flow Graph denote the possible flow of control between basic blocks. A conditional branch 
 is the origin of two edges: one to the target of the branch and one fall-through edge. An unconditional jump is 
 the origin of a single edge to its target. Fall-through edges exist between consecutive basic blocks when control 
 may fall through from one to the other. A control-flow edge signifies a possible path that the program's execution 
 may take.

Predecessors and Successors

 Description:
 • Predecessor: A block Bi is a predecessor of a block Bj if there is an execution path from Bi to Bj.
 • Immediate Predecessor: A block Bi is an immediate predecessor of a block Bj if there is an edge from Bi to Bj.
 • Successor: A block Bj is a successor of a block Bi if there is an execution path from Bi to Bj.
 • Immediate Successor: A block Bj is an immediate successor of a block Bi if there is an edge from Bi to Bj.

 Note:
 (1) A block can have any number of immediate predecessors (multiple paths may converge).
 (2) A block can have at most two immediate successors (fall-through or branch) because a conditional branch 
     can only have two successors: the target and the fall-through.

Aside: An intra-procedural CFG represents the control flow within a function. An inter-procedural CFG also captures control flow across function calls and return statements. Inter-procedural CFGs are more complex because they must match each function call edge with its corresponding return edge. This lab focuses solely on
intra-procedural CFGs. In flow analysis used in compilers, the inter-procedural flow is represented using a call graph which is not explored in these labs.

Implementation Notes

Basic Block Leaders

To construct a Control-Flow Graph (CFG), the first step is to divide a function into its constituent basic blocks. The first instruction of each basic block is a block leader. The following rules identify block leaders:

The first statement in a function is a leader.
Any statement that is the target of a branch or jump is a leader.
Any statement that immediately follows a branch or jump is a leader.

Once the leaders are identified, the following rule determines the instructions in a basic block:

A basic block is formed by a leader and the subsequent instructions up to, but not including, the next leader.

According to these rules, the _basicblocks function has six leaders, resulting in six basic blocks as illustrated below:

Basic Block Leaders in _basicblocks Function — A sample function illustrating the identification of leaders and the basic-block formation.

Basic Block Identifiers

Often, the basic blocks in a CFG are labeled sequentially as B0, B1, B2, and so on, starting from the function's entry point (as seen in the CFG of the _sample_function earlier). In this lab, each block is uniquely identified by its leader offset (in words) from the function's base address.

The diagram below illustrates the assembled program in memory with resolved labels and hexadecimal instruction representations of the _basicblocks function. The function begins at address 0x00010000, where the leader of the first block is also located, making its identifier 0. The second block starts at 0x1001000c, which is three words from the function's start, therefore its identifier is 3.

Identified Basic Blocks in _basicblocks Function — For further reference, observe the divided `_basicblocks` function below, with identifiers assigned to each block.

Note: The labels in an assembly program are symbols representing instruction addresses. They are used to indicate the target offsets for branches and jumps, but do not appear in the final executable.

Pseudo Instructions

RISC-V assembly includes pseudo-instructions, which are higher-level operations that simplify coding but are translated into one or more actual instructions within its instruction set architechture (ISA) during assembly. To illustrate this, let's examine how the _basicblocks function is assembled in RARS below. Consider the mv (move) instruction. This pseudo-instruction is translated into an add instruction. For example, mv t0, zero becomes add t0, zero, zero. Both achieve the same result of setting the t0 register to 0, but the pseudo-instruction offers a more intuitive syntax for the programmer.

Decoding Instructions

The RISC-V ISA uses a fixed 32-bit instruction format. Decoding is a two-level process: a 7-bit opcode identifies the instruction format, and within each format, a 3-bit funct3 field specifies the specific instruction.

RISC-V Instruction Formats — The core instruction formats in RISC-V. Each format begins with a 7-bit `opcode` and has a 3-bit `funct3` field, `funct3` and `opcode` are always in the same position

Let’s disassemble the fourth instruction (0x0012f393) of the _basicblocks function as an example:

          0x0012f393 = 0000 0000 0001|00101|111|00111|0010011
                                            ^^^       ^^^^^^^
                                            funct3    opcode
          opcode  = 0010011
          funct3  = 111

The opcode indicated that this is an I-type instruction, and the funct3 field specifies that it is an andi instruction (Refer to the RISC-V Green Sheet for the table mapping opcodes to instructions).

Data Structure Descriptions

Some data structures are provided in this lab. Memory for these structures has already been pre-allocated in common.s, and pointers to them are passed to the function getControlFlowGraph, in your solution file. A solution to this lab will populate these structures.

instructionsArray

 Description:
 An array of words containing the input function's instructions.
 Ends with a sentinel value of -1 (0xFFFFFFFF).

In this lab, input functions are represented as an array of RISC-V instructions. Every RISC-V instruction can be represented as a word in memory (e.g., addi t1, t1, 1 = 0x00130313). Thus, a function can be represented as an array of words.

Instructions Array for _basicblocks Function — The diagram above illustrates the `instructionsArray` for the `_basicblocks` function, showing the instruction layout as it appears in memory.

leadersArray

 Description:
 An array of bytes where each index corresponds to an instruction in the instructionsArray at the same index.
 If the instruction is a leader, its corresponding byte will be 1; 0 otherwise.
 Ends with a sentinel value of -1 (0xFF).

blocksArray

 Description:
 An array of bytes representing the unique identifiers of each basic block in the function's CFG.
 Each identifier corresponds to the offset (in words) of the block's leader from the function's base address.
 Elements appear in execution order.
 Ends with a sentinel value of -1 (0xFF).

blockEndsArray

 Description:
 An array of bytes representing the last instruction in each basic block within the CFG.
 Each element corresponds directly to a block, stored in sequential order.
 Ends with a sentinel value of -1 (0xFF).

edgesList

 Description:
 An array of pairs of bytes representing the edges in the input function's CFG.
 Each pair encodes a source and target block index.
 Ends with a sentinel pair of bytes (-1, -1).

Constraint: The structure of the edgesList must satisfy the following ordering rules:

Edges must appear in ascending order of their source blocks.
For edges with the same source, targets must also be ordered.

successorsArrays

 Description:
 Successors are stored in two arrays of bytes: firstSuccessorsArray and secondSuccessorsArray.
 Each element corresponds to a basic block.

 • Two successors:
   - The earlier one goes in firstSuccessorsArray.
   - The later one goes in secondSuccessorsArray.

 • One successor:
   - Stored in firstSuccessorsArray.
   - -1 in secondSuccessorsArray.

 • No successors:
   - -1 in both arrays.

 Both arrays end with sentinel -2 (0xFE).

predecessorsArray

 Description:
 Predecessors are stored using two structures:

 (1) predecessorsArray - array of words holding pointers to each block’s list.
 (2) predecessorsList - array of bytes representing each block's predecessors.

 Rules:
 • If a block has predecessors:
   - Its array entry points to its list in predecessorsList.
   - List ends with sentinel -1 (0xFF).
 • If a block has no predecessors:
   - Entry is -1 (0xFFFFFFFF).

 predecessorsArray ends with sentinel -2 (0xFFFFFFFE).

Note: A block can have any number of immediate predecessors, making it impractical to allocate a fixed number of arrays as done with successors. In such cases, where the maximum size of a data structure are unknown, dynamic memory allocation is an excellent solution. To use dynamic memory allocation: (1) determine the required space at runtime; (2) allocate that memory on the heap; and (3) store the data in the allocated space. Use the sbrk ecall (code 9) in RARS to allocate heap memory (see RARS Environment Calls page).

The data structures (arrays) are designed to store elements in a consistent manner. The blocksArray acts as the primary array, where each index corresponds to a specific block. Use the block's index from the blocksArray to retrieve the corresponding element from any other array.

Writing a Solution

The directive .include "common.s" in the provided solution file (controlFlowGraph.s) causes RARS to execute the code in the file common.s before the solution file. The code in common.s reads a binary file provided in the program arguments, initializes the arguments, calls getControlFlowGraph, parses the custom data structures populated by the solution, and prints the CFG information extracted from the parsed data.

Function Signatures

Write all the following RISC-V functions in the file controlFlowGraph.s:

getControlFlowGraph

  Description:
  Constructs the control-flow graph (CFG) for a RISC-V assembly function by identifying basic blocks and computing 
  edges, block boundaries, successors and predecessors.

  Arguments:
      a0: a pointer to the instructionsArray
      a1: a pointer to the leadersArray
      a2: a pointer to the blocksArray
      a3: a pointer to the blockEndsArray
      a4: a pointer to the edgesList
      a5: a pointer to the firstSuccessorsArray
      a6: a pointer to the secondSuccessorsArray
      a7: a pointer to the predecessorsArray

  Returns:
      None

getLeaders

  Description:
  Iterates through the instructionsArray and identifies the basic blocks and their respective leaders
  to fill the leadersArray as described earlier.
 
  Arguments:
      a0: a pointer to the instructionsArray
      a1: a pointer to the leadersArray

  Returns:
      None

getBlocks

 Description:
 Responsible for filling in the blocksArray with the unique identifiers of each basic block and for 
 filling in the blockEndsArray with the last instruction of each basic block. 
 Refer to the data structure descriptions earlier for the format of these arrays.

 Arguments:
      a0: a pointer to the leadersArray
      a1: a pointer to the blocksArray
      a2: a pointer to the blockEndsArray

 Returns:
      None

getEdges

  Description:
  Stores the source-target pair of each edge in the edgesList for the function's control-flow graph,
  in ascending order of source-target pair. both source and target are identified by their block identifier
  as stored in the blocksArray.
   
  Arguments:
      a0: a pointer to the instructionsArray
      a1: a pointer to the blocksArray
      a2: a pointer to the blockEndsArray
      a3: a pointer to the edgesList

  Returns:
      None

getSuccessors

  Description:
  Stores the immediate successors of each basic block in ascending order in the firstSuccessorsArray 
  and secondSuccessorsArray. There can be at most two successors. Refer to the section of successorsArrays

  Arguments:
      a0: a pointer to the blocksArray
      a1: a pointer to the edgesList
      a2: a pointer to the firstSuccessorsArray
      a3: a pointer to the secondSuccessorsArray

  Returns:
      None

getPredecessors

  Description:
  Responsible for computing the correct predecessor information for a block by using 2 data structures.
  (1) predecessorsArray: An array of pointers, each pointing to a predecessorsList
  (2) predecessorsList: A dynamically allocated flat array of bytes listing the predecessor block IDs.

  Refer to the Data Strcuture descriptions above for more information.

  Arguments:
      a0: a pointer to the blocksArray
      a1: a pointer to the edgesList
      a2: a pointer to the predecessorsArray

  Returns:
      None

In RISC-V assembly, every function must have a return statement. The following are examples of valid return statements: ret, jr ra, and jalr zero, 0(ra). The controlFlowGraph.s file already includes the function labels above and each ends with a return statement. Do not change the names of these functions, as they are called for unit testing in the common.s file.

There are two helper functions located at the bottom of the solution file. They include:

getBranchImm: Extracts the sign-extended immediate from a branch instruction (SB-type), such as beq, bne, bge, bltu, bgeu, etc.
getJalImm (optional implementation): Extracts the sign-extended immediate from a jump instruction (UJ-type), such as jal. This function is not implemented by default, and no unit test is provided, so you will need to implement and test it yourself to ensure it works correctly.

Assumptions and Notes

The input function provided to controlFlowGraph.s will consist of at most 254 instructions.
There will be no calls to other functions within the input function.
All unconditional jumps within the input function will be jal, not jalr (which is used for returns). Therefore, do not handle jalr as a control-flow instruction; instead, treat it as a normal statement.

Testing your Lab

This lab has a single program argument: the binary file of the input function (e.g., nested_loop.bin). You must provide the full path of this file in the program arguments section in the Execute tab on RARS. Ensure that the path adheres to the following criteria:

No quotation marks around the path.
No spaces in the path or file names.

Consider the path Users/JohnDoe/CMPUT-229/Lab 5/Code/Tests/nested_loop.bin. The space in the folder name Lab 5 can cause issues when trying to locate the file. It is recommended to rename it to Lab-5 or another name without spaces to avoid any problems.

The common.s file contains some unit tests to help test the solution. This testing works on a predefined input function and checks whether the output from functions in controlFlowGraph.s matches the expected output. The tests are hardcoded and do not run on the file specified in the program arguments. The results of the unit tests can be viewed in the Run I/O panel of RARS.

The common.s file also aids in parsing the data structures popualted by your solution, and the CFG infromation can be viewed in the Run I/O panel of RARS, following the unit test results. Note that parsing the data structures relies on all of them ending with their respective sentinel values. This ensures that the parsing function can correctly identify the end of each structure's data. It also relies on the predecessorsArray having valid pointers (if any). There are a few input functions with their corresponding expected outputs (parsed CFG information) provided in the Tests folder, which is included in the Code folder. Although passing the tests is essential, it does not guarantee full marks as they are not extensive. You can create your own corner-case tests to ensure that your code is correct.

Creating Test Cases

To create a test case, write a RISC-V function (e.g, INPUT_FUNCTION.s) in an assembly file. Then, type in the following command in the terminal:
rars a dump .text Binary <INPUT_FUNCTION.bin> <INPUT_FUNCTION.s>
The binary file generated by this command can serve as a program argument in RARS for the solution file controlFlowGraph.s. Use the full file path, ensuring to specify the binary (.bin) file for input, not the assembly (.s) file.

Instruction Disassembly

One way to debug a solution to this lab, is to create the CFG of an input function and compare with the solution's output. The common.s file parses the data structures populated by the solution allowing you to inspect each block and view the instructions that it contains. Instructions are represented in hexadecimal format, so a disassembler can be helpful for verifying that the instructions are correctly placed within their respective blocks.

This lab has an additional folder in the Code folder called Disassembly. It contains files that perform instruction disassembly which is the process of translating hexadecimal instructions to their corresponding RISC-V instructions. This code is from an Open Source RISC-V Disassembler. Please contact a TA if you run into any issues using the code.

To utilize this disassembler, follow these steps:

First, compile the file print-instructions.c using a C compiler. For example, from the terminal, execute the command: gcc print-instructions.c -o disassembler or clang print-instructions.c -o disassembler.
Create a file containing the hexadecimal instructions to disassemble by copying and pasting your RARS output into a text file.
Run the executable with the hexadecimal text file as a program argument. For example: ./disassembler RARS-instructions.txt.

See the README.md in the Disassembly folder for more information.

Resources

Slides used for the introduction of the lab (.pptx) (.pdf)
Marksheet used to mark the lab (.txt)

Marking Guide

Assignments too short to be adequately judged for code quality will be given a zero for that portion of the evaluation.

5% for getControlFlowGraph
15% for getLeaders
10% for getBlocks
25% for getEdges
10% for getSuccessors
15% for getPredecessors
20% for code cleanliness, readability, and comments

Submission

There is a single file to be submitted for this lab (controlFlowGraph.s ) and it should contain the code for your solution.

Do not add a main label to this file.
Do not modify the function names.
Do not remove the CMPUT 229 Student Submission License at the top of the file containing your solution and remember include your name in the appropriate section of this text.
Do not modify the line .include "common.s".
Do not modify the common.s file.
Keep the files in the Code folder of the git repository.
Push your repository to GitHub before the deadline. Just committing will not upload your code. Check online to ensure your solution is submitted.