ABSTRACT

This article presents a method of synthesizing hardware from a given executable binary code with an external interrupt handler, where the normal flow and the interrupt handling are executed by separate hardware modules. Our previous method synthesized the whole program into a single hardware module, in which register save/restore imposed limitations on the timing to start interrupt handling and also impaired efficiency of the synthesized hardware. By executing the two tasks on separate modules, register save/restore can be eliminated, which allows interrupt handler to start at arbitrary timing and reduces the response time and cost of the hardware. By allowing two processes to run in parallel, total execution time is also reduced. An experiment with a simple program has shown that the execution cycles and the delay were reduced by about 80% and 20%, respectively, as compared with MIPS CPU. A motor controller driven by periodical interrupts from a timer has been successfully synthesized from C and assembly programs, which runs more than 20 times faster than the MIPS CPU.

CCS CONCEPTS

- Hardware → Hardware-software codesign; • Computer systems organization → Embedded hardware;

KEYWORDS

High-level synthesis, binary synthesis, interrupt handling

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

ACM Reference Format:

*Currently with Murata Manufacturing Co., Ltd., Nagaokakyo, Kyoto, Japan

1 INTRODUCTION

While embedded systems are getting increasingly rich in functionality, higher and higher performance is required within limited power consumption. To implement such systems efficiently within limited design periods, there have been a lot of attempts to utilize high-level synthesis technology [1] to automatically design hardware from existing software [3, 4].

Binary synthesis [2] is a kind of high-level synthesis which takes binary codes, instead of programs in high-level programming languages, as its inputs. Binary synthesis can handle wider range of programs than high-level synthesis and hence it is suitable for translating programs originally developed as software into hardware. However, applications where processors control external devices, the programs contain interrupt processing.

Ando et al. [5] proposed a method of synthesizing dedicated hardware driven by interrupts from an abstract system model that captured interrupts, interrupt handlers, and communication among devices. Each process may be implemented as either software or hardware according to the designer’s choice. However, regardless of the choice, a CPU was always necessary in the resulting system.

On the other hand, in our team’s previous work [6], a method was proposed which converted a given binary program containing an external interrupt handler into hardware whose behavior was equivalent to the CPU running the program. No rewriting was required on the binary program and the resulting hardware ran faster than the program on the CPU. However, in this method, jumps to the handler were allowed only at the end of the basic blocks, which hinders quick response to interrupts. Moreover, save/restore of registers took time even on the hardware for it resulted in sequential memory accesses.

We have found that the bottleneck was caused by register save/restore for context switching. Based on this observation, we propose in this paper a method of extending the previous method so that a main process and an interrupt handler are synthesized into separate hardware modules. Since the two modules hold their own context, operations for register save/restore will be eliminated. This allows the interrupt handler to start execution without waiting for the main process to reach the end of the basic blocks. Furthermore, two modules may run in parallel, which drastically reduces the total execution time.
The proposed method has been implemented on top of binary synthesizer ACAP [3]. An experiment with a simple program has shown that the execution cycles and the delay were reduced by about 80% and 20%, respectively, as compared with MIPS CPU, at the cost of 40% increase in the hardware cost. A motor controller driven by periodical interrupts from a timer has been successfully synthesized from C and assembly programs, which runs more than 20 times faster than the MIPS CPU.

2 BINARY SYNTHESIS OF MIPS INTERRUPT HANDLING

2.1 Binary Synthesis and ACAP

High-level synthesis takes behavioral specification as inputs which are usually written in high-level languages such as C. However, the behavior may be specified also in assembly or machine languages, and such synthesis techniques are called binary synthesis [2]. It can handle wider range of software programs than conventional high-level synthesis, although back-end technologies to generate hardware are common. While binary synthesis can be used to convert some computationally intensive parts of given software programs, it can also convert a whole binary code into a hardware module which is functionally equivalent to the processor running the program.

ACAP [3] is a binary synthesizer which generates hardware from MIPS R3000 binary codes. As shown in Figure 1, it takes a MIPS executable binary generated by GCC from C programs or by GAS from hand coded assembly programs. It disassembles (by objdump) a binary code to recover an assembly program, and converts it into a CDFG (control dataflow graph); an instruction sequence in each basic block is translated into a DFG (dataflow graph), and jumps and branches are compiled into transitions among the DFGs (delayed branches/jumps are properly recognized). Register jumps, used for multiway branches and return from subroutines/interrupt handlers, are handled by constructing a table that translates instruction address into the IDs of the corresponding DFGs. ACAP itself does not distinguish privilege modes; all the instructions might be executed as if it were in the kernel mode. It then performs scheduling and binding to generate a Verilog HDL code based on typical high-level synthesis algorithms.

While ACAP can transform user specified sections of linked executable code into a hardware accelerator which is tightly coupled with the CPU, it can also synthesize a whole linked executable code into a hardware module which is compatible with the CPU that runs the code. This paper utilizes the the latter mode where the synthesized hardware module can replace the CPU and the instruction memory. The source program may be written in C language or assembly/machine language, and may be linked with start-up routines and library codes for floating point emulation, string processing, etc. Synthesized hardware is expected to run faster than MIPS, due to instruction level parallelism and operation chaining. The hardware size is roughly proportional to the number of instructions in the input binary code, so trade-offs between speed-up and hardware cost need to be considered.

2.2 Interrupt Handling of MIPS R3000

MIPS R3000 handles interrupts by using CP0, the system control coprocessor, which is embedded in the CPU. CP0 has the following three registers to handle external interrupts.

- EPC register keeps the PC (Program Counter) value at which the interrupt is triggered.
- Cause register keeps the information regarding the cause of the interrupt, i.e., whether it is an external interrupt, an internal interrupt, or a system call, and also its detailed cause in the case of an external interrupt.
- Status register keeps the system’s status information such as the execution mode (the user mode or the kernel mode), and the interrupt mask.

When an interrupt signal is sent from an external device to CP0, it is handled in the following way. The flow assumes only single-level interrupts; during interrupt handling, all the other interrupts are ignored.

(1) CP0 saves the PC value and the cause of the interrupt in the EPC register and the Cause register, respectively, and updates the Status register so that the system may run in the kernel mode and the other interrupts will be prohibited.

(2) CP0 sends an interrupt execution signal to the CPU, and writes the starting address of the interrupt handler into PC so that the CPU will jump to the handler.

(3) The handler saves the values of the general purpose registers to the main memory and calls the routine corresponding to the cause in the Cause register.

(4) After the routine finishes its task, the handler restores the general purpose registers.

(5) The handler restores the execution mode and clears interrupt prohibition, and the address in the EPC register is written back to PC so that the CPU can resume execution (or just jumps to the specified address for error handling).

To handle interrupts, the following instructions are used.

- mfc0 and mtc0 (move from/to CP0)
  Instruction mfc0 rt,rd moves the value in a CP0 register rd to a general purpose register rt, and mtc0 rt,rd does the move...
2.3 Synthesis from Binary Codes Containing Interrupt Handler [6]

In [6], a method of synthesizing hardware from programs containing external interrupt handlers was proposed. As illustrated in Figure 2, it translates a linked executable code including an interrupt handler to a hardware module, which replaces the CPU and the instruction memory.

The structure of the hardware generated by this method is shown in Figure 3. The hardware has CP0 of MIPS R3000 as a functional unit along with three registers HW_sig, int_sig, and ESTATE. The flow of interrupt handling is as shown in Figure 4. DFG1 through DFG4 are DFGs (dataflow graphs) corresponding to the basic blocks of a given program, and s0 through s28 are states (control steps). DFG1 and DFG2 are for the normal task and DFG3 and DFG4 are for the interrupt handler. An external interrupt triggered during the execution of DFG1 is processed in the following way.

1. The interrupt signal is sent to CP0.
2. The interrupt execution signal from CP0 is latched in the int_sig register.
3. At the last state (s3) of DFG1, int_sig is tested. If it is 1, the next state of s3 (s8) is saved in the ESTATE register as the return state, and the hardware jumps to the starting state (s20) of the interrupt handler (Handler).
4. At the initial state of the handler, int_sig is cleared.
5. At the last state (s28) of the last DFG (DFG4) of the handler, the status is restored and the hardware returns to the state saved in the ESTATE register.

Moreover, in this method, synthesized hardware must save/restore registers, just as CPUs. Since this ends up with sequential memory accesses, even hardware needs almost as much cycles as the CPU.

3 SYNTHESIZING INTERRUPT HANDLER INTO SEPARATE HARDWARE

3.1 Overview

In order to solve the problem, we propose in this paper a binary synthesis method that synthesizes an interrupt handler into a separate hardware module. Henceforth, we will call a hardware module to compute normal execution flow the main module, and a module to handle interrupts the handler module. The both modules hold their contexts in their own register sets so that register save/restore operations will be eliminated and the handler module can start execution at an arbitrary timing. Moreover, the main module may continue its execution while interrupts are being handled.

In this paper, we assume binary programs for MIPS R3000, and make the following assumptions.

- Only external interrupts (hardware interrupts) are supported. Internal interrupts (software interrupts) are out of the scope of this paper. However, system calls can be synthesized to enable mutual exclusion.
- The interrupt handler cannot refer to the instruction address where the interrupt occurred nor access to the instruction memory.

Figure 2: Synthesis from binary code with interrupt handler.

Figure 3: Configuration of the synthesized hardware in [6].

Figure 4: Interrupt handling in [6].
• Multilevel interrupts are not supported. Any other interrupts during interrupt handling are ignored.

Basically, binary programs are synthesized into hardware without rewriting, but user must specify which parts of the input program are assigned to the handler module. This is specified by means of pragmas in the assembly code (obtained by disassembling the input binary).

### 3.2 Flow of Interrupt Processing

The configuration of the hardware synthesized by the proposed method is shown in Figure 5. Like in the previous configuration, it has CP0 along with the registers HW sig and int sig. Each of the main module and the handler module has an independent set of a finite state controller, a datapath, and registers. For the purpose of sharing data on system calls, two registers k0 and k1 are provided.

The flow of interrupt processing is shown in Figure 6, where DFG3 and DFG4 are of the interrupt handler and s0 – s28 are states (control steps). Basically, the handler module runs independently of the main module. Detailed steps for interrupt handling is as follows.

1. The handler module waits at the initial state s0, polling the int sig register.
2. The interrupt signal is sent to CP0.
3. The interrupt execution signal from CP0 is latched in the int sig register.
4. As soon as int sig becomes 1, the handler module jumps to s20 and executes DFG3 and DFG4.
5. When interrupt handling is finished (at s28), the handler module resets int sig and returns to s0.

Unlike the method in [6], the main module may continue its execution while the handler module is running. If there is a need for serialization or mutual exclusion between the main module and the handler module, we assume that it is implemented in the machine program by using system calls, as explained in the next subsection.

### 3.3 Mutual Exclusion and System Call

In the MIPS R3000 convention, mutual exclusion is realized by disabling interrupts (CPU lock). This is done via the system call instruction (syscall). The syscall instruction switches the CPU state to the kernel mode and transfer control to the interrupt handler, where interrupts are disabled by updating the Status register.

In the case of single core execution, syscall will never be executed by the main routine during a certain external interrupt is handled by the interrupt handler (because the main routine is idle while the interrupt handler is running). However, in the case of our hardware configuration, this may happen, because two modules may run in parallel. In order to guarantee the equivalence of the behavior between the original machine program on the CPU and the synthesized hardware, we let the main module postpone execution of syscall until the handler module completes its task.

This is implemented by translating syscall into the three state operations.

1. Test if the handler module is executing its task. If yes, goto (2). Otherwise goto (3).
2. Wait until the handler module completes its task and then goto (1).
3. Wait until the handler module to complete the task for syscall, and goto the next state. All the external interrupts occurring during this state are ignored.

We assume a calling convention that the kernel stack pointer is passed in register k0 and the syscall function is specified in k1. In our method, where the two modules have independent register sets, we somehow transfer data in the two registers. We do this via two external registers, depicted as k0 and k1 in Figure 5. After CDFG generation, operations to store k0 and k1 to the two external registers are inserted just before syscall operations in the main module, and operations to load the two registers are added in the head DFG of the handler module.

### 3.4 Deletion of Save/Restore Codes

In our method, the handler module does not have to save/restore registers. However, the save/restore codes are written in the given input binary codes, which must be deleted during the synthesis.

This is done just after the CDFG generation.

• Register save operations
(1) In the head DFG of the handler module, all the store operations that attempt to store undefined registers are recognized as save operations and are deleted.

(2) All the operations, whose destinations are temporary registers (created during CDFG generation) and are not referred to afterwards, are also deleted.

- Register restore operations

(1) In the DFGs that contain rfe operations, all the load operations that attempt to load to registers which will never be referred to are regarded as restore operations and are deleted.

(2) All the operations, whose destinations are temporary registers and are not referred to afterwards, are deleted.

4 EXPERIMENTAL RESULTS

4.1 Synthesis

The proposed synthesis method has been implemented on top of binary synthesizer ACAP. It is implemented by Perl5 and runs on Linux and Mac OS X, and outputs RTL description in Verilog HDL.

The test program is shown in Figure 7. Function main (lines 20–28) calls function filtering (lines 30–56) which performs Laplacian filtering (the function is called just once for the purpose of measuring execution cycles). Functions int_getpixel (lines 58–65) and int_average (lines 67–83) are service routines for external interrupts; int_getpixel returns the pixel value of a specified coordinate, and int_average the average pixel values in a specified rectangle region. The coordinates are given in variables in_x1, in_y1, in_x2, in_y2 and the result is stored in variable output.

The program shown in Figure 8 is the interrupt handler [6]. When an external interrupt occurs, CPU first jumps to the top of the handler. The handler 1) saves the general purpose registers, 2) saves the return address, 3) calls the interrupt service routine function, 4) restores the general purpose registers, and 5) returns from the interrupt.

The program shown in Figure 9 is a collection of library functions for interrupt handling which is written in C and inline assembly [6]. Function init_interrupt (lines 7–12) registers interrupt service routine execution function run_exc_handler (lines 21–42), which are called from the interrupt handler, and a routine run_syscall_handler (lines 44–54) for system call (EXC_Sys). Function int_prohibition (lines 56–60) is an interface function to set interrupt prohibition by calling run_syscall_handler through system call and then, calling int_prohibition_func (lines 62–69). Function int_permission (lines 71–76) sets interrupt permission.

All the programs were compiled/assembled and linked into a single binary code executable on MIPS R3000 processor. After confirming the behavior of the program by running the processor in RTL simulation where external interrupts were given from the testbench, hardware is synthesized by extended ACAP. It was confirmed that the memory dumps of variable output obtained by the MIPS processor and by the hardware agreed. It was also confirmed that the main module continued its execution while the handler module was processing its tasks.

4.2 Performance Evaluation

Logic synthesis was performed on the Verilog HDL description generated by ACAP. The logic synthesizer was Xilinx ISE 14.3 and the target was FPGA Xilinx Spartan-3E.

The result is shown in TABLE 1. “MIPS” indicates software execution on MIPS softcore processor, “HW [6]” and “HW (proposed)” stand for synthesized hardware by the method in [6] and the proposed method, respectively. “Slices” of MIPS includes estimated cost of the instruction memory; one instruction (32bit) is assumed to cost one slice. “Delay” is the critical path delay. Subcolumn “total (0)” under “Cycles” is the total number of execution cycles when the
Figure 8: Interrupt handler [6].

main routine in Figure 7 was executed without any interrupt. Subcolumns "int (1)" and "total (1)" are for the cases where interrupts for int_getpixel were raised 1,000 times; "int (1)" is the average number of cycles taken from the occurrence of an interrupt to finish the task for the interrupt, and "total (1)" the number of the total execution cycles. Subcolumns "int (2)" and "total (2)" are for the cases where interrupts for int_average for 40 x 30 pixels were raised 10 times.

As in "int (1)", our method took much less execution cycles than MIPS and the previous methods. This is considered mainly because register save/restore operations were eliminated and partly because interrupt handler starts at any time other than the end of DFGs. Our method is especially effective in reducing interrupt latencies of light and frequent interrupts.

The total execution cycles of our method in "total (1)" and "total (2)" are fewer than those of [6] and more than 80% fewer than the previous methods. This is considered mainly because interrupt handling did not increase total execution time.

Hardware cost ("Slices") is higher than that of MIPS (which includes estimated memory size for 1,753 instructions) by 41%, but this would be reasonable considering the improvement of the speed performance.

4.3 Synthesis of Motor Controller

As a more practical example, a controller of a motor driven by a timer is synthesized from a C program. The overview of the controller is shown in Figure 10 (a). It controls the rotational speed of a brushed DC motor by changing supply voltage. The control is sensorless; it does not use a speed sensor but estimates the speed by a motor simulator. The voltage is periodically adjusted upon interrupts from the timer.

The detailed block diagram of the transfer function for the PID control, with a motor model inclusive, is shown in Figure 10 (b) [7]. Three elements PID, Gain, and Limiter constitute the controller, which receives the target angular velocity \( \omega \) and decides the voltage \( v \) to drive the motor. The value of \( v \) as well as \( \omega \) is computed by expressing the transfer function by an ordinary differential equation and numerically solving it by the Runge-Kutta method.

A C program is described in the same framework as in Figure 7 through Figure 9. In this example, the main routine does nothing (just busy-loops) after initialization, while the handler is activated by the periodic interrupts from the timer.

A synthesis result is shown in TABLE 2. The logic synthesizer and the target FPGA are the same as those in TABLE 1. Note that our MIPS core was equipped with only fixed point arithmetic units
so floating point operations were carried out with GNU soft-float library, while the hardware had floating point units and ACAP converted the calls to the soft-float library by floating point arithmetic operations. As compared with software on the MIPS, the number of execution cycles ("Cycles") is drastically reduced, though circuit size ("Slices") is 2.72 times larger. The minimum period for the motor control has been reduced to 20.6 [µs], which enables extremely precise control.

5 CONCLUSION
This paper has proposed a method of synthesizing given binary programs with external interrupt handling into hardware where the normal task and the interrupt handler are implemented as separate modules. Efficient hardware compatible with the CPU running the binary code is generated without any rewriting on the binary code except for adding pragmas to specify which part should be synthesized as the handler module.

Future work includes extension of this method to multilevel interrupts and application of this method to other processors than MIPS.

ACKNOWLEDGMENTS
We would like to thank to Mr. Takayuki Nakatani who was with Ritsumeikan University, Mr. Masaharu Yano who was with Kyoto University, Mr. Shimpei Tamura who was with Kwansei Gakuin University and all the members of Ishiura Lab. of Kwansei Gakuin University for their discussion and advices on this research. This work is partly supported by JSPS KAKENHI under Grant Nos. 16K00088, 16K01207, and 15H02680.

REFERENCES