An Optimized Montgomery Modular Multiplication Algorithm for Cryptography

P. Shenbagapriya (ME-II VLSI-design) Dr. G. Mahendran ME., Ph.D,
ECE department, Associate professor and head
Syed ammal engineering college Ramanathapuram

Abstract:
Montgomery modular multiplication is one of the fundamental operations used in cryptographical algorithms, such as RSA and Elliptic Curve Cryptosystem. The previous Montgomery multipliers perform a single Montgomery multiplication in approximately \(2n\) clock cycles and it requires more number of addition stages for large word length addition, where \(n\) is the size of operand operands in bits. In this paper, new Montgomery modarlt multipliier is proposed which perform the same operation in approximately \(n\) clock cycles with almost same clockperiod. The proposed multiplier uses carry select adders (CSLAs) to perform large wordlength additions. Carry select adders are based on the concept of Binary to Excess-1 converter (BEC). The proposed algorithm uses the concept of precomputing partial results using two possible assumptions either zero or one regarding the most significant bit of the previous word. The optimized algorithm is simulated using Xilinx ISE 12.1i and is implemented using Virtex5 FPGA device.

Keyword - Rivest, Shamir, Adleman (RSA), Carry Select Adders (CSLAs) and Binary to Excess-1 converter (BEC).

1 Introduction

The Montgomery multiplication (MM) is the basic operation used in modular exponentiation, which is required in the Diffie-Hellman and RSA public-key cryptosystem. Recent implementations of the Montgomery Multiplication are focused on elliptic curve cryptography over the finite fields \(GF(p)\) and \(GF(2^m)\). A major design concern for multiplication units used in cryptography is the large number of operand bits, which causes large fan-out of signals, large wire delays, complex routing, and the cost of extra hardware resources.

Montgomery algorithm requires the built-in multipliers in the FPGA device to speed up the computation. This feature makes the implementation expensive. The systolic high-radix design by McVor et al. described in [6] is capable of very high-speed operation, but suffers from the disadvantage of large area requirements for fast multiplier units. A different approach based on processing multiprecision operands in carry-save (CS) form has been presented in [7]. This architecture is optimized for the minimum latency and is particularly suitable for repeated sequence of Montgomery multiplications, such as the sequence used in modular exponentiations.

In this paper, the focus is given to the optimization of the Montgomery algorithm in order to minimize the number of clock cycles required to compute an \(n\)-bit precision Montgomery multiplication and reducing number of gates used in multiplier by using carry select adders. Section 2 contains the introduction of the Montgomery multiplication. Then, the new modified carry select adders are discussed in section 3. The new optimized algorithm, which is able to perform the \(n\)-bit precision MWR2MM algorithm in approximately \(n\) clock cycles, is presented in Section 4. In Section 5, the experimental results of various architectures were discussed and the performance analysis has been done in section 6. Finally, the paper was concluded in section 7.
2 Montgomery Multiplication Algorithm

Let \( M > 0 \) be an odd integer. In many cryptosystems, such as RSA, computing \( X \cdot Y \pmod{M} \) is a crucial operation. The reduction of \( X \cdot Y \pmod{M} \) is a more time-consuming step than the multiplication \( X \cdot Y \) without reduction. Montgomery introduced a method for calculating products \( \pmod{M} \) without the costly reduction \( \pmod{M} \), since then known as Montgomery multiplication. He observed that the divisions can be converted into simple shifts if multiplication is instead performed on so-called Montgomery residues\((M\text{-residues})\). Montgomery multiplication\((\text{MM})\) of residues is defined as
\[
Z = \text{MM}(X,Y) = X \cdot Y \cdot r^{-1} \pmod{M}
\]

Observe that
\[
Z = X \cdot Y \cdot r^{-1} \pmod{M} = (x \cdot y) \cdot r \pmod{M} = z \pmod{M}
\]
so the Montgomery product of two Montgomery residues is the Montgomery residue of the product of the two corresponding integers.

\[
Z = 0 \\
\text{for } i = 0 \text{ to } n-1 \\
Z = Z + x_i \cdot y_i \\
\text{if } Z \text{ is odd then } Z = Z + M \\
\text{if } Z = M \text{ then } Z = Z - M
\]

Simple radix-2 Montgomery multiplication algorithm

This algorithm computes \( S = \text{MM}(X,Y) = X \cdot Y \cdot r^{-1} \pmod{M} \). It uses \( n \) steps, as in a word-serial multiplication algorithm. On step \( i \), it adds the \( Y \) to the running sum if the \( i \)-th bit of \( X \) \((x_i)\) is true. Also on each step, it divides by two. If the running sum was odd, it first adds \( M \) so the result can be divided by two without loss of information. This is permissible because adding \( M \) \( \pmod{M} \) does not change the result. After \( n \) steps of dividing by two, the algorithm has divided by \( r = 2^n \). The algorithm might produce a result as large as \( 2M-1 \), so it concludes by subtracting \( M \) if necessary to restore the result to the legal range. The final subtraction can be avoided for repetitive multiplications used in exponentiation.

3. Carry Select Adder

Carry Select Adder (CSLA) is one of the fastest adders used in many data-processing processors to perform fast arithmetic functions. However, the CSLA is not an area efficient because it uses multiple pairs of Ripple Carry Adders (RCA) to generate partial sum and carry by considering carry input \( c_{in} = 0 \) and \( c_{in} = 1 \) then the final sum and carry are selected by the multiplexers (mux).

The basic idea of this work is to use Binary to Excess-1 Converter(BEC) instead of RCA with \( c_{in} = 1 \) in the regular CSLA to achieve lower area and power consumption [2]. The main advantage of this BEC logic comes from the lesser number of logic gates than the \( n \)-bit Full Adder (FA) structure. A structure of a 4-b BEC and 16 bit CSLA are shown in Fig. 2 and Fig 1.

Fig. 2 4 bit BEC

4. Optimized Montgomery Multiplier Using Carry Select Adders

In this context, the optimization of hardware block for single word and multiword Radix-2\((\text{MWR2MM})\) in order to minimize the number of clock cycles required to compute an \( n \)-bit precision Montgomery multiplication was focussed. It uses carry select adders (CSLAs) to perform large word length additions [1].

Algorithm 1 shows the pseudocode for the radix-2 Montgomery multiplication, where we choose \( n = (\log_2 M) + 1 \). \( n \) is the size of \( M \) in bits.

**Algorithm 1**

In [3], Tenca and Koc¸ proposed a scalable architecture based on the Multiple-Word Radix-2 Montgomery Multiplication Algorithm, shown as Algorithm 2.

**Algorithm 2**

In Algorithm 2, [1] the operand \( Y \) (multiplicand) is scanned word-by-word, and the operand \( X \) is scanned bit-by-bit. The operand length is \( n \) bits, and the word length is \( w \) bits. \( e = (n+1)/w \) words are required to store \( S \) since its range is \([0, 2M-1]\). The original \( M \) and
Y are extended by one extra bit of 0 as the most
significant bit. Presented as vectors,

\[ Y = (0, Y^{(e-1)}, \ldots, Y^{(1)}, Y^{(0)}) \]

\[ S = (0, S^{(e-1)}, \ldots, S^{(1)}, S^{(0)}) \]

\[ X = (x_{n-1}, \ldots, x_1, x_0) \]

The carry variable \( C^{(i)} \) has two bits, as explained below. Assuming \( C^{(0)} = 0 \), each subsequent value of \( C^{(i+1)} \) is given by

\[ (c^{(i+1)}, s^{(i)}) = (c^{(i)} + x_{i} Y^{(i)} + q_{i} M^{(j)} + s^{(j)}) \]

Assuming that \( c^{(0)} \leq 3 \), we obtain

\[ (c^{(i+1)}, s^{(i)}) = (c^{(i)} + x_{i} Y^{(i)} + q_{i} M^{(j)} + s^{(j)}) \]

\[ \leq 3 + 3(2^{w-1}) \]

\[ \leq 3.2^{w} \]

we have \( c^{(i+1)} \leq 3 \). By induction, \( c(j) \leq 3 \) is ensured for any \( 0 \leq j \leq e-1 \). Additionally, based on the fact that \( S \leq 2M \), we have \( c^{(0)} \leq 1 \).

In order to reduce the two-clock-cycle delay to half, we propose an approach of precomputing the partial results using two possible assumptions regarding the most significant bit of the previous word in order to reduce the two-clock-cycle delay to half, we propose an approach of precomputing the partial results using two possible assumptions regarding the most significant bit of the previous word. As shown in Fig. 3, PE #1 can take the \( (w-1) \) most significant bits of \( S{(0)}{(i=0)} \) from PE #0 at the beginning of clock #1, do a right shift, and compute two versions of \( S{(0)}{(i=0)} \), based on the two different assumptions about the most significant bit of this word at the start of computations. At the beginning of the clock cycle #2, the previously missing bit becomes available as the least significant bit of \( S{(1)}{(i=0)} \). This bit can be used to choose between the two precomputed versions of \( S{(1)}{(i=1)} \). Similarly, in the clock cycle #2, two different versions of \( S{(0)}{(i=2)} \) and \( S{(1)}{(i=2)} \) are computed by PE #2 and PE #1, respectively, based on two different assumptions about the most significant bits.

**Fig. 3 Data Operation Of Optimized Algorithm**

of these words at the start of computations. At the beginning of the clock cycle #3, the previously missing bits become available as the least significant bits of \( S{(1)}{(i=1)} \) and \( S{(2)}{(i=0)} \), respectively. These two bits can be used to choose between the two precomputed versions of these words. The same pattern of computations is repeated in subsequent clock cycles.
Furthermore, since e words are enough to represent the values in S, S⁰(i) is discarded in our designs. Therefore, e clock cycles are required to compute one iteration of S.

Algorithm 4. Computations in Task E

```
Input: q_i, x_i, C(i), Y(i), M(i), s(i), s(i+1), s(i-1)
Output: C(i+1), s(i+1)
4.1 (CO(i+1), SQ(i+1), Eq(i+1) =
   (1, s(i), C(i) + x_i \cdot Y(i) + q_i \cdot M(i);
4.2 (CE(i+1), SE(i+1), Eq(i+1) =
   (0, s(i), C(i) + x_i \cdot Y(i) + q_i \cdot M(i);
4.3 if S(i) = 1 then
   C(i+1) = CO(i+1);
4.4 else
   C(i+1) = CE(i+1);
4.5 s(i+1) = (s(i), s(i));
```

In Task D consists of three steps, the computation of q_i, the calculation of two sets of possible results, and the selection between these two sets of results using an additional input S₀, which becomes available at the end of the processing time for Task D. These three steps are shown in Algorithm 3. Task E corresponds to two steps, as shown in Algorithm 4. The data forwarding of S₀ and S₁ from one circle E to the two circles in the right column takes place at the same time. However, S₀ is used for selecting the two partial results of S¹, and S₁ is used for generating the two partial results of S². Additions of words in above algorithm is calculated by CSLA shown in fig 4.

Instead of using normal adder we can use this carry select adder to reduce delay and the area consumption. This adder is used to add x_i, q_iM and S(i) which have been mentioned in the algorithm. The Montgomery multiplier unit is shown in fig 5.

5. Experimental Results

The proposed Montgomery multiplication algorithm, Montgomery algorithm with carry select adder (CSA) and with Binary to excess-3 Converter (BEC) are simulated by using Xilinx ISE 12.1i and implemented in Virtex-5 FPGA processor. The experimental results are given in Table 1.

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Montgomery algorithm</th>
<th>Montgomery algorithm with CSA</th>
<th>Montgomery algorithm with BEC</th>
</tr>
</thead>
<tbody>
<tr>
<td>Number of Slice LUTs</td>
<td>144</td>
<td>132</td>
<td>142</td>
</tr>
<tr>
<td>Number of bonded IOBs</td>
<td>22</td>
<td>22</td>
<td>22</td>
</tr>
<tr>
<td>Time (ns)</td>
<td>17.534</td>
<td>19.660</td>
<td>20.466</td>
</tr>
</tbody>
</table>

6. Performance Analysis

The performance of the Montgomery algorithm with various types of adders are analysed based on the area and time consumption which was given in table 1.

Table 1. Experimental Results

Based on the analysis, it is come to known that the Montgomery multiplication which uses the carry select adder gives the better reduction in area and time when compared to the other two methods and is shown in the fig.5.
7. Conclusion

In general, Montgomery multiplication algorithm used for cryptography requires more area and time. In order to reduce both the parameters required, the various adders are used in the case of addition process. Among three types of addition processes which have discussed above, the Montgomery algorithm with carry select adder (CSLA) produces the optimized result. Because, it reduces both area and time considerably. Further the time consumption may also be reduced with the other new technique used for Montgomery multiplication algorithm.

References


