Range Enhanced Packet Classification Design on FPGA

Yeim-Kuan Chang and Chun-Sheng Hsueh
Department of Computer Science and Information Engineering
National Cheng Kung University, Taiwan, R.O.C.

Abstract—The future of fast Internet needs powerful routers to support abundant network functionalities, such as firewall, QoS, and virtual private networks by classifying the packets into different categories based on a set of predefined rules, so-called multi-field packet classification. Traditional packet classification that considers only 5 tuple fields is not sufficient for today’s complicated network requirements. OpenFlow switch was born to take care of these complex requirements by using a rule set with rich definition as the software-hardware interface. This paper considers OpenFlow1.0, consisting of 12 tuple header fields [2]. We propose two schemes to process range fields. The first scheme has the same characteristic as StrideBV [15] by using specially designed codes to store the pre-computed results in memory. The second scheme uses a simple sub-range comparison method to find the matching result in a sequential fashion. To show the performance and compare with other proposed schemes, we implement the proposed schemes on Xilinx Virtex-6 XC6VLX760 FPGA device. Experimental results show that our designs can handle 5K more OpenFlow rules on Virtex-6 XC6VLX760. To our knowledge, our proposed scheme is the first range supported method that can sustain the throughputs of more than 380 MHz.

Keywords—Packet Classification; Pipelined Architecture; FPGA; OpenFlow;

I. INTRODUCTION

Network security has become more and more important these days because of the various attacks on the internet. Many hardware solutions and software functions have been widely invented to avoid attacks. Packet classification and deep packet inspection (DPI) are operated in these systems to detect and protect potential threats by dropping harmful traffic. Packet classification acts as the initial filter to the network in which network traffic is classified into flows based on a predefined set of rules. It needs the inspection of multiple fields of the packet header and is different with IP forwarding where only the destination IP address is inspected.

When using hardware to implement packet classification solutions, memory used to store the pre-defined rules is often the bottleneck. Especially, for the platforms like Field Programmable Gate Arrays (FPGAs), on-chip memory is limited. Using external memory will cause lower clock rate because of the connection delay between two devices. To conquer these problems, numerous solutions have been proposed in the literature to reduce the memory usage of ruleset storage. Most of the existing papers take advantages of some rule properties to reach the goal of memory efficiency [3], [4], [5], [6], [7]. While these properties may not always exist, different ruleset might affect the results. In some cases, the heavily feature-dependent solutions may yield poor memory efficiency.

The solutions that consider memory efficiency and high throughput demand at the same time are difficult to implement and challenging due to many reasons. For example, in trie or tree based approaches, on-chip resources are easily exhausted due to pipelined tree traversal and multi-field lookup. Therefore, it will be difficult to implement multiple parallel pipelines to improve the throughput. In this work, we consider improving throughput as the primary goal.

Software Defined Networking (SDN) has been proposed as an innovative architecture for enterprise networks. SDN separates the software based control plane from the hardware based data plane, and uses a flexible protocol - OpenFlow [1] to manage network traffic. One of the most important functions in OpenFlow platforms is the flow table lookup [8], [9]. The flow table lookup needs to check the incoming packet to see if it is matched against multiple fields in a prioritized flow table. The method is similar to the classic 5-field packet classification [10].

Many existing solutions for multi-field packet classification are implemented by using ternary content addressable memories (TCAMs) [11], [12]. But TCAMs are expensive and power hungry. TCAM based solutions also suffer from range explosion problem when converting ranges into prefixes [12]. In [23], an extended TCAM was proposed to solve the range explosion problem by introducing an iterative structure of 1-bit range check circuit.

Field Programmable Gate Array (FPGA) technology has been used to implement IP lookup, packet classification and other solutions for real time network processing [13], [14]. FPGA based packet classification engine can achieve very high throughput for rule sets of moderate size [15]. We can increase the throughput by creating numerous parallel pipelines. However, as the number of packet header fields or the rule set size increases, FPGA based approaches still suffer from clock rate degradation. Because OpenFlow protocol has many packet header fields to be examined [14]. OpenFlow packet classification remains a challenging research problem.

II. RELATED WORK

A. Field-Split Bit Vector (FSBV)

The Field-Split Bit Vector (FSBV) algorithm is designed to solve packet classification problems [3]. One of the most important things for FSBV is to reduce memory consumption. The studies on the 5-field traditional packet classification rules
appearing in the Snort [16] ruleset show that the source address, destination address, and protocol fields of the rules contain a pretty small number of unique values compared with the ruleset size. Because reducing the requirement of memory is the major goal in FSBV, the authors applied the field-splitting algorithm only to the source port and destination port fields.

In FSBV, a given field $F$ of width $w$ bits can be split to a set of $w$ sub-fields $F[w_i]$ for $i = 0$ to $w-1$. Each $w_i$ has two possible values, 0 and 1. Extending this to all the $N$ rules in the ruleset will result in two $N$-bit vectors (say $R_0[N]$ and $R_1[N]$) that correspond to the two possible input values of sub-field $F[w_i]$. If bit $j$ is set in $R_0[N]$, sub-field $F[w_i]$ of rule $j$ is zero or don’t care. Similarly, if bit $j$ is set in $R_1[N]$, sub-field $F[w_i]$ of rule $j$ is 1 or don’t care. Therefore, if the input sub-field $F[w_i]$ is zero, we will examine $R_0[N]$; otherwise we will examine $R_1[N]$. We use an example ruleset to explain this process. We set the header field bit length $w$ to 4. The bit vector generation method and packet lookup process are illustrated in Figure 1.

When packet header arrives, the corresponding field $F$ will be used to fetch the related bit vectors. These bit vectors will be sent to a bit-wise logical AND to compute matching results. A bit vector in this packet classification algorithm consists of $N$ bits. Each bit represents a rule of the ruleset. A bit in a bit vector is set to 1 to indicate a match or 0 to indicate a mismatch. The correctness of this method has already been proved in [3]. The result of the bit-wise AND operation is also an $N$-bit bit vector, where bit position $k$ indicates if the rule $k$ is a match for the incoming packet header or not.

This field split algorithm is fully supported for wildcard (*) matching. The method to support wildcards is to set the corresponding bit vector to 1. In this example, wildcards appear in rules R2 and R4, and the positions are at 0 and 2, respectively. A wildcard means that the corresponding bit in the header can be either 0 or 1. Therefore, both the bits in position 0 for R2 and in position 2 for R4 are set to 1.

Since the field split method does not support ranges, direct range-to-prefix conversion is usually used. However, the direct range-to-prefix conversion suffers a problem that a single rule will be partitioned into multiple rules.

B. Stride Bit Vector (StrideBV)

As mentioned earlier, the field split algorithm is only applied for source port and destination port fields due to the consideration of memory consumption. For the other fields, TCAM/CAM is used since the number of unique values is small. Their goal is to find out a solution that avoids relying on ruleset features and achieves high throughput. Implementing TCAM on FPGA can be inefficient due to high circuit complexity and poor performance compared with pipelined architectures [6]. Implementing TCAM on FPGA limits the scalability as well as performance of FSBV as ruleset features change. The author considers memory consumption as a secondary target mainly because FPGA has various memory resources. Hence, the proposed field split algorithm in [15] was applied to all the 5 fields.

In the case of traditional 5-field packet classification, using the original FSBV method will result in 104 stages in a single pipeline. This will cause serious latency problem. From a hardware perspective, numerous stages will cause significant routing delay, which leads to clock rate degradation. Reducing the pipeline stages can be done by using multiple bits instead of a single bit. This can be performed by storing bit vectors corresponding to the $2^k$ combinations of the $k$ bit stride and loading a single bit vector per stage. The authors call it StrideBV scheme. StrideBV uses more memory while it has lower memory bandwidth to achieve high throughput. To be specific, for a rule containing the prefixes of $W$ bits, using stride $d$ requires $2^d[W/d]$ bits.

C. OpenFlow

The Open Networking Foundation (ONF), a user-led organization dedicated to promotion and adoption of software-defined networking (SDN), manages the OpenFlow standard. ONF defines OpenFlow as the first standard communications interface defined between the controls and forwarding layers of an SDN architecture. OpenFlow allows direct access to and manipulation of the forwarding plane of network devices such as switches and routers, both physical and virtual (hypervisor-based). It is the absence of an open interface to the forwarding plane that has led to the characterization of today’s networking devices as monolithic, closed, and mainframe-like. A protocol like OpenFlow is needed to move network control out of proprietary network switches and into control software that’s open source and locally managed.

An OpenFlow Switch consists of one or more flow tables and a group table performs packet lookup and forwarding via a secure OpenFlow channel to an external controller. The switch communicates with the controller and the controller
manages the switch via the OpenFlow protocol. Based on the OpenFlow protocol, the controller can add, update, and delete flow entries in flow tables, both reactively (in response to packets) and proactively. Each flow table in the switch contains a set of flow entries and each flow entry consists of match fields, counters, and a set of instructions to apply to the matching packets.

Matching process starts at the first flow table and may continue to additional flow tables. Flow entries match packets in the order of increasing priority, with the first matching entry in each table being used. If a matching entry is found, the instructions associated with the specific flow entry are executed. If no match is found in a flow table, the outcome depends on configurations: to forward the packet to the controller over the OpenFlow channel, to drop the packet, or to continue to the next flow table.

OpenFlow version 1.0 packet header includes 12 fields: Ingress port, Ethernet source address, Ethernet destination address, Ethernet type, VLAN ID, VLAN priority, IP source address, IP destination address, IP protocol, IP ToS bits, Transport source port/ICMP Type, Transport destination port/ICMP Type. Each entry contains a specific value, prefix, or don’t care. Details on the properties of each field are described in Table 1.

OpenFlow-compliant switches come in two types: OpenFlow-only, and OpenFlow-hybrid. OpenFlow-only switches support only OpenFlow operations. In those switches, all packets are processed by the OpenFlow pipeline, and cannot be processed otherwise.

OpenFlow-hybrid switches support both OpenFlow operation and normal Ethernet switching operation, i.e. traditional L2 Ethernet switching, VLAN isolation, L3 routing (IPv4 routing, IPv6 routing...), ACL and QoS processing. Those switches must provide a classification mechanism outside of OpenFlow that routes traffic to either the OpenFlow pipeline or the normal pipeline. For example, a switch may use the VLAN tag or input port of the packet to decide whether to process the packet using one pipeline or the other, or it may direct all packets to the OpenFlow pipeline. An OpenFlow-hybrid switch may also allow a packet to go from the OpenFlow pipeline to the normal pipeline through the NORMAL and FLOOD reserved ports.

D. Field Programmable Gate Array

A field-programmable gate array (FPGA) is an integrated circuit designed to be configuration by a customer or a designer after manufacturing – hence “field-programmable”. The FPGA configuration is generally specified using a hardware description language (HDL), similar to that used for an application-specific integrated circuit (ASIC).

Contemporary FPGAs have large resources of logic gates and RAM blocks to implement complex digital computations. FPGAs can be used to implement any logical function that ASICs could perform. The ability to update the functionality after shipping, partial re-configuration of the design, and the low non-recurring engineering costs relative to an ASIC design offers advantages for many applications.

Historically, FPGAs have been slower, less energy efficient and generally achieved less functionality than their fixed ASIC counterparts. More recently, FPGAs such as the Xilinx Virtex-7 or the Altera Stratix 5 have come to rival corresponding ASIC and ASSP solutions by providing significantly reduced power, increased speed, lower materials cost, minimal implementation real-estate, and increased possibilities for re-configuration ‘on-the-fly’.

Vendors can take a middle road by developing their hardware on ordinary FPGAs, but manufacture their final version as an ASIC so that it can no longer be modified after the design has been committed.

FPGA can be used to solve any problem which is computable. This is trivially proven by the fact FPGA can be used to implement a soft microprocessor. Their advantage lies in that they are sometimes significantly faster for some applications due to their parallel nature and optimality in terms of the number of gates used for a certain process.

III. PROPOSED SCHEME

Our proposed scheme is based on the bit vector scheme [15]. A fixed number of rules are implemented in a pipelined architecture. Therefore, if the rule set is large, multiple pipelines are needed. In each pipeline, the incoming packet’s header first passes through several processing stages for range fields. The partial bit vector results of range field stages will be passed to the following stages of the prefix fields. Finally, a pipelined priority encoder will aggregate multiple pipeline results and find out the highest priority match. We use the OpenFlow version 1.0, and we set the ingress port to 6 bits, so the incoming packet header is of size 243 bits. Our design takes the advantage of the statistic results of [18] with consideration of reasonable number of stages.

A. Range Bit Vector Encoding (RBVE) scheme

As the traditional packet classification rule tables, OpenFlow rule tables only have the fields of source and destination ports that contain 16-bit ranges. A 16-bit range is denoted by \([LB, UB]\), where \(LB\) and \(UB\) are its lower and upper bounds. In this section, we propose a new range encoding scheme called range bit vector encoding (RBVE) scheme. In RBVE scheme, a 16-bit range is split into many \(d\)-bit sub-ranges, where \(d\) is the fixed stride size that can be 1, 2, 4, or 8. Note that the cases of variable stride sizes are not discussed in this paper. For a fixed stride size of \(d\), there are \(s = 2^d\) bit-ranges that can be implemented as a pipeline of \(s\) stages, where \(s = 16/d\). Figure 2 illustrates the general view of a 16-bit range \([LB, UB]\) being split into four \(4\)-bit sub-ranges, \([LB_i, UB_i]\) for \(i = 1\) to \(4\). Let \(A\) be the 16-bit input address and \(A_i\) is split into four sub-addresses, \(A_{i-1}\) for \(i = 1\) to \(4\). As we can see, if \(UB_i < A_i < LB_i\), we can conclude that \(UB < A < LB\) and so the first stride is sufficient to decide if the input address \(A\) matches the range \([LB, UB]\). Figure 2 also shows the other three possible conditions, \((1) A_i > UB_i\) \(A_i < LB_i\), \((2) LB_i < UB_i = A_i\), and \((3) A_i = LB_i < UB_i\). Another additional condition not shown in Figure 2 is \(A_i = LB_i = UB_i\).
Figure 2. General view of a range being split into 4 strides.

Therefore, we propose to use a set of output signals that are generated from each stage to perform the matching process. Figure 3 shows these output signals and their meanings. Take the first stride as an example. Output signals 000 and 111 mean always mismatch and always match for the two conditions $A_1 > UB_1 \land \neg A_1 \wedge UB_1$, and $LB_1 \land \neg A_1 \wedge UB_1$, respectively. Output signal 001 means $A_1 = LB_1 \lor UB_1$ and so a match happens only when $A_2 \wedge A_3 \wedge \ldots \wedge A_n \geq LB_2 \wedge LB_3 \wedge \ldots \wedge LB_n$, where $\circ$ is the concatenation operator. Similarly, output signal 100 means $LB_1 < UB_1 = A_i$ and so a match happens when $A_2 \wedge A_3 \wedge \ldots \wedge A_i \leq UB_2 \wedge UB_3 \wedge \ldots \wedge UB_n$. Output signal 111 means $A_1 = LB_1 = UB_1$ and so a match happens only when $LB_2 \wedge LB_3 \wedge \ldots \wedge LB_n$. Therefore, we set $f_{101} = 1$ when both $a_0$ and $f_0$ are one, a final match is generated. For the case of $A_1 = UB_1 = LB_1$, a match occurs only when $LB_2 \leq A_2 \leq UB_2$. Therefore, $a_1, f_1,$ and $f_0$ are all set to one. In addition to the cases considered above, the other two cases happening at the first stage are $A_1 > UB_1$ and $A_1 < LB_1$. For these two cases, the initial code 000 is used. It is not possible to have a match result for these two cases no matter what the code for the last stride is.

The code designs of the first and last stages can be understood clearly based on the relationship between $UB_1$ and $LB_1$ and between $LB_2$ and $UB_2$. Given a range $[LB, UB]$, the relationship between $UB_1$ and $LB_1$ only satisfies two conditions: $UB_1 > LB_1$ and $UB_1 = LB_1$ as shown in Figure 5(a). In other words, the condition of $UB_1 < LB_1$ will never occur. However, the relationship between $LB_2$ and $UB_2$ meets all three possible conditions, $UB_2 > LB_2$, $UB_2 = LB_2$, and $UB_2 < LB_2$ as shown in Figure 5(b). By using these conditions, all the possible cases are listed in Figure 5(a) and (b) and their associated codes can be given easily.

After reading the codes $Code[A_1] = a_0 a_1 a_2$ in the first stage and $Code[A_1] = f_0 f_1 f_2$ in the last stage, we still have to perform a simple computation to determine if it is a match or mismatch by the equation shown in Figure 6(c).
Figure 5. Code details for range split into two sub-ranges.

(a) Code design for $d = 4$ or 2

(b) Code design for $d = 1$

A.2. Code design for $d = 4$ or 2

The codes for the first and the last stages are the same as the case of $d = 8$. The codes for the middle stages are shown in Figure 4(b). Since all the three conditions of $A_i < UB$, $A_i = UB$, $A_i > UB$ are possible in the middle stages, we need two bits ($c_2$ and $c_3$) to distinguish them. We use $c_2$ and $c_3$ independently. Bit $c_3$ is enabled when $A_i = UB$, and $c_2$ is enabled when $A_i < UB$. Thus, $c_3$ and $c_2$ cannot be enabled simultaneously. Similarly, we need two bits ($c_1$ and $c_0$) to record the relationship between $A_i$ and $LB_i$. Bit $c_1$ is enabled when $A_i = LB_i$ and $c_0$ is enabled when $A_i > LB_i$. By following the same design rationale of bits $a_{2i}a_{i+1}$ at the first stage, we have to perform some computations to obtain the 3-bit output signals ($a_{2i+1}a_i$) as shown in Figure 6. As a result, the process of determining the final match in the last stage will be same as the one shown in Figure 6(c).

A.3. Code design for $d = 1$

The code design for $d = 1$ can be the same as that for $d = 8$, 4 or 2. However, this way seems wasting a lot of memory space since 1-bit stride is much simpler than the multibit stride. Therefore, we propose a memory efficient code design for $d = 1$ as follows. As defined before, $[LB, UB]$ is the sub-range in stride $i$. We know that $LB_i$ or $UB_i$ can only be 0 and 1. Therefore, the sub-range in the first stride can be [0, 0], [0, 1], or [1, 1]. The sub-range in the other strides can be [0, 0], [0, 1], [1, 0], or [1, 1]. The sub-address of stride $i$, $A_i$, can be 0 or 1. Similar to StrideBV scheme, we can store 2-bit codes $xy$ for input address $A_i$ based on the following pre-computation rule: (1) if $A_i = UB$, $x$ is to 1; otherwise $x$ is set to 0, and (2) if $A_i = LB$, $y$ is to 1; otherwise $y$ is set to 0. Figure 7 shows an example. Since the stored code $xy$ for $A_i = 0$ is the complement of that for $A_i = 1$, we only store the former. As a result, we propose to store the code $ab$ as shown in Figure 8.

With the stored codes $ab$, the matching process is the same as the design for $d = 8$, 4 or 2 by using the output signals defined in Figure 3. Figure 9 shows the matching process at each stage.

B. Sequential subrange compare (SSC) scheme

We propose a range matching scheme called sequential subrange compare (SSC) scheme that stores subranges directly in the memory and performs subrange match operations sequentially. SSC does not perform precomputations against input headers as in the bit vector based schemes. SSC is similar to iterative structure of 1-bit range check circuit of the extended TCAM [23]. The advantage of SSC over the extended TCAM is that SSC is more general than 1-bit range check circuit in that the stride size of SSC can be any number of bits. The original intention is to solve the range field processing errors occurring in the other schemes [18][22] since all the subranges split from an original 16-bit range are not totally independent. The matching process of a subrange cannot be done by using only two comparators, e.g., the ≤ lower bound comparator and the ≥ upper bound comparator. A correct and complete implementation is much more complex. The subranges split from a range are classified into three types, the first, the middle, and the last subranges. Infered from Figure 2, we need four types of matching results that are represented by the signals, Match, Mismatch, LBmatch, and U8match. When none of Match, LBmatch, and U8match is true, Mismatch becomes true. Therefore, Mismatch is not needed. Also, Match and LBmatch/U8match are mutually exclusive and so they cannot be true simultaneously. However, LBmatch, and U8match may be true at the same time because lower and upper bounds can be the same. The range be split into $s$ subranges, $[LB, UB]$ and the input address $A$ be split into $s$ sub-addresses $A_i$ for $i = 1$ to $s$. Therefore, each stage only has to compute three signals using the signals generated from previous stage, the subrange bounds $LB_i$ and $UB_i$, and the input sub-address $A_i$. The final stage only computes the signal Match that will be the final match result.
Figure 7. Code example of range [134, 171] with 1-bit stride.

![2-bit code design for range fields with 1-bit stride.](#)

Figure 8. Code design for range fields with 1-bit stride.

C. Hardware implementation of RBVE and SSC

The pipeline implementations of RBVE and SSC schemes are given in Figure 11 and Figure 12, respectively. These two figures show the required resources for a single 16-bit range that is split into $s = 16/d$ stages of $d$-bit sub-ranges, where $d = 8$, 4, 2, or 1. For the RBVE scheme in Figure 12, 5-input LUTs (denoted by 5-LUTs) are used to implement the logic of all the stages. A memory array of $2^d$-bit, 4-bit, and 2-bit entries is needed for the first, the middle, and the last stages, respectively. Notice that the middle stages are not needed when $d = 8$. The values stored in these memory arrays are the precomputed codes for all possible $2^d$-bit sub-addresses. For the SSC scheme in Figure 12, only two $d$-bit memory spaces are needed for a sub-range to store its lower and upper bounds. However, four comparators are required in each stage. Each comparator for $d$-bit strides is implemented by a $2d$-input LUT.

**Block RAM Storage**

The block RAM in Xilinx® 6 and 7 series FPGAs stores up to 36 Kbits of data and can be configured as either two independent 18 Kb RAMs, or one 36 Kb RAM. Since the stride size $d$ of the proposed RBVE scheme is not larger than 8 for the range fields, the basic unit of the block memory we use is the 18Kb block RAM which is configured as a 512x36 memory unit (i.e., an array of 512 36-bit entries) in true dual-port mode. In the true dual-port mode, two concurrent reads on the same block RAM can be performed and so the throughput of the search operations can be doubled.

Since the proposed RBVE scheme uses the bit array to store the pre-computed initial matching results, the block RAM is a good implementation choice. Also, the minimum number of entries for a block RAM is 512, the proper choice of the stride size $d$ is 8, which requires only 256 entries. As a result, a half of the block RAM becomes wasted. We will address this problem later. Figure 13 shows the proposed block RAM layout for RBVE scheme with stride size $d = 8$. The pre-computed codes of 36 ranges are concatenated for satisfying the minimum block RAM configuration of 512 36-bit entries. BRAM1 and BRAM2 are the block RAMs of 256 entries needed for Code$_1(A_1)$ and Code$_2(A_2)$ at the stages 1 and 2, respectively. As shown in the figure, the rightmost code of...
these 36 codes is for the range [3: 512]. So, for the input port number 513, the match signal for this range is 0, which is a mismatch.

**Distributed RAM Storage**

The function generators (LUTs) in SLICEMs of the Xilinx® 6 and 7 series FPGAs can be implemented as a synchronous RAM resource called a distributed RAM denoted by distRAM. The other type of slices in Xilinx® 6 and 7 series FPGAs is SLICEL that is usually 2-3 times larger than SLICEM. SLICEL cannot be configured as distRAM. Multiple 6-input LUTs in a SLICEM can be combined in various ways to store larger amount of data as follows, single-port distRAM of 32x1/64x1/128x1/256x1 bits, dual-port distRAM of 32x1/64x1/128x1 bits, and quad-port distRAM of 32x2/64x1 bits. Specifically, four 6-input LUTs can be configured as one single-port distRAM of 256x1 bits, or one dual-port distRAM of 128x1 bits, or one quad-port distRAM of 64x1 bits.

For choosing smaller stride size while still considering reasonable number of stages, we choose stride size \( d = 4 \) for the range fields of the proposed SSC scheme. Another advantage of choosing \( d = 4 \) is the comparators of two 4-bit inputs can be implemented with one 4-input and one 5-input LUTs. The cost of implementing comparators with two \( d \)-bit inputs for \( d > 4 \) will be too expensive. So, the number of stages for a 16-bit range field is 4. As a result, each range field value in the SSC scheme takes eight 4-bit memory space to store the lower and upper bounds of four 4-bit sub-ranges. Sixteen range values are grouped in a cluster and so that the distributed RAM of 8x16 bits is needed in each of the four stages for a 16-bit range field.

**D. Prefix Field Matching**

We use **StrideBV** method for the prefix field stages. Assume the stride size is \( k \). So, we will have \([243 – 32]/k\) stages for all prefix fields. For \( k \leq 8 \), we can use distRAM and 16 rules are grouped in a cluster. For \( k > 8 \), we can use block RAM and 36 rules are grouped in a cluster. The prefix field stages are placed after the range field stages and so the partial \( N \)-bit bit vector result of range field will be passed in, where \( N = 36 \) for **RBVE** scheme and \( N = 16 \) for **SSC** scheme.

Figure 14 shows an overall architecture of the range and prefix field implementations by **RBVE** and **StrideBV** schemes, respectively. The **RBVE** scheme for range fields is implemented by using block RAMs, where stride size is 8 bits and 36 ranges are grouped together in a range pipeline as shown in Figure 14. The **StrideBV** scheme for prefix fields is implemented by using distRAMs, where stride size is 5 bits and 16 prefixes are grouped together in a prefix pipeline. Since the sizes of the groups in the range and prefix fields are different, cross-pipeline links are needed as shown in the figure. We call it the **super-pipeline** that combines the range and prefix pipelines. As a result, totally 144 rules are in a super-pipeline when all fields are considered. We refer to this super-pipeline architecture as **Design I**. If the range fields are implemented by distRAM based on **SSC** scheme, we use 16 ranges in a group which is the same as the prefix fields implemented by StrideBV scheme. Therefore, no cross-pipeline link is needed. We refer to this architecture as **Design II**. Table 2 summarizes the implementations of **Design I** and **Design II**.

In order to process more rules, we need to implement multiple pipelines. Restricted by the on-chip resources, the
total number of rules that can be supported by design I is limited by the size of the block RAM. However, total number of rules that can be supported by design II is limited by the size of the distributed RAM.

E. Improving the utilization of block RAMs

As stated earlier, a half amount of the block RAM is wasted for range fields that are implemented by the RBVE scheme. We propose a simple indexing scheme to avoid wasting any block RAM. This indexing scheme is based on the observation that all the prefix fields except the source and destination IPs in the Openflow tables contain only the singleton values. In other words, \((211 - 64) = 147\) bits of a rule must not be don’t care. Therefore, it is possible to select one bit (say bit \(x\)) out of these 147 bits such that the number of rules with bit \(x = 0\) is roughly equal to that with \(x = 1\). If it is not possible, we can select more bits (say \(n\) bits) to divide all the rules into \(2^n\) segments. Then it is simple to find \(m\) out of \(2^n\) segments such that the number of rules in these \(m\) segments is roughly equal to that in the other \(2^n - m\) segments. Based on this indexing scheme, we can always divide the rules into two groups of the same size. The real implementation can be done by an additional stage. For a block RAM of 512 entries, the rules in the first group use the upper 256 entries and the rules in the other group use the lower 256 entries. Since one or more out of the 211 bits in the prefix fields are consumed in the additional stage of the indexing scheme, the number of stages in the prefix fields will be reduced at least one stage. As a result, the overall number of the stages after applying the proposed indexing scheme will not grow. Specifically, if the indexing scheme picks one bit, the number of prefix stages reduces from 43 to 42.

F. Pipelined Priority Encoder

The pipelined priority encoder is to aggregate multiple pipelines together to find the final matched rule. The input of pipelined priority encoder is multiple multi-match results produced by each pipeline. And the output is the ID of the single highest-priority matched rule. Because the number of pipelines is often large, this encoder uses the pipeline to reduce clock period. The rules are arranged from top to bottom in the decreasing order of rule priorities. Therefore, this priority encoder outputs the index of the first set bit from top in the bitmap of the matching results. The design of the priority pipeline for the set of 6K rules is shown in Figure 15. Figure 16 shows the overview of the complete pipeline architecture.

IV. EXPERIMENTAL RESULTS

We conducted experiments using Xilinx ISE Design Suite 14.7 where speed grade is set to -2. The device used in the experiments is Xilinx® 6, Virtex-6 XC6VLX760 [19]. Virtex-6 XC6VLX760 has 1200 I/O pins, 25,920 Kb BRAM (720 36Kb BRAM blocks), 118,560 logic slices where SLICEM can be configured to realize a large distributed RAM (up to 8,280 Kb). Each slice contains four 6-input LUTs and 8 flip-flops. The numbers of rules in Design I and Design II are 6K/3K (single/dual port) and 5K, respectively.

The resource utilization and performance statistics of the Design I and II are shown in Table 3. The SLICEM that can be configured as distRAM limits the size of rule tables in the implementation. In Design I with single port, one bit from 211 prefix field bits is used for the block RAM utilization improvement stated in section III-D. As a result, only 42 stages for prefix fields are needed. Utilizing 99% of the distRAM range pipelines (4 stages) Prefix pipelines (43 stages)

![Figure 13. The block RAM layout for RBVE scheme.](image)

![Figure 14. A super-pipeline of range pipelines using RBVE and prefix pipelines using StrideBV with cross-pipeline links.](image)

![Figure 15. The block RAM layout for RBVE scheme.](image)

![Figure 16. The block RAM layout for RBVE scheme.](image)

Table 2. Comparisons of Design I and II.

<table>
<thead>
<tr>
<th>Implementation</th>
<th>Ranges</th>
<th>Prefixes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Design I</td>
<td>RBVE scheme Stride = 8 bits blockRAM 36 ranges/group</td>
<td>StrideBV Stride = 5 bits distRAM 16 ranges/group</td>
</tr>
<tr>
<td>Design II</td>
<td>SSC scheme Stride = 4 bits distRAM 16 ranges/group</td>
<td></td>
</tr>
</tbody>
</table>

2168-6750 (c) 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
In this paper, we proposed two designs for processing the range fields. The proposed RBVE scheme for range fields is similar to the StrideBV scheme in that both use a special encoding to store the pre-computed results in memory. The proposed SSC scheme is a simple sub-range comparison method that uses pipeline to sequentially find matching result. By performing experiments on Xilinx Virtex-6 XC6VLX760 FPGA device, the proposed designs can achieve the throughputs of 380 MHz and 566 MHz based on single-ported and dual-ported memory.

**Figure 15. The pipelined priority encoder of 6K rules.**

**Figure 16. Multi-pipeline System Overview.**

provided in FPGA device can support up to 6K rules. In addition, the block RAM for range fields takes 59%. In Design I with double port, the number of supportable rules is 3K because double ported distRAM doubles the usage of LUTs in SLICEM. However, the throughput can be doubled. In Design II, only the distRAM can be used because the constraint of 512 entries in block RAM is hard to overcome. We set the stride of range fields to four because large distributed RAM causes serious clock degradation. Compared to Design I, the additional distRAM used for range fields in Design II is 128 bits per rule. Totally, 5K rules can be supported.

We will compare our proposed designs with other related work based on StrideBV algorithm or OpenFlow rule sets as shown in Table 4. Depending on ruleset characteristic, the performance of the scheme proposed in [21] varies because it uses decision tree based method. Others use StrideBV algorithm and so the performance will not vary. Scalable and Modular Architecture proposed in [22] supports 5-field rules and can achieve very high throughput. To our knowledge, our proposed designs are the first range supported method that can achieve the throughput of 380 MHz. Method in [18] can also sustain a high throughput but its range field design is not complete and thus cannot always get correct results. Methods in [22] also support a range processing solution. The proposed designs are also much faster than the original paper [15] and even store 10 times larger ruleset.

**V. CONCLUSION**

In this paper, we proposed two designs for processing the range fields. The proposed RBVE scheme for range fields is similar to the StrideBV scheme in that both use a special encoding to store the pre-computed results in memory. The proposed SSC scheme is a simple sub-range comparison method that uses pipeline to sequentially find matching result. By performing experiments on Xilinx Virtex-6 XC6VLX760 FPGA device, the proposed designs can achieve the throughputs of 380 MHz and 566 MHz based on single-ported and dual-ported memory.

**REFERENCES**


Table 3: Performance of the proposed schemes.

<table>
<thead>
<tr>
<th>Approach</th>
<th>Throughput (MHz)</th>
<th>Latency (Clocks)</th>
<th># of Fields Supported</th>
<th>Speed Depending on ruleset</th>
<th>Range Field Support</th>
<th># of Rules Supported</th>
</tr>
</thead>
<tbody>
<tr>
<td>Design I-dual port</td>
<td>566</td>
<td>53</td>
<td>12</td>
<td>No</td>
<td>Yes</td>
<td>3K</td>
</tr>
<tr>
<td>Design II</td>
<td>380</td>
<td>57</td>
<td>12</td>
<td>No</td>
<td>Yes</td>
<td>5K</td>
</tr>
<tr>
<td>Scalable Packet Classification [21]</td>
<td>125</td>
<td>36</td>
<td>12</td>
<td>Yes*</td>
<td>No</td>
<td>1K</td>
</tr>
<tr>
<td>High-performance architecture [18]</td>
<td>373</td>
<td>72</td>
<td>15</td>
<td>No</td>
<td>Yes*</td>
<td>1K</td>
</tr>
<tr>
<td>StrideBV [15]</td>
<td>235</td>
<td>30</td>
<td>5</td>
<td>No</td>
<td>No</td>
<td>0.5K</td>
</tr>
<tr>
<td>Scalable and Modular Architecture [22]</td>
<td>526</td>
<td>28</td>
<td>5</td>
<td>No</td>
<td>Yes*</td>
<td>28K</td>
</tr>
</tbody>
</table>

*: Range field is supported but incomplete


Yeim-Kuan Chang received PhD degree in computer science from Texas A&M University, College Station, in 1995. He is currently a professor in the Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan. His research interests include Internet router design, computer architecture, and multiprocessor systems.

Chun-Sheng Hsueh received the MS degree in computer science and information engineering from National Cheng Kung University, Tainan, Taiwan, in 2014. His research interests include high-speed packet processing in hardware and deep packet inspection architectures.