Parallel

Fast String Search on Multicore Processors

By Daniele Paolo Scarpazza, Oreste Villa and and Fabrizio Petrini, March 13, 2008

Multicores are the future, and you need to start mapping basic algorithms onto this new hardware.

The Heavy-Duty Solution

The aforementioned solution is fast but lacks capacity. Even in the series configuration, it won't support a dictionary larger than 1200 patterns. Additionally, it supports only 32 input symbols, which practically limits us to case-insensitive English text matching only. Instead, popular rule sets available for Snort comprise about 5000 rules, including both text and binary patterns, with a larger average length than 10 characters. And because the bad guys keep themselves busy, this rule set is expected to grow in the future.

Because a solution with more dictionary capacity is needed, we move the STT to a roomier place—main memory. There is a drawback—accessing the STT costs more because each state transition requires a DMA transfer. We can still cache some frequently hit states in the LS, but optimization efforts need to focus on accessing the main memory, where the majority of the STT resides. We must rethink the algorithm "bottom-up" around this goal—making DMA accesses as fast as possible.

Automata spend their lives performing state transitions. This breaks down into two phases:

Computation. Determining the address of the next state in the STT, depending on current state and input.
Data-transfer. Getting data from that address. Because the STT is now in main memory, the data transfer takes much more than computation—13 nanoseconds to compute and 250 to transfer.

Again, a single automaton yields poor utilization of both the processing power and memory subsystem (see Figure 2). And again, we employ multiple automata to improve this. In our code, we statically schedule these automata to run cyclically. The first automaton waits for its transfer to complete, computes the transition, and starts the next data transfer. The second automaton does the same. Then the third, fourth, and so on, up to the last one. At this point, another cycle begins, with the first automaton running again. Figure 3 illustrates this pipelining scheme. We put so much work between two runs of the same automaton, so that we can exploit the DMA latency time to do something useful.

[Click image to view at full size]

Figure 2: Processing power and memory subsystem.

[Click image to view at full size]

Figure 3: Pipelining scheme.

In this approach, the bottleneck is the time elapsed between the completion of two subsequent DMA transfers, the memory gap. A new state transition can be completed in no less time than the memory gap. We use 16 automata because benchmarks tell us that the best memory gap is experienced when at least 16 concurrent transfers happen at any time. Benchmarks also prescribe not to transfer blocks larger than 64 bytes, and keep the distribution of accessed memory locations as uniform as possible.

When these conditions hold, the gap falls to 40.68 nanoseconds, when all the 16 SPEs in a double Cell platform are employed. Under these assumptions, transitions happen at a frequency of 24.58 MHz, yielding a maximum overall throughput of 3.15 Gbps.

We designed a first implementation respecting these ideal conditions, and evaluated it in four scenarios: full-text search, network content monitoring, network intrusion detection, and virus scanning (see Figure 4). To our disappointment, performance was largely below the theoretical bound (the curve in red in the figure) and scalability was poor. This degradation is due to congestion in the memory subsystem, caused by a nonuniform state hit distribution. In fact, a tiny percentage of the states are responsible for the last majority of the hits. For example, the states immediately reachable from the initial state are between 0.1-0.5 percent of all the states, but they receive between 46.8-89.8 percent of all hits. Concurrent accesses to the same or adjacent states by multiple automata cause hot spots in the memory subsystem, which show up as longer gaps and lower throughput.

[Click image to view at full size]

Figure 4: Evaluation scenarios.

To fight congestion, we employ a combination of state caching, replication, and shuffling. First, we cache in the LS those states that are closest to the initial one, thus completely avoiding memory access for the states that are accessed most frequently. Given the limited size of the LS, we can cache only 180 states out of the 50,000 of the examples considered.

Then we shuffle the states by randomly renumbering them. This redistributes hot spots among the memory banks, ensuring that all of them are subject to the same pressure, but it does not relieve the impact of each single hot spot. For that, we replicate the most frequently hit of the noncached states, such that different automata access different replica of the same STT entry. This way, each replica receives only a fraction of the accesses of the original state. As a drawback, replication causes the STT to grow from approximately 50 to 800 MB.

But these optimizations only solve half of the problem. In fact, if you see the STT as a row/column table, where each row corresponds to a state and each column to an input character, the above optimizations balance pressure only on the rows but not on the columns. Pressure on the columns is unbalanced, especially with English text patterns, because most hits fall on lowercase a-z characters, while almost none fall on characters 128-255. To counter this unbalance, we also shuffle the alphabet space. The offsets at which next states are written in an STT entry are hashed, with a hash function that depends on the stat number, too.

With these optimizations, the aggregate throughput reaches 3.3-4.3 Gbps depending on the scenario (see Figure 5). This suggests that a system composed of three double-Cell blades, each with 1 GB of RAM, can be employed to match traffic at 10 Gbps, against a dictionary comprising at least 5000-25,000 patterns with an average length of 16 bytes.

[Click image to view at full size]

Figure 5: Aggregate throughput.

Previous 1 2 3 4

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Fast String Search on Multicore Processors

The Heavy-Duty Solution

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

Fast String Search on Multicore Processors

The Heavy-Duty Solution

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content