Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Design

The "All-Pairs Closest Points" Problem


Jan03: Algorithm Alley

William is the senior vice president of Technical Support Inc. and a part-time computer science faculty member at the University of Nebraska at Omaha. He can be contacted at [email protected].


Each year, my wife and I dedicate one day near the end of the year to holiday shopping. I take the day off, the kids go to their grandparents, and we hit the road with plenty of cash. Last year, we dropped off the kids, hopped into the car, and I was asked the inevitable "Where to?" question. I immediately went with the safe answer: "I don't know, where do you want to go?"

"Well, are there any toy stores close together so that we don't have to drive as much?" she asked.

Which, of course, requires an efficient algorithm for finding the pair {A,B} of map coordinates from a set of map coordinates S, such that the pair {A,B} has the minimal Euclidean distance between all pairs in S. The Euclidean distance, remember, is discovered through the Pythagorean Theorem shown in Example 1.

Each map coordinate can be thought of as the x,y-coordinates of the point on the map (in this case, the front door of the toy store). I use the upper-right quadrant for the coordinates so that x increases left to right and y increases bottom to top. Although we have hills, I assume that all toy stores are at the same altitude. The challenge is to find the pair {A,B} such that this distance is minimal among all possible pairs. The cardinality of the set S is |S|, the number of elements in it. Obviously, there are |S|2 pairs of points, and the naive among us would resort to an algorithm that is O(|S|2) to locate the closest pair of toy stores—just check all possible points against all of the other ones. This method is demonstrated in the source code (available electronically; see "Resource Center," page 5) as a check of the alternate technique. But, of course, we need to be fuel efficient, so I really want to use a better algorithm.

Consequently, I describe here a divide-and-conquer algorithm for solving this "all-pairs closest points" problem. The algorithm works by recursively partitioning the set of points into two halves, then solving for the halves plus a "strip" in the middle. I show that the overall algorithm is O(|S|log|S|), which is better than the naive approach.

To find the closest pair, the coordinate set must first be sorted by the value of x for each coordinate. This takes O(|S|log|S|) to sort, so as far as the overall algorithm complexity is concerned, this doesn't cost anything. At various times, the coordinate set also needs to be sorted by y value. The same argument applies here—from the standpoint of the algorithm complexity, this is free. Because of these requirements, the algorithm simply sets aside two arrays—one containing indices into the coordinates so that scanning the array gives coordinates in x order, and the other gives them in y order. This is a preprocessing step. Sorting once is sufficient for the whole algorithm.

Once the sorted arrays are ready, the algorithm recursively subdivides the map into halves, finding the closest pair of points in each half. The smaller of these two is almost the correct answer. First, imagine a vertical line through the map called the "slicing line." This line can be positioned so that it puts half of the points in S on the left of the line, and half of the points in S on the right of the line (Figure 1). If the set of map coordinates happened to be sorted by the x map coordinate, this takes essentially no time to position the line because I know that I have |S| points. Thus, if the points are sorted, |S|/2 is the halfway point on the map as far as x-coordinates are concerned. Anything before this point in the sorted array is in the left, and anything after is in the right. (There is a subtlety here, but for now, this makes things simple.)

Now imagine that I make two recursive calls, which return the closest pair to the left of the slicing line and the closest pair to the right of the slicing line. While I've yet to describe the algorithm in detail, if you can assume it is possible to find the closest pair for the entire map—certainly it is possible to find the closest pair for half of it. Make these left and right recursive calls based on the slicing line; then save their return values. Call these pairs PL and PR with distances L and R, respectively. One or the other is smaller; call this PLR at a distance LR. You'd think that this shorter distance is the pair of points that is the closest in the set S.

However, when it comes to shopping, things are never as easy as they sound. That's the case here, for I could have a point just barely on the left of my slicing line that is close to a point just slightly to the right of my slicing line. The distance between them could be less than LR. This pair is not in the left half, so it was not found in the recursive call for the left side of the map. Neither is it in the right half for the same reason.

It is necessary to check for this condition on the set of coordinates so that I don't inadvertently miss the actual closest pair. The area where this could be a problem lies within a vertical strip centered on the vertical slicing line, and extends LR on the left and LR on the right; see Figure 2.

You can likely now see the problem: A point may be slightly to the left of the slicing line, but within LR of it, and another point could be slightly on the right of the slicing line yet within LR of it. Also, it is possible to have a point that is on the slicing line but considered "in the other half" during recursion, resulting in the need to check +/-LR. You need to check any and all points within this strip to make sure you have not missed the real closest pair of points.

Unfortunately, in some strange arrangements of points, the number of coordinates to check within this strip could be very close to |S|. So deciding upon a method to efficiently check this strip is crucial to the efficiency of the algorithm—you don't want to have to resort to checking all |S|2 points once again. If you are clever, this can be done in time proportional to the number of points in the strip; that is, O(|M|) for M strip points. Remember that I said that sorting by y-coordinate was a setup step for the algorithm. Here's why.

Suppose that only the toy stores within the vertical strip were sorted by their y-coordinates. This will let you order the coordinates going up from the bottom, or south to north, on the map. You could then check the lowest point, call it P1, against the next lowest, P2 , the second to next lowest, P3, the third to next lowest, P4, and so on. At first blush, this appears to force you once again to |S|2 or at least |M|2, checking P1 against P2, P3, P1,...|M|. But I am checking them in order of their y-coordinates. Once the distance between P1 and some other point, Pn, above it differs by a distance greater than LR, there is no further need to check any points above Pn. All other points are even farther above Pn, and so no other point in M could be less than LR away from P1.

This severely limits the number of actual tests that must be made in the strip. Since nothing that is greater than LR above a point needs to be considered, you can think of a bounding box that is LR left of the slicing line, LR right of the slicing line, and LR above the point you need to check. This is the close-up view shown in Figure 3.

Figure 3 is one of a small set of worst-case scenarios where the point in question is on a corner of the area you are concerned with. (You can quickly convince yourself that the worst-case test points will always be at the corners.) Remember that any point with a y-coordinate above the top of this rectangular area terminates the search—if you are looking in increasing order of y, there can be no more points beyond this and you can stop testing other points relative to the test coordinate.

Now consider what points could lie within the rectangular areas in Figure 3. Suppose that there was another point inside of Area B. This could be close to the test point, and also on the right of the slicing line; therefore, the distance from this point to the test point might be less than R. But if that were the case, the rectangular area would be smaller, since you would have returned this (closer) pair on the recursive call handling the right of the slicing line. If the other point inside of Area B is farther away than R, it is not better than R. You can picture an arc inside of Area B, beyond which the points are not closer and inside of which there could not be a point (Figure 4).

For this reason, the only possible improvement would be if the test point is closer to the slicing line and the other point is in Area A. Additional points could be located on the arc, but they would only tie for the closest distance and not improve upon it.

Obviously, if the test point is in the lower left corner of the region, the same argument applies to Area A, just in reverse.

A bit of work with scratch paper shows that the worst-case example in the rectangle only requires one to test four points before giving up and moving to the next point in the region. Think about the test point being on the slicing line, for instance.

If there are at most four points to test in Area A plus Area B, this is O(1); that is, constant time to test for a given point in the slice area. Because there are |M|<=|S| points in this region, the region can be completely tested in O(|S|) time, and often considerably less than this bound. The worst case happens only when all points are on the slicing line—certainly not realistic as far as the location of toy stores is concerned.

By the way, this approach is because |M| could be close to |S|. If this is not the case, and the points in S are distributed at random (as is the case in the sample code), then |M| is about |S|and you could argue that the brute-force O(|M|2) approach within the strip is not unreasonable in these cases.

Now that all of this is noted, the expense of actually sorting the coordinates within the strip needs to be eliminated. You definitely want to avoid the expense of sorting every time you check the strip, because this would cost O(|M|log|M|) every time the recursive calls return. That would tend to force the algorithm towards O(|S|log2|S|)—not as good. Instead, if you have a complete list of coordinates sorted by y (remember, I created this at the start), scanning this array retrieves the values in y order at a cost of O(|M|).

It now remains to see that overall you can find the closest pair in O(|S|log|S|) worst-case complexity. Table 1 lists the overall time requirements for the algorithm. As a result T(|S|)=2T(|S|/2)+|S|= O(|S|log|S|).

The C++ source code (also available electronically) has a few constants that can be changed at the top of the program. I selected a set of 2500 random points in an area that was 5000 units on a side. The function fill_coordinates() is charged with initializing the set of coordinates with random data. Duplicate points are allowed. The seed for the random number can be passed on the command line; while debugging, I used a shell script to check 10,000 runs of the program at a time, and to stop with the random seed printed if the program failed.

There are two functions in the program to calculate the closest pair—the first is O(|S|2), which checks the results of the more complex recursive version and is in the function closest_via_n_squared().

The actual algorithm starts in closest_via_n_log_n(), which does the preliminary work, assigning the coordinate indices to the arrays x_sort (coordinate indices sorted by x) and y_sort (coordinate indices sorted by y). It is important later for a coordinate to know where the corresponding x_sort entry is, so a final setup step is to set this pointer back to the address in the x_sort array where the entry resides. The array x_sort is never altered once it is sorted.

Once the sorts are ready, I call a function find_closest(), which is the actual recursive function to locate the closest pair. The initial call passes the sort arrays and the number of points in the region, denoted as N.

The recursive call checks the size of the arrays (that is, the number of points in the area). In small cases of one, two, or three points, the calculations are done directly on the points. A case where one point is passed returns an infinite distance. Assuming that N>3, you do the recursive calls. First decide on a split line, which can be easily determined by the middle point in the x_sort array. Once the split is located, the y_sort array is partitioned into halves, one for the left side and one for the right. This is trickier than it first appears, and many references to this algorithm explain it incorrectly.

The difficulty is semantics. Strictly speaking, you are not locating an x value (horizontal coordinate value) and partitioning around it. Rather, it is necessary to put those y_sort values that correspond to x_sort[0] through x_sort[N/2] in the left, and the remainder in the right. What's the difference? There could be coordinates with duplicate x distance values right at the split line. In other words, the true split must be by array index, not by Euclidean x distance. Splitting purely by x distance can result in the left recursive call receiving indices in x_sort that do not have a corresponding value in y_sort. The solution is that you compare by pointers, thus the need to have a coordinate know the address of its x_sort counterpart. The variable mid_ptr is used for this.

After the y_sort is split (at a cost of O(|S|)), recursive calls return values for pairs PL and PR, and distances L and R. The actual function returns an item of type pair, which contains the coordinates and their distance.

The next step is to examine the points in the strip. Using the original y_sort, any coordinates in the strip are placed into a temporary array y_strip. This is then scanned with the two for loops. The break statement (labeled "important") lets the loop with the index j terminate early. In fact, it always terminates with j<=4, just as it should. Without this statement, the two loops execute in quadratic time compared to the number of points in y_sort.

Finally, compare the best in the strip to the distances L and R, and return the smallest pair of all of them. The recursion goes all the way back to the calling function, where you have the answer.

Meanwhile, my wife and I had made a few successful stops right away. I was raring to go, having recently consumed a grandé café mocha with whipped cream. I was basking in the fact that I knew a good solution to the "all-pairs closest points" problem, when my wife continued: "What other toy store is the closest to this one?". Of course, since we were sitting in the parking lot at a particular store right at that moment, we could decide on that question in O(|S|) for that store. But I anticipated she would be asking that again and again as the day wore on, so I started thinking about how to solve the "all nearest neighbors problem"—for all toy stores, what is the closest other toy store? This can be done in O(|S|log|S|) as well. But that's next year's shopping puzzle.

DDJ


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.