Developer

C++: Removing duplicates from a range

Removing duplicates from ranges is a fairly common task in C++, but the std::unique algorithm in the Standard Template Library (STL) has some limitations. Find out how to work around them.


This article was originally published on Builder.com.

The Standard Template Library (STL) shows us how to deal with the complexity of algorithms over structures: Separate them. That’s why we have containers (std::vector, std::deque, std::list, etc.) and algorithms. To be as generic as possible, algorithms operate on ranges—sequences of elements. That is, you can apply an algorithm to only the first half of elements from a vector, if you want.

It is not an uncommon task to have to remove duplicates from a range, and the STL provides just the tool for that: the std::unique algorithm. But wait—std::unique needs a sorted range. It checks each two consecutive elements for equivalence; if you have two equivalent elements that are not consecutive, the latter won’t be removed. This renders std::unique useless when you have a nonsorted range and you want to keep it that way, which is most of the time.

I’ll show you how to implement a generic stable_unique, which will work for any range. After the duplicates have been removed, the order of the remaining elements is preserved. If an element exists more than once in a range, only its first occurrence will be preserved; the next occurrences will be removed.

As a bonus, stable_unique is easy to use and has a similar interface to the STL algorithms you’re used to. I won’t reinvent the wheel. The stable_unique is efficient, while reusing as much as possible of the existing STL algorithms, as you’ll soon see.

std::unique behavior
To understand what we need for stable_unique, first let's take a peek at std::unique and see what we can achieve with it. Listing A shows a simple example of trying to remove the duplicates from a vector. It succeeds, but the original order of the elements has been lost because we had to sort the vector first.

This points us in the right direction: Sort the elements, but keep them in the original order. Therefore, sort pointers to the elements.

Implementing stable_unique
Let’s take a shot at implementing stable_unique:
  1. 1.      Create an array (aPtr) holding pointers to the elements.
  2. 2.      Sort the array of pointers.
  3. 3.      Invoke std::unique on the array of pointers.
  4. 4.      For those elements that were removed by std::unique, remove them from the original range.

To achieve step 4, each element in aPtr, besides containing a pointer to an element, will have to contain its index as well (so that we’ll know what to remove). Let me explain this a little better. Suppose we didn’t hold any index. We would have a pointer to each element to remove. We would then have to traverse the original range until we found the element. The pseudocode would look like that in Listing B. This could be very costly; that’s why I chose to keep an index as well.

Removing efficiently
What we strive to achieve in the end is to have an array of indexes of elements to remove. Once we have that, it will be easy to remove them.

Our solution will model existing STL erase algorithms such as std::remove, std::remove_if, and std::unique. They work like this: Given a range—represented by a “first” and a “last” iterator—they move the remaining elements first and then return a “middle” iterator to the last remaining element. After the removal algorithm has been executed, the [first, middle) range contains the remaining elements, and the [middle, last) range contains the elements to be erased.

You can't make any assumption about the values that lie in the [middle, last) range. In particular, you can't assume that they are actually the values of the removed elements.

There are many ways to remove elements, given their indexes. To remove them efficiently, we should be able to go through the range only once. This implies that we should sort the array of indexes. Then, while traversing the range, we'll remove the problematic elements as we reach them (as we reach their index). Listing C shows how to do it.

Finding the right solution
This algorithm contains a flaw: You can't know the elements that were erased by std::unique—at least not right away.

std::unique is an STL erase algorithm. It returns an iterator to the last valid element. You have no idea what’s after the returned iterator, as the following example shows:
10, 100, 4, 10, 10, 4, 0, 100

After invoking std::unique, you’ll have:
10, 100, 4, 0, ??, ??, ??, ??

You can't assume anything about the ?? values. (You don’t know which elements were removed.) To overcome this, two possible solutions come to mind:
  • ·        After calling std::unique, to find out which elements should be removed, examine the indexes of the remaining elements. The indexes that are not found are the indexes of the elements to be deleted.
  • ·        Create a reverse_unique function, which will keep only the elements that would normally be removed by std::unique. The indexes that are found are the indexes of the elements to be deleted (the opposite of the previous solution).

Listing D shows a trimmed-down example of the first technique. I chose to go with the solution for creating a reverse_unique function for the following reasons:
  • ·        Implementing the first solution above is quite complicated, as Listing D shows.
  • ·        For calling std::unique, it’s harder to come up with a generic solution to remove the duplicates. (Remember: In Listing D, we just printed the indexes.)
  • ·        Usually, when calling stable_unique for a range, you’ll have fewer elements to remove than to keep. This is important, since both solutions require sorting some elements by index: for calling std::unique, the remaining elements, and for creating a reverse_unique function, the elements to be deleted. Of course, the fewer elements you have to sort, the more efficient your algorithm is. For calling std::unique, sorting is done at step 4 in the above numbered list. For creating a reverse function, the algorithm will become:
  1. 1.      Create an array (aPtr) holding pointers to the elements + their indexes.
  2. 2.      Sort the aPtr array.
  3. 3.      Invoke reverse_unique on the aPtr. This will yield the elements that should be removed; further on, we’ll need only the indexes of the elements to be removed.
  4. 4.      Re-sort aPtr by index.
  5. 5.      From the original array, remove the elements that exist in aPtr. (We know them by index.)

Note that implementing a reverse_unique function is quite simple, as Listing E shows. Now we’re ready to implement stable_unique. To better grasp it, check out  Listing F.

There’s room for improvement. As with STL erase algorithms, we should allow for predicates. In our case, stable_unique will allow for:
  • ·        A sorting predicate (a binary functor/ function that returns true if the first element is less than the second).
  • ·        An equivalence predicate (a binary functor/ function that returns true if the two elements are equivalent).

You’ll be able to call stable_unique like this:
  • ·        stable_unique( itFirst, itLast, comp, equivalence) will remove duplicates from [itFirst, itLast) range using comp as the sorting predicate and equivalence as the equivalence predicate.
  • ·        stable_unique( itFirst, itLast) will remove duplicates from the [itFirst, itLast) range; it will use std::less<>() as the sorting predicate and std::equal_to<>() as the equivalence predicate. You can download the code for stable_unique and an example here.
  • ·        The ptr_and_index keeps both a pointer to an element and its index.
  • ·        In stable_unique_impl, step {1} shows a nice way to insert elements of a different type into another container.
  • ·        compare_pointed_val compares two pointed values. You should note that if they are equivalent, we have an additional test: by index. If that test weren’t there, preserving the original order of the elements would not be possible. Comment it out, and see what happens.
  • ·        is_at_index is constructed with a sorted array of indexes. For an element, it returns true if it’s at one of those indexes. It’s used in correlation with Private::remove_if.
  • ·        Private details are found in namespace Private. This clearly separates the hidden details from the interface we show to the clients.
  • ·        stable_unique delegates to stable_unique_impl to make it work for VC6 as well. This is because the implementation of STL that comes with VC6 does not allow using std::iterator_traits as expected; it just has a workaround.

Conclusion
STL provides us with many algorithms, but you can use existing algorithms to build new ones. The stable_unique is such an algorithm. You can use it to complement STL algorithms to remove duplicates efficiently.

 

 

 

 

 

 

Editor's Picks

Free Newsletters, In your Inbox