Today I want to quickly introduce the first building block of the suffix array library that is part of Sufex: Trigram generation.
One of the most important modern algorithms for suffix array generation is the DC algorithm, also known as Skew algorithm (DOI 10.1007/3-540-45061-0_73), discovered by Juha Kärkkäinen and Peter Sanders in 2003.
I won’t describe the algorithm here, however, its first step is to
generate trigrams from the input text and sorting them
lexicographically. The algorithm does not generate all trigrams, but
only those located at positions not divisible by three, i.e. positions
p
such that p % 3
is 1
or 2
. The module of Sufex that performs
this step is
sux/trigram.hpp
.
Trigram implementation
In later steps of the algorithm, Sufex needs to access the trigrams at the place where they are found in the input text. Therefore, the trigram implementation must include either the relative position of the trigram within the text, or a pointer to the character location.
During the sorting step, lots of trigrams must be compared to each other, which requires access only to the three characters that belong to each trigram. For better cache-efficiency, it is therefore useful to actually store these three characters inside the trigram representation, along with the position information or pointer. However, this isn’t necessary.
Sufex provides three different implementations of trigrams, and
the sorting algorithms are designed to work with any of them. The
three types are specified by the values of an enumeration called
sux::TGImpl
, while the actual data type for trigrams is defined as a
struct called sux::TrigramImpl
:
1 2 3 4 5 6 7 8 |
|
Char
and Pos
represent data types for individual characters and
position/offset information, respectively. Hence, when using char
for characters and long
for positions, a single trigram can be
stored in one of three data structures shown below:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
The first one, TGImpl::pointer
, consists of a single pointer char
*
only and is therefore small in memory but relatively slow during
sorting.
The second one, TGImpl::tuple
, uses a std::tuple
to store both the
relative position of the trigram and its contents. This is the fastest
implementation during sorting.
The third one, TGImpl::arraytuple
, uses a combination of
std::tuple
and std::array
to store the relative position and the
contents of the trigram. It is a little slower than TGImpl::tuple
,
but can more easily be extended to store n-grams with higher n, which
may become important in alternative implementations of suffix array
construction.
Generating trigrams
The DC algorithm involves creating an array of trigrams at positions p
such that p % 3 = 1,2
. For this, the sux::TrigramMaker
can be
used:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Evidently, TrigramMaker
takes three template arguments:
- The trigram implementation (
TGImpl::tuple
,TGImpl::arraytuple
orTGImpl::pointer
); - The character type (e.g.
char
) - The position type (e.g.
unsigned long
)
It’s make_23trigrams
function must be applied to a range of
characters specified by two iterators. The value returned is a
std::vector
of trigrams.
There is a convenience function sux::string_to_23trigrams()
that
can be used if the input is stored in a std::basic_string
:
1 2 3 4 5 6 7 8 9 10 11 |
|
It takes only one template argument, the trigram implementation. It
defaults to TGImpl::tuple
.
Sorting trigrams
Trigrams are sorted in three passes of radix sort. During the first pass, trigrams are put into buckets according to the third character of each trigrams. During the second pass, they are stably sorted into buckets defined by the second character, and during the third pass they are stably sorted into buckets defined by the first character. Each pass can be parallelized by running several radix-sort threads in parallel, each on a different chunk of the trigrams.
The implementation is encapsulated in the sux::TrigramSorter
template, which can be applied to any implementation of trigrams,
i.e. regardless of whether TGImpl::tuple
, TGImple::arraytuple
or
TGImpl::pointer
is used, and regardless of the data types used for
characters and positions.
Again, there is a convenience function sux::sort_23trigrams()
,
which takes two arguments:
- The trigram container (i.e. what was returned by
make_23trigrams()
) - The desired number of threads (defaults to 1)
Hence, the easiest way to create the 2,3-trigrams from a std::string
and sort them using parallel 2 threads is this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Running times for different trigram implementations
I’ve compared running times of trigram sorting for the three
implementations of trigrams, and for two different sorting algorithms:
The radix sort described above, and std::stable_sort
.
The comparison involved running each version once on a dual-core Intel machine (2 T9300 cores, L1-cache 32KB per core, L2 cache 6144 KB for both cores, 4 GB RAM) and measuring user time, system time and real time in milliseconds. The results are reported in this Google spreadsheet.
Main observations
TGImpl::tuple
andTGImpl::arraytuple
are equally fast, butTGImpl::pointer
is considerably slower. This is no surprise, as using pointer requires derefencing it in each comparison step.- In one-thread mode,
std::stable_sort
is faster than my radix sort implementation. I believe this is mainly caused by the fact that radix sort needs to determine the alphabet and count the number of occurrences of every character, in each radix pass. This is acceptable, because during the recursive steps of suffix array construction (not yet implemented), an integer alphabet is used, which does not require this step. - In multithreading mode, radix sort is much faster in terms of
real time (yellow bars) than the built-in
std::stable_sort
(which runs on one thread only).