stringalgo::sufex

Thoughts, suggestions and reports on string algorithms, natural language processing and C++

Making Trigrams

| Comments

Today I want to quickly introduce the first building block of the suffix array library that is part of Sufex: Trigram generation.

One of the most important modern algorithms for suffix array generation is the DC algorithm, also known as Skew algorithm (DOI 10.1007/3-540-45061-0_73), discovered by Juha Kärkkäinen and Peter Sanders in 2003.

I won’t describe the algorithm here, however, its first step is to generate trigrams from the input text and sorting them lexicographically. The algorithm does not generate all trigrams, but only those located at positions not divisible by three, i.e. positions p such that p % 3 is 1 or 2. The module of Sufex that performs this step is sux/trigram.hpp.

Trigram implementation

In later steps of the algorithm, Sufex needs to access the trigrams at the place where they are found in the input text. Therefore, the trigram implementation must include either the relative position of the trigram within the text, or a pointer to the character location.

During the sorting step, lots of trigrams must be compared to each other, which requires access only to the three characters that belong to each trigram. For better cache-efficiency, it is therefore useful to actually store these three characters inside the trigram representation, along with the position information or pointer. However, this isn’t necessary.

Sufex provides three different implementations of trigrams, and the sorting algorithms are designed to work with any of them. The three types are specified by the values of an enumeration called sux::TGImpl, while the actual data type for trigrams is defined as a struct called sux::TrigramImpl:

Trigram implementations
1
2
3
4
5
6
7
8
namespace sux {
  /* Three implementations of trigrams: */
  enum class TGImpl { tuple, arraytuple, pointer };

  /* Data type for the trigram implementation: */
  template <TGImpl tgimpl, typename Char, typename Pos>
  struct TrigramImpl;
}

Char and Pos represent data types for individual characters and position/offset information, respectively. Hence, when using char for characters and long for positions, a single trigram can be stored in one of three data structures shown below:

Trigram implementations
1
2
3
4
5
6
7
8
9
10
11
12
using namespace sux;

/* Using a single pointer to represent the trigram: */
TrigramImpl<TGImpl::pointer,char,long>    trigram1;

/* Using std::tuple<long,char,char,char> to encode position
   and contents of the trigram: */
TrigramImpl<TGImpl::tuple,char,long>      trigram2;

/* Using std::tuple<long,std::array<char,3>> to encode position
   and contents of the trigram: */
TrigramImpl<TGImpl::arraytuple,char,long> trigram3;

The first one, TGImpl::pointer, consists of a single pointer char * only and is therefore small in memory but relatively slow during sorting.

The second one, TGImpl::tuple, uses a std::tuple to store both the relative position of the trigram and its contents. This is the fastest implementation during sorting.

The third one, TGImpl::arraytuple, uses a combination of std::tuple and std::array to store the relative position and the contents of the trigram. It is a little slower than TGImpl::tuple, but can more easily be extended to store n-grams with higher n, which may become important in alternative implementations of suffix array construction.

Generating trigrams

The DC algorithm involves creating an array of trigrams at positions p such that p % 3 = 1,2. For this, the sux::TrigramMaker can be used:

Trigram generation
1
2
3
4
5
6
7
8
9
10
11
12
  using namespace std;
  using namespace sux;
  string input { "abcabeabxd" };

  /* Generating trigrams. */
  auto trigrams =
      TrigramMaker<TGImpl::tuple,string::value_type,size_t>::make_23trigrams(begin(input),end(input));

  /* Printing them. */
  for (const auto &trigram : trigrams)
    cout << triget1(trigram) << triget2(trigram) << triget3(trigram) << '\n';
  cout.flush();

Evidently, TrigramMaker takes three template arguments:

  • The trigram implementation (TGImpl::tuple, TGImpl::arraytuple or TGImpl::pointer);
  • The character type (e.g. char)
  • The position type (e.g. unsigned long)

It’s make_23trigrams function must be applied to a range of characters specified by two iterators. The value returned is a std::vector of trigrams.

There is a convenience function sux::string_to_23trigrams() that can be used if the input is stored in a std::basic_string:

Trigram generation
1
2
3
4
5
6
7
8
9
10
11
  using namespace std;
  using namespace sux;
  string input { "abcabeabxd" };

  /* Generating trigrams. */
  auto trigrams = string_to_23trigrams<TGImpl::arraytuple>(input);

  /* Printing them. */
  for (const auto &trigram : trigrams)
    cout << triget1(trigram) << triget2(trigram) << triget3(trigram) << '\n';
  cout.flush();

It takes only one template argument, the trigram implementation. It defaults to TGImpl::tuple.

Sorting trigrams

Trigrams are sorted in three passes of radix sort. During the first pass, trigrams are put into buckets according to the third character of each trigrams. During the second pass, they are stably sorted into buckets defined by the second character, and during the third pass they are stably sorted into buckets defined by the first character. Each pass can be parallelized by running several radix-sort threads in parallel, each on a different chunk of the trigrams.

The implementation is encapsulated in the sux::TrigramSorter template, which can be applied to any implementation of trigrams, i.e. regardless of whether TGImpl::tuple, TGImple::arraytuple or TGImpl::pointer is used, and regardless of the data types used for characters and positions.

Again, there is a convenience function sux::sort_23trigrams(), which takes two arguments:

  • The trigram container (i.e. what was returned by make_23trigrams())
  • The desired number of threads (defaults to 1)

Hence, the easiest way to create the 2,3-trigrams from a std::string and sort them using parallel 2 threads is this:

Trigram generation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
  using namespace std;
  using namespace sux;

  /* Input. */
  string input { "abcabeabxd" };

  /* 2,3-Trigrams. */
  auto trigrams = string_to_23trigrams(input);

  /* Trigram sorting using 2 parallel threads. */
  sort_23trigrams(trigrams,2);

  /* Printing the results. */
  for (const auto &trigram : trigrams)
    cout << triget1(trigram) << triget2(trigram) << triget3(trigram) << '\n';
  cout.flush();

Running times for different trigram implementations

I’ve compared running times of trigram sorting for the three implementations of trigrams, and for two different sorting algorithms: The radix sort described above, and std::stable_sort.

The comparison involved running each version once on a dual-core Intel machine (2 T9300 cores, L1-cache 32KB per core, L2 cache 6144 KB for both cores, 4 GB RAM) and measuring user time, system time and real time in milliseconds. The results are reported in this Google spreadsheet.

Main observations

  1. TGImpl::tuple and TGImpl::arraytuple are equally fast, but TGImpl::pointer is considerably slower. This is no surprise, as using pointer requires derefencing it in each comparison step.
  2. In one-thread mode, std::stable_sort is faster than my radix sort implementation. I believe this is mainly caused by the fact that radix sort needs to determine the alphabet and count the number of occurrences of every character, in each radix pass. This is acceptable, because during the recursive steps of suffix array construction (not yet implemented), an integer alphabet is used, which does not require this step.
  3. In multithreading mode, radix sort is much faster in terms of real time (yellow bars) than the built-in std::stable_sort (which runs on one thread only).

Measuring CPU Time, Boost and C++11

| Comments

The first set of algorithms I am going to discuss will be related to the elementary steps of suffix array construction, in particular trigram sorting and lexicographic renaming, and I will start a series of blog posts about this soon.

However, to be able to compare various implementations of those algorithms, it’s important to use a clearly defined and consistent approach to measuring time. In addition to using Linux tools like time and ps, I would like to be able to build precise time-measurements into the C++ code, so as to be able to measure precisely what amount of time is spent in which part of the program.

Two types of time

There are two types of time I need to be able to measure:

  • Real time (aka wall clock time): The actual time that passed between a predefined starting point and an end point, measured in seconds.

  • CPU time (for example described here): The number of ticks (of the CPU clock) that passed while the CPU was busy performing the process being measured. This does not include time spent during non-CPU related activities, such as IO, and it does not include time the CPU spent performing other processes. While CPU time is measured in clock ticks, it is usually converted to seconds by dividing it by the number of clock ticks per second, which on Linux you can retrieve using a system call: ::sysconf(_SC_CLK_TCK).

There is a third type of time that will be important: Thread-specific CPU time, but I won’t talk about this in the current post.

Library support

C++11

C++11 offers a convenient method to measure real time, but not CPU time, through the <chrono> library. Specifically, there are three functions that return a representation of the current point in time:

std::chrono::system_clock::now()  // System clock, may be modified by the OS while
                                  // the process is running, e.g. when switching
                                  // to daylight-saving time

std::chrono::steady_clock::now() // Steady clock, guaranteed to run steadily from the
                                 // moment the process is started to its end

std::chrono::high_resolution_clock::now() // Steady clock with high resolution

The existence of all three is required by the Standard, so they can be used regardless of the platform the code is written for. The std::time_point object returned by these functions can easily be used to compute the time difference between two time points (using operator-), and that difference can be readily converted to any unit of choice (using std::chrono::duration_cast):

1
2
3
4
5
6
7
8
auto start = std::chrono::system_clock::now();
/* .. perform work .. */
auto end = std::chrono::system_clock::now();

int elapsed_seconds = std::chrono::duration_cast<std::chrono::seconds>
                         (end - start).count();

std::cout << "Elapsed time: " << elapsed_seconds << " seconds" << std::endl;

The benefits of this are portability and convenience, but unfortunately this can only be used to measure real time, not CPU time.

Boost

The Boost library provides a similarly designed framework with the same clocks for real time (in fact, of course, std::chrono was designed using boost:chrono as a model). In addition, there are three more clocks:

boost::chrono::process_user_cpu_clock::now()   // User CPU time since the process
                                               // started (including user CPU time
                                               // of child processes)

boost::chrono::process_system_cpu_clock::now() // System CPU time since the process
                                               // started (including user CPU time
                                               // of child processes)

boost::chrono::process_real_cpu_clock::now()   // Wall-clock time, measured using the
                                               // CPU clock of the process

Boost also provides a combined clock, boost::chrono::process_cpu_clock, which gives access to time points for all three types of CPU clock. This is also portable and convenient, but I had a number of problems with this:

Problems with Boost

  • boost::chrono::time_point objects returned by the Boost clocks are not compatible with std::time_point, which I find makes the code look confusing when you use both the time points and duration object from boost::chrono and std::chrono in parallel.

  • The Boost CPU clocks measure time in terms of CPU ticks and immediately convert this to nanoseconds. The problem with this on typical Linux installations is that a clock tick is a relatively long unit of time; on my machine it is 10 milliseconds in length. This means that even when you measure a relatively short process, e.g. 2 seconds CPU time, the internal representation of this used by the Boost library will be 2 billion nanoseconds. The data type used to store the value is std::clock_t, which on 32-bit Linux is typically a signed 32-bit integer, which means 2 billion is very close to overflow. As a result, you get overflows and negative run times even for processes of just a few seconds (even sub-second if several threads are involved).

  • There doesn’t seem to be any pretty-print function, nor unit-converters for the combined user/system/real clock Boost provides. I am usually interested in measuring all three of them, but boost::chrono::process_cpu_clock::now() returns time point objects that are triples, and none of the convenient duration-casts and no pretty-printing seems to be available for them.

New implementation: rlxutil::combined_clock

What I wanted is a clock that returns time points which include user, system and real time, where real time is measured using the high-resolution clock from std, and user and system time are measured using a platform-specific implementation, but using std::chrono::time_point objects (and std::chrono::duration objects when the difference between two time points is calculated).

I ended up implementing this myself. It now exists in the form of the util/proctime.hpp header within the sufex repository. The main class is rlxutil::combined_clock<Precision>. Note the template parameter Precision: You can use any of the std::ratio types defined in <ratio>, e.g. std::micro or std::nano as argument. This will determine the internal representation of the time points: As nanoseconds when std::nano is used, as microseconds when std::micro is used, and so on.

Usage

Here is an example of how it’s used:

combined_clock usage link
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#include "util/proctime.hpp"

#include <ratio>
#include <chrono>
#include <thread>
#include <utility>
#include <iostream>

int main()
{
  using std::chrono::duration_cast;
  using millisec   = std::chrono::milliseconds;
  using clock_type = rlxutil::combined_clock<std::micro>;

  auto tp1 = clock_type::now();

  /* Perform some random calculations. */
  unsigned long step1 = 1;
  unsigned long step2 = 1;
  for (int i = 0 ; i < 50000000 ; ++i) {
    unsigned long step3 = step1 + step2;
    std::swap(step1,step2);
    std::swap(step2,step3);
  }

  /* Sleep for a while (this adds to real time, but not CPU time). */
  std::this_thread::sleep_for(millisec(1000));

  auto tp2 = clock_type::now();

  std::cout << "Elapsed time: "
            << duration_cast<millisec>(tp2 - tp1)
            << std::endl;

  return 0;
}

Output generated by this program:

Elapsed time: [user 40, system 0, real 1070 millisec]

Implementation

The static now() function generates a time point by calling ::times to retrieve user and system time, as well as calling std::chrono::high_resolution_clock::now() to retrieve real time. The user and system times are then converted from ticks to the unit determined by Precision. This requires a tick factor, which depends on the number of ticks per second ::sysconf(SC_CLK_TCK) and Precision:

rlxutil::tickfactor link
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
/* Code related to log messages removed. */
template <typename Precision>
static std::intmax_t tickfactor() {
  static std::intmax_t result = 0;
  if (result == 0)
  {
    result = ::sysconf(_SC_CLK_TCK);
    if (result <= 0)
      result = -1;
    else if (result > Precision::den)
      result = -1;
    else
      result = Precision::den / ::sysconf(_SC_CLK_TCK);
  }
  return result;
}

The key differences to the Boost implementation are:

  • We use std::intmax_t instead of clock_t, which on typical systems (including 32-bit systems) provides us with 64 bits to store the internal representation of the time point;

  • We use the unit suggested by Precision rather than defaulting to nanoseconds.

Both help avoid overflows and spurious negative duration results. Specifically, the Precision parameter can be used by the caller to obtain a representation that suits the purpose: For measuring very short-lived processes, std::micro may be best, while longer processes can be measured using std::milli.

The clock data type itself has been implemented following the requirements by the C++ Standard.

rlxutil::combined_clock interface link
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
template <typename Precision>
class combined_clock
{
  static_assert(std::chrono::__is_ratio<Precision>::value,
      "Precision must be a std::ratio");
public:
  typedef rep_combined_clock<Precision>                     rep;
  typedef Precision                                         period;
  typedef std::chrono::duration<rep,period>                 duration;
  typedef std::chrono::time_point<combined_clock,duration>  time_point;

  static constexpr bool is_steady = true;

  static time_point now() noexcept;
};

The internal representation of a time point, combined_clock::rep is a tuple nested into a struct:

rlxutil::rep_combined_clock definition link
1
2
3
4
5
6
7
8
9
10
11
template <typename Precision>
struct rep_combined_clock
{
  typedef Precision                             precision;
  typedef std::tuple<
      std::intmax_t,
      std::intmax_t,
      std::chrono::high_resolution_clock::rep>  rep;

  rep _d;
};

The struct wrapper was necessary to distinguish it clearly from tuple types that originate somewhere else: The operator- and duration_cast function used to calculate elapsed time only accept the struct defined above, not general tuples.

The functions below were implemented inside namespace std to enable pretty-printing, elapsed-time calculation and duration-casts:

Convenience functions link
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
template <typename Precision, typename Period>
std::ostream &operator<<(
    std::ostream &strm,
    const chrono::duration<rlxutil::rep_combined_clock<Precision>,Period> &dur);

template <typename Pr, typename Per>
constexpr duration<rlxutil::rep_combined_clock<Pr>,Per>
operator-(
    const time_point<rlxutil::combined_clock<Pr>,
                     duration<rlxutil::rep_combined_clock<Pr>,Per>> &lhs,
    const time_point<rlxutil::combined_clock<Pr>,
                     duration<rlxutil::rep_combined_clock<Pr>,Per>> &rhs);

template <typename ToDur, typename Pr, typename Period>
constexpr duration<rlxutil::rep_combined_clock<Pr>,typename ToDur::period>
duration_cast(const duration<rlxutil::rep_combined_clock<Pr>,Period> &dur);

Ultimate Goal: Better Natural Language Processing

| Comments

Natural language processing has become big. Almost every university employs at least a few researchers specializing in it, almost every major internet company hires NLP specialists to help analyse the natural language part of “big data”, and hundreds of workshops and conferences are held every year.

Its methods have changed, too: Several decades ago, ideas based on mathematical logic dominated the semantic analysis, and ideas based on phrase structure grammars dominated the syntactic analysis. However, for the past ten years or so, statistical machine learning and large-scale corpus studies have dominated the field.

This blog is dedicated to its own variety of natural language processing. Its principles are different from those of today’s mainstream NLP, and they are different from most (though probably not all) of the NLP and computational linguistics of the past. The key elements are:

  • Lexeme-specific rules: The syntax of a language is described using rules, but there are far more rules than in traditional phrase structure grammars, and there are many rules that apply to a single lexeme only.

  • Structural analogy: Sentence structures that cannot be analysed using existing rules are analysed in the same way as sentences that are structurally similar or analogous. This notion of structural analogy will be defined in terms of predetermined transformation rules as well as approximate (fuzzy) matching against corpora of example sentences.

  • Interactive refinement: The formal description of the grammar (lexeme-specific rules, transformation rules and other elements) is created iteratively by testing intermediate stages against large corpora. To this end, a specialized corpus query system (CQS) is used.

A lot of this is still very unclear. I started my activities by creating a CQS, and doing this involved a significant amount of research into the architecture of search engines, and the data structures and algorithms they use.

This blog is going to describe my approach to NLP as it evolves, and it will do so in great detail. I’ll start with the CQS, which I decided to re-implement from scratch. I’ll blog about my choice of algorithms and implementation strategies, as well as specific – and sometimes very minor – programming techniques as I rebuild the CQS. This alone will take many months, perhaps more than a year.

I have also started to work on the definition of the lexeme-specific rule framework and prototypical designs for a parser. Posts related to this will start to appear in the blog soon, perhaps in a few weeks or months from now. Over time, the focus will gradually shift from string algorithms and implementation strategies towards grammar description and linguistics, and the blog title, Algorithms on Strings, will be increasingly

My programming language is C++ (C++11 and a lot of Boost), and the operating system is Linux. The hardware I use includes both Intel-based PCs and large Xeon-based servers with tens of CPU cores and 1 TB of RAM.