This is a short README for our implementations of eSAIS and DC3 including LCP variants as described in the paper "Suffix and LCP Array in External Memory" presented at ALENEX'13 and submitted to JEA's Special Edition.
--- Source Code Overview
The main algorithm implementations are in the directory "src/external/", inputs are generated by the classes in "src/input/" and various auxiliary algorithms are encapsulated in "src/tools/" header files. The STXXL library contained in "stxxl/" is a modified version of 1.4.1 and contains many customizations we added while implementing eSAIS.
Only one binary program is built when compiling with standard options. This program is called "esactest" and includes all algorithms via the header files. It is used to run speed tests on the various input instances and automatically checks the calculated arrays.
--- Using esactest
This except from the esactest.cc describes how to call the test program:
- This main program is designed to quickly test different external memory
- suffix sorters on a wide variety of inputs. The <input> is specified on the
- command line and can be:
- plain data files located under $HOME/sac-corpus/
- gzip compressed data files in $HOME/sac-corpus/, which are decompressed
- on-the-fly. These must be named datafile.123456.gz, where 123456 is the
- _decompressed_ size of the data file.
- artificial inputs generated by functions in the input/ sub-directory.
- verbatim strings can be processed as "simple/<string>"
- Call the program without arguments for a list of available inputs. The input
- is loaded into a stxxl vector and then processed by the external memory
- suffix sorters that this main program was compiled with.
- Size of the input can be limited via the -s command line parameter.
The default external memory suffix sorter is eSAIS with LCP construction. The distributed source code is by default configured to use 4 GiB of RAM and the 40-bit position data type. The STXXL library requires further configuration to specify where temporary files are created; this .stxxl file is described below.
--- Building esactest
To build the program "cmake" version 2.8 or higher and a recent gcc C++ compiler are required. Having "eSAIS-DC3-LCP-X.Y.Z.tar" unpacked in the current directory, the following commands will compile and run the esactest program, compiled with the included custom STXXL library:
$ mkdir build $ cd build $ cmake -DCMAKE_BUILD_TYPE=Release .. $ make $ cd src/ $ ./esactest
If the cmake configuration routine does not finish successfully, you may need to install some development packages via your Linux distribution's package manager.
To build separate binaries for each algorithm variant add -DBUILD_ALL=ON to the cmake configuration line.
Here is a simple example to generate suffix and LCP array of a random input:
$ mkdir artificial $ ./esactest -s 10mb --writeinput --writeoutput artificial/rand10 $ ls -l artifical/
--- Configuring STXXL Temporary Files
The external memory library STXXL creates (huge) temporary files when processing large data. It is configured by the file ".stxxl" in the current directory. Each line specifies one temporary file, its maximum size, and access method. We generally use
for two 1 TiB disks mounted at /data01 and /data02. The method "syscall_unlink" creates a deleted file on the disk, which _does not appear_ in directory listings and is automatically freed by the kernel when the program ends.
--- Further Information
We have collected source code, some test instances, and further information at
For information about STXXL see
Written 2012-11-26 by Timo Bingmann, updated 2013-03-30.