Skip to content

Latest commit

 

History

History
405 lines (326 loc) · 12.3 KB

README.md

File metadata and controls

405 lines (326 loc) · 12.3 KB

FM-Index - Compressed full-text Index

A simple C++ based FM-Index [1] implementation using RRR [4] wavelet trees [5] which allows to build a full-text index over a given text T of size n supporting the following operations:

  • count(P,m) : count the number of occurences of pattern P of size m in T.
  • locate(P,m) : locate the text positions of all occurences of P of size m in T.
  • extract(A,B) : extract T[A,B] from the index.
  • recover() : recover T from the index.

The constructed index uses nH_k + o(n log sigma) bits of space [3] which is roughly the size of the compressed representation of T and can perform the above operations without the need to store T. An empirical evaluation of the index is sown in the Benchmark section below.

Drawbacks of the FM-Index is long construction time and high memory requirements during construction.

Usage

Compiling the Index

make

Building an Index

./fmbuild alice29.txt alice29.txt.fm

Builds and writes the FM-Index alice29.txt.fm.

Running count() queries

./fmcount -i alice29.txt.fm alice.qry

The queries are stored in a new line seperated file:

the
house
keep
Alice
and

The index returns the number of occurrences for each query:

./fmcount -i alice29.txt.fm alice.qry
the : 2101
house : 20
keep : 11
Alice : 395
and : 880

Running locate() queries

./fmlocate -i alice29.txt.fm alice.qry

The index returns a sorted list of the locations of all occurences for each query:

./fmlocate -i alice29.txt.fm alice.qry
Read 3 queries
keep (11) : 46385 51125 69491 74680 81562 83046 104830 105180 133621 149966 151623
poison (3) : 8151 8619 8731
tomorrow (1) : 63637

Running extract() queries

./fmextract -i alice29.txt.fm alice.extract

The queries are stored in a new line seperated file:

118 147
1213 1245
24 55

The index returns the extracted text snippet for each query:

./fmextract -i alice29.txt.fm alice.extract
118 - 147 : 'THE MILLENNIUM FULCRUM EDITION'
1213 - 1245 : 'TOOK A WATCH OUT OF ITS WAISTCOAT'
24 - 55 : 'ALICE'S ADVENTURES IN WONDERLAND'

Recover the original text from the index

./fmrecover -i alice29.txt.fm 

The index outputs the original text to stdout.

Verbose output

The -v command line parameter enables verbose messages:

./fmbuild -v alice29.txt
building index.
- remapping alphabet.
- creating cumulative counts C[].
- performing bwt.
- sample SA locations.
- creating bwt output.
- create RRR wavelet tree over bwt.
build FM-Index done. (0.101 sec)
space usage:
- remap_reverse: 75 bytes (0.07%)
- C: 1028 bytes (0.90%)
- Suffixes: 9508 bytes (8.31%)
- Positions: 9512 bytes (8.31%)
- Sampled: 7948 bytes (6.95%)
- T_bwt: 86088 bytes (75.23%)
input Size n = 152090 bytes
index Size = 114431 bytes (0.75 n)
writing FM Index to file 'alice29.txt.fm'

Using the FM-Index to provide full-text search on a given text T

A small code example illustrating the use of the FM class:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include "FM.h"

int
main(int argc,char** argv) {
	uint8_t* T = /* read text */
	uint32_t n = /* sizeof T */
	FM* = new FM(T,n);
	if(FM) {
		/* count the occurences of 'house' in T */
		uint32_t cnt = FM->count("house",strlen("house"));
		
		/* get all locations offsets of 'house' in T */
		uint32_t matches; /* number of matches */
		uint32_t* locs; /* list of location offsets */
		locs = FM->locate("house",strlen("house"),&matches);
		
		/* extract text snippet from T */
		uint8_t* snippet = FM->extract(5,251);
		
		/* recover T from index */
		uint32_t nnew; /* size of Tnew */
		uint8_t* Tnew = FM->reconstructText(&nnew);
		
		delete FM;
	}
}

Compile with g++ -o test main.cpp FM.cpp util.c libcds.a libdivsufsort.a

Benchmarks

Experiments are similar to the ones described in [1]. All experiments were run on a Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz using 4GB of RAM. The default samplerate s=64 was used in all experiments.

Test data

Test data was taken from the The Pizza&Chili Site and TREC.

FileDescriptionAlphabet sizeEntropy (bps)
wsjEnglish Text taken from the TREC wsj collection904.60
srcConcatenated source code (.c,.h,.C,.java) of linux-2.6.11.6 and gcc-4.0.02305.47
proteinsSequence of newline-separated protein sequences254.20
dnaGene DNA sequences41.97
xmlXML that provides bibliographic info on compsci pubs(dblp)965.26

Construction

Peak memory usage was measured using the valgrind --tool=massif tool. Running time was measured using the gettimeofday() system call.

FileSize [MB]Time [sec]Memory [MB]
wsj32.15720.02
wsj64.48739.73
wsj129.29379.15
wsj2519.028158.0
wsj5038.699315.7
wsj10078.287631.2
wsj200158.3411233
src5039.909315.7
src10080.489631.2
src200163.0721233
proteins5035.374315.7
proteins10072.824631.2
proteins200146.1731233
dna5025.941315.7
dna10053.287631.2
dna200109.4491233
xml5036.601315.7
xml10074.104631.2
xml200150.0501233

Construction time increases linearly O(n) with size n of T. Memory requirement is roughly 6n independent of the file type.

FileSize [MB]Index Size [MB]Index Size [relative to file size]
wsj25150.6
wsj50290.58
wsj100570.57
wsj2001120.56
src50300.77
src100590.77
src2001170.75
proteins50360.72
proteins100710.71
proteins2001370.685
dna50240.48
dna100480.48
dna200950.475
xml50250.50
xml100500.50
xml200990.495

Index size depends on the compressability of T. For each specific file type, as n increases, the size of the auxillary data becomes less significant which leads to better overall compression ratio.

Count

50000 random patterns for each length m=5 to 50 were extracted from each 200MB file.

FM-Index count() - 200MB - 50000 Queries

The graph shows the time in seconds it took the FM-Index to answer all 50000 queries on different file types. Note that the query time grows linearly O(m) with the pattern length. Files with smaller alphabet show faster query times overall.

Also note that the query time (m=20) does not depend on the size n of the text T:

FileSize [MB]Query Time [sec]
wsj313.37
wsj611.96
wsj1212.74
wsj2512.74
wsj5012.07
wsj10014.59
wsj20014.34

Overall, searching for 50000 patterns in T using a FM-Index is fast but requieres considerable amount of upfront work (time+space).

Locate

k queries pf length 5 are randomly selected from T so they roughly amount to a certain number of total occurences over all queries.

FM-Index locate() - 200MB - Query length 5

Note the running time increases linearly with the number of occurrences. Similar to count(), files with small alphabet size perform better most likely due to decreased height of the wavelet tree. Comparing the running times of both count() and locate() we observe that locate() is significantly slower than count() as patterns (especially short ones), on large text collections, tend to have many occurrences.

The samplerate s (default = 64) determines how often the underlying suffix array is sampled to enable faster locate() calls. For different samplerates the index achieves different time and space trade-offs during the locate() operation:

FM-Index locate() - 50MB - Query length 6 - 50000 occurrences - variable samplerate

The graph above shows the time required to retrieve the locations of 50000 occurrences compared to the size of the FM-Index for different suffix samplerates. Smaller samplerate decrease the locate() running time but the space used by the index increases. For s=8 the index is 1.5 times the sizes of the original text. For s=512, that is only every 512th position in the underlying suffix array is sampled, the index is less than 50% of the original text. The above graph shows that the best time-space trade-off is achieved for samplerates between 64 and 128.

Extract

Extracting snippets of T from the index shows the same properties as the locate() operation. Like locate(), the running time of extract() grows linearly with the size of the size of the snippet. The running time is also influenced by the samplerate s which provides the same time and space trade-offs described above.

Libraries

The following libraries are needed to use the index. Sourcecode for both libaries is included.

  • libcds -- succinct low level data structures
  • libdivsufsort -- fast suffix sorting (requires cmake to build)

Licence

GPL v3 (see LICENCE file)

References

  1. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005.
  2. Francisco Claude and Gonzalo Navarro. Practical Rank/Select Queries over Arbitrary Sequences. SPIRE'08 176-187, 2008.
  3. Veli M"akinen and Gonzalo Navarro. Implicit Compression Boosting with Applications to Self-Indexing. SPIRE'07 214-226.
  4. R. Raman, V. Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. SODA'02, 233-242.
  5. R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. SODA'03, 841-850.
  6. The Pizza&Chili Site.

Author

Matthias Petri [email protected] see github.