- Given an input string
S
, generate |S
| cyclic rotations. - Sort these strings into alphabetical order by first character
- Collate all the last characters of every string
- These characters combined form the
BWT
— the encoded version - Then, compress repeated characters by placing a number in front of a character (this represents the number of repeats)
- The final string is the compressed, encoded text
CAR$
generates the following ($
represents the end of the text):
CAR$
$CAR
R$CA
AR$C
Sorted alphabetically, we get
$CAR
AR$C
CAR$
R$CA
Collating the last characters then labelling the number of characters, we get: 1R1C1$1A
.
- Let
bwt
be the input string.
- Create a list,
F
, that contains the sorted version ofbwt
. - Create a list,
N
, that numbers each occurence of every character inbwt
. - Create a list,
R
, that ‘ranks’ each unique character inF
. - Create a list,
L
, that is essentiallybwt
. - Set a counter called
row
to0
, and create an empty string calledoutput
. - Repeat n-1 times, where n is the length of the string,
bwt
:- Set the current character,
c
to therow
-th item ofL
. - Insert this character at the beginning of
output
. - Let
$r$ be the unicode number of the character,c
. Add together the$r$ -th item ofR
androw
-th item of N. This should give a new integer. Set this as the newrow
number.
- Set the current character,
- Return the
output
string.
Given 1R1C1$1A
, we de-encode to get RC$A. Then, we create our lists:
-
F
irst: $ACR -
N
umerals: R1 C1 $1 A1 -
R
ank:
$ | A | C | R | |
---|---|---|---|---|
First appears in F at position |
1 | 2 | 3 | 4 |
L
ast: RC$A
Then we repeat our algorithm, building up the word each time:
$
R$
AR$
CAR$
- Let
bwt
be the input string, andS
be the query
- Create a list,
F
, that contains the sorted version ofbwt
. - Create a list,
L
, that is essentiallybwt
. - Let
range
be the boundaries of a substring inL
in the form(start, end)
- For every character
c
inS
:- Find the first occurence of
c
inL
(say, positioni
) - Once found, find where this
i
-th character ofL
occurs inF
- Set the index of this found character to be the
start
of therange
- Do the above steps for finding the last occurence of
c
inL
(by traversing backwards) - Set
end
accordingly
- Find the first occurence of
- If every character in
S
has been iterated, it has been found - But if any character can't be found during the loop above, it is not found