Compression techniques work by finding repeated patterns in the data and encoding the duplications more compactly. The Burrows–Wheeler transform (BWT, also called blocksorting compression) rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as movetofront transform and runlength encoding. More importantly, the transformation is reversible, without needing to store any additional data. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation.
Description
The Burrows–Wheeler transform is an algorithm used in data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while working at DEC Systems Research Center in Palo Alto, California.^{[1]} It is based on a previously unpublished transformation discovered by Wheeler in 1983.
When a character string is transformed by the BWT, none of its characters change value. The transformation permutes the order of the characters. If the original string had several substrings that occurred often, then the transformed string will have several places where a single character is repeated multiple times in a row.
For example:
Input

SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES

Output

TEXYDST.E.IXIXIXXSSMPPS.B..E..UESFXDIIOIIITS

The output is easier to compress because it has many repeated characters. In fact, in the transformed string, there are a total of six runs of identical characters: XX, SS, PP, .., II, and III, which together make 13 out of the 44 characters in it.
Example
The transform is done by sorting all rotations of the text into lexicographic order, by which we mean that the 8 rotations appear in the second column in a different order, in that the 8 rows have been sorted into lexicographical order. We then take as output the last column and the number k = 7 of the row that the non rotated row ends up in. For example, the text "^BANANA" is transformed into "BNN^AAA" through these steps (the red  character indicates the 'EOF' pointer):
Transformation

Input

All
Rotations

Sorting All Rows into Lex Order

Taking
Last Column

Output
Last Column

^BANANA

^BANANA
^BANANA
A^BANAN
NA^BANA
ANA^BAN
NANA^BA
ANANA^B
BANANA^

ANANA^B
ANA^BAN
A^BANAN
BANANA^
NANA^BA
NA^BANA
^BANANA
^BANANA

ANANA^B
ANA^BAN
A^BANAN
BANANA^
NANA^BA
NA^BANA
^BANANA
^BANANA

BNN^AAA

The following pseudocode gives a simple (though inefficient) way to calculate the BWT and its inverse. It assumes that the input string s
contains a special character 'EOF' which is the last character, occurs nowhere else in the text, and is ignored during sorting.
function BWT (string s)
create a table, rows are all possible rotations of s
sort rows alphabetically
return (last column of the table)
function inverseBWT (string s)
create empty table
repeat length(s) times
// first insert creates first column
insert s as a column of table before first column of the table
sort rows of the table alphabetically
return (row that ends with the 'EOF' character)
Explanation
To understand why this creates moreeasilycompressible data, consider transforming a long English text frequently containing the word "the". Sorting the rotations of this text will often group rotations starting with "he " together, and the last character of that rotation (which is also the character before the "he ") will usually be "t", so the result of the transform would contain a number of "t" characters along with the perhaps lesscommon exceptions (such as if it contains "Brahe ") mixed in. So it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text).
The remarkable thing about the BWT is not that it generates a more easily encoded output—an ordinary sort would do that—but that it is reversible, allowing the original document to be regenerated from the last column data.
The inverse can be understood this way. Take the final table in the BWT algorithm, and erase all but the last column. Given only this information, you can easily reconstruct the first column. The last column tells you all the characters in the text, so just sort these characters alphabetically to get the first column. Then, the first and last columns (of each row) together give you all pairs of successive characters in the document, where pairs are taken cyclically so that the last and first character form a pair. Sorting the list of pairs gives the first and second columns. Continuing in this manner, you can reconstruct the entire list. Then, the row with the "end of file" character at the end is the original text. Reversing the example above is done like this:
Inverse Transformation

Input

BNN^AAA

Add 1

Sort 1

Add 2

Sort 2

B
N
N
^
A
A

A

A
A
A
B
N
N
^


BA
NA
NA
^B
AN
AN
^
A

AN
AN
A
BA
NA
NA
^B
^

Add 3

Sort 3

Add 4

Sort 4

BAN
NAN
NA
^BA
ANA
ANA
^B
A^

ANA
ANA
A^
BAN
NAN
NA
^BA
^B

BANA
NANA
NA^
^BAN
ANAN
ANA
^BA
A^B

ANAN
ANA
A^B
BANA
NANA
NA^
^BAN
^BA

Add 5

Sort 5

Add 6

Sort 6

BANAN
NANA
NA^B
^BANA
ANANA
ANA^
^BAN
A^BA

ANANA
ANA^
A^BA
BANAN
NANA
NA^B
^BANA
^BAN

BANANA
NANA^
NA^BA
^BANAN
ANANA
ANA^B
^BANA
A^BAN

ANANA
ANA^B
A^BAN
BANANA
NANA^
NA^BA
^BANAN
^BANA

Add 7

Sort 7

Add 8

Sort 8

BANANA
NANA^B
NA^BAN
^BANANA
ANANA^
ANA^BA
^BANAN
A^BANA

ANANA^
ANA^BA
A^BANA
BANANA
NANA^B
NA^BAN
^BANANA
^BANAN

BANANA^
NANA^BA
NA^BANA
^BANANA
ANANA^B
ANA^BAN
^BANANA
A^BANAN

ANANA^B
ANA^BAN
A^BANAN
BANANA^
NANA^BA
NA^BANA
^BANANA
^BANANA

Output

^BANANA

Optimization
A number of optimizations can make these algorithms run more efficiently without changing the output. There is no need to represent the table in either the encoder or decoder. In the encoder, each row of the table can be represented by a single pointer into the strings, and the sort performed using the indices. Some care must be taken to ensure that the sort does not exhibit bad worstcase behavior: Standard library sort functions are unlikely to be appropriate. In the decoder, there is also no need to store the table, and in fact no sort is needed at all. In time proportional to the alphabet size and string length, the decoded string may be generated one character at a time from right to left. A "character" in the algorithm can be a byte, or a bit, or any other convenient size.
One may also make the observation that mathematically, the encoded string can be computed as a simple modification of the suffix array, and suffix arrays can be computed with linear time and memory.
There is no need to have an actual 'EOF' character. Instead, a pointer can be used that remembers where in a string the 'EOF' would be if it existed. In this approach, the output of the BWT must include both the transformed string, and the final value of the pointer. That means the BWT does expand its input slightly. The inverse transform then shrinks it back down to the original size: it is given a string and a pointer, and returns just a string.
A complete description of the algorithms can be found in Burrows and Wheeler's paper, or in a number of online sources.
Bijective variant
When a bijective variant of the Burrows–Wheeler transform is performed on "^BANANA", you get ANNBAA^ without the need for a special character for the end of the string. This forces one to increase character space by one, or to have a separate field with a numerical value for an offset. Either of these features makes data compression more difficult. When dealing with short files, the savings are great percentagewise.
The bijective transform is done by sorting all rotations of the Lyndon words. In comparing two strings of unequal length, one can compare the infinite periodic repetitions of each of these in lexicographic order and take the last column of the baserotated Lyndon word. For example, the text "^BANANA" is transformed into "ANNBAA^" through these steps (the red  character indicates the EOF pointer) in the original string. The EOF character is unneeded in the bijective transform, so it is dropped during the transform and readded to its proper place in the file.
The string is broken into Lyndon words so the words in the sequence are decreasing using the comparison method above. "^BANANA" becomes (^) (B) (AN) (AN) (A), but Lyndon words are combined into (^) (B) (ANAN) (A).
Bijective transformation

Input

All
rotations

Sorted alphabetically
by first letter

Last Column
of rotated Lyndon word

Output

^BANANA

^^^^^^^^ (^)
BBBBBBBB (B)
ANANANAN... (ANAN)
NANANANA... (NANA)
ANANANAN... (ANAN)
NANANANA... (NANA)
AAAAAAAA... (A)

AAAAAAAA... (A)
ANANANAN... (ANAN)
ANANANAN... (ANAN)
BBBBBBBB... (B)
NANANANA... (NANA)
NANANANA... (NANA)
^^^^^^^^... (^)

AAAAAAAA... (A)
ANANANAN... (ANAN)
ANANANAN... (ANAN)
BBBBBBBB... (B)
NANANANA... (NANA)
NANANANA... (NANA)
^^^^^^^^... (^)

ANNBAA^

Inverse bijective transform

Input

ANNBAA^

Add 1

Sort 1

Add 2

Sort 2

A
N
N
B
A
A
^

A
A
A
B
N
N
^

AA
NA
NA
BB
AN
AN
^^

AA
AN
AN
BB
NA
NA
^^

Add 3

Sort 3

Add 4

Sort 4

AAA
NAN
NAN
BBB
ANA
ANA
^^^

AAA
ANA
ANA
BBB
NAN
NAN
^^^

AAAA
NANA
NANA
BBBB
ANAN
ANAN
^^^^

AAAA
ANAN
ANAN
BBBB
NANA
NANA
^^^^

Output

^BANANA

The above may be viewed as four cycles
^ = (^)(^)... = ^^^^^...
B = (B)(B)... = BBBB...
ANAN = (ANAN)(ANAN)... = ANANANAN...
A = (A)(A).. = AAAAA..
or 5 cycles WHERE ANAN broken into 2
AN = (AN) (AN) ... = ANANANAN
AN = (AN) (AN) ... = ANANANAN
If a cycle is N character it will be repeated N times:
(^)
(B)
(ANAN)
(A)
or
(^)
(B)
(AN)
(AN)
(A)
to get the ^BANANA
Since any rotation of the input string will lead to the same transformed string, the BWT cannot be inverted without adding an EOF marker to the input or, augmenting the output with information such as an index, making it possible to identify the input string from all its rotations.
There is a bijective version of the transform, by which the transformed string uniquely identifies the original. In this version, every string has a unique inverse of the same length.^{[2]}^{[3]}
The fastest versions are linear in time and space.
The bijective transform is computed by factoring the input into a nonincreasing sequence of Lyndon words; such a factorization exists in the Chen–Fox–Lyndon theorem,^{[4]} and may be found in linear time.^{[5]} The algorithm sorts the rotations of all the words; as in the Burrows–Wheeler transform, this produces a sorted sequence of n strings. The transformed string is then obtained by picking the final character of each string in this sorted list.
For example, applying the bijective transform gives:
Input

SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES

Lyndon words

SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES

Output

STEYDST.E.IXXIIXXSMPPXS.B..EE..SUSFXDIOIIIIT

The bijective transform includes eight runs of identical characters. These runs are, in order: XX, II, XX, PP, .., EE, .., and IIII.
In total, 18 characters are used in these runs.
Dynamic Burrows–Wheeler transform
Instead of reconstructing the Burrows–Wheeler transform of an edited text, Salson et al.^{[6]} propose an algorithm that deduces the new Burrows–Wheeler transform from the original one, doing a limited number of local reorderings in the original Burrows–Wheeler transform.
Sample implementation
This Python implementation sacrifices speed for simplicity: the program is short, but takes more than the linear time that would be desired in a practical implementation.
Using the null character as the end of file marker, and using s[i:] + s[:i]
to construct the ith rotation of s
, the forward transform takes the last character of each of the sorted rows:
def bwt(s):
"""Apply BurrowsWheeler transform to input string."""
assert "\0" not in s, "Input string cannot contain null character ('\\0')"
s += "\0" # Add end of file marker
table = sorted(s[i:] + s[:i] for i in range(len(s))) # Table of rotations of string
last_column = [row[1:] for row in table] # Last characters of each row
return "".join(last_column) # Convert list of characters into string
The inverse transform repeatedly inserts r
as the left column of the table and sorts the table. After the whole table is built, it returns the row that ends with null, minus the null.
def ibwt(r):
"""Apply inverse BurrowsWheeler transform."""
table = [""] * len(r) # Make empty table
for i in range(len(r)):
table = sorted(r[i] + table[i] for i in range(len(r))) # Add a column of r
s = [row for row in table if row.endswith("\0")][0] # Find the correct row (ending in "\0")
return s.rstrip("\0") # Get rid of trailing null character
Here is another, more efficient method for the inverse transform. Although more complex, it increases the speed greatly when decoding lengthy strings.
def ibwt(r, *args):
"Inverse BurrowsWheeler transform. args is the original index \
if it was not indicated by a null byte"
firstCol = "".join(sorted(r))
count = [0]*256
byteStart = [1]*256
output = [""] * len(r)
shortcut = [None]*len(r)
#Generates shortcut lists
for i in range(len(r)):
shortcutIndex = ord(r[i])
shortcut[i] = count[shortcutIndex]
count[shortcutIndex] += 1
shortcutIndex = ord(firstCol[i])
if byteStart[shortcutIndex] == 1:
byteStart[shortcutIndex] = i
localIndex = (r.index("\x00") if not args else args[0])
for i in range(len(r)):
#takes the next index indicated by the transformation vector
nextByte = r[localIndex]
output [len(r)i1] = nextByte
shortcutIndex = ord(nextByte)
#assigns localIndex to the next index in the transformation vector
localIndex = byteStart[shortcutIndex] + shortcut[localIndex]
return "".join(output).rstrip("\x00")
BWT in bioinformatics
The advent of hashing (e.g., Eland, SOAP,^{[7]} or Maq^{[8]}). In an effort to reduce the memory requirement for sequence alignment, several alignment programs were developed (Bowtie,^{[9]} BWA,^{[10]} and SOAP2^{[11]}) that use the Burrows–Wheeler transform.
References

^

^ Gil, J.; Scott, D. A. (2009), A bijective string sorting transform

^ Kufleitner, Manfred (2009), "On bijective variants of the BurrowsWheeler transform", in Holub, Jan; Žďárek, Jan, Prague Stringology Conference, pp. 65–69, .

^ *

^ Duval, JeanPierre (1983 zbl=0532.68061), "Factorizing words over an ordered alphabet", Journal of Algorithms 4 (4): 363–381, .

^ Salson M, Lecroq T, Léonard M and Mouchard L (2009). "A FourStage Algorithm for Updating a Burrows–Wheeler Transform". Theoretical Computer Science 410 (43): 4350.

^ Li R, et al. (2008). "SOAP: short oligonucleotide alignment program". Bioinformatics 24 (5): 713–714.

^ Li H, Ruan J, Durbin R (20080819). "Mapping short DNA sequencing reads and calling variants using mapping quality scores". Genome Research 18 (11): 1851–1858.

^ Langmead B, Trapnell C, Pop M, Salzberg SL (2009). "Ultrafast and memoryefficient alignment of short DNA sequences to the human genome". Genome Biology 10 (3): R25.

^ Li H, Durbin R (2009). "Fast and accurate short read alignment with Burrows–Wheeler Transform". Bioinformatics 25 (14): 1754–1760.

^ Li R, et al. (2009). "SOAP2: an improved ultrafast tool for short read alignment". Bioinformatics 25 (15): 1966–1967.
External links

Compression comparison of BWT based file compressors

Article by Mark Nelson on the BWT

A Bijective StringSorting Transform, by Gil and Scott

Yuta's openbwtv1.5.zip contains source code for various BWT routines including BWTS for bijective version

On Bijective Variants of the Burrows–Wheeler Transform, by Kufleitner

Blog post and project page for an opensource compression program and library based on the Burrows–Wheeler algorithm
This article was sourced from Creative Commons AttributionShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and USA.gov, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for USA.gov and content contributors is made possible from the U.S. Congress, EGovernment Act of 2002.
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a nonprofit organization.