Editing Lempel–Ziv–Welch (section)

==Algorithm==
The scenario described by Welch's 1984 paper<ref name="WelchIEEE"/> encodes sequences of 8-bit data as fixed-length 12-bit codes.  The codes from 0 to 255 represent 1-character sequences consisting of the corresponding 8-bit character, and the codes 256 through 4095 are created in a dictionary for sequences encountered in the data as it is encoded.  At each stage in compression, input bytes are gathered into a sequence until the next character would make a sequence with no code yet in the dictionary.  The code for the sequence (without that character) is added to the output, and a new code (for the sequence with that character) is added to the dictionary.

The idea was quickly adapted to other situations.  In an image based on a color table, for example, the natural character alphabet is the set of color table indexes, and in the 1980s, many images had small color tables (on the order of 16 colors).  For such a reduced alphabet, the full 12-bit codes yielded poor compression unless the image was large, so the idea of a '''variable-width''' code was introduced:  codes typically start one bit wider than the symbols being encoded, and as each code size is used up, the code width increases by 1 bit, up to some prescribed maximum (typically 12 bits).  When the maximum code value is reached, encoding proceeds using the existing table, but new codes are not generated for addition to the table.

Further refinements include reserving a code to indicate that the code table should be cleared and restored to its initial state (a "clear code", typically the first value immediately after the values for the individual alphabet characters), and a code to indicate the end of data (a "stop code", typically one greater than the clear code).  The clear code lets the table be reinitialized after it fills up, which lets the encoding adapt to changing patterns in the input data.  Smart encoders can monitor the compression efficiency and clear the table whenever the existing table no longer matches the input well.

Since codes are added in a manner determined by the data, the decoder mimics building the table as it sees the resulting codes.  It is critical that the encoder and decoder agree on the variety of LZW used: the size of the alphabet, the maximum table size (and code width), whether variable-width encoding is used, initial code size, and whether to use the clear and stop codes (and what values they have).  Most formats that employ LZW build this information into the format specification or provide explicit fields for them in a compression header for the data.

===Encoding===
A high-level view of the encoding algorithm is shown here:

# Initialize the dictionary to contain all strings of length one.
# Find the longest string '''W''' in the dictionary that matches the current input.
# Emit the dictionary index for '''W''' to output and remove '''W''' from the input.
# Add '''W''' followed by the next symbol in the input to the dictionary.
# Go to Step 2.

A dictionary is initialized to contain the single-character strings corresponding to all the possible input characters (and nothing else except the clear and stop codes if they're being used). The algorithm works by scanning through the input string for successively longer substrings until it finds one that is not in the dictionary. When such a string is found, the index for the string without the last character (i.e., the longest substring that ''is'' in the dictionary) is retrieved from the dictionary and sent to output, and the new string (including the last character) is added to the dictionary with the next available code. The last input character is then used as the next starting point to scan for substrings.

In this way, successively longer strings are registered in the dictionary and available for subsequent encoding as single output values. The algorithm works best on data with repeated patterns, so the initial parts of a message see little compression. As the message grows, however, the [[data compression ratio|compression ratio]] tends asymptotically to the maximum (i.e., the compression factor or ratio improves on an increasing curve, and not linearly, approaching a theoretical maximum inside a limited time period rather than over infinite time).<ref>{{Cite journal | last1 = Ziv | first1 = J. | last2 = Lempel | first2 = A. | doi = 10.1109/TIT.1978.1055934 | url = http://www.cs.duke.edu/courses/spring03/cps296.5/papers/ziv_lempel_1978_variable-rate.pdf | title = Compression of individual sequences via variable-rate coding | journal = IEEE Transactions on Information Theory | volume = 24 | issue = 5 | pages = 530 | year = 1978 | citeseerx = 10.1.1.14.2892 | access-date = 2009-03-03 | archive-date = 2012-04-12 | archive-url = https://web.archive.org/web/20120412115357/http://www.cs.duke.edu/courses/spring03/cps296.5/papers/ziv_lempel_1978_variable-rate.pdf | url-status = dead }}</ref>

===Decoding===
A high-level view of the decoding algorithm is shown here:
# Initialize the dictionary to contain all strings of length one.
# Read the next encoded symbol: Is it encoded in the dictionary?
##Yes:
###Emit the corresponding string '''W''' to output.
###Concatenate the previous string emitted to output with the first symbol of '''W'''. Add this to the dictionary.
##No:
###Concatenate the previous string emitted to output with its first symbol. Call this string '''V'''.
###Add '''V''' to the dictionary and emit '''V''' to output.
#Repeat Step 2 until end of input string
The decoding algorithm works by reading a value from the encoded input and outputting the corresponding string from the dictionary. However, the full dictionary is not needed, only the initial dictionary that contains single-character strings (and that is usually [[Hard coding|hard coded]] in the program, instead of sent with the encoded data). Instead, the full dictionary is rebuilt during the decoding process the following way: after decoding a value and outputting a string, the decoder [[concatenation|concatenates]] it with the first character of the ''next'' decoded string (or the first character of current string, if the next one can't be decoded; since if the next value is unknown, then it must be the value added to the dictionary in ''this'' iteration, and so its first character is the same as the first character of the current string), and updates the dictionary with the new string. The decoder then proceeds to the next input (which was already read in the previous iteration) and processes it as before, and so on until it has exhausted the input stream.

===Variable-width codes===
If variable-width codes are being used, the encoder and decoder must be careful to change the width at the same points in the encoded data so they don't disagree on boundaries between individual codes in the stream. In the standard version, the encoder increases the width from ''p'' to ''p''&nbsp;+&nbsp;1 when a sequence ω&nbsp;+&nbsp;''s'' is encountered that is not in the table (so that a code must be added for it) but the next available code in the table is 2<sup>''p''</sup> (the first code requiring ''p''&nbsp;+&nbsp;1 bits).  The encoder emits the code for ω at width ''p'' (since that code does not require ''p''&nbsp;+&nbsp;1 bits), and then increases the code width so that the next code emitted is ''p''&nbsp;+&nbsp;1 bits wide.

The decoder is always one code behind the encoder in building the table, so when it sees the code for ω, it generates an entry for code 2<sup>''p''</sup>&nbsp;&minus;&nbsp;1.  Since this is the point where the encoder increases the code width, the decoder must increase the width here as well—at the point where it generates the largest code that fits in ''p'' bits.

Unfortunately, some early implementations of the encoding algorithm increase the code width and ''then'' emit ω at the new width instead of the old width, so that to the decoder it looks like the width changes one code too early.  This is called "early change"; it caused so much confusion that Adobe now allows both versions in PDF files, but includes an explicit flag in the header of each LZW-compressed stream to indicate whether early change is being used.  Of the graphics file formats that support LZW compression, [[TIFF]] uses early change, while [[GIF]] and most others don't.

When the table is cleared in response to a clear code, both encoder and decoder change the code width after the clear code back to the initial code width, starting with the code immediately following the clear code.

===Packing order===
Since the codes emitted typically do not fall on byte boundaries, the encoder and decoder must agree on how codes are packed into bytes.  The two common methods are ''LSB-first'' ("[[least significant bit]] first") and ''MSB-first'' ("[[most significant bit]] first").  In LSB-first packing, the first code is aligned so that the least significant bit of the code falls in the least significant bit of the first stream byte, and if the code has more than 8 bits, the high-order bits left over are aligned with the least significant bits of the next byte; further codes are packed with LSB going into the least significant bit not yet used in the current stream byte, proceeding into further bytes as necessary. MSB-first packing aligns the first code so that its ''most'' significant bit falls in the MSB of the first stream byte, with overflow aligned with the MSB of the next byte; further codes are written with MSB going into the most significant bit not yet used in the current stream byte.

GIF files use LSB-first packing order.  TIFF files and PDF files use MSB-first packing order.