Editing Lempel–Ziv–Welch (section)

==Example==
The following example illustrates the LZW algorithm in action, showing the status of the output and the [[Dictionary coder|dictionary]] at every stage, both in encoding and decoding the data. This example has been constructed to give reasonable compression on a very short message.  In real text data, repetition is generally less pronounced, so longer input streams are typically necessary before the compression builds up efficiency.

The plaintext to be encoded (from an alphabet using only the capital letters) is:

 TOBEORNOTTOBEORTOBEORNOT#

There are 26 symbols in the plaintext alphabet (the capital letters ''A'' through ''Z'').  ''#'' is used to represent a stop code: a code outside the plaintext alphabet that triggers special handling. We arbitrarily assign these the values 1 through 26 for the letters, and 0 for the stop code '#'. (Most flavors of LZW would put the stop code ''after'' the data alphabet, but nothing in the basic algorithm requires that.  The encoder and decoder only have to agree what value it has.)

A computer renders these as strings of [[bit]]s.  Five-bit codes are needed to give sufficient combinations to encompass this set of 27 values. The dictionary is initialized with these 27 values.  As the dictionary grows, the codes must grow in width to accommodate the additional entries. A 5-bit code gives 2<sup>5</sup> = 32 possible combinations of bits, so when the 33rd dictionary word is created, the algorithm must switch at that point from 5-bit strings to 6-bit strings (for ''all'' code values, including those previously output with only five bits). Note that since the all-zero code 00000 is used, and is labeled "0", the 33rd dictionary entry is labeled '''32'''.  (Previously generated output is not affected by the code-width change, but once a 6-bit value is generated in the dictionary, it could conceivably be the next code emitted, so the width for subsequent output shifts to 6 bits to accommodate that.)

The initial dictionary, then, consists of the following entries:

{| class="wikitable" style="text-align: right; margin-left: auto; margin-right: auto;"
|-
! Symbol !! Binary !! Decimal
|-
| # || 00000 ||  0
|-
| A || 00001 ||  1
|-
| B || 00010 ||  2
|-
| C || 00011 ||  3
|-
| D || 00100 ||  4
|-
| E || 00101 ||  5
|-
| F || 00110 ||  6
|-
| G || 00111 ||  7
|-
| H || 01000 ||  8
|-
| I || 01001 ||  9
|-
| J || 01010 || 10
|-
| K || 01011 || 11
|-
| L || 01100 || 12
|-
| M || 01101 || 13
|-
| N || 01110 || 14
|-
| O || 01111 || 15
|-
| P || 10000 || 16
|-
| Q || 10001 || 17
|-
| R || 10010 || 18
|-
| S || 10011 || 19
|-
| T || 10100 || 20
|-
| U || 10101 || 21
|-
| V || 10110 || 22
|-
| W || 10111 || 23
|-
| X || 11000 || 24
|-
| Y || 11001 || 25
|-
| Z || 11010 || 26
|-
|}

===Encoding===
Buffer input characters in a sequence ω until ω + next character is not in the dictionary.  Emit the code for ω, and add ω + next character to the dictionary.  Start buffering again with the next character. (The string to be encoded is "TOBEORNOTTOBEORTOBEORNOT#".)

{| class="wikitable" style="text-align: right; margin-left: auto; margin-right: auto;"
|-
! scope="col" width="6em" rowspan="2" | Current Sequence
! scope="col" width="4em" rowspan="2" | Next Char
! colspan="2" | Output
! scope="col" width="7em" rowspan="2" colspan="2" | Extended Dictionary
! rowspan="2" | Comments
|-
! Code || Bits
|-
| style="text-align: center;" | NULL
| style="text-align: center;" | T || ||
| style="border-right: none;" |
| style="border-left: none;" | ||
|-
| style="text-align: center;" | T
| style="text-align: center;" | O
| 20 || {{mono|10100}}
| style="border-right: none;" | 27:
| style="border-left: none;" | TO
| style="text-align: left;" | 27 = first available code after 0 through 26
|-
| style="text-align: center;" | O
| style="text-align: center;" | B
| 15 || {{mono|01111}}
| style="border-right: none;" | 28:
| style="border-left: none;" | OB ||
|-
| style="text-align: center;" | B
| style="text-align: center;" | E
| 2 || {{mono|00010}}
| style="border-right: none;" | 29:
| style="border-left: none;" | BE ||
|-
| style="text-align: center;" | E
| style="text-align: center;" | O
| 5 || {{mono|00101}}
| style="border-right: none;" | 30:
| style="border-left: none;" | EO ||
|-
| style="text-align: center;" | O
| style="text-align: center;" | R
| 15 || {{mono|01111}}
| style="border-right: none;" | 31:
| style="border-left: none;" | OR ||
|-
| style="text-align: center;" | R
| style="text-align: center;" | N
| 18 || {{mono|10010}}
| style="border-right: none;" | 32:
| style="border-left: none;" | RN
| style="text-align: left;" | 32 requires 6 bits, so for next output use 6 bits
|-
| style="text-align: center;" | N
| style="text-align: center;" | O
| 14 || {{mono|001110}}
| style="border-right: none;" | 33:
| style="border-left: none;" | NO ||
|-
| style="text-align: center;" | O
| style="text-align: center;" | T
| 15 || {{mono|001111}}
| style="border-right: none;" | 34:
| style="border-left: none;" | OT ||
|-
| style="text-align: center;" | T
| style="text-align: center;" | T
| 20 || {{mono|010100}}
| style="border-right: none;" | 35:
| style="border-left: none;" | TT ||
|-
| style="text-align: center;" | TO
| style="text-align: center;" | B
| 27 || {{mono|011011}}
| style="border-right: none;" | 36:
| style="border-left: none;" | TOB ||
|-
| style="text-align: center;" | BE
| style="text-align: center;" | O
| 29 || {{mono|011101}}
| style="border-right: none;" | 37:
| style="border-left: none;" | BEO ||
|-
| style="text-align: center;" | OR
| style="text-align: center;" | T
| 31 || {{mono|011111}}
| style="border-right: none;" | 38:
| style="border-left: none;" | ORT ||
|-
| style="text-align: center;" | TOB
| style="text-align: center;" | E
| 36 || {{mono|100100}}
| style="border-right: none;" | 39:
| style="border-left: none;" | TOBE ||
|-
| style="text-align: center;" | EO
| style="text-align: center;" | R
| 30 || {{mono|011110}}
| style="border-right: none;" | 40:
| style="border-left: none;" | EOR ||
|-
| style="text-align: center;" | RN
| style="text-align: center;" | O
| 32 || {{mono|100000}}
| style="border-right: none;" | 41:
| style="border-left: none;" | RNO ||
|-
| style="text-align: center;" | OT
| style="text-align: center;" | #
| 34 || {{mono|100010}}
| style="border-right: none;" |
| style="border-left: none;" |
| style="text-align: left;" | # stops the algorithm; send the cur seq
|-
| || || 0 || {{mono|000000}}
| style="border-right: none;" |
| style="border-left: none;" |
| style="text-align: left;" | and the stop code
|-
|}

:Unencoded length = 25 symbols × 5 bits/symbol = 125 bits
:Encoded length = (6 codes × 5 bits/code) + (11 codes × 6 bits/code) = 96 bits.

Using LZW has saved 29 bits out of 125, reducing the message by more than 23%. If the message were longer, then the dictionary words would begin to represent longer and longer sections of text, sending repeated words very compactly.

===Decoding===
To decode an LZW-compressed archive, one needs to know in advance the initial dictionary used, but additional entries can be reconstructed as they are always simply [[concatenation]]s of previous entries.

{| class="wikitable" style="text-align: right; margin-left: auto; margin-right: auto;"
|-
! colspan="2" | Input
! scope="col" width="6em" rowspan="2" | Output Sequence
! colspan="4" | New Dictionary Entry
! rowspan="2" | Comments
|-
! Bits !! Code
! scope="col" width="6em" colspan="2" | Full
! scope="col" width="6em" colspan="2" | Conjecture
|-
| 10100 || 20
| style="text-align: center;" | T
| style="border-right: none;" |
| style="border-left: none;" |
| style="border-right: none;" | 27:
| style="border-left: none;" | T? ||
|-
| 01111 || 15
| style="text-align: center;" | O
| style="border-right: none;" | 27:
| style="border-left: none;" | TO
| style="border-right: none;" | 28:
| style="border-left: none;" | O? ||
|-
| 00010 || 2
| style="text-align: center;" | B
| style="border-right: none;" | 28:
| style="border-left: none;" | OB
| style="border-right: none;" | 29:
| style="border-left: none;" | B? ||
|-
| 00101 || 5
| style="text-align: center;" | E
| style="border-right: none;" | 29:
| style="border-left: none;" | BE
| style="border-right: none;" | 30:
| style="border-left: none;" | E? ||
|-
| 01111 || 15
| style="text-align: center;" | O
| style="border-right: none;" | 30:
| style="border-left: none;" | EO
| style="border-right: none;" | 31:
| style="border-left: none;" | O? ||
|-
| 10010 || 18
| style="text-align: center;" | R
| style="border-right: none;" | 31:
| style="border-left: none;" | OR
| style="border-right: none;" | 32:
| style="border-left: none;" | R?
| style="text-align: left;" | created code 31 (last to fit in 5 bits)
|-
| 001110 || 14
| style="text-align: center;" | N
| style="border-right: none;" | 32:
| style="border-left: none;" | RN
| style="border-right: none;" | 33:
| style="border-left: none;" | N?
| style="text-align: left;" | so start reading input at 6 bits
|-
| 001111 || 15
| style="text-align: center;" | O
| style="border-right: none;" | 33:
| style="border-left: none;" | NO
| style="border-right: none;" | 34:
| style="border-left: none;" | O? ||
|-
| 010100 || 20
| style="text-align: center;" | T
| style="border-right: none;" | 34:
| style="border-left: none;" | OT
| style="border-right: none;" | 35:
| style="border-left: none;" | T? ||
|-
| 011011 || 27
| style="text-align: center;" | TO
| style="border-right: none;" | 35:
| style="border-left: none;" | TT
| style="border-right: none;" | 36:
| style="border-left: none;" | TO? ||
|-
| 011101 || 29
| style="text-align: center;" | BE
| style="border-right: none;" | 36:
| style="border-left: none;" | TOB
| style="border-right: none;" | 37:
| style="border-left: none;" | BE?
| style="text-align: left;" | 36 = TO + 1st symbol (B) of
|-
| 011111 || 31
| style="text-align: center;" | OR
| style="border-right: none;" | 37:
| style="border-left: none;" | BEO
| style="border-right: none;" | 38:
| style="border-left: none;" | OR?
| style="text-align: left;" | next coded sequence received (BE)
|-
| 100100 || 36
| style="text-align: center;" | TOB
| style="border-right: none;" | 38:
| style="border-left: none;" | ORT
| style="border-right: none;" | 39:
| style="border-left: none;" | TOB? ||
|-
| 011110 || 30
| style="text-align: center;" | EO
| style="border-right: none;" | 39:
| style="border-left: none;" | TOBE
| style="border-right: none;" | 40:
| style="border-left: none;" | EO? ||
|-
| 100000 || 32
| style="text-align: center;" | RN
| style="border-right: none;" | 40:
| style="border-left: none;" | EOR
| style="border-right: none;" | 41:
| style="border-left: none;" | RN? ||
|-
| 100010 || 34
| style="text-align: center;" | OT
| style="border-right: none;" | 41:
| style="border-left: none;" | RNO
| style="border-right: none;" | 42:
| style="border-left: none;" | OT? ||
|-
| 000000 || 0
| style="text-align: center;" | #
| style="border-right: none;" | 
| style="border-left: none;" |
| style="border-right: none;" |
| style="border-left: none;" | ||
|-
|}

At each stage, the decoder receives a code X; it looks X up in the table and outputs the sequence χ it codes, and it conjectures χ + ? as the entry the encoder just added – because the encoder emitted X for χ precisely because χ + ? was not in the table, and the encoder goes ahead and adds it.  But what is the missing letter?  It is the first letter in the sequence coded by the ''next'' code Z that the decoder receives.  So the decoder looks up Z, decodes it into the sequence ω and takes the first letter z and tacks it onto the end of χ as the next dictionary entry.

This works as long as the codes received are in the decoder's dictionary, so that they can be decoded into sequences.  What happens if the decoder receives a code Z that is not yet in its dictionary?  Since the decoder is always just one code behind the encoder, Z can be in the encoder's dictionary only if the encoder ''just'' generated it, when emitting the previous code X for χ.  Thus Z codes some ω that is χ + ?, and the decoder can determine the unknown character as follows:

# The decoder sees X and then Z, where X codes the sequence χ and Z codes some unknown sequence ω.
# The decoder knows that the encoder just added Z as a code for χ + some unknown character ''c'', so ω = χ + ''c''.
# Since ''c'' is the first character in the input stream after χ, and since ω is the string appearing immediately after χ, ''c'' must be the first character of the sequence ω.
# Since χ is an initial substring of ω, ''c'' must also be the first character of χ.
# So even though the Z code is not in the table, the decoder is able to infer the unknown sequence and adds χ + (the first character of χ) to the table as the value of Z.

This situation occurs whenever the encoder encounters input of the form ''cScSc'', where ''c'' is a single character, ''S'' is a string and ''cS'' is already in the dictionary, but ''cSc'' is not. The encoder emits the code for ''cS'', putting a new code for ''cSc'' into the dictionary. Next it sees ''cSc'' in the input (starting at the second ''c'' of ''cScSc'') and emits the new code it just inserted.  The argument above shows that whenever the decoder receives a code not in its dictionary, the situation must look like this.

Although input of form ''cScSc'' might seem unlikely, this pattern is fairly common when the input stream is characterized by significant repetition.  In particular, long strings of a single character (which are common in the kinds of images LZW is often used to encode) repeatedly generate patterns of this sort.