Editing LZ77 and LZ78 (section)

==LZ77==
LZ77 algorithms achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the uncompressed data stream. A match is encoded by a pair of numbers called a ''length-distance pair'', which is equivalent to the statement "each of the next ''length'' characters is equal to the characters exactly ''distance'' characters behind it in the uncompressed stream". (The ''distance'' is sometimes called the ''offset'' instead.)

To spot matches, the encoder must keep track of some amount of the most recent data, such as the last 2&nbsp;[[kilobyte|KB]], 4&nbsp;KB, or 32&nbsp;KB. The structure in which this data is held is called a ''sliding window'', which is why LZ77 is sometimes called ''sliding-window compression''. The encoder needs to keep this data to look for matches, and the decoder needs to keep this data to interpret the matches the encoder refers to. The larger the sliding window is, the longer back the encoder may search for creating references.

It is not only acceptable but frequently useful to allow length-distance pairs to specify a length that actually exceeds the distance. As a copy command, this is puzzling: "Go back ''four'' characters and copy ''ten'' characters from that position into the current position". How can ten characters be copied over when only four of them are actually in the buffer? Tackling one byte at a time, there is no problem serving this request, because as a byte is copied over, it may be fed again as input to the copy command. When the copy-from position makes it to the initial destination position, it is consequently fed data that was pasted from the ''beginning'' of the copy-from position. The operation is thus equivalent to the statement "copy the data you were given and repetitively paste it until it fits". As this type of pair repeats a single copy of data multiple times, it can be used to incorporate a flexible and easy form of [[run-length encoding]].

Another way to see things is as follows: While encoding, for the search pointer to continue finding matched pairs past the end of the search window, all characters from the first match at offset ''D'' and forward to the end of the search window must have matched input, and these are the (previously seen) characters that compose a single run unit of length ''L''<sub>R</sub>, which must equal ''D''.  Then as the search pointer proceeds past the search window and forward, as far as the run pattern repeats in the input, the search and input pointers will be in sync and match characters until the run pattern is interrupted.  Then ''L'' characters have been matched in total, ''L'' > ''D'', and the code is [''D'', ''L'', ''c''].

Upon decoding [''D'', ''L'', ''c''], again, ''D'' = ''L''<sub>R</sub>.  When the first ''L''<sub>R</sub> characters are read to the output, this corresponds to a single run unit appended to the output buffer.  At this point, the read pointer could be thought of as only needing to return int(''L''/''L''<sub>R</sub>) + (1 if ''L'' mod ''L''<sub>R</sub> ≠ 0) times to the start of that single buffered run unit, read ''L''<sub>R</sub> characters (or maybe fewer on the last return), and repeat until a total of ''L'' characters are read. But mirroring the encoding process, since the pattern is repetitive, the read pointer need only trail in sync with the write pointer by a fixed distance equal to the run length ''L''<sub>R</sub> until ''L'' characters have been copied to output in total.

Considering the above, especially if the compression of data runs is expected to predominate, the window search should begin at the end of the window and proceed backwards, since run patterns, if they exist, will be found first and allow the search to terminate, absolutely if the current maximal matching sequence length is met, or judiciously, if a sufficient length is met, and finally for the simple possibility that the data is more recent and may correlate better with the next input.

===Pseudocode===
The following pseudocode is a reproduction of the LZ77 compression algorithm sliding window.

 '''while''' input is not empty '''do'''
     match := longest repeated occurrence of input that begins in window
     
     '''if''' match exists '''then'''
         d := distance to start of match
         l := length of match
         c := char following match in input
     '''else'''
         d := 0
         l := 0
         c := first char of input
     '''end if'''
     
     '''output''' (d, l, c)
     
     discard ''l'' + 1 chars from front of window
     s := pop ''l'' + 1 chars from front of input
     append s to back of window
 '''repeat'''

===Implementations===
Even though all LZ77 algorithms work by definition on the same basic principle, they can vary widely in how they encode their compressed data to vary the numerical ranges of a length–distance pair, alter the number of bits consumed for a length–distance pair, and distinguish their length–distance pairs from ''literals'' (raw data encoded as itself, rather than as part of a length–distance pair). A few examples:
* The algorithm illustrated in Lempel and Ziv's original 1977 article outputs all its data three values at a time: the length and distance of the longest match found in the buffer, and the literal that followed that match. If two successive characters in the input stream could be encoded only as literals, the length of the length–distance pair would be 0.
* [[Lempel–Ziv–Storer–Szymanski|LZSS]] improves on LZ77 by using a 1-bit flag to indicate whether the next chunk of data is a literal or a length–distance pair, and using literals if a length–distance pair would be longer.
* In the PalmDoc format, a length–distance pair is always encoded by a two-byte sequence. Of the 16 bits that make up these two bytes, 11 bits go to encoding the distance, 3 go to encoding the length, and the remaining two are used to make sure the decoder can identify the first byte as the beginning of such a two-byte sequence.
* In the implementation used for many games by [[Electronic Arts]],<ref>{{cite web
| url=http://wiki.niotso.org/QFS_compression
| title=QFS Compression (RefPack)
| work=Niotso Wiki
| access-date=2014-11-09}}</ref> the size in bytes of a length–distance pair can be specified inside the first byte of the length–distance pair itself; depending on whether the first byte begins with a 0, 10, 110, or 111 (when read in [[Endianness|big-endian]] bit orientation), the length of the entire length–distance pair can be 1 to 4 bytes.
* {{As of|2008}}, the most popular LZ77-based compression method is [[DEFLATE]]; it combines LZSS with [[Huffman coding]].<ref>{{cite web
| url=https://www.zlib.net/feldspar.html
| title=An Explanation of the Deflate Algorithm
| first=Antaeus
| last=Feldspar
| date=23 August 1997
| work=comp.compression [[Usenet newsgroup|newsgroup]]
| publisher=zlib.net
| access-date=2014-11-09}}</ref> Literals, lengths, and a symbol to indicate the end of the current block of data are all placed together into one alphabet. Distances can be safely placed into a separate alphabet; because a distance only occurs just after a length, it cannot be mistaken for another kind of symbol or vice versa.