Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Lexical analysis
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== <span class="anchor" id="Tokenization"></span><span class="anchor" id="Token"></span>Lexical token and lexical tokenization== {{Distinguish|Large language model#Tokenization|tokenization (data security)}} <!--[[Lexical token]] and [[Token (parser)]] and [[Tokenize]] and [[Tokenizing]] redirect here ([[MOS:HEAD]])--> A ''lexical token'' is a [[String (computer science)|string]] with an assigned and thus identified meaning, in contrast to the probabilistic token used in [[large language model]]s. A lexical token consists of a ''token name'' and an optional ''token value''. The token name is a category of a rule-based lexical unit.<ref name="auto">page 111, "Compilers Principles, Techniques, & Tools, 2nd Ed." (WorldCat) by Aho, Lam, Sethi and Ullman, as quoted in https://stackoverflow.com/questions/14954721/what-is-the-difference-between-token-and-lexeme</ref> {|class="wikitable" |+ Examples of common tokens ! Token name (Lexical category) ! Explanation !! Sample token values |- | [[Identifier (computer languages)|identifier]] || Names assigned by the programmer. || {{code|x}}, {{code|color}}, {{code|UP}} |- | [[Reserved word|keyword]] || Reserved words of the language. || {{code|2=c|if}}, {{code|2=c|while}}, {{code|2=c|return}} |- | [[delimiter|separator/punctuator]] || Punctuation characters and paired delimiters. || <code>}</code>, <code>(</code>, <code>;</code> |- | [[Operator (computer programming)|operator]] || Symbols that operate on arguments and produce results. || {{code|2=c|1=+}}, {{code|2=c|1=<}}, {{code|2=c|1==}} |- | [[Literal (computer programming)|literal]] || Numeric, logical, textual, and reference literals. || {{code|2=c|true}}, {{code|2=c|6.02e23}}, {{code|2=c|"music"}} |- | [[Comment (computer programming)|comment]] || Line or block comments. Usually discarded. || {{code|2=c|/* Retrieves user data */}}, {{code|2=c|// must be negative}} |- | [[Whitespace character|whitespace]] || Groups of non-printable characters. Usually discarded. || β |} Consider this expression in the [[C (programming language)|C]] programming language: : {{code|2=c|1=x = a + b * 2;}} The lexical analysis of this expression yields the following sequence of tokens: : <code>[(identifier, x), (operator, =), (identifier, a), (operator, +), (identifier, b), (operator, *), (literal, 2), (separator, ;)]</code> A token name is what might be termed a [[part of speech]] in linguistics. ''Lexical tokenization'' is the conversion of a raw text into (semantically or syntactically) meaningful lexical tokens, belonging to categories defined by a "lexer" program, such as identifiers, operators, grouping symbols, and data types. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of [[parsing]] input. For example, in the text [[String (computer science)|string]]: : <code>The quick brown fox jumps over the lazy dog</code> the string is not implicitly segmented on spaces, as a [[natural language]] speaker would do. The raw input, the 43 characters, must be explicitly split into the 9 tokens with a given space delimiter (i.e., matching the string <code>" "</code> or [[regular expression]] <code>/\s{1}/</code>). When a token class represents more than one possible lexeme, the lexer often saves enough information to reproduce the original lexeme, so that it can be used in [[Semantic analysis (compilers)|semantic analysis]]. The parser typically retrieves this information from the lexer and stores it in the [[abstract syntax tree]]. This is necessary in order to avoid information loss in the case where numbers may also be valid identifiers. Tokens are identified based on the specific rules of the lexer. Some methods used to identify tokens include [[regular expression]]s, specific sequences of characters termed a [[Flag (computing)|flag]], specific separating characters called [[delimiter]]s, and explicit definition by a dictionary. Special characters, including punctuation characters, are commonly used by lexers to identify tokens because of their natural use in written and programming languages. A lexical analyzer generally does nothing with combinations of tokens, a task left for a [[parser]]. For example, a typical lexical analyzer recognizes parentheses as tokens but does nothing to ensure that each "(" is matched with a ")". When a lexer feeds tokens to the parser, the representation used is typically an [[enumerated type]], which is a list of number representations. For example, "Identifier" can be represented with 0, "Assignment operator" with 1, "Addition operator" with 2, etc. Tokens are often defined by [[regular expression]]s, which are understood by a lexical analyzer generator such as [[lex (software)|lex]], or handcoded equivalent [[finite-state automata]]. The lexical analyzer (generated automatically by a tool like lex or hand-crafted) reads in a stream of characters, identifies the [[#Lexeme|lexemes]] in the stream, and categorizes them into tokens. This is termed ''tokenizing''. If the lexer finds an invalid token, it will report an error. Following tokenizing is [[parsing]]. From there, the interpreted data may be loaded into data structures for general use, interpretation, or [[compiling]].
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
Lexical analysis
(section)
Add topic