Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
String (computer science)
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Character encoding === String datatypes have historically allocated one byte per character, and, although the exact character set varied by region, character encodings were similar enough that programmers could often get away with ignoring this, since characters a program treated specially (such as period and space and comma) were in the same place in all the encodings a program would encounter. These character sets were typically based on [[ASCII]] or [[EBCDIC]]. If text in one encoding was displayed on a system using a different encoding, text was often [[mojibake|mangled]], though often somewhat readable and some computer users learned to read the mangled text. [[Logograph]]ic languages such as [[Chinese language|Chinese]], [[Japanese language|Japanese]], and [[Korean language|Korean]] (known collectively as [[CJK characters|CJK]]) need far more than 256 characters (the limit of a one 8-bit byte per-character encoding) for reasonable representation. The normal solutions involved keeping single-byte representations for ASCII and using two-byte representations for CJK [[ideographs]]. Use of these with existing code led to problems with matching and cutting of strings, the severity of which depended on how the character encoding was designed. Some encodings such as the [[Extended Unix Code|EUC]] family guarantee that a byte value in the ASCII range will represent only that ASCII character, making the encoding safe for systems that use those characters as field separators. Other encodings such as [[ISO-2022]] and [[Shift-JIS]] do not make such guarantees, making matching on byte codes unsafe. These encodings also were not "self-synchronizing", so that locating character boundaries required backing up to the start of a string, and pasting two strings together could result in corruption of the second string. [[Unicode]] has simplified the picture somewhat. Most programming languages now have a datatype for Unicode strings. Unicode's preferred byte stream format [[UTF-8]] is designed not to have the problems described above for older multibyte encodings. UTF-8, UTF-16 and [[UTF-32]] require the programmer to know that the fixed-size code units are different from the "characters", the main difficulty currently is incorrectly designed APIs that attempt to hide this difference (UTF-32 does make ''code points'' fixed-sized, but these are not "characters" due to composing codes).
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
String (computer science)
(section)
Add topic