Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Plain text
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Encoding== ===Character encodings=== {{Unreferenced section|date=December 2023}} {{main|Character encoding}} Before the early 1960s, computers were mainly used for number-crunching rather than for text, and memory was extremely expensive. Computers often allocated only 6 bits for each character, permitting only 64 characters—assigning codes for A-Z, a-z, and 0-9 would leave only 2 codes: nowhere near enough. Most computers opted not to support lower-case letters. Thus, early text projects such as [[Roberto Busa]]'s [[Index Thomisticus]], the [[Brown Corpus]], and others had to resort to conventions such as keying an asterisk preceding letters actually intended to be upper-case. [[Fred Brooks]] of [[IBM]] argued strongly for going to 8-bit bytes, because someday people might want to process text, and won. Although IBM used [[EBCDIC]], most text from then on came to be encoded in [[ASCII]], using values from 0 to 31 for (non-printing) [[control characters]], and values from 32 to 127 for graphic characters such as letters, digits, and punctuation. Most machines stored characters in 8 bits rather than 7, ignoring the remaining bit or using it as a [[checksum]]. The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign ("$") was not as useful in England, and the accented characters used in Spanish, French, German, Portuguese, Italian and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control characters, or using values in the range from 128 to 255. Using values above 128 conflicts with using the 8th bit as a checksum, but the checksum usage gradually died out. These additional characters were encoded differently in different countries, making texts impossible to decode without figuring out the originator's rules. For instance, a browser might display '''¬A''' rather than '''`''' if it tried to interpret one character set as another. The International Organization for Standardization ([[International Organization for Standardization|ISO]]) eventually developed several [[code pages]] under [[ISO 8859]], to accommodate various languages. The first of these ([[ISO 8859-1]]) is also known as "Latin-1", and covers the needs of most (not all) European languages that use Latin-based characters (there was not quite enough room to cover them all). [[ISO 2022]] then provided conventions for "switching" between different character sets in mid-file. Many other organisations developed variations on these, and for many years Windows and Macintosh computers used incompatible variations. The text-encoding situation became more and more complex, leading to efforts by ISO and by the [[Unicode Consortium]] to develop a single, unified character encoding that could cover all known (or at least all currently known) languages. After some conflict,<ref>{{Cite web |title=ISO/Unicode Merger: Ed Hart Memo |url=https://www.unicode.org/history/hartmemo.html |access-date=2024-10-21 |website=www.unicode.org}}</ref> these efforts were unified. [[Unicode]] currently allows for 1,114,112 code values, and assigns codes covering nearly all modern text writing systems, as well as many historical ones, and for many non-linguistic characters such as printer's [[dingbat]]s, mathematical symbols, etc. Text is considered plain text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Perhaps the most common way of explicitly stating the specific encoding of plain text is with a [[MIME type]]. For email and [[HTTP]], the default MIME type is "[[text/plain]]" -- plain text without markup. Another MIME type often used in both email and HTTP is "[[text/html]]; charset=UTF-8" -- plain text represented using the UTF-8 character encoding with HTML markup. Another common MIME type is "application/json" -- plain text represented using the UTF-8 character encoding with [[JSON]] markup. When a document is received without any explicit indication of the character encoding, some applications use [[charset detection]] to attempt to guess what encoding was used. ===Control codes=== {{main|C0 and C1 control codes}} [[ASCII]] reserves the first 32 codes (numbers 0–31 decimal) for [[control character]]s known as the "C0 set": codes originally intended not to represent printable information, but rather to control devices (such as [[Computer printer|printers]]) that make use of ASCII, or to provide [[Metadata|meta-information]] about data streams such as those stored on magnetic tape. They include common characters like the [[newline]] and the [[tab character]]. In 8-bit character sets such as [[ISO/IEC 8859-1|Latin-1]] and the other [[ISO/IEC 8859|ISO 8859]] sets, the first 32 characters of the "upper half" (128 to 159) are also control codes, known as the "C1 set". They are rarely used directly; when they turn up in documents which are ostensibly in an ISO 8859 encoding, their code positions generally refer instead to the characters at that position in a proprietary, system-specific encoding, such as [[Windows-1252]] or [[Mac OS Roman]], that use the codes to instead provide additional graphic characters. {{main|Unicode control characters}} [[Unicode]] defines additional control characters, including [[bi-directional text]] direction override characters (used to explicitly mark right-to-left writing inside left-to-right writing and the other way around) and [[Variant form (Unicode)|variation selectors]] to select alternate forms of [[CJK ideographs]], [[emoji]] and other characters.
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
Plain text
(section)
Add topic