Control character

In computing and telecommunication, a control character or non-printing character is a code point (a number) in a character set, that does not represent a written symbol. They are used as in-band signaling to cause effects other than the addition of a symbol to the text. All other characters are mainly printing, printable, or graphic characters, except perhaps for the "space" character (see ASCII printable characters).

All entries in the ASCII table below code 32 (technically the C0 control code set) are of this kind, including CR and LF used to separate lines of text. The code 127 (DEL) is also a control character. Extended ASCII sets defined by ISO 8859 added the codes 128 through 159 as control characters, this was primarily done so that if the high bit was stripped it would not change a printing character to a C0 control code, but there have been some assignments here, in particular NEL. This second set is called the C1 set.

These 65 control codes were carried over to Unicode. Unicode added more characters that could be considered controls, but it makes a distinction between these "Formatting characters" (such as the Zero-width non-joiner), and the 65 Control characters.

The Extended Binary Coded Decimal Interchange Code (EBCDIC) character set contains 65 control codes, including all of the ASCII control codes as well as additional codes which are mostly used to control IBM peripherals.

History

Procedural signs in Morse code are a form of control character.

A form of control characters were introduced in the 1870 Baudot code: NUL and DEL. The 1901 Murray code added the carriage return (CR) and line feed (LF), and other versions of the Baudot code included other control characters.

The bell character (BEL), which rang a bell to alert operators, was also an early teletype control character.

Control characters have also been called "format effectors".

In ASCII

The control characters in ASCII still in common use include:

0 (null, NUL, \0, ^@), originally intended to be an ignored character, but now used by many programming languages including C to mark the end of a string.
7 (bell, BEL, \a, ^G), which may cause the device to emit a warning such as a bell or beep sound or the screen flashing.
8 (backspace, BS, \b, ^H), may overprint the previous character.
9 (horizontal tab, HT, \t, ^I), moves the printing position right to the next tab stop.
10 (line feed, LF, \n, ^J), moves the print head down one line, or to the left edge and down. Used as the end of line marker in most UNIX systems and variants.
11 (vertical tab, VT, \v, ^K), vertical tabulation.
12 (form feed, FF, \f, ^L), to cause a printer to eject paper to the top of the next page, or a video terminal to clear the screen.
13 (carriage return, CR, \r, ^M), moves the printing position to the start of the line, allowing overprinting. Used as the end of line marker in Classic Mac OS, OS-9, FLEX (and variants). A CR+LF pair is used by CP/M-80 and its derivatives including DOS and Windows, and by Application Layer protocols such as FTP, SMTP, and HTTP.
26 (Control-Z, SUB, EOF, ^Z). Acts as an end-of-file for the Windows text-mode file i/o.
27 (escape, ESC, \e (GCC only), ^[). Introduces an escape sequence.

You may often see control characters described as doing something when the user inputs them, such as code 3 (End-of-Text character, ETX, ^C) to interrupt the running process, or code 4 (End of transmission, EOT, ^D), used to end text input or to exit a Unix shell. These uses usually have little to do with their use when they are in text being output, and on modern systems usually do not involve the transmission of the code number at all (instead the program gets the fact that the user is holding down the Ctrl key and pushing the key marked with a 'C').

There were quite a few control characters defined (33 in ASCII, and the ECMA-48 standard adds 32 more). This was because early terminals had very primitive mechanical or electrical controls that made any kind of state-remembering api quite expensive to implement, thus a different code for each and every function looked like a requirement. It quickly became possible and inexpensive to interpret sequences of codes to perform a function, and device makers found a way to send hundreds of device instructions. Specifically, they used ASCII code 27 (escape), followed by a series of characters called a "control sequence" or "escape sequence". The mechanism was invented by Bob Bemer, the father of ASCII. For example, the sequence of code 27, followed by the printable characters "[2;10H", would cause a DEC VT-102 terminal to move its cursor to the 10th cell of the 2nd line of the screen. Several standards exist for these sequences, notably ANSI X3.64. But the number of non-standard variations in use is large, especially among printers, where technology has advanced far faster than any standards body can possibly keep up with.

In Unicode

In Unicode, "Control-characters" are U+0000—U+001F (C0 controls), U+007F (delete), and U+0080—U+009F (C1 controls). Their General Category is "Cc". Formatting codes are distinct, in General Category "Cf". The Cc control characters have no Name in Unicode, but are given labels such as "<control-001A>" instead.^[2]

Display

There are a number of techniques to display non-printing characters, which may be illustrated with the bell character in ASCII encoding:

Code point: decimal 7, hexadecimal 0x07
An abbreviation, often three capital letters: BEL
A special character condensing the abbreviation: Unicode U+2407 (␇), "symbol for bell"
An ISO 2047 graphical representation: Unicode U+237E (⍾), "graphic for bell"
Caret notation in ASCII, where code point 00xxxxx is represented as a caret followed by the capital letter at code point 10xxxxx: ^G
An escape sequence, as in C/C++ character string codes: \a, \007, \x07, etc.

How control characters map to keyboards

ASCII-based keyboards have a key labelled "Control", "Ctrl", or (rarely) "Cntl" which is used much like a shift key, being pressed in combination with another letter or symbol key. In one implementation, the control key generates the code 64 places below the code for the (generally) uppercase letter it is pressed in combination with (i.e., subtract 64 from ASCII code value in decimal of the (generally) uppercase letter). The other implementation is to take the ASCII code produced by the key and bitwise AND it with 31, forcing bits 6 and 7 to zero. For example, pressing "control" and the letter "g" or "G" (code 107 in octal or 71 in base 10, which is 01000111 in binary, produces the code 7 (Bell, 7 in base 10, or 00000111 in binary). The NULL character (code 0) is represented by Ctrl-@, "@" being the code immediately before "A" in the ASCII character set. For convenience, a lot of terminals accept Ctrl-Space as an alias for Ctrl-@. In either case, this produces one of the 32 ASCII control codes between 0 and 31. This approach is not able to represent the DEL character because of its value (code 127), but Ctrl-? is often used for this character, as subtracting 64 from a '?' gives −1, which if masked to 7 bits is 127.^[3]

When the control key is held down, letter keys produce the same control characters regardless of the state of the shift or caps lock keys. In other words, it does not matter whether the key would have produced an upper-case or a lower-case letter. The interpretation of the control key with the space, graphics character, and digit keys (ASCII codes 32 to 63) vary between systems. Some will produce the same character code as if the control key were not held down. Other systems translate these keys into control characters when the control key is held down. The interpretation of the control key with non-ASCII ("foreign") keys also varies between systems.

Control characters are often rendered into a printable form known as caret notation by printing a caret (^) and then the ASCII character that has a value of the control character plus 64. Control characters generated using letter keys are thus displayed with the upper-case form of the letter. For example, ^G represents code 7, which is generated by pressing the G key when the control key is held down.

Keyboards also typically have a few single keys which produce control character codes. For example, the key labelled "Backspace" typically produces code 8, "Tab" code 9, "Enter" or "Return" code 13 (though some keyboards might produce code 10 for "Enter").

Many keyboards include keys that do not correspond to any ASCII printable or control character, for example cursor control arrows and word processing functions. The associated keypresses are communicated to computer programs by one of four methods: appropriating otherwise unused control characters; using some encoding other than ASCII; using multi-character control sequences; or using an additional mechanism outside of generating characters. "Dumb" computer terminals typically use control sequences. Keyboards attached to stand-alone personal computers made in the 1980s typically use one (or both) of the first two methods. Modern computer keyboards generate scancodes that identify the specific physical keys that are pressed; computer software then determines how to handle the keys that are pressed, including any of the four methods described above.

The design purpose

The control characters were designed to fall into a few groups: printing and display control, data structuring, transmission control, and miscellaneous.

Printing and display control

Printing control characters were first used to control the physical mechanism of printers, the earliest output device. An early implementation of this idea was the out-of-band ASA carriage control characters. Later, control characters were integrated into the stream of data to be printed. The carriage return character (CR), when sent to such a device, causes it to put the character at the edge of the paper at which writing begins (it may, or may not, also move the printing position to the next line). The line feed character (LF/NL) causes the device to put the printing position on the next line. It may (or may not), depending on the device and its configuration, also move the printing position to the start of the next line (which would be the leftmost position for left-to-right scripts, such as the alphabets used for Western languages, and the rightmost position for right-to-left scripts such as the Hebrew and Arabic alphabets). The vertical and horizontal tab characters (VT and HT/TAB) cause the output device to move the printing position to the next tab stop in the direction of reading. The form feed character (FF/NP) starts a new sheet of paper, and may or may not move to the start of the first line. The backspace character (BS) moves the printing position one character space backwards. On printers, this is most often used so the printer can overprint characters to make other, not normally available, characters. On terminals and other electronic output devices, there are often software (or hardware) configuration choices which will allow a destruct backspace (i.e., a BS, SP, BS sequence) which erases, or a non-destructive one which does not. The shift in and shift out characters (SO and SI) selected alternate character sets, fonts, underlining or other printing modes. Escape sequences were often used to do the same thing.

With the advent of computer terminals that did not physically print on paper and so offered more flexibility regarding screen placement, erasure, and so forth, printing control codes were adapted. Form feeds, for example, usually cleared the screen, there being no new paper page to move to. More complex escape sequences were developed to take advantage of the flexibility of the new terminals, and indeed of newer printers. The concept of a control character had always been somewhat limiting, and was extremely so when used with new, much more flexible, hardware. Control sequences (sometimes implemented as escape sequences) could match the new flexibility and power and became the standard method. However, there were, and remain, a large variety of standard sequences to choose from.

Data structuring

The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made to structure data, usually on a tape, in order to simulate punched cards. End of medium (EM) warns that the tape (or other recording medium) is ending. While many systems use CR/LF and TAB for structuring data, it is possible to encounter the separator control characters in data that needs to be structured. The separator control characters are not overloaded; there is no general use of them except to separate data into structured groupings. Their numeric values are contiguous with the space character, which can be considered a member of the group, as a word separator.

Transmission control

The transmission control characters were intended to structure a data stream, and to manage re-transmission or graceful failure, as needed, in the face of transmission errors.

The start of heading (SOH) character was to mark a non-data section of a data stream—the part of a stream containing addresses and other housekeeping data. The start of text character (STX) marked the end of the header, and the start of the textual part of a stream. The end of text character (ETX) marked the end of the data of a message. A widely used convention is to make the two characters preceding ETX a checksum or CRC for error-detection purposes. The end of transmission block character (ETB) was used to indicate the end of a block of data, where data was divided into such blocks for transmission purposes.

The escape character (ESC) was intended to "quote" the next character, if it was another control character it would print it instead of performing the control function. It is almost never used for this purpose today.

The substitute character (SUB) was intended to request a translation of the next character from a printable character to another value, usually by setting bit 5 to zero. This is handy because some media (such as sheets of paper produced by typewriters) can transmit only printable characters. However, on MS-DOS systems with files opened in text mode, "end of text" or "end of file" is marked by this Ctrl-Z character, instead of the Ctrl-C or Ctrl-D, which are common on other operating systems.

The cancel character (CAN) signalled that the previous element should be discarded. The negative acknowledge character (NAK) is a definite flag for, usually, noting that reception was a problem, and, often, that the current element should be sent again. The acknowledge character (ACK) is normally used as a flag to indicate no problem detected with current element.

When a transmission medium is half duplex (that is, it can transmit in only one direction at a time), there is usually a master station that can transmit at any time, and one or more slave stations that transmit when they have permission. The enquire character (ENQ) is generally used by a master station to ask a slave station to send its next message. A slave station indicates that it has completed its transmission by sending the end of transmission character (EOT).

The device control codes (DC1 to DC4) were originally generic, to be implemented as necessary by each device. However, a universal need in data transmission is to request the sender to stop transmitting when a receiver is temporarily unable to accept any more data. Digital Equipment Corporation invented a convention which used 19 (the device control 3 character (DC3), also known as control-S, or XOFF) to "S"top transmission, and 17 (the device control 1 character (DC1), a.k.a. control-Q, or XON) to start transmission. It has become so widely used that most don't realize it is not part of official ASCII. This technique, however implemented, avoids additional wires in the data cable devoted only to transmission management, which saves money. A sensible protocol for the use of such transmission flow control signals must be used, to avoid potential deadlock conditions, however.

The data link escape character (DLE) was intended to be a signal to the other end of a data link that the following character is a control character such as STX or ETX. For example a packet may be structured in the following way (DLE) <STX> <PAYLOAD> (DLE) <ETX>.

Miscellaneous codes

Code 7 (BEL) is intended to cause an audible signal in the receiving terminal.^[4]

Many of the ASCII control characters were designed for devices of the time that are not often seen today. For example, code 22, "synchronous idle" (SYN), was originally sent by synchronous modems (which have to send data constantly) when there was no actual data to send. (Modern systems typically use a start bit to announce the beginning of a transmitted word— this is a feature of asynchronous communication. Synchronous communication links were more often seen with mainframes, where they were typically run over corporate leased lines to connect a mainframe to another mainframe or perhaps a minicomputer.)

Code 0 (ASCII code name NUL) is a special case. In paper tape, it is the case when there are no holes. It is convenient to treat this as a fill character with no meaning otherwise. Since the position of a NUL character has no holes punched, it can be replaced with any other character at a later time, so it was typically used to reserve space, either for correcting errors or for inserting information that would be available at a later time or in another place. In computing it is often used for padding in fixed length records and more commonly, to mark the end of a string.

Code 127 (DEL, a.k.a. "rubout") is likewise a special case. Its 7-bit code is all-bits-on in binary, which essentially erased a character cell on a paper tape when overpunched. Paper tape was a common storage medium when ASCII was developed, with a computing history dating back to WWII code breaking equipment at Biuro Szyfrów. Paper tape became obsolete in the 1970s, so this clever aspect of ASCII rarely saw any use after that. Some systems (such as the original Apples) converted it to a backspace. But because its code is in the range occupied by other printable characters, and because it had no official assigned glyph, many computer equipment vendors used it as an additional printable character (often an all-black "box" character useful for erasing text by overprinting with ink).

Non-erasable Programmable ROMs are typically implemented as arrays of fusible elements, each representing a bit, which can only be switched one way, usually from one to zero. In such PROMs, the DEL and NUL characters can be used in the same way that they were used on punched tape: one to reserve meaningless fill bytes that can be written later, and the other to convert written bytes to meaningless fill bytes. For PROMs that switch one to zero, the roles of NUL and DEL are reversed; also, DEL will only work with 7-bit characters, which are rarely used today; for 8-bit content, the character code 255, commonly defined as a nonbreaking space character, can be used instead of DEL.

Many file systems do not allow control characters in the filenames, as they may have reserved functions.

Notes and references

↑ MS-DOS QBasic v1.1 Documentation. Microsoft 1987-1991.
↑ "4.8 Name". The Unicode® Standard Version 11.0 – Core Specification (PDF). Unicode, Inc.
↑ "ASCII Characters". Archived from the original on October 28, 2009. Retrieved 2010-10-08.
↑ "RFC20". Retrieved 2013-11-03.An old RFC, which explains the structure and meaning of the control characters in chapters 4.1 and 5.2

External links

ISO IR 1 C0 Set of ISO 646 (PDF)

This article "Control character" is from Wikipedia. The list of its authors can be seen in its historical. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one. {{{{⚠️🚨COPIED from EverybodyWiki ❗❕⚠️😡😤Please respect Licence CC-BY-SA ❗}}}}

[1] MS-DOS QBasic v1.1 Documentation. Microsoft 1987-1991.

[2] "4.8 Name". The Unicode® Standard Version 11.0 – Core Specification (PDF). Unicode, Inc.

[3] "ASCII Characters". Archived from the original on October 28, 2009. Retrieved 2010-10-08.

[4] "RFC20". Retrieved 2013-11-03.An old RFC, which explains the structure and meaning of the control characters in chapters 4.1 and 5.2

[1]

[2]

[3]

[4]

v t e Character encodings
Early telecommunications	ASCII ISO/IEC 646 ISO/IEC 6937 T.61 BCDIC Baudot code Morse code Telegraph code Wabun code Special telegraphy codes Non-Latin Chinese Cyrillic Needle telegraph codes
ISO/IEC 8859	-1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 -15 -16
Bibliographic use	ANSEL ISO 5426 / 5426-2 / 5427 / 5428 / 6438 / 6861 / 6862 / 10585 / 10586 / 10754 / 11822 MARC-8
National standards	ArmSCII BraSCII CNS 11643 ELOT 927 GOST 10859 GB 18030 HKSCS I.S. 434 ISCII JIS X 0201 JIS X 0208 JIS X 0212 JIS X 0213 KOI-7 KPS 9566 KS X 1001 PASCII SI 960 TIS-620 TSCII VISCII VSCII YUSCII
EUC	CN JP KR TW
ISO/IEC 2022	CN JP KR CCCII
MacOS code pages ("scripts")	Armenian Arabic Barents Cyrillic Celtic CentEuro ChineseSimp / EUC-CN ChineseTrad / Big5 Croatian Cyrillic Devanagari Dingbats Farsi (Persian) Gaelic Georgian Greek Gujarati Gurmukhi Hebrew Iceland Inuit Japanese / ShiftJIS Keyboard Korean / EUC-KR Latin (Kermit) Maltese/Esperanto Ogham / I.S. 434 Roman Romanian Sámi Symbol Thai / TIS-620 Turkish Turkic Latin Turkic Cyrillic Ukrainian VT100
DOS code pages	100 111 112 113 151 152 161 162 163 164 165 166 210 220 301 437 449 489 620 667 668 707 708 709 710 711 714 715 720 721 737 768 770 771 772 773 774 775 776 777 778 790 850 851 852 853 854 855/872 856 857 858 859 860 861 862 863 864/17248 865 866/808 867 868 869 874/1161/1162 876 877 878 881 882 883 884 885 891 895 896 897 898 899 900 903 904 906 907 909 910 911 926 927 928 929 932 934 936 938 941 942 943 944 946 947 948 949 950/1370 951 966 991 1034 1039 1040 1041 1042 1043 1044 1046 1086 1088 1092 1093 1098 1108 1109 1114 1115 1116 1117 1118 1119 1125/848 1126 1127 1131/849 1139 1167 1168 1300 1351 1361 1362 1363 1372 1373 1374 1375 1380 1381 1385 1386 1391 1392 1393 1394 CWI-2 Iran System Kamenický KOI8 Mazovia MIK
IBM AIX code pages	367 371 806 813 819 895 896 912 913 914 915 916 919 920 921/901 922/902 923 952 953 954 955 956 957 958 959 960 961 963 964 965 970 971 1004 1006 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1029 1036 1089 1111 1124 1129/1163 1133 1350 1382 1383
IBM Apple MacIntosh emulations	1275 1280 1281 1282 1283 1284 1285 1286
IBM Adobe emulations	1038 1276 1277
IBM DEC emulations	1020 1021 1023 1090 1100 1101 1102 1103 1104 1105 1106 1107 1287 1288
IBM HP emulations	1050 1051 1052 1053 1054 1055 1056 1057 1058
Windows code pages	CER-GS 874/1162 (TIS-620) 932/943 (Shift JIS) 936/1386 (GBK) 950/1370 (Big5) 949/1363 (EUC-KR) 1169 1174 Extended Latin-8 1200 (UTF-16LE) 1201 (UTF-16BE) 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1261 1270 54936 (GB18030) 65001 (UTF-8)
EBCDIC code pages	1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37/1140 37-2 38 39 40 251 252 254 256 257 258 259 260 264 273/1141 274 275 276 277/1142 278/1143 279 280/1144 281 282 283 284/1145 285/1146 286 287 288 289 290 293 297/1147 298 300 310 320 321 322 330 351 352 353 355 357 358 359 360 361 363 382 383 384 385 386 387 388 389 390 391 392 393 394 395 410 420/16804 421 423 424/8616/12712 425 435 500/1148 803 829 833 834 835 836 837 838/838 839 870/1110/1153 871/1149 875/4971/9067 880 881 882 883 884 885 886 887 888 889 890 892 893 905 918 924 930/1390 931 933/1364 935/1388 937/1371 939/1399 1001 1002 1003 1005 1007 1024 1025/1154 1026/1155 1027 1028 1030 1031 1032 1033 1037 1047 1068 1069 1070 1071 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1087 1091 1097 1112/1156 1113 1122/1157 1123/1158 1130/1164 1132 1136 1137 1150 1151 1152 1159 1165 1166 1278 1279 1303 1364 1376 1377 JEF KEIS
Platform specific	Acorn Adobe Standard Adobe Latin 1 Apple II Apple Sabine ATASCII Atari ST BICS Casio calculators CDC Compucolor II CPC DEC Radix-50 DEC MCS/NRCS DG International ELWRO-Junior FIELDATA GEM GEOS GSM 03.38 HP Roman Extension HP Roman-8 HP Roman-9 HP FOCAL HP RPL LICS LMBCS Mattel Aquarius Minitel MSX NEC APC NeXT OricSCII PCW PETSCII Sega SC-3000 Sharp calculators Sharp MZ Sinclair QL Teletext TI calculators TRS-80 Ventura International Ventura Symbol Videotex WISCII XCCS ZX80 ZX81 ZX Spectrum
Unicode / ISO/IEC 10646	UTF-1 UTF-7 UTF-8 UTF-16 (UTF-16LE/UTF-16BE) / UCS-2 UTF-32 (UTF-32LE/UTF-32BE) / UCS-4 UTF-EBCDIC GB 18030 BOCU-1 CESU-8 SCSU
TeX typesetting system	Cork LGR LY1 OML OMS OMX OT1 OT2 OT3 OT4 T2A T2B T2C T2D T3 T4 T5 TS1 TS3 U X2
Miscellaneous code pages	ABICOMP APL ARIB STD-B24 HZ INIS INIS-8 ISO-IR-111 ISO-IR-169 ISO-IR-182 ISO-IR-197 ISO-IR-200 ISO-IR-201 Johab SEASCII Stanford/ITS TACE16 TRON UTF-5 UTF-6 WTF-8
Related topics	Code page Control character (C0 C1) CCSID Character encodings in HTML Charset detection Han unification Hardware ISO 6429/IEC 6429/ANSI X3.64 Mojibake
Character sets