Module:UCS
| This module is rated as alpha. It is ready for third-party input, and may be used on a few pages to see if problems arise, but should be watched. Suggestions for new features or changes in their input and output mechanisms are welcome. |
The module โUCSโ has the only usable table call that returns a specially formatted table of specified UCS characters.
Usage
{{#invoke: UCS|table|format|list|annotations}}
Parameters
All three currently supported parameters of the table call are positional.
format
Currently ignored but reserved for forward compatibility.
list
Input data, as a sequence of ASCII characters, for building the table. Supported inputs are:
- +hexadecimal โ jump to specified UCS code point, usually (but not necessarily) four hexadecimal digits. Closes the current row if necessary. Default start location is U+0020 SPACE.
- ! string โ name/description the character block, in wiki code. Should not be used where the current row is not finished. The string extends up to newline, so character specifications must start on the next line.
- Classifiers for exactly one code point:
- - (hyphen-minus) โ the code point is disallowed; produces a purple empty cell.
- Basic Latin letters (A โ Z or a โ z) โ the code point is an allowed character and belongs to a specified class: see below. Different classes make cells with different background colors. Class letters are case-insensitive, but lowercase letters make smaller character samples.
- Newline (0x0A) โ close the current table row. A special case is a row that consists of a block description and only one โ-โ: it produces a pink cell spanning all table width that means that specified code points are disallowed.
- #, ; , / โ a comment that runs up to newline. The difference is that for # the ending line feed is not included in the comment and effects its action, whereas for ; and / the interpreter resumes from the next line as if were not any line feeds.
- Spaces (0x20) are ignored and likely will be ignored in future versions.
Tabs (0x09) are currently ignored, but may be interpreted in future versions. All other characters may cause errors or be ignored. Support for page transclusion is planned, but not implemented.
If list is omitted or empty, then a hard-coded list is processed that produces a table for ISO 8859-1.
annotations
An optional list of lines that specify location of #-links on characters. Currently only lines of the format
- c1c2โฆcn#Anchor_for_internal_link
are supported, that generates #-links on specified characters.
Support for โ+โ code points, ranges, and other targets (links to the mainspace) is planned, but not implemented.
Character classes
This is an original classification, it does not correspond to Unicode character classes. Classifiers are not stored in the module or some other permanent location, but are extracted from the list argument, so classifiers of the same character in different tables can differ.
- D โ digraphs, ligatures, presentation forms, and other redundant characters. Currently a light gray background.
- I โ IPA Extensions and other IPA symbols (except basic Latin). Currently a violet background.
- J โ combining characters. Currently are yellow on the black background.
- K, L, M โ Latin alphabet. Namely, โKโ are basic (ASCII) Latin letters, โLโ are lesser common letters, and โMโ are exotic letters. Currently all have backgrounds in various tones of blue and cyan.
- N โ numerals. Currently a pale red background.
- O โ control characters, broadly construed. Only characters allowed in HTML are classified here. Currently an orange background.
- P, Q โ punctuation marks, common (in English) and exotic respectively. Currently have backgrounds in shades of green.
- S, T, U โ symbols. Can includes also characters from non-Latin scripts, although most of them are not intended to be shown in tables. Namely, โSโ are common symbols, โTโ are semigraphics, and โUโ are exotic symbols. Currently have backgrounds around yellow, olive, and lime.
- X โ classification is unknown. Includes unallocated code points. Currently an empty (default) background.
Class letters A, B, C, E, F, G, H, R, V, W, Y, Z are currently reserved.
The classification has not a firm base and largely reflects personal tastes of the creator. Namely, a separate class for International Phonetic Alphabet reflects its extensive use in Wikipedia, and there is no sharp criterion to discreet โcommonโ and โexoticโ characters. Distinction between โUโ (exotic symbols) and โQโ (exotic punctuation) is rather arbitrary and probably somewhere is applied mistakenly.
Examples
| Block(s) | โฏ0 | โฏ1 | โฏ2 | โฏ3 | โฏ4 | โฏ5 | โฏ6 | โฏ7 | โฏ8 | โฏ9 | 10 0a |
11 0b |
12 0c |
13 0d |
14 0e |
15 0f |
16 10 |
17 11 |
18 12 |
19 13 |
20 14 |
21 15 |
22 16 |
23 17 |
24 18 |
25 19 |
26 1a |
27 1b |
28 1c |
29 1d |
30 1e |
31 1f |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| U+0030:Decimal digits | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | (skipped) | |||||||||||||||||||||
| U+0041:Basic Latin letters | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | (skipped) | |||||
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
-- โ Makes the table of UCS (Unicode) characters for a reference page โ
-- โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
-- Gโlโoโbโaโlโ โvโaโrโiโaโbโlโeโsโ.
local outbuff = { '{| class="wikitable"' } -- a sequence of (output) strings
local outptr = 1 -- global pointer in outbuff
local base_codepoint = 32
local Block = " [[Basic Latin (Unicode block)|Basic Latin]]"
local row_start = 1 -- usually, pointer to the last " |-" in outbuff
-- Uโtโiโlโiโtโyโ โfโuโnโcโtโiโoโnโsโ โsโtโaโrโtโ โhโeโrโeโ.
function puts( s )
-- mw.log("Output: "..s)
outptr = outptr + 1
outbuff[outptr] = s
end
function close_row( NoC, s )
-- mw.log("close_row("..NoC..", "..s..")")
if ( outptr > row_start ) then
local columns_deficit = row_start + NoC - outptr
if (columns_deficit > 0) then -- may not happen with correct input data
local colspan=''
if (columns_deficit > 1) then colspan='colspan='..columns_deficit..' ' end
puts ( ' | '..colspan..' style="color:red" |'..s)
end
puts(" |-")
row_start = outptr
end
end
function mkchar( c )
if (
( c < 36) -- C0, space, !, ", #
or ( c == 38 ) -- &
or ( c >= 91 ) and ( c <= 93 ) -- [ \ ]
or ( c >= 123 ) and ( c <= 125 ) -- { | }
or ( c == 127 ) -- DEL, and ( c < 160 ) (C1) pointless
) then
return '&#'..c..';'
end
return mw.ustring.char( c )
end
local hh = 0;
function is_hex ( c )
if (c>102) then return -1 end
if (c>=97) then -- aโf
hh = c - 87
return hh
end
if (c>70) then return -1 end
if (c>=65) then -- AโF
hh = c - 55
return hh
end
if (c>=58) then return -1 end
if (c>=48) then -- 0โ9
hh = c - 48
return hh
end
return (-1)
end
function get_hex ( s, i )
local v = 0
while ( is_hex (string.byte( s, i)) >= 0 ) do
v = 16*v + hh
i = i + 1
end
return v, i
end
-- Uโtโiโlโiโtโyโ โfโuโnโcโtโiโoโnโsโ โeโnโdโ โhโeโrโeโ.
local p = {}
-- Tโhโeโ โaโnโnโoโtโaโtโiโoโnโsโ โpโaโrโsโeโrโ โsโtโaโrโtโsโ โhโeโrโeโ.
p.annot_map = { }
function mk_item ( c )
if ( p.annot_map[c] ) then
return ('[['..p.annot_map[c]..'|'..mkchar(c)..']]')
end
return mkchar(c)
end
function p.process_arg3 ( annots )
-- mw.log(" annots = "..annots)
local iter = mw.ustring.gmatch( annots, "(%S+)(#.-)%s" )
while (true) do
local t, a;
t, a = iter()
if (not a) then return end
-- mw.log(t.." โ "..a)
for cpt in mw.ustring.gcodepoint( t ) do
p.annot_map[cpt] = a
end
end
end
-- Tโhโeโ โaโnโnโoโtโaโtโiโoโnโsโ โpโaโrโsโeโrโ โeโnโdโsโ โhโeโrโeโ.
-- Tโhโeโ โcโhโaโrโaโcโtโeโrโ โlโiโsโtโ โpโaโrโsโeโrโ โsโtโaโrโtโsโ โhโeโrโeโ.
local bubu = 'style="color:#9900FF" '
local bgg = {
bubu, bubu, bubu, 'bgcolor=#999999 ', bubu, bubu, bubu, bubu,
'bgcolor=#6600FF ', -- IPA
'style="background-color:#000000; color:#FFFF66" ', --combining diacritics
-- Latin letters (K, L, M)
'bgcolor=#3333FF ', -- ASCII
'bgcolor=#3377FF ', -- less common
'bgcolor=#0099FF ', -- exotic
-- Numbers (N)
'bgcolor=#FF9999 ',
-- Control characters (O)
'bgcolor=#FFAA66 ',
-- Punctuation (P, Q)
'bgcolor=#33FF33 ', -- common (English)
'bgcolor=#22AA22 ', -- less common
bubu,
-- Symbols (S, T, U)
'bgcolor=#FFFF66 ', -- common
'bgcolor=#CCFF66 ', -- box drawing / pseudographics
'bgcolor=#AAAA44 ', -- uncommon
bubu, bubu, '', bubu, bubu, bubu, bubu, bubu, bubu, bubu, [0] = bubu
}
function p.process_arg2 ( charlist )
local c_length = string.len ( charlist )
if ( c_length <= 1 ) then return 0 end
local c_index = 1
while ( c_index <= c_length ) do
local c_code = string.byte( charlist, c_index )
if ( c_code == 43 ) then -- โ+โ
base_codepoint, c_index = get_hex (charlist, c_index+1 )
if (
( outptr == row_start + 1 )
and string.match( outbuff[outptr], '^ | style=')
) then
outbuff[outptr] = ' | colspan=33 ' .. string.sub( outbuff[outptr], 3)
puts(" |-")
row_start = outptr
else
close_row( 33, "Unfinished row")
end
elseif ( c_code == 33 ) then -- โ!โ
close_row( 33, "Unexpected โ!โ command")
local eol = string.find( charlist, "\n", c_index+1, true )
if (eol == nil) then break end
Block = string.sub( charlist, c_index+1, eol-1 )
puts(
' | style="font-size:80%" |U+' ..
string.format('%04x:',base_codepoint) .. Block
)
local o = base_codepoint % 32
if ( o > 0 ) then
puts( ' | colspan='..o..' |' )
row_start = row_start - o + 1 -- temporary kludge
end
c_index = eol + 1
elseif ( ( c_code == 35 ) or ( c_code == 59 ) or ( c_code == 47 ) ) then
local eol = string.find( charlist, "\n", c_index+1, true )
if (eol == nil) then break end
if ( c_code == 35 ) then
c_index = eol
else
c_index = eol + 1
end
elseif ( c_code == 10 ) then -- line feed
if (
( outptr == row_start + 2 ) -- only one item in the row
and ( string.byte( charlist, c_index - 1 ) == 45 ) -- it is โ-โ
and string.match( outbuff[row_start+1], '^ | style=')
) then
outbuff[row_start+1] = ' | colspan=33 bgcolor=#FF6699 ' .. string.sub( outbuff[row_start+1], 3)
outbuff[outptr] = " |-"
row_start = outptr
else
close_row( 33, "(skipped)") -- temporary
end
base_codepoint = base_codepoint + ( (2097152 - base_codepoint) % 16 )
c_index = c_index + 1
else
if ( outptr <= row_start ) then
puts(
' | style="font-size:75%" |U+' ..
string.format('%04x:',base_codepoint) .. Block
)
end
if ( (c_code >= 65 ) and (c_code <= 122) ) then
local dimin = ''
if (c_code >= 96 ) then dimin = 'style="font-size:75%" ' end
local item = mk_item(base_codepoint)
if ( c_code%32 == 10 ) then item = 'โ'..item end
puts(' | '..bgg[c_code%32]..dimin..'|\t'..item)
base_codepoint = base_codepoint + 1 --temporary
elseif ( c_code == 45 ) then -- โ-โ
puts(' | bgcolor=#AA4466 | ')
base_codepoint = base_codepoint + 1 --temporary
end -- ignore all other bytes
c_index = c_index + 1
end
end
close_row( 33, "end of data")
return 1
end
-- Tโhโeโ โcโhโaโrโaโcโtโeโrโ โlโiโsโtโ โpโaโrโsโeโrโ โeโnโdโsโ โhโeโrโeโ.
-- Tโhโeโ โmโaโiโnโ โrโoโuโtโiโnโeโ โsโtโaโrโtโsโ โhโeโrโeโ.
function p.table( frame )
-- frame.args[1] is ignored now, but planned to affect the table format
puts(" |Block(s)")
for k = 0, 9 do
puts(" !โฏ"..k)
end
for k = 10, 31 do
puts(' ! style="font-size:75%; line-height:1.25" |'..string.format("%d<br/>%02x", k, k))
end
close_row( 33, "???")
if ( frame.args[3] ) then
p.process_arg3 ( frame.args[3] )
end
if ( frame.args[2] ) then
p.process_arg2 ( frame.args[2] )
else
p.process_arg2 ( [=[
PPPSSSSPPPSSPPPPNNNNNNNNNNPPSSSP
SKKKKKKKKKKKKKKKKKKKKKKKKKKPPPSS
DKKKKKKKKKKKKKKKKKKKKKKKKKKPPPS-
+00A0! [[Latin-1 Supplement (Unicode block)|Latin-1 Supplement]]
PQSSSSUPDSDQSOSDSSDDDSPPDDDQdddQ
LLLLLLlLLLLLLLLLLLLLLLLSLLLLLLLL
LLLLLLlLLLLLLLLLILLLLLLULLLLLLLL
]=] )
end
outbuff[outptr] = " |}"
return table.concat( outbuff, "\n" )
end
-- Tโhโeโ โmโaโiโnโ โrโoโuโtโiโnโeโ โeโnโdโsโ โhโeโrโeโ.
function p.sheet( frame )
return '\nThe <code>sheet</code> call is discontinued.\t'
end
return p
