w3c · aphillips · Oct 18, 2024 · Oct 24, 2024 · Oct 25, 2024 · Oct 25, 2024
diff --git a/index.html b/index.html
@@ -757,7 +757,7 @@ <h3>Background information</h3>
   <h4>Important definitions</h4>
   <p>In order to correctly display text written in a 'right-to-left' script or left-to-right text containing bidirectional elements, it is important to establish the <a href="https://www.w3.org/International/articles/inline-bidi-markup/uba-basics#context" class="termref">base direction</a> that will be used to dictate the order in which elements of the text will be displayed.</p>
   <p>If you are not familiar with what the Unicode Bidirectional Algorithm (UBA) does and doesn't do, and why the base direction is so important, read <a href="https://www.w3.org/International/articles/inline-bidi-markup/uba-basics">Unicode Bidirectional Algorithm basics</a>.</p>
-  <aside class="example">
+  <aside class="example" id="sec-dir-example">
     <p>For example, the following annotation will not display correctly unless the application doing the display knows that the base direction needs to be right-to-left.</p>
     <pre>{
   "@context": "http://www.w3.org/ns/anno.jsonld",
@@ -772,10 +772,25 @@ <h4>Important definitions</h4>
   "target": "http://example.org/photo1"
 }
     </pre>
-    <p>You would expect the phrase in the <code class="kw" translate="no">text</code> property value to be displayed as</p>
-    <p><span dir="rtl">פעילות הבינאום, W3C</span></p>
-    <p>however, if there is no indication that the base direction should be right-to-left the following incorrect display will be produced:</p>
-    <p>פעילות הבינאום, W3C</p>
+    <p>If there is no indication that the [=base direction=] is right-to-left, the display of the item <code>text</code> will be incorrect if the text is placed into a left-to-right context (such as the table below):</p>
+
+       <table dir="ltr" class="bidi-example-table">
+	   <thead>
+		   <tr><th>Description</th><th>HTML</th><th style="width:25%">Appearance</th></tr>
+	   </thead>
+	   <tbody>
+		   <tr>
+			   <td>Incorrect:<br>(without <code>dir</code>)</td>
+			   <td><pre class="html">&lt;span lang="he"&gtות הבינאום, W3C&lt;/span&gt;</pre></td>
+			   <td class="spilloverExample"><span lang="he">ות הבינאום, W3C</span></td>
+		   </tr>
+		   <tr>
+			   <td>Correct:<br>(with <code>dir</code>)</td>
+			   <td><pre class="html">&lt;span lang="he" dir="rtl"&gtות הבינאום, W3C&lt;/span&gt;</pre></td>
+			   <td class="spilloverExample"><span dir="rtl" lang="he">ות הבינאום, W3C</span></td>
+		   </tr>
+	   </tbody>
+   </table>
   </aside>
 
   <p>In this section, the word <dfn class="lint-ignore">paragraph</dfn> indicates a run of text followed by a hard line-break in plain text, but may signify different things in other situations. In CSV it equates to 'cell', so a single line of comma-separated items is actually a set of comma-separated paragraphs.&nbsp; In HTML it equates to the lowest level of block element, which is often a <code class="kw" translate="no">p</code> element, but may be things such as <code class="kw" translate="no">div</code>, <code class="kw" translate="no">li</code>, etc., if they only contain text and/or inline elements. In JSON, it often equates to a quoted string value, but if a string value uses markup then paragraphs are associated with block elements, and if the string value is multiple lines of plain text then each line is a paragraph.</p>
@@ -878,36 +893,63 @@ <h4>Problems with control characters</h4>
 <h4>Strong directional formatting characters: RLM, LRM, and ALM</h4>
 <p>A word about the Unicode characters <span class="codepoint" translate="no"><img alt="RLM" src="images/200F.png"><code class="uname">U+200F RIGHT-TO-LEFT MARK</code></span> (RLM), <span class="codepoint" translate="no"><img alt="LRM" src="images/200E.png"><code class="uname">U+200E LEFT-TO-RIGHT MARK</code></span> (LRM), and <span class="codepoint" translate="no"><img alt="ALM" src="images/061C.png"><code class="uname">U+061C ARABIC LETTER MARK</code></span> (ALM) is warranted at this point.</p>
 <p>The first point to be clear about is that these three characters do not establish the base direction for a range of text. They are simply invisible characters with strong directional properties.</p>
-<p>This means that you cannot use RLM for example, to make the text <kbd>W3C</kbd> appear to the left of the Hebrew text in the following example.</p>
-<p>The title is "<span dir="rtl" lang="he">פעילות הבינאום, W3C</span>".</p>
-<p>For this you can only use metadata or the paired control characters.</p>
-<p>Of course, if you are detecting base direction using first-strong heuristics (such as <code>dir="auto"</code> in HTML), then inserting an RLM, ALM, or LRM can be useful for influencing the base direction detected where the text in question begins with something that would otherwise give the wrong result. For example:</p>
-<p>"<span dir="rtl" lang="ar">نشاط التدويل</span>" is how you say "i18n Activity" in Arabic.</p>
-<p>Here an LRM could be placed at the start of the text, before the strong right-to-left Arabic characters, to prevent the algorithm from assuming that the text should be right-to-left. (Remember that if metadata is used to set the base direction, the strong directional formatting character is ignored, unless the metadata specifically says that first-strong heuristics should be used.)</p>
+<p>Recalling an <a href="#sec-dir-example">earlier example</a>, this means that you cannot use RLM, for example, to make the text <kbd>W3C</kbd> appear to the left of the Hebrew text. Only using metadata or paired control characters results in the correct display.</p>
+
+<aside class="example" id="rlm-not-working" title="Use metadata instead of strongly directional formatting characters">
+
+	<table dir="ltr" class="bidi-example-table">
+	   <thead>
+		   <tr><th>Description</th><th>HTML</th><th style="width:25%">Result</th></tr>
+	   </thead>
+	   <tbody>
+		   <tr>
+			   <td>With RLM<br>(incorrect)</td>
+			   <td><pre class="html">&lt;span lang="he"&gtות&#x05D5;&#x05EA; &#x05D4;&#x05D1;&#x05D9;&#x05E0;&#x05D0;&#x05D5;&#x05DD;, W3C&amp;rlm;&lt;/span&gt;</pre></td>
+			   <td class="spilloverExample"><span lang="he">ות&#x05D5;&#x05EA; &#x05D4;&#x05D1;&#x05D9;&#x05E0;&#x05D0;&#x05D5;&#x05DD;, W3C&rlm;</span></td>
+		   </tr>
+		   <tr>
+			   <td>With metadata<br>(correct)</td>
+			   <td><pre class="html">&lt;span lang="he" dir="rtl"&gtות&#x05D5;&#x05EA; &#x05D4;&#x05D1;&#x05D9;&#x05E0;&#x05D0;&#x05D5;&#x05DD;, W3C&lt;/span&gt;</pre></td>
+			   <td class="spilloverExample"><span lang="he" dir="rtl">ות&#x05D5;&#x05EA; &#x05D4;&#x05D1;&#x05D9;&#x05E0;&#x05D0;&#x05D5;&#x05DD;, W3C</span></td>
+		   </tr>
+	   </tbody>
+   </table>
+</aside>
+
+<p>Of course, if you are detecting base direction using first-strong heuristics (such as <code>dir="auto"</code> in HTML), then inserting an RLM, ALM, or LRM can be useful for influencing the base direction detected where the text in question begins with something that would otherwise give the wrong result.</p>
+<aside class="example" title="Using a strong directional formatting character to assist first-strong heuristics">
+	<p>This HTML has strongly right-to-left Arabic characters near the start, where they will be picked up by a first-strong heuristic. Notice that there is a neutral character right at the start:</p>
+	<p><pre class="html">&ltp dir="auto"&gt;"نشاط التدويل" is how you say "i18n activity" in Arabic.&lt;/p&gt;</pre></p>
+	<p>This produces the wrong result:</p>
+	<p dir="auto" class="spilloverExample">"&#x0646;&#x0634;&#x0627;&#x0637; &#x0627;&#x0644;&#x062a;&#x062f;&#x0648;&#x064a;&#x0644;" is how you say "i18n Activity" in Arabic.</p>
+
+	<p>Here an LRM could be placed at the start of the text to prevent the algorithm from assuming that the text should be right-to-left.</p>
+	<p><pre class="html">&ltp dir="auto"&gt;&amp;lrm;"نشاط التدويل" is how you say "i18n activity" in Arabic.&lt;/p&gt;</pre></p>
+	<p dir="auto" class="spilloverExample">&lrm;"&#x0646;&#x0634;&#x0627;&#x0637; &#x0627;&#x0644;&#x062a;&#x062f;&#x0648;&#x064a;&#x0644;" is how you say "i18n Activity" in Arabic.</p>
+
+</aside>
+<p>Remember that if metadata is used to set the base direction, the strong directional formatting character is ignored, unless the metadata specifically says that first-strong heuristics should be used.</p>
 <p>Finally, a note about the use of <span class="codepoint" translate="no"><img alt="ALM" src="images/061C.png"><code class="uname">U+061C ARABIC LETTER MARK</code></span> (ALM). This character is used to influence the display of sequences of numbers in Arabic script text in cases where no Arabic letters occur before the number.</p>
 <aside class="example" title="Example of ALM usage">
    <p>In some Arabic-script languages the range <code dir="rtl">100-200</code> should appear as <code dir="rtl">&#x061c;100-200</code>. If no Arabic letters appear before the numbers, the [=Unicode Bidirectional Algorithm=] will not perform this reordering. Note that the character sequences in both cases is "100-200" and that both have a <kbd>code</kbd> element with a <code>dir="rtl"</code> around them.  In the third example, an ALM is used to provide the necessary hint, like so:</p>
-   <table>
+   <table class="bidi-example-table">
 	   <thead>
-		   <tr><th>Description</th><th>HTML / Appearance</th></tr>
+		   <tr><th>Description</th><th>HTML</th><th>Appearance</th></tr>
 	   </thead>
 	   <tbody>
 		   	<tr>
-			   <td rowspan="2">Preceded by Arabic letters</td>
-			   <td><pre class="html">&lt;code dir="rtl" lang="ar"&gt;&#x0634;&#x0627;&#x0637; &#x0627;&#x0644;&#x062A;&#x062F;&#x0648;&#x064A;&#x0644; 100-200&lt;/code&gt;</pre></td>
-			</tr><tr>
-			   <td dir="rtl" class="spilloverExample"><code dir="rtl" lang="ar">&#x0634;&#x0627;&#x0637; &#x0627;&#x0644;&#x062A;&#x062F;&#x0648;&#x064A;&#x0644; 100-200</code></td>
+			   <td>Preceded by Arabic letters</td>
+			   <td><pre class="html">&lt;code dir="rtl" lang="ar"&gt;&#x0646;&#x0634;&#x0627;&#x0637; &#x0627;&#x0644;&#x062A;&#x062F;&#x0648;&#x064A;&#x0644; 100-200&lt;/code&gt;</pre></td>
+			   <td dir="rtl" class="spilloverExample"><code dir="rtl" lang="ar">&#x0646;&#x0634;&#x0627;&#x0637; &#x0627;&#x0644;&#x062A;&#x062F;&#x0648;&#x064A;&#x0644; 100-200</code></td>
 		   </tr>
 		   <tr>
-			   <td rowspan="2">Without ALM</td>
+			   <td>Without ALM</td>
 			   <td><pre class="html">&lt;code dir="rtl" lang="ar"&gt100-200&lt;/code&gt;</pre></td>
-		   </tr><tr>
 			   <td dir="rtl" class="spilloverExample"><code dir="rtl" lang="ar">100-200</code></td>
 		   </tr>
 		   <tr>
-			   <td rowspan="2">With ALM</td>
+			   <td>With ALM</td>
 			   <td><pre class="html">&lt;code dir="rtl" lang="ar"&gt&amp;#x061C;100-200&lt;/code&gt;</pre></td>
-		   </tr><tr>
 			   <td dir="rtl" class="spilloverExample"><code dir="rtl" lang="ar" >&#x061C;100-200</code></td>
 		   </tr>
 	   </tbody>
@@ -1237,10 +1279,13 @@ <h2>Characters</h2>
 
     <p>At their simplest, user-perceived characters are a single shape that can be tied one-to-one to the underlying computing representation. But a user-perceived character can be formed, in some scripts, from more than one character. And a given logical character can take many different shapes due to such influences as font selection, style, or the surrounding context (such as adjacent characters). In some cases, a single user-perceived character might be formed from a long sequence of logical characters. And some logical characters (so-called "combining marks") are always used in conjunction with another character.</p>
 
-    <p>When user-perceived characters are represented visibly (on screen or in print), they are represented by individual rendering units. This visual unit is called a <a>grapheme</a> (the word <a>glyph</a> is also used). Graphemes are the visual units found in fonts and rendering software.</p>
+    <p>When user-perceived characters are represented visibly (on screen or in print), they are represented by individual rendering units. This visual unit is called a [=grapheme cluster=] (or [=grapheme=] for short; the word [=glyph=] is also sometimes used).</p>
+
+    <aside class=note>
+		<p>[[Unicode]] has several definitions for graphemes. Unless otherwise specified, the term [=grapheme cluster=] in this document refers to what [[UAX29]] refers to as an "extended default grapheme cluster".</p>
+    </aside>
 
-    <aside class=example>
-		<h5>Examples of user-perceived characters</h5>
+    <aside class=example title="Examples of grapheme clusters and user-perceived characters">
 		<p>Here is the word for "Unicode" in the Latin, Katakana, Arabic, and Devanagari scripts.</p>
 		<p class=bigtext>Unicode
 		   <span lang=ja>&#x30E6;&#x30CB;&#x30B3;&#x30FC;&#x30C9;</span>
@@ -1259,8 +1304,7 @@ <h5>Examples of user-perceived characters</h5>
 
     <p>The relationship between code points and graphemes can be complex. In most cases, a code point sequence that forms a single grapheme should be treated as a single textual unit. For example, when cursoring across text, an entire grapheme should select together. It shouldn't be possible to cursor into the "middle" of a grapheme or delete only a part of user-perceived character. Because the relationship is not one-to-one between code points and graphemes and because the relationship can be somewhat complex, [[Unicode]] defines a specific type of grapheme: the <a>extended grapheme cluster</a> which most closely matches the mapping of the underlying logical character sequence to a user-perceived character. When referring to 'graphemes' in this document, we mean extended grapheme clusters (unless otherwise called out).</p>
 
-    <aside class=example>
-		<h5>Hindi example showing mapping from graphemes to code points</h5>
+    <aside class=example title="Example of the difference between graphemes and code points">
 		<p>Returning to the example above, the Hindi word for Unicode is made of four graphemes:</p>
 		<p class=bigtext lang=hi>&#x092F;&#x0942;&nbsp;<span style="color:red">&#x0928;&#x093F;</span>&nbsp;&#x0915;&#x094B;&nbsp;&#x0921;</p>
 		<p>Several of these graphemes are made up of more than one Unicode character because of the way that the Devanagari script works. In Devanagari, the basic set of "letters" are syllables ending with the short 'a' vowel sound. When you want to use a different vowel, you add a combining vowel character that changes the shape of the grapheme. The red text in the example above is the syllable "ni" in "Unicode". It is made of two characters: U+0928 (the syllable "na") and U+093F (combining "short i" sound):</p>
@@ -1294,8 +1338,7 @@ <h5>Hindi example showing mapping from graphemes to code points</h5>
 
     <p>A set of rules for converting code points to or from code units is called a <a>character encoding form</a> (or just "character encoding" for short.</p>
 
-    <aside class=example>
-       <h2>UTF-8 Character Encoding Form</h2>
+    <aside class=example title="UTF-8 Character Encoding Form">
 
        <p>The most common character encoding used on the Web is UTF-8. UTF-8 uses 8-bit bytes as its code unit. Each Unicode code point encoded into UTF-8 takes between one and four bytes to encode. ASCII characters take one byte to encode. Code points from 0x80 to 0x7FF take two bytes. Code points from 0x800 to 0xFFFF take three bytes. And code points from 0x10000 to 0x10FFFF (that is, the rest of Unicode) take four bytes each.</p>
 
@@ -5563,7 +5606,17 @@ <h2> Revision Log</h2>
 <section class="appendix" id="ack">
 <h2>Acknowledgements</h2>
 <p>Thanks to Addison Phillips for help reviewing old reviews for recommendations.</p>
-<p>Other people who contributed through reviews or issues include Steve Atkin, Andrew Cunningham, Martin Dürst, Asmus Freytag, John Klensin, Tomer Mahlin, Chaals McCathieNevile, Florian Rivoal. Some material about locale-neutral representation was adapted from [[DWBP]].</p>
+<p>Other people who contributed through reviews or issues include 
+Steve Atkin, 
+Andrew Cunningham, 
+Martin Dürst, 
+Asmus Freytag, 
+John Klensin, 
+Tomer Mahlin, 
+Chaals McCathieNevile, 
+Florian Rivoal,
+Najib Tounsi. 
+Some material about locale-neutral representation was adapted from [[DWBP]].</p>
 </section>
 
 

diff --git a/local.css b/local.css
@@ -455,9 +455,36 @@ td.exampleChar {
   font-size: 140%;
 }
 
+.spilloverExample :lang(ar) {
+  font-family: Noto Sans Arabic, Tahoma, sans-serif;
+  }
+
 .localdef {
     background-color:white;
     border: 1px solid brown; 
     margin:0.5em; 
     padding:0.5em;
 }
+
+table.bidi-example-table {
+    background-color: white;
+    border-collapse: collapse;
+    padding: 0;
+    width: 98%;
+}
+
+table.bidi-example-table td {
+    padding: 0;
+}
+
+table.bidi-example-table th {
+    text-align: center;
+}
+
+table.bidi-example-table tr {
+    border-bottom: 1px solid #ddd;
+}
+
+table.bidi-example-table tr td:last-child {
+    white-space: nowrap;
+}