DMN 1.3 RTF Avatar
  1. OMG Issue

DMN13 — Spec does not clarify meaning of hex value

  • Key: DMN13-142
  • Status: open  
  • Source: fujitsu america ( keith swenson)
  • Summary:

    Rule #66 on page 111 says that a character in a string can be expressed as:

    "\u", hex digit, hex digit, hex digit, hex digit

    For example "\uD83D"

    That is, exactly four hex digits. I believe the intent is that FEEL only allows exactly four digits, and does not allow the kinds of expressions that we see in the EBNF.

    What is never specified is the exact meaning of that hex value. There are two possibilities:

    (a) Is that value a Unicode code point? In this case it is easy, the hex value is the code point value, however because you are limited to 64K characters, and not the 1.1M character range normally considered, and not even the values that are mentioned in the spec as having significance.

    (b) Or is it a UTF-16 code value? UTF-16 has encoding rules about values in the surrogate character range. In UTF-16 a high-surrogate-code value must be followed by a low-surrogate-code value or else the sequence of values is invalid and undefined. Using surrogate characters you can address the entire 1.1million characters but the user is required to understand about surrogate pairs.

    The spec never mentions that UTF-16 encoding is required! It always uses "Unicode" and talks about "characters" and "code points". It does not mention anything about surrogate pairs. It never says that these values a "just like Java" or any other UTF-16 implementation.

    Page 124 says that the FEEL string value is the same as java.lang.String. Should we infer from that that internal representations must be in UTF-16? however it also says that it is equivalent to an XML string (which is NOT constrained to UTF-16) and PMML string which I looked up and seems to be based on XML. XML allows characters to be expressed as &#nnnn ; That is an ampersand, a hash, a decimal number, terminated by a semicolon. In this case, the decimal value is the actual code point, and not the UTF-16 value. So page 124 does not say unambiguously that Java defines the string values that can be used.

    Unicode is mentioned only in three places: on page 108 (about EBNF character ranges), page 111 that tokens are a sequence of unicode characters, page 114 in an example.

    While it might be nice to be a "code point", the syntax clearly limits you to four digits leaving you no way to express larger code point values. If it was a code point you would be limited to only specifying 64,000 character (minus several thousand code points that not allowed for various reasons).

    The easiest repair is to state clearly that the \u notation assumes that UTF-16 is being used to encode the strings, and that UTF-16 rules must be used when specifying hex values for characters.

    I believe most implementations to date have assumed that these are UTF-16 code unit values. That is what Java does. That is what JavaScript does. I don't know of any environments that do anything different for this kind of expression.

  • Reported: DMN 1.2b1 — Fri, 8 Feb 2019 18:33 GMT
  • Updated: Tue, 18 Jun 2019 16:17 GMT