Merge pull request #1032 from matrix-org/rav/mxid_grammar

Indentifier grammar updates
2026-04-29 05:44:10 +02:00 · 2017-10-23 10:57:47 +01:00 · 2017-10-23 10:57:47 +01:00 · 6282a53ca9
parent 8d8ea861ec 44fc033624
commit 6282a53ca9
6 changed files with 256 additions and 229 deletions
--- a/api/client-server/registration.yaml
+++ b/api/client-server/registration.yaml
@ -45,6 +45,16 @@ paths:
        If the client does not supply a ``device_id``, the server must
        auto-generate one.
        The server SHOULD register an account with a User ID based on the
        ``username`` provided, if any. Note that the grammar of Matrix User ID
        localparts is restricted, so the server MUST either map the provided
        ``username`` onto a ``user_id`` in a logical manner, or reject
        ``username``\s which do not comply to the grammar, with
        ``M_INVALID_USERNAME``.
        Matrix clients MUST NOT assume that localpart of the registered
        ``user_id`` matches the provided ``username``.
        The returned access token must be associated with the ``device_id``
        supplied by the client or generated by the server. The server may
        invalidate any access token previously associated with that device. See
@ -86,7 +96,7 @@ paths:
              username:
                type: string
                description: |-
-                  The local part of the desired Matrix ID. If omitted,
+                  The basis for the localpart of the desired Matrix ID. If omitted,
                  the homeserver MUST generate a Matrix ID local part.
                example: cheeky_monkey
              password:
@ -121,7 +131,11 @@ paths:
            properties:
              user_id:
                type: string
-                description: The fully-qualified Matrix ID that has been registered.
+                description: |-
                  The fully-qualified Matrix user ID (MXID) that has been registered.
                  Any user ID returned by this API must conform to the grammar given in the
                  `Matrix specification <https://matrix.org/docs/spec/appendices.html#user-identifiers>`_.
              access_token:
                type: string
                description: |-
--- a/changelogs/client_server.rst
+++ b/changelogs/client_server.rst
@ -92,6 +92,9 @@
  - Add some clarifying notes on the behaviour of rooms with no
    ``m.room.power_levels`` event
    (`#1026 <https://github.com/matrix-org/matrix-doc/pull/1026>`_).
  - Clarify the relationship between ``username`` and ``user_id`` in the
    ``/register`` API
    (`#1032 <https://github.com/matrix-org/matrix-doc/pull/1032>`_).
 r0.2.0
 ======
--- a/specification/appendices/identifier_grammar.rst
+++ b/specification/appendices/identifier_grammar.rst
@ -0,0 +1,225 @@
 .. Copyright 2016 Openmarket Ltd.
 .. Copyright 2017 New Vector Ltd.
 ..
 .. Licensed under the Apache License, Version 2.0 (the "License");
 .. you may not use this file except in compliance with the License.
 .. You may obtain a copy of the License at
 ..
 ..     http://www.apache.org/licenses/LICENSE-2.0
 ..
 .. Unless required by applicable law or agreed to in writing, software
 .. distributed under the License is distributed on an "AS IS" BASIS,
 .. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 .. See the License for the specific language governing permissions and
 .. limitations under the License.
 Identifier Grammar
 ------------------
 Server Name
 ~~~~~~~~~~~
 A homeserver is uniquely identified by its server name. This value is used in a
 number of identifiers, as described below.
 The server name represents the address at which the homeserver in question can
 be reached by other homeservers. The complete grammar is::
    server_name = dns_name [ ":" port]
    dns_name = host
    port = *DIGIT
 where ``host`` is as defined by `RFC3986, section 3.2.2
 <https://tools.ietf.org/html/rfc3986#section-3.2.2>`_.
 Examples of valid server names are:
 * ``matrix.org``
 * ``matrix.org:8888``
 * ``1.2.3.4`` (IPv4 literal)
 * ``1.2.3.4:1234`` (IPv4 literal with explicit port)
 * ``[1234:5678::abcd]`` (IPv6 literal)
 * ``[1234:5678::abcd]:5678`` (IPv6 literal with explicit port)
 Common Identifier Format
 ~~~~~~~~~~~~~~~~~~~~~~~~
 The Matrix protocol uses a common format to assign unique identifiers to a
 number of entities, including users, events and rooms. Each identifier takes
 the form::
  &localpart:domain
 where ``&`` represents a 'sigil' character; ``domain`` is the `server name`_ of
 the homeserver which allocated the identifier, and ``localpart`` is an
 identifier allocated by that homeserver.
 The sigil characters are as follows:
 * ``@``: User ID
 * ``!``: Room ID
 * ``$``: Event ID
 * ``#``: Room alias
 The precise grammar defining the allowable format of an identifier depends on
 the type of identifier.
 User Identifiers
 ++++++++++++++++
 Users within Matrix are uniquely identified by their Matrix user ID. The user
 ID is namespaced to the homeserver which allocated the account and has the
 form::
  @localpart:domain
 The ``localpart`` of a user ID is an opaque identifier for that user. It MUST
 NOT be empty, and MUST contain only the characters ``a-z``, ``0-9``, ``.``,
 ``_``, ``=``, ``-``, and ``/``.
 The ``domain`` of a user ID is the `server name`_ of the homeserver which
 allocated the account.
 The length of a user ID, including the ``@`` sigil and the domain, MUST NOT
 exceed 255 characters.
 The complete grammar for a legal user ID is::
  user_id = "@" user_id_localpart ":" server_name
  user_id_localpart = 1*user_id_char
  user_id_char = DIGIT
               / %x61-7A                   ; a-z
               / "-" / "." / "=" / "_" / "/"
 .. admonition:: Rationale
  A number of factors were considered when defining the allowable characters
  for a user ID.
  Firstly, we chose to exclude characters outside the basic US-ASCII character
  set. User IDs are primarily intended for use as an identifier at the protocol
  level, and their use as a human-readable handle is of secondary
  benefit. Furthermore, they are useful as a last-resort differentiator between
  users with similar display names. Allowing the full unicode character set
  would make very difficult for a human to distinguish two similar user IDs. The
  limited character set used has the advantage that even a user unfamiliar with
  the Latin alphabet should be able to distinguish similar user IDs manually, if
  somewhat laboriously.
  We chose to disallow upper-case characters because we do not consider it
  valid to have two user IDs which differ only in case: indeed it should be
  possible to reach ``@user:matrix.org`` as ``@USER:matrix.org``. However,
  user IDs are necessarily used in a number of situations which are inherently
  case-sensitive (notably in the ``state_key`` of ``m.room.member``
  events). Forbidding upper-case characters (and requiring homeservers to
  downcase usernames when creating user IDs for new users) is a relatively simple
  way to ensure that ``@USER:matrix.org`` cannot refer to a different user to
  ``@user:matrix.org``.
  Finally, we decided to restrict the allowable punctuation to a very basic set
  to reduce the possibility of conflicts with special characters in various
  situations. For example, "*" is used as a wildcard in some APIs (notably the
  filter API), so it cannot be a legal user ID character.
  The length restriction is derived from the limit on the length of the
  ``sender`` key on events; since the user ID appears in every event sent by the
  user, it is limited to ensure that the user ID does not dominate over the actual
  content of the events.
 Matrix user IDs are sometimes informally referred to as MXIDs.
 Historical User IDs
 <<<<<<<<<<<<<<<<<<<
 Older versions of this specification were more tolerant of the characters
 permitted in user ID localparts. There are currently active users whose user
 IDs do not conform to the permitted character set, and a number of rooms whose
 history includes events with a ``sender`` which does not conform. In order to
 handle these rooms successfully, clients and servers MUST accept user IDs with
 localparts from the expanded character set::
  extended_user_id_char = %x21-7E
 Mapping from other character sets
 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
 In certain circumstances it will be desirable to map from a wider character set
 onto the limited character set allowed in a user ID localpart. Examples include
 a homeserver creating a user ID for a new user based on the username passed to
 ``/register``, or a bridge mapping user ids from another protocol.
 .. TODO-spec
   We need to better define the mechanism by which homeservers can allow users
   to have non-Latin login credentials. The general idea is for clients to pass
   the non-Latin in the ``username`` field to ``/register`` and ``/login``, and
   the HS then maps it onto the MXID space when turning it into the
   fully-qualified ``user_id`` which is returned to the client and used in
   events.
 Implementations are free to do this mapping however they choose. Since the user
 ID is opaque except to the implementation which created it, the only
 requirement is that the implemention can perform the mapping
 consistently. However, we suggest the following algorithm:
 1. Encode character strings as UTF-8.
 2. Convert the bytes ``A-Z`` to lower-case.
   * In the case where a bridge must be able to distinguish two different users
     with ids which differ only by case, escape upper-case characters by
     prefixing with ``_`` before downcasing. For example, ``A`` becomes
     ``_a``. Escape a real ``_`` with a second ``_``.
 3. Encode any remaining bytes outside the allowed character set, as well as
   ``=``, as their hexadecimal value, prefixed with ``=``. For example, ``#``
   becomes ``=23``; ``á`` becomes ``=c3=a1``.
 .. admonition:: Rationale
  The suggested mapping is an attempt to preserve human-readability of simple
  ASCII identifiers (unlike, for example, base-32), whilst still allowing
  representation of *any* character (unlike punycode, which provides no way to
  encode ASCII punctuation).
 Room IDs and Event IDs
 ++++++++++++++++++++++
 A room has exactly one room ID. A room ID has the format::
  !opaque_id:domain
 An event has exactly one event ID. An event ID has the format::
  $opaque_id:domain
 The ``domain`` of a room/event ID is the `server name`_ of the homeserver which
 created the room/event. The domain is used only for namespacing to avoid the
 risk of clashes of identifiers between different homeservers. There is no
 implication that the room or event in question is still available at the
 corresponding homeserver.
 Event IDs and Room IDs are case-sensitive. They are not meant to be human
 readable.
 .. TODO-spec
  What is the grammar for the opaque part? https://matrix.org/jira/browse/SPEC-389
 Room Aliases
 ++++++++++++
 A room may have zero or more aliases. A room alias has the format::
      #room_alias:domain
 The ``domain`` of a room alias is the `server name`_ of the homeserver which
 created the alias. Other servers may contact this homeserver to look up the
 alias.
 Room aliases MUST NOT exceed 255 bytes (including the ``#`` sigil and the
 domain).
 .. TODO-spec
  - Need to specify precise grammar for Room Aliases. https://matrix.org/jira/browse/SPEC-391
--- a/specification/index.rst
+++ b/specification/index.rst
@ -27,17 +27,17 @@ Voice over IP (VoIP) signalling, Internet of Things (IoT) communication, and bri
 together existing communication silos - providing the basis of a new open real-time
 communication ecosystem.
 `Introduction to Matrix <intro.html>`_ provides a full introduction to Matrix and the spec.
 Matrix APIs
 -----------
-The following APIs are documented in this specification:
+The specification consists of the following parts:
 `Introduction to Matrix <intro.html>`_ provides a full introduction to Matrix and the spec.
 {{apis}}
-`Appendices <appendices.html>`_ with supplemental information not specific to
+The `Appendices <appendices.html>`_ contain supplemental information not specific to
-one of the above APIs are also available.
+one of the above APIs.
 Specification Versions
 ----------------------
--- a/specification/intro.rst
+++ b/specification/intro.rst
@ -157,9 +157,8 @@ allocated the account and has the form::
  @localpart:domain
-See the `Identifier Grammar`_ section for full details of the structure of
+See the `appendices <appendices.html#identifier-grammar>`_ for full details of
-user IDs.
+the structure of user IDs.
 Devices
 ~~~~~~~
@ -242,8 +241,8 @@ There is exactly one room ID for each room. Whilst the room ID does contain a
 domain, it is simply for globally namespacing room IDs. The room does NOT
 reside on the domain specified.
-See the `Identifier Grammar`_ section for full details of the structure of
+See the `appendices <appendices.html#identifier-grammar>`_ for full details of
-a room ID.
+the structure of a room ID.
 The following conceptual diagram shows an
 ``m.room.message`` event being sent to the room ``!qporfwt:matrix.org``::
@ -318,8 +317,8 @@ Each room can also have multiple "Room Aliases", which look like::
  #room_alias:domain
-See the `Identifier Grammar`_ section for full details of the structure of
+See the `appendices <appendices.html#identifier-grammar>`_ for full details of
-a room alias.
+the structure of a room alias.
 A room alias "points" to a room ID and is the human-readable label by which
 rooms are publicised and discovered.  The room ID the alias is pointing to can
@ -387,221 +386,6 @@ dedicated API.  The API is symmetrical to managing Profile data.
  Would it really be overengineered to use the same API for both profile &
  private user data, but with different ACLs?
 Identifier Grammar
 ------------------
 Server Name
 ~~~~~~~~~~~
 A homeserver is uniquely identified by its server name. This value is used in a
 number of identifiers, as described below.
 The server name represents the address at which the homeserver in question can
 be reached by other homeservers. The complete grammar is::
    server_name = dns_name [ ":" port]
    dns_name = host
    port = *DIGIT
 where ``host`` is as defined by `RFC3986, section 3.2.2
 <https://tools.ietf.org/html/rfc3986#section-3.2.2>`_.
 Examples of valid server names are:
 * ``matrix.org``
 * ``matrix.org:8888``
 * ``1.2.3.4`` (IPv4 literal)
 * ``1.2.3.4:1234`` (IPv4 literal with explicit port)
 * ``[1234:5678::abcd]`` (IPv6 literal)
 * ``[1234:5678::abcd]:5678`` (IPv6 literal with explicit port)
 Common Identifier Format
 ~~~~~~~~~~~~~~~~~~~~~~~~
 The Matrix protocol uses a common format to assign unique identifiers to a
 number of entities, including users, events and rooms. Each identifier takes
 the form::
  &localpart:domain
 where ``&`` represents a 'sigil' character; ``domain`` is the `server name`_ of
 the homeserver which allocated the identifier, and ``localpart`` is an
 identifier allocated by that homeserver.
 The sigil characters are as follows:
 * ``@``: User ID
 * ``!``: Room ID
 * ``$``: Event ID
 * ``#``: Room alias
 The precise grammar defining the allowable format of an identifier depends on
 the type of identifier.
 User Identifiers
 ++++++++++++++++
 Users within Matrix are uniquely identified by their Matrix user ID. The user
 ID is namespaced to the homeserver which allocated the account and has the
 form::
  @localpart:domain
 The ``localpart`` of a user ID is an opaque identifier for that user. It MUST
 NOT be empty, and MUST contain only the characters ``a-z``, ``0-9``, ``.``,
 ``_``, ``=``, and ``-``.
 The ``domain`` of a user ID is the `server name`_ of the homeserver which
 allocated the account.
 The length of a user ID, including the ``@`` sigil and the domain, MUST NOT
 exceed 255 characters.
 The complete grammar for a legal user ID is::
  user_id = "@" user_id_localpart ":" server_name
  user_id_localpart = 1*user_id_char
  user_id_char = DIGIT
               / %x61-7A                   ; a-z
               / "-" / "." / "=" / "_"
 .. admonition:: Rationale
  A number of factors were considered when defining the allowable characters
  for a user ID.
  Firstly, we chose to exclude characters outside the basic US-ASCII character
  set. User IDs are primarily intended for use as an identifier at the protocol
  level, and their use as a human-readable handle is of secondary
  benefit. Furthermore, they are useful as a last-resort differentiator between
  users with similar display names. Allowing the full unicode character set
  would make very difficult for a human to distinguish two similar user IDs. The
  limited character set used has the advantage that even a user unfamiliar with
  the Latin alphabet should be able to distinguish similar user IDs manually, if
  somewhat laboriously.
  We chose to disallow upper-case characters because we do not consider it
  valid to have two user IDs which differ only in case: indeed it should be
  possible to reach ``@user:matrix.org`` as ``@USER:matrix.org``. However,
  user IDs are necessarily used in a number of situations which are inherently
  case-sensitive (notably in the ``state_key`` of ``m.room.member``
  events). Forbidding upper-case characters (and requiring homeservers to
  downcase usernames when creating user IDs for new users) is a relatively simple
  way to ensure that ``@USER:matrix.org`` cannot refer to a different user to
  ``@user:matrix.org``.
  Finally, we decided to restrict the allowable punctuation to a very basic set
  to ensure that the identifier can be used as-is in as wide a number of
  situations as possible, without requiring escaping. For instance, allowing
  "%" or "/" would make it harder to use a user ID in a URI. "*" is used as a
  wildcard in some APIs (notably the filter API), so it also cannot be a legal
  user ID character.
  The length restriction is derived from the limit on the length of the
  ``sender`` key on events; since the user ID appears in every event sent by the
  user, it is limited to ensure that the user ID does not dominate over the actual
  content of the events.
 Matrix user IDs are sometimes informally referred to as MXIDs.
 Historical User IDs
 <<<<<<<<<<<<<<<<<<<
 Older versions of this specification were more tolerant of the characters
 permitted in user ID localparts. There are currently active users whose user
 IDs do not conform to the permitted character set, and a number of rooms whose
 history includes events with a ``sender`` which does not conform. In order to
 handle these rooms successfully, clients and servers MUST accept user IDs with
 localparts from the expanded character set::
  extended_user_id_char = %x21-7E
 Mapping from other character sets
 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
 In certain circumstances it will be desirable to map from a wider character set
 onto the limited character set allowed in a user ID localpart. Examples include
 a homeserver creating a user ID for a new user based on the username passed to
 ``/register``, or a bridge mapping user ids from another protocol.
 .. TODO-spec
   We need to better define the mechanism by which homeservers can allow users
   to have non-Latin login credentials. The general idea is for clients to pass
   the non-Latin in the ``username`` field to ``/register`` and ``/login``, and
   the HS then maps it onto the MXID space when turning it into the
   fully-qualified ``user_id`` which is returned to the client and used in
   events.
 Implementations are free to do this mapping however they choose. Since the user
 ID is opaque except to the implementation which created it, the only
 requirement is that the implemention can perform the mapping
 consistently. However, we suggest the following algorithm:
 1. Encode character strings as UTF-8.
 2. Convert the bytes ``A-Z`` to lower-case.
   * In the case where a bridge must be able to distinguish two different users
     with ids which differ only by case, escape upper-case characters by
     prefixing with ``_`` before downcasing. For example, ``A`` becomes
     ``_a``. Escape a real ``_`` with a second ``_``.
 3. Encode any remaining bytes outside the allowed character set, as well as
   ``=``, as their hexadecimal value, prefixed with ``=``. For example, ``#``
   becomes ``=23``; ``á`` becomes ``=c3=a1``.
 .. admonition:: Rationale
  The suggested mapping is an attempt to preserve human-readability of simple
  ASCII identifiers (unlike, for example, base-32), whilst still allowing
  representation of *any* character (unlike punycode, which provides no way to
  encode ASCII punctuation).
 Room IDs and Event IDs
 ++++++++++++++++++++++
 A room has exactly one room ID. A room ID has the format::
  !opaque_id:domain
 An event has exactly one event ID. An event ID has the format::
  $opaque_id:domain
 The ``domain`` of a room/event ID is the `server name`_ of the homeserver which
 created the room/event. The domain is used only for namespacing to avoid the
 risk of clashes of identifiers between different homeservers. There is no
 implication that the room or event in question is still available at the
 corresponding homeserver.
 Event IDs and Room IDs are case-sensitive. They are not meant to be human
 readable.
 .. TODO-spec
  What is the grammar for the opaque part? https://matrix.org/jira/browse/SPEC-389
 Room Aliases
 ++++++++++++
 A room may have zero or more aliases. A room alias has the format::
      #room_alias:domain
 The ``domain`` of a room alias is the `server name`_ of the homeserver which
 created the alias. Other servers may contact this homeserver to look up the
 alias.
 Room aliases MUST NOT exceed 255 bytes (including the ``#`` sigil and the
 domain).
 .. TODO-spec
  - Need to specify precise grammar for Room Aliases. https://matrix.org/jira/browse/SPEC-391
 License
 -------
--- a/specification/targets.yaml
+++ b/specification/targets.yaml
@ -34,6 +34,7 @@ targets:
      - appendices.rst
      - appendices/base64.rst
      - appendices/signing_json.rst
      - appendices/identifier_grammar.rst
      - appendices/threat_model.rst
      - appendices/test_vectors.rst
 groups:  # reusable blobs of files when prefixed with 'group:'