Support for multi-lingual candidate names #138

saumier · 2023-09-14T13:55:08Z

As a service provider, I would like clients to be able to query in any language and to return candidate names in one or more languages specified by the client request.

Use Case

A client is reconciling a place in Canada using the Artsdata.ca Reconciliation service with the name "Studio Azrieli".

Current solution (not ideal)

The service returns multiple entities including K11-15 "National Arts Centre - Azrieli Studio" and K11-15 "Centre National des Arts - Studio Azrieli" which appear as separate entities but have the same URI. This may appear incorrect to the user because there are 2 candidates. If the user doesn't notice that they have the same URI then they may be mistaken as duplicates.

Ideal solution

The service returns multiple entities but only a single K11-15 displaying both names "National Arts Centre - Azrieli Studio" and "Centre National des Arts - Studio Azrieli" together. Parameters can specify the languages the client would like to display.

fsteeg · 2023-11-09T10:41:25Z

So on the protocol level, would this mean to allow arrays of objects for candidate name and description?

"candidates": [
  {
    "id": "K11-15",
    "name": [
      {
        "str": "National Arts Centre - Azrieli Studio",
        "lang": "en"
      },
      {
        "str": "Centre National des Arts - Studio Azrieli",
        "lang": "fr"
      }
    ],
    ...
  }
]

wetneb · 2023-11-09T13:52:53Z

If we go down that route, I wonder if we should also add support for that for multiple names for properties (when returned in a property suggest response, or in a data extension response) or for types (when returned in a type suggest response, or in a reconciliation response as part of the reconciliation candidates). I guess it would make things look more uniform but I am not really sure about the use case. What do you think @saumier?

saumier · 2024-03-14T14:06:20Z

If we go down that route, I wonder if we should also add support for that for multiple names for properties (when returned in a property suggest response, or in a data extension response) or for types (when returned in a type suggest response, or in a reconciliation response as part of the reconciliation candidates). I guess it would make things look more uniform but I am not really sure about the use case. What do you think @saumier?

Yes. Since the group is not recommending JSON-LD, then I think this is the next best approach.

I am implementing a bilingual website (en, fr) that implements a client for the reconciliation API here kg.artsdata.ca. The UI of this site can switch between English and French. When querying using the reconciliation API, a query string can be in any language. For example I could query a Place using "Studio Azrieli" and "Azrieli Studio". The response would return candidates including K11-15. With this new approach, the website could display the name and description in the UI language.

Also good for add support for property and type suggestions.

wetneb · 2024-04-05T14:57:11Z

Summary of our discussion on the monthly call of last month: we could either

always wrap the name and description field of entities, properties and types in additional array/objects so that multiple values can be specified depending on the language (see Support for multi-lingual candidate names #138 (comment)). This would be done even if only a single language is used, with the benefit of using a consistent JSON structure regardless of the data.
we could do this wrapping only in cases where multiple languages need to be returned, and fall back on the current syntax (bare strings) when a single language is provided. This has the benefit of offering a simpler JSON structure for most use cases.

Maybe there are other options?

We thought that it is worth bringing more attention to this issue from the broader community, to gather more feedback.

tfmorris · 2024-04-05T18:14:44Z

Unless the variable structure is backward compatible when the simple variant is used, I think it's better to be consistent and always use the array form, even for a single entry. I suspect that things have diverged enough that there's not a compatibility benefit.

thadguidry · 2024-04-06T02:11:31Z

I second @tfmorris opinion. I like the consistency of when our API standards have a context that could be "one or many" then we resort to Array form. (mostly because the idea of simpler JSON structure, is precluded that perhaps JSON Array objects are complicated or noisy?, when they really are not for developers and our 2024+ tooling nowadays)

acka47 · 2024-04-08T13:25:16Z

Generally, this seems to be related to #52 as a solution to this issue will also resolve the #52, won't it?

Maybe there are other options?

I am late to the party (sorry) but am adding this for reference. Generally, I like the "language map" approach from JSON-LD (examples) for providing labels in multiple languages as it is simple, terse and easy to read. The example from #138 (comment) would look like this with language maps:

{
   "candidates":[
      {
         "id":"K11-15",
         "name":{
            "en":"National Arts Centre - Azrieli Studio",
            "fr":"Centre National des Arts - Studio Azrieli"
         }
      }
   ]
}

thadguidry · 2024-04-08T13:44:23Z

@acka47 If we went that route, we'd have to adopt a convention and document it. That being the key should be an ISO 639-3 three letter code? Hmm, what else?

wetneb · 2024-04-08T13:54:33Z

@acka47 I like the conciseness but how would a service represent a name or description for which it does not know the language? (Use case: a tool like CSV-reconcile, which spins a reconciliation service on arbitrary datasets, generally will not have access to this sort of information and shouldn't make up a language for the sake of fitting in)

acka47 · 2024-04-08T14:57:17Z

If we went that route, we'd have to adopt a convention and document it. That being the key should be an ISO 639-3 three letter code?

Yes, we could define it similar to JSON-LD like this: "keys must be strings representing [BCP47] language codes and the values must be a string."

how would a service represent a name or description for which it does not know the language?

Good question. I guess for the other approach from #138 (comment) you would you just omit the optional lang key. With the language map approach you would have to use und as key (for "undetermined"), I guess.

awagner-mainz · 2024-04-10T14:09:33Z

Would the array approach allow for multiple alias names in the same language whereas the map approach would not? That could be an argument for choosing the array approach. On the other hand, I am not sure we actually want to allow this?

fsteeg · 2024-04-11T11:41:18Z

Another aspect to consider for the lang field vs. language maps is that the field provides a general approach for all objects. To quote from the current draft:

All objects used in this protocol (entities, types, properties, queries, candidates, features, etc.) MAY declare an explicit text-processing language in a lang field.

fsteeg · 2024-04-11T11:58:40Z

[...] I think it's better to be consistent and always use the array form [...]

To be clear, this is not only about array vs. non-array, but also object vs. string.

The common, simple case currently:

"name": "National Arts Centre - Azrieli Studio"

The common case in the unified syntax:

"name": [
  {
    "str": "National Arts Centre - Azrieli Studio"
  }
]

If this was the first and only place where we introduce optional structure (string or array of objects), I'd agree we might want to avoid that. But since we do the same thing in other places (e.g. property values), I feel like the much simpler common case is worth having the option.

saumier · 2024-04-11T14:01:10Z

how would a service represent a name or description for which it does not know the language?

From JSON-LD https://www.w3.org/TR/json-ld/#example-102-indexing-languaged-tagged-strings-using-none-for-no-language

... the special index @none is used for indexing strings which do not have a language; this is useful to maintain a normalized representation for string values not having a datatype.

Example if there was no language for a name.

{
   "candidates":[
      {
         "id":"K11-15",
         "name":{
            "@none":"National Arts Centre - Azrieli Studio"
         }
      }
   ]
}

wetneb · 2024-04-11T14:11:37Z

I'm not really enthusiastic about any of the solutions, but the one that I find the least bad is @fsteeg's suggestion to use the existing language (+ text direction) mechanisms we have, and simply switch to this default syntax:

"name": [
  {
    "str": "National Arts Centre - Azrieli Studio"
  }
]

with the option to add a lang and dir attributes at the same level as the str if needed, and to add more objects in the array.
This also has the benefit of allowing for returning multiple names in a same language (for alternate names, such as acronyms for instance).

wetneb · 2024-04-11T14:17:01Z

And I agree with @tfmorris on the preference to stick to the array form.

saumier · 2024-04-11T19:10:17Z

I also agree with @wetneb and @tfmorris to use an array of objects with the str attribute and optional lang and dir.

For the sake of comparison with other patterns, this somewhat resembles the keys @value, @language and @direction used in JSON-LD.

acka47 · 2024-04-12T07:11:46Z

I have no preference here but just felt that the language map approach should at least be discussed in this context. Thus, I am fine with an array of objects containing at least the str with optional lang and dir.

saumier · 2024-07-08T14:42:41Z

@wetneb My team has implemented an endpoint for the current draft spec and updated our branch of the test bench to support both v0.2 and v0.3 (draft).

Here are 2 screen grabs from our branch of test bench. One showing our production reconciliation endpoint v0.2 and a second screen grab showing our test reconciliation endpoint v0.3 with multi-lingual support meeting the needs of this use case. This is a work in progress.

v0.2 - current spec - showing Azieli Studio returned 2 times with the same ID K11-15

v0.3 - draft spec - showing Azrieli Studio entity combined in a single response with en and fr.

With required `str`, optional `lang` and `dir` fields

And use for candidate `description`

- Extract existing string object definition to its own schema file - Reference string-object.json in the suggest response schemas - Update spec & examples to use string objects in suggest responses - Redefine types used in suggest response as described in the spec (instead of referencing the actual type.json schema) - Clarify in the spec that we don't return actual full entity, property, or type objects in the suggest response's `result` field

Also add `description` to spec and example (was in schema already)

fsteeg · 2024-10-10T15:45:18Z

Quoting myself in in the related PR #176 (comment):

Did we ever consider implementing it by (only) specifying the language(s) in the client request (in the Accept-Language header), and returning the old structure? Do we actually need to return multiple languages at the same time, for a single request?

I feel like we, in particular myself in #138 (comment), might have jumped to the solution of changing the data structure too quickly. One approach we discussed in today's meeting is using multiple requests, one for each language, each returning the current, simple structure.

So instead of a single response in the new format:

"candidates": [
  {
    "id": "K11-15",
    "name": [
      {
        "str": "National Arts Centre - Azrieli Studio",
        "lang": "en"
      },
      {
        "str": "Centre National des Arts - Studio Azrieli",
        "lang": "fr"
      }
    ],
    ...
  }
]

We'd have two responses (for two requests with different Accept-Language headers) in the old format:

"candidates": [
  {
    "id": "K11-15",
    "name": "National Arts Centre - Azrieli Studio",
    "lang": "en"
    ...
  }
]

"candidates": [
  {
    "id": "K11-15",
    "name": "Centre National des Arts - Studio Azrieli",
    "lang": "fr"
    ...
  }
]

This seems way more lightweight and in line with the other internationalization support, which is completely optional (request and response headers, optional lang and dir fields on existing objects), instead of determining the structure of the protocol.

It's actually kind of close to the original workaround of returning multiple candidates with the same ID but different labels by @saumier in #138 (comment), but I guess in all cases the client will have to handle something (grouping candidates with the same ID or displaying the new structure).

So not sure how that would be implemented exactly, but wanted to ask for feedback on the basic idea.

tfmorris · 2024-10-12T00:59:54Z

Using multiple queries seems inefficient to me. I think doing it the way the Google KG Search does with an ordered list of requested languages would be simpler:

https://kgsearch.googleapis.com/v1/entities:search?languages=fr&languages=en&query=etage&key=<key>

which then returns the results in the same order as specified by the request:

        "@id": "kg:/m/02vk6kk",
        "name": [
          {
            "@language": "fr",
            "@value": "Étage"
          },
          {
            "@value": "Storey",
            "@language": "en"
          }
        ],

saumier · 2024-11-04T16:51:52Z

I feel like we, in particular myself in #138 (comment), might have jumped to the solution of changing the data structure too quickly.

@fsteeg I am also coming around to the idea that we maybe changed the data structure too quickly.

In my specific use case, the implementation of the reconciliation service is such that it always processes "matchType": "name" requests in both languages (ignoring the Text-processing language if specified). This is because in Canada it is not uncommon to be speaking one language but use the name of an entity in another language. So a person speaking english may talk about a place using the french name while continuing to speak in english.

The root of my problem is how to return the response that the user is expecting to see in the UI. Especially when there is an exact match in one language but not the other. To illustrate my use case with a concrete example (as in the original use case), imagine a reconciliation query for "Studio Azrieli" and the Language of the intended audience set to "en" in the Accept-Language request header. The service processes the request by searching in both languages: "en" and "fr". The response is formatted in "en" because of the Accept-Language. I could just display the english name "National Arts Centre - Azrieli Studio" and stop there. But ideally I would like to display the french name so the user will recognize their search and see the exact match. The current live production server for Artsdata.ca returns 2 candidates with the same URI but different names for english and french, and the exact match candidate hi-lighted. This has the down side (as mentioned in my original use case) that returning 2 candidates may appear incorrect to the user and be mistaken as duplicates.

This seems way more lightweight and in line with the other internationalization support, which is completely optional (request and response headers, optional lang and dir fields on existing objects), instead of determining the structure of the protocol.

It's actually kind of close to the original workaround of returning multiple candidates with the same ID but different labels by @saumier in #138 (comment), but I guess in all cases the client will have to handle something (grouping candidates with the same ID or displaying the new structure).

So not sure how that would be implemented exactly, but wanted to ask for feedback on the basic idea.

@fsteeg I understand your idea, but instead of doing two requests with different Accept-Language request headers, I am thinking of sacrificing one display language instead. The user may not recognize their "exact" match in the response if the response candidate is in a different language than their initial search string, but this may not really be a show stopper. I plan to do some usability testing on my end with the idea of only displaying only the Language of the intended audience in the UI.

fsteeg · 2024-11-12T11:40:18Z

This has the down side (as mentioned in my original use case) that returning 2 candidates may appear incorrect to the user and be mistaken as duplicates. [...] I am thinking of sacrificing one display language instead.

Isn't that mainly a client / UI issue? You could return both candidates, no matter which language(s) were requested:

"candidates": [
  {
    "id": "K11-15",
    "name": "National Arts Centre - Azrieli Studio",
    "lang": "en"
    ...
  },
  {
    "id": "K11-15",
    "name": "Centre National des Arts - Studio Azrieli",
    "lang": "fr"
    ...
  }
]

As mentioned above, it's up to the client to display multi-language candidates properly (no matter which approach we take), e.g. by grouping these candidates by ID, or by language, or by simply adding the language as a field in the UI.

wetneb mentioned this issue Jul 8, 2024

Enhancement/issue 13 reconciliation-api/testbench#74

Closed

fsteeg added a commit that referenced this issue Jul 10, 2024

Change candidate name from string to array of objects (#138)

1e5ca4a

With required `str`, optional `lang` and `dir` fields

fsteeg added a commit that referenced this issue Jul 10, 2024

Extract string_object definition from candidate name (#138)

42bdcbc

And use for candidate `description`

fsteeg linked a pull request Jul 10, 2024 that will close this issue

Change candidate name and description from string to array of objects with required str, optional lang and dir fields #176

Open

fsteeg added a commit that referenced this issue Oct 9, 2024

Reuse string-object schema in data-extension-response schema (#138)

1398197

fsteeg added a commit that referenced this issue Oct 9, 2024

Use string objects in property proposals, update changelog (#138)

1878f2b

Also add `description` to spec and example (was in schema already)

AbhishekPAnil mentioned this issue Nov 4, 2024

Reconciliaton API - iteration 3 multi-language culturecreates/artsdata-reconciliation#13

Open

6 tasks

acka47 mentioned this issue Nov 11, 2024

Do we need JSON-LD support? If yes, what for? #183

Open

fsteeg added a commit that referenced this issue Nov 12, 2024

Move string-object.json to new 1.0-draft/schemas (#138, #179)

634ed9f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for multi-lingual candidate names #138

Support for multi-lingual candidate names #138

saumier commented Sep 14, 2023 •

edited

Loading

fsteeg commented Nov 9, 2023

wetneb commented Nov 9, 2023

saumier commented Mar 14, 2024

wetneb commented Apr 5, 2024

tfmorris commented Apr 5, 2024

thadguidry commented Apr 6, 2024

acka47 commented Apr 8, 2024

thadguidry commented Apr 8, 2024

wetneb commented Apr 8, 2024

acka47 commented Apr 8, 2024

awagner-mainz commented Apr 10, 2024

fsteeg commented Apr 11, 2024

fsteeg commented Apr 11, 2024

saumier commented Apr 11, 2024

wetneb commented Apr 11, 2024

wetneb commented Apr 11, 2024

saumier commented Apr 11, 2024

acka47 commented Apr 12, 2024

saumier commented Jul 8, 2024 •

edited

Loading

fsteeg commented Oct 10, 2024

tfmorris commented Oct 12, 2024

saumier commented Nov 4, 2024 •

edited

Loading

fsteeg commented Nov 12, 2024

Support for multi-lingual candidate names #138

Support for multi-lingual candidate names #138

Comments

saumier commented Sep 14, 2023 • edited Loading

Use Case

Current solution (not ideal)

Ideal solution

fsteeg commented Nov 9, 2023

wetneb commented Nov 9, 2023

saumier commented Mar 14, 2024

wetneb commented Apr 5, 2024

tfmorris commented Apr 5, 2024

thadguidry commented Apr 6, 2024

acka47 commented Apr 8, 2024

thadguidry commented Apr 8, 2024

wetneb commented Apr 8, 2024

acka47 commented Apr 8, 2024

awagner-mainz commented Apr 10, 2024

fsteeg commented Apr 11, 2024

fsteeg commented Apr 11, 2024

saumier commented Apr 11, 2024

wetneb commented Apr 11, 2024

wetneb commented Apr 11, 2024

saumier commented Apr 11, 2024

acka47 commented Apr 12, 2024

saumier commented Jul 8, 2024 • edited Loading

v0.2 - current spec - showing Azieli Studio returned 2 times with the same ID K11-15

v0.3 - draft spec - showing Azrieli Studio entity combined in a single response with en and fr.

fsteeg commented Oct 10, 2024

tfmorris commented Oct 12, 2024

saumier commented Nov 4, 2024 • edited Loading

fsteeg commented Nov 12, 2024

saumier commented Sep 14, 2023 •

edited

Loading

saumier commented Jul 8, 2024 •

edited

Loading

saumier commented Nov 4, 2024 •

edited

Loading