Issue Details (XML | Word | Printable)

Key: MBS-1799
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Normal Normal
Assignee: Kuno Woudt
Reporter: nikki
Votes: 7
Watchers: 5
Operations

If you were logged in you would be able to see more operations.
MusicBrainz Server

Add ISO 639-3 language codes to the database

Created: 02/May/11 02:55 AM   Updated: 15/May/12 07:40 PM  Due: 05/Dec/12   Resolved: 15/May/12 07:40 PM
Component/s: Admin, Edit system, Schema Change
Affects Version/s: NGS - Release Candidate 2
Fix Version/s: Schema change, 2012-05-15

Issue Links:
Depends
 
Relates
 


 Description  « Hide

We have stuff in the database that's only covered by ISO 639-3 (we're currently using ISO 639-2). It would be nice if we could use the proper codes for these (we could invent some, but that defeats the whole point of using standardised codes).

Some issues we would need to think about:

  • There are about 7500 languages in ISO 639-3. We either need to only enable them on request (probably better, we don't need that many of them) or we need a search rather than a dropdown.
  • ID3's TLAN frame specifically says ISO 639-2.
  • Not all ISO 639-2 codes are valid ISO 639-3. In general, the ones which were removed have their frequency set to 0 so they're not selectable, but we are using "Artificial, Other".

See http://forums.musicbrainz.org/viewtopic.php?id=2681 for example



Sort Order: Ascending order - Click to sort in descending order

Oliver Charles added a comment - 17/Feb/12 03:18 PM

As this ticket has open points, I'm moving this to decision required.


nikki added a comment - 06/Mar/12 02:30 AM

As per http://blog.musicbrainz.org/?p=1301 :

I'm not aware of any objections to being able to use ISO 639-3 codes. This ticket has a number of votes and there have been a number of requests in other places for languages only available in ISO 639-3.

My proposal is:

The list of languages we use should be based on ISO 639-3. This would affect release languages and, if implemented, work lyrics languages (see MBS-1798). Where we currently display ISO 639-2/T codes, we would display ISO 639-3 codes.

Addressing point 1:
We should import all of the languages from ISO 639-3 but set the frequency to 0, so that they're not displayed in the list by default. We can then enable languages on request as needed by changing the frequency.

Addressing point 2:
I think this should largely be solved on Picard's side. My preference would be to implement MBS-4342 so that Picard can do a request to the server to get information such as a list of languages and their ISO codes. It can then use the information to determine which codes are valid for the ID3 TLAN frame. Until then, I suppose the best thing for Picard to do would be to ship a list of valid codes with Picard, or ignore the problem entirely and write ISO 639-3 codes occasionally.

Addressing point 3:
We should migrate all existing 'art' (currently "[Artificial (Other)]") releases - http://musicbrainz.org/search?query=lang%3Aart&type=release&limit=25&advanced=1 - to another code which is available in ISO 639-3. This could be "mis" ("uncoded language") or a user-defined code (e.g. "qaa"). I would probably prefer the latter.

We should also make sure the releases on http://musicbrainz.org/search?query=lang%3A%28bih+him+afa+alg+apa+ath+aus+bad+bai+bat+ber+bnt+btk+cai+cau+cel+cmc+col+cpe+cpf+cpp+crp+cus+day+dra+fiu+gem+ijo+inc+ine+ira+iro+kar+khi+kro+map+mkh+mno+mun+myn+nah+nai+nic+nub+oto+paa+phi+pra+roa+sai+sal+sem+sgn+sio+sit+sla+smi+son+ssa+tai+tup+tut+wak+wen+ypk+znd%29&type=release&limit=25&advanced=1 have been changed, as those codes have no equivalent in ISO 639-3 (they represent groups of languages, so they correspond to multiple ISO 639-3 codes). As there are so few, this could easily be done by hand beforehand instead of needing a migration script.

Then we should set the frequency of those codes to 0 (although I think other than 'art', they are set to 0 already). I presume edits store old data and therefore we wouldn't want to remove those rows.

Notes:
1. It would also be good to have a list of requested languages which could be enabled immediately on release, although since it could be done afterwards, this is not essential. The ones which come to mind right now are vro, liv, rys, rcf and moe (Võro, Livonian, Yaeyama, Réunion Creole French, Montagnais respectively) for releases and cmn, yue, nan and pyu (Mandarin, Cantonese, Min Nan and Puyuma, respectively) for lyrics.

2. I assume we would implement this as a new column. The existing column names are confusing though, as they're based on the number of letters, not the ISO version, perhaps that's something we should fix if we're doing a schema change anyway.


Ian McEwen added a comment - 06/Mar/12 03:09 AM

I'm not sure it makes sense to keep the old ISO 639-2 columns if we're not using them anywhere; as far as I know we aren't using iso_code_2 or iso_code_3b now, only iso_code_3t. iso_code_2 we only have 186 of in the DB anyway (of 485 languages).

Pending your and developer approval, I'd say changing the language schema to (id serial primary key, iso_code char(3) not null, name varchar(100) not null, frequency int not null default 0) makes more sense. This also makes the language and script tables more alike (script has iso_number as well; not sure what that is/if it's used).


Ian McEwen added a comment - 06/Mar/12 03:11 AM

Pardon, looks like iso_code_3b is used by /ws/1 – so we need a plan there. How would /ws/2 deal with this change, also?


Ian McEwen added a comment - 06/Mar/12 06:13 AM

added schema change component


nikki added a comment - 06/Mar/12 03:11 PM

/ws/1 shouldn't even be using 2/B, I've entered MBS-4404 for that.

I don't think we can remove the 639-2/T codes. As I said: "I presume edits store old data and therefore we wouldn't want to remove those rows.". I had a look at http://musicbrainz.org/edit/8144388 on rika, and the edit data is

{"language_id":"91","script_id":"28","old":[{"release_name":"Bwarouz","language_id":"0","release_ids":["612553"],"script_id":"0"}]}

so the name in the edit is coming from id 91 in the language table. The edits don't use ISO codes though, so it shouldn't be a problem for displaying the name. We just need to make sure current data can be displayed properly when ISO codes are used, e.g. by the WS, etc.

My suggestion of implementing MBS-4342 so that Picard can use it to figure out which codes are valid would also rely on both 639-2 and 639-3 codes to be present and returned.

If we use the ISO versions for naming the columns, i.e. iso_code_1, iso_code_2t, iso_code_2b and iso_code_3, it would be clearer. I'm not sure about removing the iso_code_2 and iso_code_3b columns. I just have a feeling that we might find them useful later on when we do more work on i18n. It would be really annoying if we remove them only to realise a few months later that we need a mapping between 639-1 and 639-3.


Robert Kaye added a comment - 19/Mar/12 09:15 PM

This ticket is under consideration for the May 15h schema change release. However, it is under specified at this point in time. In order to keep this ticket in consideration for the release, please do the following:

  • Specify exactly what the user interface changes will be. Mock ups would be great. Due date: April 2
  • Specify exactly what the database changes will be. We will need to have an exact database table change proposal. Exactly which columns will be added/removed/modified and what new tables will be created. If you cannot specify the changes to be done, chase down a developer to specify them for you. Due date: March 26
  • This ticket is considered contentious and requires more community review. I will post a blog entry highlighting this ticket for more review.

nikki added a comment - 20/Mar/12 11:53 AM

User interface changes: No visible changes. Templates which display the language ISO code should be updated to use the value from the new column.

Database changes: Add one column, "iso_code_3", to the language table. Ideally we would also rename the existing columns because they're misleadingly named (see my previous comment), but it is not a requirement.


Paul Taylor added a comment - 20/Mar/12 04:57 PM

So ISO 629-3 is essentially a superset of ISO 629-2. Im not sure it is really an issue that TLAN specifically says ISO-639-2 because IS0-639-3 just didnt exist when the spec
was done, and the both specs use 3 charcater codes. However if an application depends upon the TLAN field being set to a known list there could be a problem.

The webservice (lookup and search) also need updating to read from the new column, unsure as to whether it should always return the IS0-639-3 instead of the ISO-639-2 value or both.


Philip Jägenstedt added a comment - 20/Mar/12 07:38 PM

Do I understand correctly that the suggestion is to have a single language concept that is used both for releases (written language of track list) and works (vocal language of lyrics)? It's a bit tricky to say that a track list is in a specific dialect, are we sure that all ISO 629-3 codes actually make sense when applied to text? In other words, should we have a separate set of languages to represent text and audio?


nikki added a comment - 20/Mar/12 09:47 PM

Yes, that's correct. I think the vast majority of ISO 639-3 codes would make sense for text at some point (I assume you're thinking of Chinese, in which case I would point out things like http://musicbrainz.org/release-group/5214f8d2-138a-333f-b50f-814a61856934 ). Don't forget that we'll only be enabling codes on request as well, so the list we use won't be massively different from the existing one, it will just have a few more languages that didn't make it into ISO 639-2.

If we do end up wanting to completely prevent people from using some codes for some situations, I don't think it would make sense to use separate lists, since the majority of the list would be identical. Instead I would just adjust the (poorly named) "frequency" column we already use to determine when languages are shown.


Philip Jägenstedt added a comment - 21/Mar/12 02:51 PM

OK, I guess we can deal with the problem when it occurs, by blacklisting certain combinations, or something.


Oliver Charles added a comment - 27/Mar/12 01:41 PM - edited

So far, it seems the suggestion is:

ALTER TABLE language ADD COLUMN iso_code_3 CHAR(3) NOT NULL;

I'm in favour of renaming the other columns at the same time, could you please provide a mapping of old column name to new column name?


nikki added a comment - 27/Mar/12 04:07 PM

iso_code_2 -> iso_code_1
iso_code_3t -> iso_code_2t
iso_code_3b -> iso_code_2b


Robert Kaye added a comment - 02/Apr/12 06:58 PM

Does this ticket require any UI changes?


Nicolás Tamargo added a comment - 02/Apr/12 07:58 PM

Not that I can see - same list, more contents.


Oliver Charles added a comment - 03/Apr/12 11:20 AM

As mentioned in nikki's comment, I don't think there are any user interface changes. I think this ticket is in a good state for inclusion.


Oliver Charles added a comment - 03/Apr/12 11:30 AM

I think the changes to our web services need to be discussed, however. Are we adding a new element/attribute for this code, or replacing the current iso_code_3t (which is really 2t) value currently shown with iso 639-3?


Kuno Woudt added a comment - 10/Apr/12 06:19 PM

I discussed the question raised by ocharles in the previous comment with nikki and ianmcorvidae on irc.

We concluded that we should just use the code from the new iso_code_3 column in the existing <language> element of the xml. Only 67 Part 2/T codes have no equivalent value in Part 3. These languages will remain in the database with NULL in the iso_code_3 column.

As I implemented these changes the webservice will emit the Part 2/T code if the Part 3 code is NULL, to avoid crashing. After the affected releases have been migrated we may want to remove these codes from the table, so that the iso_code_3 column can get NOT NULL and UNIQUE constraints – after that change the fallback to Part 2/T can be removed from the webservice code.