|
Thanks for your testings and explanations. If you just searched by the artist name rather than artistname, sortname, alias you would get the expected result So you think this will be fix enabled with a more recent version of Lucene, great. Whether an artist has been on TV or not is completelt irrelevant, we don't boost artist just for being popular. We only boost a few artists (composers) if they are best known by an alias rather than their actual name. TV IS THE PROOF OF HIGHEST QUALITY ARTISTS (not). BARBARA Yeah well the regression is caused by the fact that previously all aliases matches were given the same length norm, now they can get a higher length norm if they contain just one or two words, this improves a number of searches but not your examples, however we can fix your examples by not including artist credits that are identical to the artist in the alias field, and this part of the fix can be done now without waiting for Lucene 4.0. I think I will update my Opera search URL to
because aren't those bogus artist credits a bug ? I mean how can we get rid of those and how can we be sure every artist isn't getting such a bogus identical artist credit ? All the artists I've gone through have this now. ← maybe these (new it seems) bogus artist credits are the root of this bug/regression/ticket in fact ? Apparently not, but anyway they are going to be removed from the search anyway to fix this problem. « Apparently not » You mean you found some artists without this duplicate AC ? No I mean nikki has told me they always exist Im only going to remove artistcredits that are identical to artistname, if they are different they will be kept Now fixed, artist credits not added to alias field if same as artist You can test it out on test, now the correct Suede comes out on top http://test.musicbrainz.org/search?query=suede&type=artist&limit=25&method=indexed I would consider this perfect, real Suede gets a better score because it matches on alias and the other query no longer does , but because we are doing a disjunction query when a search term matches mutiple fields it only gets a minor boost to if it just matched one field. http://test.musicbrainz.org/search?query=suede&type=artist&limit=25&method=advanced The larger difference in score is due to the fact that advanced search sum matches over the three fields artist, alias, sortname and now the other suede only matches on two fields, of course the point of advanced search is that is upto you to specify the query as you want it so if you want the scoring to be closer you could adjust it. Thanks very much Paul, it's very much better again! |
||||||||||||||||||||||||||||||||||
Ok, the essential problem is Lucene cannot tell when scoring whether you have two values to one field, or one value containing two words, this is a problem for us because aliases as an artist can have multiple aliases. Scoring includes a component called lengthNorm, basically if you match a document to a short field it scores better than a long field but this doesnt make much sense when multiple aliases are being added. We address this in the http://svn.musicbrainz.org/search_server/trunk/index/src/main/java/org/musicbrainz/search/analysis/MusicbrainzSimilarity.java
file, but this cannot be fully solved until Lucene 4.0 is available.
The problem is the same for both basic & advanced search but Im going to look at advanced search because it is simpler.
Equivalent ws/2 query
http://search.musicbrainz.org/ws/2/artist/?query=suede
With scoring explanation:
http://search.musicbrainz.org/ws/2/artist/?query=suede&explain=true
We can see its matching one suede higher than the other because the wrong suede gets a higher score on the alias field because the fieldnorm is higher, because this one only contains a single one word alias. whereas the real suede contains three aliases. To minimize the effect of multiple aliases once the field length is more than 3 words you always get the same field norm but we want a match to a one word field to do better than a two word field, othwerwise a search for Suede might match Suede higher than 'Alyssa Suede'
Now actually you have discovered another bug, the top scoring Suede doesnt actually have any aliases just an artist credit which is the same as the artist name 'Suede', the code lumps artist credits annd aliases together but I think it should ignore artist crdits that are ust identical to the artist name. I dont know if there was some change to do with artist credits to make these more common, Ive never noticed that artist credist were listed if same as artist name before.
Whether an artist has been on TV or not is completelt irrelevant, we don't boost artist just for being popular. We only boost a few artists (composers) if they are best known by an alias rather than their actual name.
If you just searched by the artist name rather than artistname, sortname,alias you would get the expected result
http://search.musicbrainz.org/ws/2/artist/?query=artist:suede
I don't think this is particulary a big deal as the best matches are at the top of the list, but this problerm will get fixed in time.