Issue Details (XML | Word | Printable)

Key: SEARCH-198
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Normal Normal
Assignee: Paul Taylor
Reporter: patate12
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
MusicBrainz Search Server

The artist is getting a lowered score on MBS

Created: 28/Mar/12 05:49 PM   Updated: 29/Mar/12 09:48 PM   Resolved: 29/Mar/12 08:14 PM
Component/s: None
Affects Version/s: None
Fix Version/s: 2012-05-15


 Description  « Hide

Just today I've noticed that THE REAL AND ONLY suede (yes always lowcaps) is now getting a score of only 98% (normal seacrh) and only 90% (advanced search) despite being 100% matching the name.

BEHIND AN ARTIST THAT NEVER SHOWED UP ON TELEVISION !!!!!!

Search server is stil saying 100%.
I could swear it was 100% before.
Maybe because of today's search fix deploys ?



Sort Order: Ascending order - Click to sort in descending order
Paul Taylor added a comment - 28/Mar/12 06:54 PM

Ok, the essential problem is Lucene cannot tell when scoring whether you have two values to one field, or one value containing two words, this is a problem for us because aliases as an artist can have multiple aliases. Scoring includes a component called lengthNorm, basically if you match a document to a short field it scores better than a long field but this doesnt make much sense when multiple aliases are being added. We address this in the http://svn.musicbrainz.org/search_server/trunk/index/src/main/java/org/musicbrainz/search/analysis/MusicbrainzSimilarity.java file, but this cannot be fully solved until Lucene 4.0 is available.

The problem is the same for both basic & advanced search but Im going to look at advanced search because it is simpler.

Equivalent ws/2 query
http://search.musicbrainz.org/ws/2/artist/?query=suede

With scoring explanation:
http://search.musicbrainz.org/ws/2/artist/?query=suede&explain=true

We can see its matching one suede higher than the other because the wrong suede gets a higher score on the alias field because the fieldnorm is higher, because this one only contains a single one word alias. whereas the real suede contains three aliases. To minimize the effect of multiple aliases once the field length is more than 3 words you always get the same field norm but we want a match to a one word field to do better than a two word field, othwerwise a search for Suede might match Suede higher than 'Alyssa Suede'

Now actually you have discovered another bug, the top scoring Suede doesnt actually have any aliases just an artist credit which is the same as the artist name 'Suede', the code lumps artist credits annd aliases together but I think it should ignore artist crdits that are ust identical to the artist name. I dont know if there was some change to do with artist credits to make these more common, Ive never noticed that artist credist were listed if same as artist name before.

Whether an artist has been on TV or not is completelt irrelevant, we don't boost artist just for being popular. We only boost a few artists (composers) if they are best known by an alias rather than their actual name.

If you just searched by the artist name rather than artistname, sortname,alias you would get the expected result

http://search.musicbrainz.org/ws/2/artist/?query=artist:suede

I don't think this is particulary a big deal as the best matches are at the top of the list, but this problerm will get fixed in time.


patate12 added a comment - 29/Mar/12 12:58 AM - edited

Thanks for your testings and explanations.

If you just searched by the artist name rather than artistname, sortname, alias you would get the expected result
I now understand more although it looks strangely redundant to have to type something like query=artist:suede&type=artist&limit=25&advanced=1.
I didn't explicitely search for artistname, sortname, alias. I came to that artist search page and typed the artist name.

So you think this will be fix enabled with a more recent version of Lucene, great.

Whether an artist has been on TV or not is completelt irrelevant, we don't boost artist just for being popular. We only boost a few artists (composers) if they are best known by an alias rather than their actual name.

TV IS THE PROOF OF HIGHEST QUALITY ARTISTS (not).
I mean look at the real suede with lots of releases, recordings and everything. The 100% one has just one. I think a matching alias should always be after a matching name. Aliases should not be the kind of things that wuold lower the score from 100 to 90 on a matching name.


patate12 added a comment - 29/Mar/12 01:16 AM

BARBARA (as seen on TV) also wrong now.
Seriously, I think it's a recent regression.
I would have noticed it because I use a userjs that redirects to the only 100% matching artist and I am almost certain that I was not redirected to the wrong one before.


Paul Taylor added a comment - 29/Mar/12 06:18 AM

Yeah well the regression is caused by the fact that previously all aliases matches were given the same length norm, now they can get a higher length norm if they contain just one or two words, this improves a number of searches but not your examples, however we can fix your examples by not including artist credits that are identical to the artist in the alias field, and this part of the fix can be done now without waiting for Lucene 4.0.


patate12 added a comment - 29/Mar/12 07:00 AM

I think I will update my Opera search URL to

hptt://musicbrainz.org/search?type=artist&limit=100&advanced=1&query=artist%3A%s


because aren't those bogus artist credits a bug ? I mean how can we get rid of those and how can we be sure every artist isn't getting such a bogus identical artist credit ? All the artists I've gone through have this now. ← maybe these (new it seems) bogus artist credits are the root of this bug/regression/ticket in fact ?


Paul Taylor added a comment - 29/Mar/12 07:23 AM

Apparently not, but anyway they are going to be removed from the search anyway to fix this problem.


patate12 added a comment - 29/Mar/12 07:30 AM - edited

« Apparently not » You mean you found some artists without this duplicate AC ?
It's good to have AC in the search IMO because this way we can find some artists as they are presented on the release we own, which can be different and in AC.
The problem is maybe just THIS bogus AC=name ← new stuff ?


Paul Taylor added a comment - 29/Mar/12 07:40 AM

No I mean nikki has told me they always exist
7:07am at http://chatlogs.musicbrainz.org/musicbrainz-devel/2012/2012-03/2012-03-29.html

Im only going to remove artistcredits that are identical to artistname, if they are different they will be kept


patate12 added a comment - 29/Mar/12 02:54 PM

Oh... So I don't really understand why it has changed then...
Hopefully will be fixed with you java file with Lucene 4.0. That's cool.


Paul Taylor added a comment - 29/Mar/12 08:14 PM

Now fixed, artist credits not added to alias field if same as artist

You can test it out on test, now the correct Suede comes out on top

http://test.musicbrainz.org/search?query=suede&type=artist&limit=25&method=indexed
Suede 100
Suede (American) 98

I would consider this perfect, real Suede gets a better score because it matches on alias and the other query no longer does , but because we are doing a disjunction query when a search term matches mutiple fields it only gets a minor boost to if it just matched one field.

http://test.musicbrainz.org/search?query=suede&type=artist&limit=25&method=advanced
Suede 100
Suede (American) 71

The larger difference in score is due to the fact that advanced search sum matches over the three fields artist, alias, sortname and now the other suede only matches on two fields, of course the point of advanced search is that is upto you to specify the query as you want it so if you want the scoring to be closer you could adjust it.


patate12 added a comment - 29/Mar/12 09:48 PM

Thanks very much Paul, it's very much better again!
But I actually feel bad for the other Suede. She's called exactly Suede but is not score 100.
But anyway we can't have everything. Toto works good on indexed search but gets less than 100 on advanced because of that totto alias.
Thanks very much anyway, it seems to be better like the way you changed it anyway.