Issue Details (XML | Word | Printable)

Key: SEARCH-166
Type: Improvement Improvement
Status: Closed Closed
Resolution: Duplicate
Priority: Normal Normal
Assignee: Unassigned
Reporter: monxton
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
MusicBrainz Search Server

Artist search does not always offer the obvious (to a human!) results

Created: 02/Jan/12 06:05 PM   Updated: 23/Mar/12 06:02 PM   Resolved: 02/Feb/12 04:17 PM
Component/s: None
Affects Version/s: None
Fix Version/s: 2012-03-23


 Description  « Hide

Example: use the standard search box, enter "Louis Clarke". Happily the artist "Louis Clarke" appears in the #1 spot. However the artist "Louis Clark" does not appear until page 5 (result #235), after just about everyone called Clarke, and even quite a lot of people called Clark.

Trying the same thing in reverse, the results are worse. Search for "Louis Clark", and you don't hit "Louis Clarke" until page 23 (result #578). Everyone named Louis gets in first.

This is a time-limited example, because I have submitted an edit to merge these two, as Louis Clarke should not have been created in the first place. But I guess it could be recreated on the test server if you wish.

(Note that the Louis Clarke artist was not marked as a Person, and the sort name was set to "Louis Clarke", not to "Clarke, Louis". Just in case either of these factors make a difference.)

If I do a fuzzy search in Advanced Search mode then I get the results I want. But I guess the editor who created the duplicate artist in the first place did not know to do this.



Sort Order: Ascending order - Click to sort in descending order
Paul Taylor added a comment - 05/Jan/12 02:36 PM

The standard search is rewritten behind the scenes, we could rewrite to add some wildcard searches.

But there is difficluty is differentiating between mispelt words and different words. When you are searching a large chunk of text this takes care of itself but when searching only a couple of words per records it can really screw things up. I mean should a search for clarke return

clarka
clarky
clarko
clarce

as well, and once you consider that Musicbrainz is not english based and can be for any language and there is no way to reliably identify what language has been entered how do you make sensible rules that wont break other things ?

I think we can conisdering making a few specific improvements that would improve advanced search as well, for example I think http://tickets.musicbrainz.org/browse/SEARCH-160
could work so that a search for L. Clark would match Louise Clark and a search for L.Clarke would match Louise Clarke (if Louise clarke has been marked as a person) but this wont solve your problem.


monxton added a comment - 05/Jan/12 04:08 PM

I do recognise that search is Very Difficult, and that adding lots of special cases just makes them fight with each other.

I just wonder if it all too complicated. I mean, if the only thing I gave any extra significance to was the position of the white space, and simply tallied up how many letters were in matching sequences in the two strings, then the results for this example would be far better. You don't have to be an English speaker to know that Louis Clark is a better match for Louis Clarke than Dave Clarke is.

Of course there are far more considerations. But personally I am less concerned about advanced search and functions that advanced users can learn to use than making the default search work really well, so that naïve users get the best results possible.

I guess there are some regression tests for this? I don't often go digging in the repository.

BTW Should I infer from your reply that the fact that Louis Clarke was not marked as a Person was indeed a factor in the poor results?


Paul Taylor added a comment - 09/Jan/12 12:09 PM

I dont think your idea is very safe, I mean which would you consider is the best match for Louise Clarke) from these two:
Louis Clarks
Lucy Clarke

Basic search will be improved, Im going to probably add some kind of wildcard into the basic query as described in
http://tickets.musicbrainz.org/browse/SEARCH-167

BTW, no I only mentioned the person as part in relation to my comment about initials.


monxton added a comment - 09/Jan/12 12:54 PM

Was that a rhetorical question? I don't suppose either of them would be the best match. The first would be the better of the two because it has greater similarity. But I would hope to see both of them before I was offered any Dave Clark/Clarke/Clarks, because a naive user will not keep going to the tenth page of matches before they give up.

Perhaps my previous comment came over too literally, because I did not intend / am not so arrogant as to offer an alternative for search implementation; I was just trying to demonstrate that a simple-minded approach would give better results for my single example. I was serious about wanting to understand what the success criteria are for search though, that's why I asked what regression tests there are. The outputs won't get better unless there's consensus as to what "better" means.


Paul Taylor added a comment - 10/Jan/12 09:02 AM

I was pointing out that it is very difficult to really say what option is better than another, but if common misspellings are added as aliases (as they sometimes are) then they would be searched, but of course in many cases they do not exist. Ive considered your idea bit more and maybe it could be used to augment the scoring after Lucene has done, so once Lucene has done its scoring, I could check evey record with the same score and increase the score of those that match more letters, whilst taking care not to increase the score above the next result so for example if i matched each letter to the result I was searching for (Louise Clarke) I would get

So say the results were

Louis Clarke = 100
Louis Clarks = 70
Lucy Clarke = 70
Dave Clarke = 70

Id count up matching letters for the tied 70 scorers
Louis Clarks = 10
Lucy Clarke = 7
Dave Clarke = 5

and then apportion up to the next score, giving something like
Louis Clarke = 100
Louis Clarks = 90
Lucy Clarke = 80
Dave Clarke = 70

But I dont know how costly this would be to calculate.

As for tests, there are unit test for about 95% of the code, but what I dont have our test that run against the real database. So for example I might have a test that checks that if you add an artist called "Louis Clarke" to the database and you search for "Louise Clark" it will return "Louise Clarke" as a possible match but I dont have a test to check that when you search against the database that "louise Clarke" comes up higher than "Dave Clarke". Test like this would be useful but it then means that whenever you runa build it would depend on having access to a live database or the tests would fail, and also the build would run slower so Id have to consider this.


nikki added a comment - 10/Jan/12 09:29 AM

monxton added a comment - 11/Jan/12 01:19 PM

Yes, and I have no problem if yoou want to close this as a dupe.


nikki added a comment - 12/Jan/12 07:20 AM

Here's another case I just came across:
I wanted to search for "matsuura aya", but I made a typo and searched for "matsuura ata" instead. http://musicbrainz.org/search?query=matsuura+ata&type=artist does include the artist, but nowhere near the top despite the difference only being one character. http://musicbrainz.org/search?query=matsuura%7E+ata%7E&type=artist&limit=25&advanced=1 has it as the first result.


Paul Taylor added a comment - 02/Feb/12 04:17 PM

Resolved following the changes made for SEARCH-167