|
I do recognise that search is Very Difficult, and that adding lots of special cases just makes them fight with each other. I just wonder if it all too complicated. I mean, if the only thing I gave any extra significance to was the position of the white space, and simply tallied up how many letters were in matching sequences in the two strings, then the results for this example would be far better. You don't have to be an English speaker to know that Louis Clark is a better match for Louis Clarke than Dave Clarke is. Of course there are far more considerations. But personally I am less concerned about advanced search and functions that advanced users can learn to use than making the default search work really well, so that naïve users get the best results possible. I guess there are some regression tests for this? I don't often go digging in the repository. BTW Should I infer from your reply that the fact that Louis Clarke was not marked as a Person was indeed a factor in the poor results? I dont think your idea is very safe, I mean which would you consider is the best match for Louise Clarke) from these two: Basic search will be improved, Im going to probably add some kind of wildcard into the basic query as described in BTW, no I only mentioned the person as part in relation to my comment about initials. Was that a rhetorical question? I don't suppose either of them would be the best match. The first would be the better of the two because it has greater similarity. But I would hope to see both of them before I was offered any Dave Clark/Clarke/Clarks, because a naive user will not keep going to the tenth page of matches before they give up. Perhaps my previous comment came over too literally, because I did not intend / am not so arrogant as to offer an alternative for search implementation; I was just trying to demonstrate that a simple-minded approach would give better results for my single example. I was serious about wanting to understand what the success criteria are for search though, that's why I asked what regression tests there are. The outputs won't get better unless there's consensus as to what "better" means. I was pointing out that it is very difficult to really say what option is better than another, but if common misspellings are added as aliases (as they sometimes are) then they would be searched, but of course in many cases they do not exist. Ive considered your idea bit more and maybe it could be used to augment the scoring after Lucene has done, so once Lucene has done its scoring, I could check evey record with the same score and increase the score of those that match more letters, whilst taking care not to increase the score above the next result so for example if i matched each letter to the result I was searching for (Louise Clarke) I would get So say the results were Louis Clarke = 100 Id count up matching letters for the tied 70 scorers and then apportion up to the next score, giving something like But I dont know how costly this would be to calculate. As for tests, there are unit test for about 95% of the code, but what I dont have our test that run against the real database. So for example I might have a test that checks that if you add an artist called "Louis Clarke" to the database and you search for "Louise Clark" it will return "Louise Clarke" as a possible match but I dont have a test to check that when you search against the database that "louise Clarke" comes up higher than "Dave Clarke". Test like this would be useful but it then means that whenever you runa build it would depend on having access to a live database or the tests would fail, and also the build would run slower so Id have to consider this. This is basically the same as Here's another case I just came across: Resolved following the changes made for |
||||||||||||||||||||||||||||||||||
The standard search is rewritten behind the scenes, we could rewrite to add some wildcard searches.
But there is difficluty is differentiating between mispelt words and different words. When you are searching a large chunk of text this takes care of itself but when searching only a couple of words per records it can really screw things up. I mean should a search for clarke return
clarka
clarky
clarko
clarce
as well, and once you consider that Musicbrainz is not english based and can be for any language and there is no way to reliably identify what language has been entered how do you make sensible rules that wont break other things ?
I think we can conisdering making a few specific improvements that would improve advanced search as well, for example I think http://tickets.musicbrainz.org/browse/SEARCH-160
could work so that a search for L. Clark would match Louise Clark and a search for L.Clarke would match Louise Clarke (if Louise clarke has been marked as a person) but this wont solve your problem.