|
Don't all ASINs start with B? That would already make the chances of it picking the wrong thing much smaller, wouldn't it? ASINs for audiobooks e.g. don't start with B. And their three first chars are not Letter Number Number either? (just trying to find something to make this somewhat more precise) No, for books (also audiobooks) the ASIN is often the ISBN number, so all numbers. musicbrainz=# select url from url where url ~ '^http://www.amazon.(com|ca|co.uk|fr|de|it|es|co.jp|cn)/gp/product/[0-9A-Z]{10}$' and url !~ '^http://www.amazon.(com|ca|co.uk|fr|de|it|es|co.jp|cn)/gp/product/(B[0-9A-Z]{9}|[0-9]{9}[0-9X])$';
url
-----
(0 rows)
musicbrainz=# select url from url where url ~ '^http://www.amazon.(com|ca|co.uk|fr|de|it|es|co.jp|cn)/gp/product/[0-9A-Z]{10}$' and url !~ '^http://www.amazon.(com|ca|co.uk|fr|de|it|es|co.jp|cn)/gp/product/(B00[0-9A-Z]{7}|[0-9]{9}[0-9X])$';
url
-----------------------------------------------
http://www.amazon.com/gp/product/BT00CHI1V2
http://www.amazon.co.uk/gp/product/BT00CHI1V2
(2 rows)
musicbrainz=#
I think my preference would be to try multiple replacements, the first one trying to match the typical URL formats (as hrglgrmpf mentioned) with a relatively strict regex for the ASIN (e.g. the one from the second query) which should work for the vast majority of cases and then try the current method of trying to find anything that looks like it could be an ASIN if that fails. +1 for that... I think it is the best thing we can do... do you want to implement it? Whoever rewrites the cleanup code should also take into account the issue with artist profiles in |
||||||||||||||||||||||||||||||||||||||||||||||||||
Hmm, I don't have a perfect solution for this, so I'm unassigning myself. My best try would be to match for "/dp/([A-Z0-9]{10})" or "/product/([A-Z0-9]{10})" first. The regular expression which causes the bug is
/(?:\/|\ba=)([A-Z0-9]{10})(?:[/?&%#]|$)/https://github.com/metabrainz/musicbrainz-server/blob/master/root/static/scripts/edit/MB/Control/URLCleanup.js#L197