Issue Details (XML | Word | Printable)

Key: MBS-4948
Type: New Feature New Feature
Status: Reopened Reopened
Priority: Normal Normal
Assignee: Unassigned
Reporter: Oliver Charles
Votes: 2
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
MusicBrainz Server

Provide a way for users to export their own data

Created: 29/Jun/12 10:07 AM   Updated: 23/Nov/12 08:39 PM
Component/s: None
Affects Version/s: None
Fix Version/s: None

Issue Links:
Relates
 


 Description  « Hide

A user should be able to export their tags, ratings and collections from the website, into some sort of parseable file



Sort Order: Ascending order - Click to sort in descending order
patate12 added a comment - 30/Jun/12 09:34 AM

And import (partial or full) too


nikki added a comment - 30/Jun/12 03:48 PM

Don't forget subscriptions. That was the whole reason I was even talking about it!


Oliver Charles added a comment - 01/Jul/12 01:11 AM

Import is not part of this ticket.


Ian McEwen added a comment - 09/Aug/12 09:16 PM

I'm considering this, making some notes to myself here; tables to use include:

Ratings:
artist_rating_raw
label_rating_raw
recording_rating_raw
release_group_rawing_raw
work_rating_raw

Tags:
artist_tag_raw
label_tag_raw
recording_tag_raw
release_group_tag_raw
release_tag_raw
work_tag_raw

Collections:
editor_collection
editor_collection_release

Subscriptions:
editor_subscribe_artist
editor_subscribe_label
editor_subscribe_editor (maybe?)


Ian McEwen added a comment - 09/Aug/12 09:44 PM

Bits of information that should be included, for each type:

Everything: editor name & rowid

Ratings: entity (type, mbid), rating (appears to be out of 100)
Tags: entity (type, mbid), tags (name, possibly ID)
Subscriptions: type, entity (mbid) or editor (name, rowid)
Collections: collection MBID, name, publicity, releases (mbids)

Anyone who happens to be watching, anything that you think should be added? I'm additionally wondering if it might be valuable to provide a way to export edits/edit notes by an editor as well – though that's public information, available from the dumps, it seems adjacent to this ticket and might be worth including.


Oliver Charles added a comment - 09/Aug/12 10:29 PM

Edit notes stay once you've deleted your account anyway, so I think I was more interested in removing data that is private and linked to closely to accounts (ie, will not exist if you delete your account).


nikki added a comment - 10/Aug/12 02:31 AM

I wouldn't include edit notes or edits (or votes).


Ian McEwen added a comment - 10/Aug/12 04:03 AM

Sounds reasonable; we can consider edits/edit notes excluded then, and people who'd like that can use the dumps.

Does anyone have particular thoughts on format? I don't know of any standardized formats for any of these things, so I was figuring I'd go for a hopefully relatively self-documenting JSON structure in the vein of

{'editor': {'id': 1, 'name': 'whatever'}, 
 'ratings': {'artist': [{'mbid': 'blah', 'rating': 80}, ...], ...}, 
 'tags': {'artist': [{'mbid': 'blah', 'tags': ['rock', 'roll']}, ...], ...},
 'subscriptions': {'artist': [{'mbid': 'blah'}, ...], 'editor': [{'id': 3, 'name': 'nikki'}, ...], ...},
 'collections': [{'mbid': 'whatever', 'name': 'An Awesome Collection', 'public': 1, releases: []}, ..]
}

If people agree with this sort of a unified format and I'm right there isn't any sort of standard at play here, there's a couple more questions:

1.) Do we want to support partial exports – e.g., if someone wants to export just their collections, or just one collection, or just their ratings, or just their artist ratings, do we support that?
2.) I'm not totally sure about which data to include for subscriptions; these tables have a somewhat strange format to them, whose parts I'm not sure we care about replicating. Editors are straightforward, but artist and label have 'deleted_by_edit' and 'merged_by_edit' columns, and some testing suggests that deleted and merged artists stay in the table – a new entry is put in for whatever they were merged into (e.g. artist 907879 (locally) merged into 2327, I have lines for both, with the 907879 one having an entry in merged_by_edit). This also excludes us having a foreign key on those tables, which is also strange but outside of this scope. My intuition here is to just exclude anything I can't get an MBID for (deleted/merged artists) and ignore the deleted/merged_by_edit columns, but I'm also fine with including them if anyone thinks that information would be useful.

Thoughts?


Ian McEwen added a comment - 10/Aug/12 04:17 AM

Another option, which I somewhat like, is to always have the four subcategories separate (or even further), and have the "export everything" option package up an archive of some sort (probably .zip, which is reasonably cross-platform) including all the files. Then the above format is more like a schema; any given file won't have more than 'editor' and one of ['ratings', 'tags', 'subscriptions', 'collections'].


Oliver Charles added a comment - 10/Aug/12 12:41 PM

1. I don't think we should support partial exports; lets just do the bare minimum to let people get their data out.

2. For subscriptions, just dump what's in /user/foo/subscriptions - that is, a list of entities that the editor is subscribed to. The merged/deleted columns are used to inform editors about deletions and merges in their emails, those subscriptions should be deleted after the email is sent. Thus I'd probably rows where those columns have values.

3. I would just provide a single dump in JSON format, as that is easier for us to generate on demand. For zips and stuff that require more heavy IO, I wouldn't want the server to be responsible for this, which means we need a message queue or something to generate those dumps and this all gets drastically more complicated.


Ian McEwen added a comment - 16/Aug/12 07:01 AM - edited

I've changed the format slightly, but I have a sample: https://gist.github.com/3367628

Folks who care: questions/comments/concerns?

edit: changed the format more


Joachim LeBlanc added a comment - 16/Aug/12 08:24 AM

Format looks good and fairly inutuitive. You do have an empty array (tags.label). Don't remove it from your example, but in production that empty arrays shouldn't be included.

Also, I'd like a consistent approach to plurality of these names. In this case, if the key (releases, tags, artists, etc) can have more than one value, it should be plural. If the key has only one element or one object (e.g. the first instance of editor, not the second), the key should be singular. Your example is inconsistent (release[s]). Or the alternative is to have all of the keys singular, but that doesn't feel right to me. And, apologies if this seems nitpicky. It's frustrating using other services that inconsistent key plurality. If I expect it to be plural I would use the plural form while coding. If it changes, then I have to repeatedly check the docs or examples to make sure the key is right for the case it's being used.


Ian McEwen added a comment - 16/Aug/12 08:36 AM

Key plurality stuff seems spot-on, I'll note that to myself to fix.

As for empty arrays: I'm not sure they should be removed. Having the key with an empty array is a very clear "this could have tags but there aren't any", to me – and I don't think sticking a null there is correct, so choosing between "don't include the key at all" and "include with an empty array", the latter seems like the right option.


Ian McEwen added a comment - 16/Aug/12 08:36 AM

Oliver Charles added a comment - 16/Aug/12 09:55 AM

The empty array is correct. The property exists, but it's simply empty. I would keep that.

I'm not sure about including the editor row ID, because that's fairly internal, but I suppose it is necessary if you ever want to crawl public dumps for edit notes you left/edits you made.


Ian McEwen added a comment - 16/Aug/12 10:23 AM

That was my thought – there are instances where editor names change (deletion) but none where their ID will. If we assigned editors mbids I'd pass that, but we don't


Oliver Charles added a comment - 25/Sep/12 11:44 AM

The merge window for 2012-10-01 is now closed, so this will have to wait until 2012-10-15.


Ian McEwen added a comment - 05/Oct/12 10:41 AM

Reopening, but I'll keep myself assigned; the codereview needed more work which I've been procrastinating on


Ian McEwen added a comment - 23/Nov/12 08:39 PM

I'm not presently working on this, unassigning for clarity