|
Import is not part of this ticket. I'm considering this, making some notes to myself here; tables to use include: Ratings: Tags: Collections: Subscriptions: Bits of information that should be included, for each type: Everything: editor name & rowid Ratings: entity (type, mbid), rating (appears to be out of 100) Anyone who happens to be watching, anything that you think should be added? I'm additionally wondering if it might be valuable to provide a way to export edits/edit notes by an editor as well – though that's public information, available from the dumps, it seems adjacent to this ticket and might be worth including. Edit notes stay once you've deleted your account anyway, so I think I was more interested in removing data that is private and linked to closely to accounts (ie, will not exist if you delete your account). Sounds reasonable; we can consider edits/edit notes excluded then, and people who'd like that can use the dumps. Does anyone have particular thoughts on format? I don't know of any standardized formats for any of these things, so I was figuring I'd go for a hopefully relatively self-documenting JSON structure in the vein of {'editor': {'id': 1, 'name': 'whatever'},
'ratings': {'artist': [{'mbid': 'blah', 'rating': 80}, ...], ...},
'tags': {'artist': [{'mbid': 'blah', 'tags': ['rock', 'roll']}, ...], ...},
'subscriptions': {'artist': [{'mbid': 'blah'}, ...], 'editor': [{'id': 3, 'name': 'nikki'}, ...], ...},
'collections': [{'mbid': 'whatever', 'name': 'An Awesome Collection', 'public': 1, releases: []}, ..]
}
If people agree with this sort of a unified format and I'm right there isn't any sort of standard at play here, there's a couple more questions: 1.) Do we want to support partial exports – e.g., if someone wants to export just their collections, or just one collection, or just their ratings, or just their artist ratings, do we support that? Thoughts? Another option, which I somewhat like, is to always have the four subcategories separate (or even further), and have the "export everything" option package up an archive of some sort (probably .zip, which is reasonably cross-platform) including all the files. Then the above format is more like a schema; any given file won't have more than 'editor' and one of ['ratings', 'tags', 'subscriptions', 'collections']. 1. I don't think we should support partial exports; lets just do the bare minimum to let people get their data out. 2. For subscriptions, just dump what's in /user/foo/subscriptions - that is, a list of entities that the editor is subscribed to. The merged/deleted columns are used to inform editors about deletions and merges in their emails, those subscriptions should be deleted after the email is sent. Thus I'd probably rows where those columns have values. 3. I would just provide a single dump in JSON format, as that is easier for us to generate on demand. For zips and stuff that require more heavy IO, I wouldn't want the server to be responsible for this, which means we need a message queue or something to generate those dumps and this all gets drastically more complicated. I've changed the format slightly, but I have a sample: https://gist.github.com/3367628 Folks who care: questions/comments/concerns? edit: changed the format more Format looks good and fairly inutuitive. You do have an empty array (tags.label). Don't remove it from your example, but in production that empty arrays shouldn't be included. Also, I'd like a consistent approach to plurality of these names. In this case, if the key (releases, tags, artists, etc) can have more than one value, it should be plural. If the key has only one element or one object (e.g. the first instance of editor, not the second), the key should be singular. Your example is inconsistent (release[s]). Or the alternative is to have all of the keys singular, but that doesn't feel right to me. And, apologies if this seems nitpicky. It's frustrating using other services that inconsistent key plurality. If I expect it to be plural I would use the plural form while coding. If it changes, then I have to repeatedly check the docs or examples to make sure the key is right for the case it's being used. Key plurality stuff seems spot-on, I'll note that to myself to fix. As for empty arrays: I'm not sure they should be removed. Having the key with an empty array is a very clear "this could have tags but there aren't any", to me – and I don't think sticking a null there is correct, so choosing between "don't include the key at all" and "include with an empty array", the latter seems like the right option. The empty array is correct. The property exists, but it's simply empty. I would keep that. I'm not sure about including the editor row ID, because that's fairly internal, but I suppose it is necessary if you ever want to crawl public dumps for edit notes you left/edits you made. That was my thought – there are instances where editor names change (deletion) but none where their ID will. If we assigned editors mbids I'd pass that, but we don't The merge window for 2012-10-01 is now closed, so this will have to wait until 2012-10-15. Reopening, but I'll keep myself assigned; the codereview needed more work which I've been procrastinating on I'm not presently working on this, unassigning for clarity |
||||||||||||||||||||||||||||||||||||||||||||||||
And import (partial or full) too