Issue Details (XML | Word | Printable)

Key: MBS-5285
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Normal Normal
Assignee: Oliver Charles
Reporter: Oliver Charles
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
MusicBrainz Server

Add NOT NULL constraints to disambiguation & join phrase column

Created: 10/Sep/12 11:57 AM   Updated: 17/Oct/12 02:07 AM   Resolved: 05/Oct/12 10:47 AM
Component/s: None
Affects Version/s: None
Fix Version/s: Schema change, 2012-10-15


 Description  « Hide

Both the disambiguation comment column and artist credit join_phrase columns contain data that we define in MusicBrainz, thus NULL never makes any sense for them (we always know what these columns should contain). We should not allow NULL data for these columns.



Sort Order: Ascending order - Click to sort in descending order
Paul Taylor added a comment - 10/Sep/12 12:57 PM

This seems like the wrong way round to me for disambuguation comments, in relational databases its normal to have nullable columns for fields that don't require a value. Would make more sense to enforce a constraint to not allow the value to be set to "", and in the UI if the user delete a comment by setting to "" this should cause the underlying value in the database to be set back to NULL.

There are two advantages to this.
1. The datamodel clearly shows which fields are mandatory i.e release title is mandatory therefore not nullable, disambugation column nor mandatory so nullable..
2. I assume that we only have disambuguation comments set to "" if we have deleted a previously entered comment, so that we must have many more database rows with disambuguation comment set to NULL, then "", so the size of replication packets required to change "" to NULL would be much smaller than changing NULL to "".


Oliver Charles added a comment - 10/Sep/12 01:46 PM - edited

I disagree that NULL is a correct value here, and I disagree that it's normal/good database design to have NULL mean 'not mandatory'. There are other more correct ways to model not mandatory, such as join tables. NULL is specifically undefined, unknown; a hole inside a row with no value. There is very little you can do with it, other than asserting its absence. In fact, the original relational database design didn't even have NULL! But I digress; my general point is that we define this data, and thus we are able to always provide a value for it. I don't see why having the empty string is not acceptable for the domain of this column - we certainly treat NULL in that manner anyway.

A good example of where NULL is required is with the barcode column. In this case the empty string and NULL do have different meanings - one means "we don't know" and the other means "we know there is no barcode". We will never be in a scenario when we don't know if an artist has a disambiguation comment or not.

1. Moot point - it is a required field.
2. This will be done as part of the schema upgrade process, which all clients will have to run, thus replication packets are not a problem.


Paul Taylor added a comment - 11/Sep/12 06:39 AM

Okay, we discussed this on irc and sounds like I have misunderstood a nullable column as meaning not mandatory, when in fact it means not known, usually this doesn't make much difference as often both conditions do apply but in this case I see that the disambuguation comment isn't unknown information that we are trying to find out. therefore missses the not known condition.


Oliver Charles added a comment - 11/Sep/12 11:28 AM

Ok, first inevitable problem has come up

By making clients and the master run this at different times, we introduce the problem of both having the last_updated column be out of sync. Lets consider a very simple database with only a single artist row with comment=NULL, last_updated=1.

1. The master runs the upgrade schema script at t=2. The last_updated column is thus updated to 2.
2 We allow edits to artists again, and start producing replication packets. Assume this artist is updated at t=3}, then we will have a replication packet setting {{last_updated to 3.
3. Clients run the upgrade script at t=4. They don't have triggers, so will leave last_updated column as 1 which is now out of sync with master.

I can see 3 solutions:

1. Forgeddaboudit! This problem isn't worth worrying about, thus we leave with this caveat.
2. Add an explicit last_updated date to the script, so both client and master will have the same date.
3. Disable last_updated triggers during the upgrade. This is basically 2, except the explicit date is 'whatever it was before we ran'.

I think my personal preference is option 3, and we should do this for all tables at the start of migration.


Ian McEwen added a comment - 11/Sep/12 09:19 PM

Questions:

for
1.), would it break the application of replication packets (i.e., do they care about where a value starts, or do they just update to whatever they're supposed to, ignoring the old value);
2.), we'd have to decide what value to use, and it seems likely that any choice would be wrong;
3.), are there performance issues?

Nevertheless, I think 3 is probably the best option, and I agree it should be for all tables – in fact, two scripts for disabling/enabling last_updated triggers would seem to me to be a generally valuable introduction to our codebase for future schema changes/debugging/etc.


Karl Dietz added a comment - 14/Sep/12 01:42 PM

Another important aspect is that 0 = 0 and '' = '' but NULL <> NULL.
So if you have a unique constraint on an optional attribute you can have multiple entries containing NULL but only one containing an empty string or zero.
By using NULL instead of '' for the artist disambiguation comment you can enforce "two artists with the same name must have different comments (or no comment at all)" with a simply unique key constraint.


Oliver Charles added a comment - 14/Sep/12 05:42 PM - edited

You can still do that with null, fwiw - you just need 2 indexes. One is UNIQUE INDEX artist (name) WHERE comment IS NULL and the other is UNIQUE INDEX artist (name, comment) WHERE comment IS NOT NULL


Oliver Charles added a comment - 05/Oct/12 10:47 AM

All merged to schema change branch


Duke Yin added a comment - 17/Oct/12 01:28 AM

What does this code change do to artist join phrases? Is a blank join phrase still possible?


Ulrich Klauer added a comment - 17/Oct/12 01:52 AM

Yes. It is just stored as an empty string '' instead of NULL.


Ian McEwen added a comment - 17/Oct/12 02:07 AM

@yindesu: yes, they are. This doesn't actually change anything on the frontend or the semantics of the data (other than places where our code is incorrect...) – it just means the empty join phrase is always stored as '' rather than sometimes as NULL. On the frontend, these two didn't display differently before; this just removes the duplication of '' and NULL being the same, by disallowing NULL.

There is one change (not this ticket, but within the same schema change) that affects join phrases, but only in one strange case: when the join phrase at the end of an artist credit (i.e., after the last artist) is entirely whitespace. Once again, this doesn't actually change semantics – we've always considered final join phrases without actual content to be meaningless (they're also usually mistakes, byproducts of the release editor interface being... less than ideal).