Uploaded image for project: 'Zapped: AcousticBrainz'
  1. Zapped: AcousticBrainz
  2. AB-101

frame-level ll data is too big to store in postgres

XMLWordPrintable

    • Icon: Improvement Improvement
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • Server
    • None

      We've started looking at how to add more detailed information to AB, by computing data for every 1024 frames instead of taking an average value over the duration of the entire song.

      This results in a huge increase in filesize of the returned json, which we expected. A 3 minute song which used to give ~100k json now returns something about 10MB in size.
      Given the current size of AcousticBrainz (200gb), we'd be looking at something like 20TB to store the current 3 million entries.

      We're switching to postgres 9.4 and jsonb, however this has a limit per column of 0x0fffffff bytes (260mb). This means we can't fit data for long files (>25 minutes) in the database.
      We may be coming to the limit of storing data in json. Especially our numerical data, which we're storing as text in scientific notation. This is significantly larger than the same data in floating point. What's more, I don't think that postgres gives us any advantages when we store this data. Indexes in jsonb make it easy to select and filter on specific values, but I don't think we're ever going to want to ask for "all files where the 3rd mel band of frame 6830 is > 2".

      Perhaps we should look at continuing to store the average data in postgres, but then store this complete data out of the database, for people to download for Machine Learning tasks.
      Where should we store it? Do we continue to look at database solutions (redis? other key-value stores?) or just put it on a disk. What directory structure would we need to support 10 million files?

      I've been looking at alternative storage formats. Protocol buffers gives us a 60% reduction in size, which is great, but requires custom protocol files for each language, and needs to be updated every time we add new data. It seems like there are some matlab tools for this (https://github.com/farsounder/protobuf-matlab), but I'm not sure of their quality.
      I will also look into other formats (msgpack, hdf5) to see if they provide better results in terms of filesize or accessibility from many languages.

            Unassigned Unassigned
            alastairp Alastair Porter
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:

                Version Package