Shared Schema Database

Published:

I talk about using shared schema databases and practices that should be followed if this pattern is used. Having shared schema and multiple users is always a risk because by default there is no limit to what a database user can do.

It is easy to forget WHERE -clause which cause data leaks or loss. Backing up and restoring individual user data is also hard because data is not isolated.

Description

The tables in shared schemas can be accessed by multiple users, and the only restriction by default is by having a foreign key that points to user.

Let's assume we have two tables User and IceCream.

id | username
---+----------
1  | poppy
2  | leona


id | user | flavour
---+------+------------
1  | 1    | chocolate
2  | 2    | vanilla
3  | 2    | chocolate
4  | 1    | blueberry

In this architecture, you would rely on queries with foreign keys

SELECT * FROM IceCream WHERE user=1;

This is okay if data is yours but when the data belongs to customer this isn't sufficient. You should consider using a schema per user. If you have already made this mistake, the next sections provides possible remedies.

Row-level security

With row-level-security you can restrict the visibility of the rows, it basically just always appends the WHERE clause.

https://www.postgresql.org/docs/current/ddl-rowsecurity.html

Query filters in application code

The row level security does the restriction in database level, but it can be done in application level too. Especially when using ORM one can accidentally get all rows e.g. in Django

Icream.objects.all()
IceCream.objects.filter()
IceCream.objects.filter(user=None)
# or
user = None
IceCream.objects.filter(user=user)

All the statements are equivalent, and a single bug such as accidentally setting user to None can delete all your rows!

The only remedy I can think of is to construct the queries differently by using a wrapper that appends the WHERE condition.

query: str = 'SELECT * FROM IceCream'

def find(cursor, query: str, params: dict, user: User):
    ...
    if user and user.id:
        sql = rls_wrap(query)
        cursor.execute(sql, { **params, 'user': user.id })
    else:
        raise Exception('User id is not supplied to query.')


def rls_wrap(query: str):
    wrapped_query: str = query
    if 'WHERE' in query:
        wrapped_query += ' AND user=%(user)s'
    else:
        wrapped_query += ' WHERE user=%(user)s'
    return wrapped_query

After implementing the wrapper, it needs to be kept as minimal as possible, well-tested and not modified after. Each table should have the same column name as foreign key.

If you use ORM such as Django, override the default filter functions and always check that the user is passed as parameter.

Back up the subset of SQL rows with NoSQL

I haven't found a proper way to perform user-level backups, so the only idea I have is using a document storage such as NoSQL databases and save different versions as JSON.

I would store them like this

{
  "_shard": 0,
  "_version": 0,
  "date": "2020-14-12",
  "table": "IceCream",
  "user": 1,
  "rows": [
    {
      "id": 1,
      "flavour": "chocolate"
    },
    {
      "id": 4,
      "flavour": "blueberry"
    }
  ]
}

Here, I use shards because if it happens to exceed the max size then we just use another shard which would contain more rows. You may want to delete very old backups and versions, but be careful to not delete unintended data.

Using this method you can either backup when the changes happen or periodically. It is also possible to put each row in own document and version the rows, but table might be easier to restore.

Restoring from NoSQL

Query the things you want to restore and then check if it already exists with primary key and user id. If it exists then, update the row to match the backup data and if it doesn't insert new rows.