Shared Schema Database
I talk about using shared schema databases and practices that should be followed if this pattern is used. Having shared schema and multiple users is always a risk because by default there is no limit to what a database user can do.
It is easy to forget WHERE
-clause which cause data leaks or loss.
Backing up and restoring individual user data is also hard because data is not isolated.
Description
The tables in shared schemas can be accessed by multiple users, and the only restriction by default is by having a foreign key that points to user.
Let's assume we have two tables User and IceCream.
id | username
---+----------
1 | poppy
2 | leona
id | user | flavour
---+------+------------
1 | 1 | chocolate
2 | 2 | vanilla
3 | 2 | chocolate
4 | 1 | blueberry
In this architecture, you would rely on queries with foreign keys
SELECT * FROM IceCream WHERE user=1;
This is okay if data is yours but when the data belongs to customer this isn't sufficient. You should consider using a schema per user. If you have already made this mistake, the next sections provides possible remedies.
Row-level security
With row-level-security you can restrict the visibility of the rows, it basically just always appends the WHERE clause.
https://www.postgresql.org/docs/current/ddl-rowsecurity.html
Query filters in application code
The row level security does the restriction in database level, but it can be done in application level too. Especially when using ORM one can accidentally get all rows e.g. in Django
Icream.objects.all()
IceCream.objects.filter()
IceCream.objects.filter(user=None)
# or
user = None
IceCream.objects.filter(user=user)
All the statements are equivalent, and a single bug such as accidentally setting user to None
can delete all your rows!
The only remedy I can think of is to construct the queries differently by using a wrapper that appends the WHERE
condition.
query: str = 'SELECT * FROM IceCream'
def find(cursor, query: str, params: dict, user: User):
...
if user and user.id:
sql = rls_wrap(query)
cursor.execute(sql, { **params, 'user': user.id })
else:
raise Exception('User id is not supplied to query.')
def rls_wrap(query: str):
wrapped_query: str = query
if 'WHERE' in query:
wrapped_query += ' AND user=%(user)s'
else:
wrapped_query += ' WHERE user=%(user)s'
return wrapped_query
After implementing the wrapper, it needs to be kept as minimal as possible, well-tested and not modified after. Each table should have the same column name as foreign key.
If you use ORM such as Django, override the default filter
functions and always check that the user is passed
as parameter.
Back up the subset of SQL rows with NoSQL
I haven't found a proper way to perform user-level backups, so the only idea I have is using a document storage such as NoSQL databases and save different versions as JSON.
I would store them like this
{
"_shard": 0,
"_version": 0,
"date": "2020-14-12",
"table": "IceCream",
"user": 1,
"rows": [
{
"id": 1,
"flavour": "chocolate"
},
{
"id": 4,
"flavour": "blueberry"
}
]
}
Here, I use shards because if it happens to exceed the max size then we just use another shard which would contain more rows. You may want to delete very old backups and versions, but be careful to not delete unintended data.
Using this method you can either backup when the changes happen or periodically. It is also possible to put each row in own document and version the rows, but table might be easier to restore.
Restoring from NoSQL
Query the things you want to restore and then check if it already exists with primary key and user id. If it exists then, update the row to match the backup data and if it doesn't insert new rows.