- Notifications
You must be signed in to change notification settings - Fork16
Description
Previously we thought that 1 MB can track changes page-to-page in the 1 GB of data files. However, recently it became evident that our ptrack map or basic hash table behaves more like a Bloom filter with a number of hash functionsk = 1
. See more here:https://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives. Such filter has naturally more collisions.
For example, for database of size 22 GB and ptrack.map_size = 32 MB previously we thought that we will get almost zero collisions. However, theory says that we should track an extra 640 MB by tracking a 1 GB of data:
importmathdefcol_proba(n,m,k=1):return (1-math.exp(-k*n/m))**kmap_size=1024*1024# 1 MB in bytesmap_size=map_size*32# ptrack.map_size in bytesmap_pages=map_size//8# 64-bit uint or 8 bytes per slotchange_size=1000*1024# size of changes in data files in KBchange_pages=change_size//8# number of changed pages in DBdb_pages=22*1024*1024//8# 22 GB in pagesproba=col_proba(change_pages,map_pages,1)print("Collision probability: ",proba)print("Extra data (MB): ",proba*(db_pages-change_pages)*8/1024)Collisionprobability:0.030056617868655988Extradata (MB):647.0588694764261
Unfortunately this seems to be true and experiment confirms that we mark 1.6 GB of pages instead of 1 GB. Same experiment with 16 MB of map, 24 GB database and 1 GB insert: theory says there should be 1375 MB of extra blocks, in reality it appears to be ~1275 MB, which is very close.
Theoretically we can substantially reduce collisions rate by storing each page LSN into two cells (i.e.k = 2
):
Collision probability: 0.003505804615154038Extra data (MB): 75.47296175503612
Let's try to do that.