HashingFileSystem

From Pickwiki
Jump to: navigation, search
Inside a MultiValue database, data is typically stored in hashed files. What this means is the file is divided into an equal number of groups or "buckets". The number of buckets is called the modulo and the size of each bucket is called the separation. When a basic program does
READ REC FROM MYFILE, CUST.NO
the database engine does the following:
  1. Take the key field, CUST.NO, and produces an offset into the file (a classic PICK system does this by adding together the ASCII values of the characters in the key, see [1] and [3] for a description).
  2. Adds the offset to the base address of the file (calculated when the file was opened) and then reads that bucket and checks to see if the key is there. If we have attempted to store too many records in one bucket, the additional records are stored in "overflow" and each overflow bucket is part of linked list. In this case the list is traversed reading each overflow bucket looking for the key. This is even slower than it sounds.

The choice of modulo is very important - if the number of buckets is too small, the file will very quickly go into overflow, and a read of one record will typically take many physical reads. If the number of buckets is too large, disk space will be wasted. Additionally, if the modulo if a muliple of 2 or 5, the hashing algorithm can break down, and you have the worst of both worlds - lots of overflown groups and lots of wasted space.

[2] has some good descriptions of the effects of different choices of modulo, though the conclusion not to use a prime number goes against orthodoxy :-)

Links


[1] http://hometown.aol.com/mbtexts/195.html#28

[2] http://hometown.aol.com/mbtexts/183.html

[3] Approximate code for the calculation, from Susan Lynch:

S = 0
L = LEN(KEY)
FOR I = 1 TO L
    S = S*10 + SEQ(KEY[I,1])
NEXT I
GROUP.NUMBER = MOD(S,file.modulo)