Hashing and Hash Tables, Trees & Binary Tree


Hashing

Suppose we want to design a system for storing employee records keyed using phone numbers. And we want following queries to be performed efficiently:
  •         Insert a phone number and corresponding information.
  •        Search a phone number and fetch the information.
  •      Delete a phone number and related information.

We can think of using the following data structures to maintain information about different phone numbers.
  •      Array of phone numbers and records.
  •       Linked List of phone numbers and records.
  •      Balanced binary search tree with phone numbers as keys.
  •        Direct Access Table.


Hashing is an improvement over Direct Access Table. The idea is to use hash function that converts a given phone number or any other key to a smaller number and uses the small number as index in a table called hash table.

Hash Function: A function that converts a given big phone number to a small practical integer value. The mapped integer value is used as an index in hash table. In simple terms, a hash function maps a big number or string to a small integer that can be used as index in hash table.

A good hash function should have following properties
1) Efficiently computable.
2) Should uniformly distribute the keys (Each table position equally likely for each key)

For example for phone numbers a bad hash function is to take first three digits. A better function is consider last three digits. Please note that this may not be the best hash function. There may be better ways.

There are many ways to hash a string into a key. The following are some of the important methods for constructing hash functions.

Folding

A folding hash code is produced by dividing the input into n sections of m bits, where 2^m is the table size, and using a parity-preserving bitwise operation like ADD or XOR, to combine the sections. The final operation is a mask or shift to trim off any excess bits at the high or low end. For example, for a table size of 15 bits and key value of 0x0123456789ABCDEF, there are 5 sections 0x4DEF, 0x1357, 0x159E, 0x091A and 0x8. Adding, we obtain 0x7AA4, a 15-bit value.

Mid-squares

A mid-squares hash code is produced by squaring the input and extracting an appropriate number of middle digits or bits. For example, if the input is 123,456,789 and the hash table size 10,000, squaring the key produces 1.524157875019e16, so the hash code is taken as the middle 4 digits of the 17-digit number (ignoring the high digit) 8750. The mid-squares method produces a reasonable hash code if there are not a lot of leading or trailing zeros in the key. This is a variant of multiplicative hashing, but not as good, because an arbitrary key is not a good multiplier.

Division hashing


A standard technique is to use a modulo function on the key, by selecting a divisor {\displaystyle M}M which is a prime number close to the table size, so {\displaystyle h(K)=K\mod M}{\displaystyle h(K)=K\mod M}. The table size is usually a power of 2. This gives a distribution from {\displaystyle \{0,M-1\}}{\displaystyle \{0,M-1\}}. This gives good results over a large number of key sets. A significant drawback of division hashing is that division is microprogrammed on most modern architectures including x86, and can be 10 times slower than multiply. A second drawback is that it won't break up clustered keys. For example, the keys 123000, 456000, 789000, etc. modulo 1000 all map to the same address. This technique works well in practice because many key sets are sufficiently random already, and the probability that a key set will be cyclical by a large prime number is small.

Hash Table: An array that stores pointers to records corresponding to a given phone number. An entry in hash table is NIL if no existing phone number has hash function value equal to the index for the entry.

Collision Handling: Since a hash function gets us a small number for a big key, there is possibility that two keys result in same value. The situation where a newly inserted key maps to an already occupied slot in hash table is called collision and must be handled using some collision handling technique. Following are the ways to handle collisions:
  • Chaining: The idea is to make each cell of hash table point to a linked list of records that have same hash function value. Chaining is simple, but requires additional memory outside the table.
  • Open Addressing: In open addressing, all elements are stored in the hash table itself. Each table entry contains either a record or NIL. When searching for an element, we one by one examine table slots until the desired element is found or it is clear that the element is not in the table.

Binary Tree

A tree whose elements have at most 2 children is called a binary tree. Since each element in a binary tree can have only 2 children, we typically name them the left and right child. Trees: Unlike Arrays, Linked Lists, Stack and queues, which are linear data structures, trees are hierarchical data structures. 

Tree Vocabulary: The topmost node is called root of the tree. The elements that are directly under an element are called its children. The element directly above something is called its parent. For example, ‘a’ is a child of ‘f’, and ‘f’ is the parent of ‘a’. Finally, elements with no children are called leaves.

Why using trees?

1. One reason to use trees might be because you want to store information that naturally forms a hierarchy. For example, the file system on a computer:
file system
-----------
     /    <-- root
  /      \
...       home
      /          \
   ugrad        course
    /       /      |     \
  ...      cs101  cs112  cs113  
2. Trees (with some ordering e.g., BST) provide moderate access/search (quicker than Linked List and slower than arrays).
3. Trees provide moderate insertion/deletion (quicker than Arrays and slower than Unordered Linked Lists).
4. Like Linked Lists and unlike Arrays, Trees don’t have an upper limit on number of nodes as nodes are linked using pointers.

Main applications of trees include:
  1. Manipulate hierarchical data.
  2. Make information easy to search (see tree traversal).
  3. Manipulate sorted lists of data.
  4. As a workflow for compositing digital images for visual effects.
  5. Router algorithms
  6. Form of a multi-stage decision-making (see business chess).


Binary Tree: A tree whose elements have at most 2 children is called a binary tree. Since each element in a binary tree can have only 2 children, we typically name them the left and right child. Binary Tree Representation in C: A tree is represented by a pointer to the topmost node in tree. If the tree is empty, then value of root is NULL.

A Tree node contains following parts.
  1. Data
  2. Pointer to left child
  3. Pointer to right child

Properties of Binary Trees

1) The maximum number of nodes at level ‘l’ of a binary tree is 2l-1.
Here level is number of nodes on path from root to the node (including root and node). Level of root is 1. This can be proved by induction.
For root, l = 1, number of nodes = 21-1 = 1
Assume that maximum number of nodes on level l is 2l-1
Since in Binary tree every node has at most 2 children, next level would have twice nodes, i.e. 2 * 2l-1
2) Maximum number of nodes in a binary tree of height ‘h’ is 2h – 1.
Here height of a tree is maximum number of nodes on root to leaf path. Height of a tree with single node is considered as 1. This result can be derived from point 2 above. A tree has maximum nodes if all levels have maximum nodes. So maximum number of nodes in a binary tree of height h is 1 + 2 + 4 + .. + 2h-1. This is a simple geometric series with h terms and sum of this series is 2h – 1.
In some books, height of the root is considered as 0. In this convention, the above formula becomes 2h+1 – 1
3) In a Binary Tree with N nodes, minimum possible height or minimum number of levels is  ? Log2(N+1) ?  
This can be directly derived from point 2 above. If we consider the convention where height of a leaf node is considered as 0, then above formula for minimum possible height becomes   ? Log2(N+1) ? – 1
4) A Binary Tree with L leaves has at least   ? Log2L ? + 1   levels
A Binary tree has maximum number of leaves (and minimum number of levels) when all levels are fully filled. Let all leaves be at level l, then below is true for number of leaves L.
   L   <=  2l-1  [From Point 1]
   l =   ? Log2L ? + 1 
   where l is the minimum number of levels. 
5) In Binary tree where every node has 0 or 2 children, number of leaf nodes is always one more than nodes with two children.
   L = T + 1
Where L = Number of leaf nodes
      T = Number of internal nodes with two children

Common types of Binary Trees

Full Binary Tree A Binary Tree is full if every node has 0 or 2 children. Following are examples of a full binary tree. We can also say a full binary tree is a binary tree in which all nodes except leaves have two children.
               18
           /       \  
         15         30  
        /  \        /  \
      40    50    100   40

             18
           /    \   
         15     20    
        /  \       
      40    50   
    /   \
   30   50

               18
            /     \  
          40       30  
                   /  \
                 100   40
In a Full Binary, number of leaf nodes is number of internal nodes plus 1
       L = I + 1
Where L = Number of leaf nodes, I = Number of internal nodes

Complete Binary Tree: A Binary Tree is complete Binary Tree if all levels are completely filled except possibly the last level and the last level has all keys as left as possible
Following are examples of Complete Binary Trees
               18
           /       \  
         15         30  
        /  \        /  \
      40    50    100   40


               18
           /       \  
         15         30  
        /  \        /  \
      40    50    100   40
     /  \   /
    8   7  9 

Perfect Binary Tree A Binary tree is Perfect Binary Tree in which all internal nodes have two children and all leaves are at the same level.
Following are examples of Perfect Binary Trees.
               18
           /       \  
         15         30  
        /  \        /  \
      40    50    100   40


               18
           /       \  
         15         30  
A Perfect Binary Tree of height h (where height is the number of nodes on the path from the root to leaf) has 2h – 1 node. Example of a Perfect binary tree is ancestors in the family. Keep a person at root, parents as children, parents of parents as their children.

Balanced Binary Tree
A binary tree is balanced if the height of the tree is O(Log n) where n is the number of nodes. For Example, AVL tree maintains O(Log n) height by making sure that the difference between heights of left and right subtrees is atmost 1. Red-Black trees maintain O(Log n) height by making sure that the number of Black nodes on every root to leaf paths are same and there are no adjacent red nodes. Balanced Binary Search trees are performance wise good as they provide O(log n) time for search, insert and delete.

A degenerate (or pathological) tree A Tree where every internal node has one child. Such trees are performance-wise same as linked list.
      10
      /
    20
     \
     30
      \
      40     

Source:

Comments

Popular posts from this blog

Circular Linked List, Doubly Linked list, and Circular Doubly Linked List

Binary Search Tree