Hash Table:Chaining

Posted by Chiu CC on 5 25, 2016


先備知識與注意事項

本篇文章將延續Hash Table:Intro(簡介)的議題,介紹Chaining來解決Collision

其中將會用到Linked list的概念,若不熟悉請參考:


目錄


Chaining的概念

如果利用Division Method實作Hash Function:

$$ h(Key)=Key\bmod m $$

若選擇\(m=6\),那麼對於Key\(14,16,26,36,47,50,71,90\)的item,進行Hashing後將會有如圖一的Collision發生。

解決的辦法,就是將被分配到同一個slot的item用Linked list串起來,這就是Chaining

cc

圖一:。

有了Linked list處理被分配到同ㄧ個slot的item,Hash Table的三項資料處理分別修正成:

Insert

  • 先利用Hash Function取得Tableindex
  • 接著,只要在每一個slot的list之front加入item,即可保證在\(O(1)\)的時間複雜度完成。

Search

  • 先利用Hash Function取得Tableindex
  • 再利用Linked list的traversal尋找item。

Delete

  • 先利用Hash Function取得Tableindex
  • 再利用Linked list的traversal尋找欲刪除的item。

關於SearchDelete的時間複雜度:

worst case\(O(n)\),所有item都被很遜的Hash Function分配到同一個slot。

average case\(O(1+\alpha)\),其中\(\alpha=\frac{n}{m}\)稱為load factor,其物理意義為:

  • 「資料數量(\(n\))」與「slot個數(\(m\))」的比例。
  • 也會是list的expected length(list的平均長度)。

以圖二為例,若\(m=3,n=6\),那麼\(\alpha=n\m=2\),也就是「平均」每個list會被分配到\(\alpha=2\)個item,所以Search的時間複雜度就是list長度\(O(\alpha)\),再加上Hash Function的時間複雜度\(O(1)\),便得到\(O(1+\alpha)\)

cc

圖二:以上四種皆為可能的情形,其中右下圖為平均的/期望值結果。

若「資料數量(\(n\))」與「slot個數(\(m\))」的比例具有\(n=O(m)\)的關係,再加上一個不會把所有item分配到同一個slot的正常Hash Function,那麼可以想像,SearchDelete的時間可以在接近

$$ O(1+\alpha)=O(1+constant)=O(1) $$

的情況下完成。
詳細證明請參考:Adnan Aziz:Hash Tables


程式碼

以下提供兩份基本的Hash Table實作方法:

第一份用標準模板函式庫(STL)的std::vector<std::list<struct>>處理Hash Table和Chaining
重點放在:Insert()Delete()Search()Prehashing()上。

第二份很老實地用pointer串出Linked list,重點將放在TableDoubling()TableShrinking()Rehashing()上。


偷懶:使用STL

以下範例程式碼包含了:

struct dict為自定義的dictionary,其中keyvalue皆為std::string

class HashChain_std表示以std::vector<std::list<dict>>建立Hash Table,其中包含了:

  • Insert()Delete()Search()DisplayTable()等基本函式。
    • 涉及Linked list中的push_front()traversal
    • traversalstd::list::iterator處理。
  • PreHashing():目的是把資料形態為string的key轉換成int。範例程式定義成:
    • 挑選一個正整數作為「底數(\(a\))」(稍後將進行指數運算),將字串先經由ASCII編碼轉換成正整數,再乘上\(a^{index}\)
    • 若字串key為「Jordan」,\(a=9\),便轉換出
      \(ASCII(J)\times9^5+ASCII(o)\times9^4+ASCII(r)\times9^3+ASCII(d)\times9^2+ASCII(a)\times9^1+ASCII(n)\times9^0\\ =74\times9^5+111\times9^4+114\times9^3+100\times9^2+97\times9+110\\ =5190086\)
  • HashFunction():此處使用Division method,將PreHashing()得到的整數除以Table大小(\(m\))後取餘數。

main()中以「假設NBA官網要建立球員資料,記錄每個球員所屬的球隊」為例示範Hash Table應用。

// C++ code
#include <iostream>
#include <vector>
#include <list>
#include <string>

using std::vector;
using std::list;
using std::string;
using std::cout;
using std::endl;

struct dict{                        // self-defined dictionary
    string key;                     //  key  for Name (eg:Jordan)
    string value;                   // value for Team (eg:Bulls)
    dict():key(""),value(""){};     
    dict(string Key, string Value):key(Key),value(Value){};  
};

class HashChain_std{
private:
    int size,                // size of table
        count;               // count: number of data

    vector<list<dict> > table;            // hash table with linked list

    int PreHashing(string key_str);       // turn string_type_key to int_type_key
    int HashFunction(string key_str);     // using Division method

public:
    HashChain_std(){};                       
    HashChain_std(int m):size(m),count(0){   
        table.resize(size);               // allocate memory for each slot
    }

    void Insert(dict data);               
    void Delete(string key);              
    string Search(string key);            
    void DisplayTable();                  
};

string HashChain_std::Search(string key_str){
    // two steps: 1. get index from hash function
    //            2. traversal in linked list
    int index = HashFunction(key_str);
    for (list<dict>::iterator itr = table[index].begin(); itr != table[index].end(); itr++) {
        if ((*itr).key == key_str) {
            return (*itr).value;
        }
    }
    return "...\nno such data";
}

void HashChain_std::Delete(string key_str){
    // two steps: 1. get index from hash function
    //            2. traversal in linked list
    int index = HashFunction(key_str);
    for (list<dict>::iterator itr = table[index].begin(); itr != table[index].end(); itr++) {
        if ((*itr).key == key_str) {
            table[index].erase(itr);
        }
    }
}

void HashChain_std::Insert(dict data){
    // two steps: 1. get index from hash function
    //            2. insert data at the front of linked list
    int index = HashFunction(data.key);
    table[index].push_front(data);             
}

int HashChain_std::PreHashing(string key_str){
    // if   key_str = Jordan, exp = 9
    // then key_int = ASCII(J)*9^5+ASCII(o)*9^4+ASCII(r)*9^3
    //               +ASCII(d)*9^2+ASCII(a)*9^1+ASCII(n)*9^0

    int exp = 9,        // choose randomly 
        key_int = 0,
        p = 1;

    for (int i = (int)key_str.size()-1; i >= 0; i--) {
        key_int += key_str[i]*p;
        p *= exp;
    }
    return key_int;
}

int HashChain_std::HashFunction(string key_str){

    return (PreHashing(key_str) % this->size);     // Division method
}

void HashChain_std::DisplayTable(){

    for (int i = 0; i < table.size(); i++) {
        cout << "slot#" << i << ": ";
        for (list<dict>::iterator itr = table[i].begin(); itr != table[i].end(); itr++) {
            cout << "(" << (*itr).key << "," << (*itr).value << ") ";
        }
        cout << endl;
    }
    cout << endl;
}

int main() {

    HashChain_std hash(5);
    hash.Insert(dict("T-Mac","Magic"));
    hash.Insert(dict("Bryant","Lakers"));
    hash.Insert(dict("Webber","Kings"));
    hash.Insert(dict("Arenas", "Wizards"));
    hash.Insert(dict("Davis","Clippers"));
    hash.Insert(dict("Kidd","Nets"));
    hash.DisplayTable();

    cout << "T-Mac is in " << hash.Search("T-Mac") << ". " << endl;
    cout << "Arenas is in " << hash.Search("Arenas") << ". " << endl;

    hash.Delete("Kidd");
    hash.Delete("T-Mac");
    cout << "\nAfter deleing Kidd and T-Mac:\n";
    hash.DisplayTable();

    return 0;
}

output:

slot#0: (Kidd,Nets) (Bryant,Lakers)
slot#1: (Arenas,Wizards)
slot#2: (Webber,Kings)
slot#3: (T-Mac,Magic)
slot#4: (Davis,Clippers)

T-Mac is in Magic.
Arenas is in Wizards.

After deleing Kidd and T-Mac:
slot#0: (Bryant,Lakers)
slot#1: (Arenas,Wizards)
slot#2: (Webber,Kings)
slot#3:
slot#4: (Davis,Clippers)


不偷懶:用pointer串出Linked list

以下的範例程式碼紮紮實實用pointer串出Linked list,其中的Insert()Delete()Search()DisplayTable()與上一小節大同小異,只是要加入Linked list的手法(改變pointer指向)。

HashFunction()採取Multiplication method,詳細討論請參考:Hash Table:Intro(簡介)/Multiplication Method

比較酷的是因應load factor(\(\alpha=\frac{n}{m}\))改變Table大小,以及改變之後把node從舊的Table搬到新的Table的Rehashing()

TableDoubling():當\(\alpha=\frac{n}{m}>1\)時,表示資料量大於slot數量,就把Table大小\(m\)加倍(並配置\(m\)加倍後的Table),如此在理論上可以儘量避免Collision發生,增加搜尋資料的效率。

TableShrinking():當\(\alpha=\frac{n}{m}<\frac{1}{4}\)時,表示資料量減少到Table大小\(m\)\(\frac{1}{4}\),就把Table大小\(m\)減半(並配置\(m\)減半後的Table),以節省記憶體空間。

  • 為什麼選擇\(\alpha=\frac{n}{m}<\frac{1}{4}\)而不是\(\alpha=\frac{n}{m}<\frac{1}{2}\)
    為了避免資料量在「臨界點」增增減減,造成不斷地動態配置記憶體(成本相當高)。
  • 例如:起初\(n=8,m=8\),現增加一筆資料,變成\(n=9,m=8\),將會觸發一次TableDoubling(),變成\(n=9,m=16\)
    若接下來連續刪除兩筆資料,變成\(n=7,m=16\),因為\(\frac{n}{m}<\frac{1}{2}\),將會觸發一次TableShrinking(),變成\(n=7,m=8\)
    若接下來又連續增加兩筆資料,又將觸發一次TableDoubling()...依此類推,為了避免這種不斷配置記憶體的情況發生,寧可犧牲一點記憶體空間,等到\(\frac{n}{m}<\frac{1}{4}\)再觸發TableShrinking(),重新為Table配置新的記憶體位置。

Rehashing():當TableDoubling()/TableShrinking()增加/減半Table大小\(m\)後,需要把舊的Table上的資料(node)搬到新的Table上,過程將會透過Hash Function根據各筆資料的key重新分配一次index(因此稱為Rehashing),此index即為資料在新的Table上的位置,如圖三。

  • 範例程式碼採取直接改變node之pointer的做法,不另外配置新的記憶體空間。

cc

圖三:。

main()以「唱片行老闆想要以編號來整理各種樂風(genre)」示範Hash Table。

// C++ code
#include <iostream>
#include <vector>
#include <string>
#include <math.h>       // floor()

using std::vector;
using std::string;
using std::cout;
using std::endl;

struct Node{
    int key;                    // number 
    string value;               // genre
    Node *next;                 // pointer to remember memory address of next node

    Node():key(0),value(""),next(0){};
    Node(int Key, string Value):key(Key),value(Value),next(0){};
    Node(Node const &data):key(data.key),value(data.value),next(data.next){};
};

class HashChainNode{
private:
    int size,                 // size: size of table, count: number of data
        count;                // count/size = load factor
    Node **table;             // pointer to pointer, hash table  

    int HashFunction(int key);      // Multiplication method
    void TableDoubling();           
    void TableShrinking();          
    void Rehashing(int size_orig);  

public:
    HashChainNode(){};
    HashChainNode(int m):size(m),count(0){
        table = new Node *[size];           // allocate the first demension of table
        for (int i = 0; i < size; i++) {    // initialization
            table[i] = 0;                   // ensure every slot points to NULL
        }
    }
    ~HashChainNode();

    void Insert(Node data);         // consider TableDoubling()
    void Delete(int key);           // consider TableShrinking()
    string Search(int key);
    void DisplayTable();
};

void HashChainNode::Insert(Node data){

    count++;
    if (count > size) {         // consider load factor
        TableDoubling();        // if n/m > 1, then double the size of table
    }

    int index = HashFunction(data.key);   // get index of slot
    Node *newNode = new Node(data);       // create new node to store data

    // push_front()
    if (table[index] == NULL) {           // eg: list: (empty), add4
        table[index] = newNode;           // eg: list: 4->NULL
    }
    else {                                // eg: list: 5->9->NULL  , add 4
        Node *next = table[index]->next;  //     list: 5->4->9->NULL
        table[index]->next = newNode;
        newNode->next = next;
    }
}

void HashChainNode::Delete(int key){

    int index = HashFunction(key);        // get index of slot
    Node *current = table[index],         // use two pointer for traversal in list
         *previous = NULL;

    while (current != NULL && current->key != key) {  
        previous = current;               // traversal in list, 3 cases:
        current = current->next;          // 1. data not found
    }                                     // 2. data found at first node in list
                                          // 3. data found at other position in list

    if (current == NULL) {                    // eg: list:5->2->9->NULL, want to delete 3
        cout << "data not found.\n\n";
        return;
    }
    else {
        if (previous == NULL) {               // eg: list:5->2->9->NULL, want to delete 5
            table[index] = current->next;     // after deleting 5, list:2->9->NULL
        }                                     // current points to 5     

        else {                                // eg: list:5->2->9->NULL, want to delete 2
            previous->next = current->next;   // after deleting 2, list:5->9->NULL
        }                                     // current points to 2
        delete current;
        current = 0;
    }

    count--;
    if (count < size/4) {      // consider load factor
        TableShrinking();      // if n/m < 4, then shrink the table
    }
}

string HashChainNode::Search(int key){

    int index = HashFunction(key);      // get index of slot
    Node *current = table[index];       // current points to the first node in list

    while (current != NULL) {           // traversal in list
        if ( current->key == key) {
            return current->value;
        }
        current = current->next;
    }
    return "...\nno such data";
}

int HashChainNode::HashFunction(int key){
    // Multiplication method
    double A = 0.6180339887,
           frac = key*A-floor(key*A);
    return floor(this->size*frac);
}

void HashChainNode::TableDoubling(){

    int size_orig = size;    // size_orig represents the original size of table
    size *= 2;               // double the size of table
    Rehashing(size_orig);;   // create new table with new larger size
}

void HashChainNode::TableShrinking(){

    int size_orig = size;    // size_orig represents the original size of table
    size /= 2;               // shrink the size of table
    Rehashing(size_orig);    // create new table with new smaller size
}

void HashChainNode::Rehashing(int size_orig){

    Node **newtable = new Node *[size];    // allocate memory for new table
    for (int i = 0; i < size; i++) {       // initializetion 
        newtable[i] = 0;                   // ensure every node in slot points to NULL
    }

    for (int i = 0; i < size_orig; i++) {  // visit every node in the original table

        Node *curr_orig = table[i],        // curr_orig: current node in original table
             *prev_orig = NULL;            // prev_orig: following curr_orig 

        while (curr_orig != NULL) {      // traversal in list of each slot in original table

            prev_orig = curr_orig->next; // curr_orig will be directly move to new table
                                         // need prev_orig to keep pointer in original table

            int index = HashFunction(curr_orig->key);    // get index of slot in new table

            // push_front(), do not allocate new memory space for data
            // directly move node in original table to new table
            if (newtable[index] == NULL) {       // means newtable[index] is empty
                newtable[index] = curr_orig;
                newtable[index]->next = 0;       // equivalent to curr_orig->next = 0;
            }
            // if there is no initialization for newtable, segmentation faults might happen
            // because newtable[index] might not point to NULL 
            // but newtable[index] is empty
            else {                                   // if newtable[index] is not empty
                Node *next = newtable[index]->next;  // push_front()
                newtable[index]->next = curr_orig;
                curr_orig->next = next;
            }
            curr_orig = prev_orig;          // visit the next node in list in original table
        }
    }
    delete [] table;               // release memory of original table
    this->table = newtable;        // point table of object to new table
}

HashChainNode::~HashChainNode(){

    for (int i = 0; i < size; i++) {    // visit every node in table 
                                        // and release the memory of each node 
        Node *current = table[i];       // point *current to first node in list
        while (current != NULL) {       // traversal in list
            Node *previous = current;
            current = current->next;
            delete previous;
            previous = 0;
        }
    }
    delete [] table;
}

void HashChainNode::DisplayTable(){

    for (int i = 0; i < size; i++) {    // visit every node in table 
        cout << "#slot#" << i << ": ";
        Node *current = table[i];
        while (current != NULL) {
            cout << "(" << current->key << "," << current->value << ") ";
            current = current->next;
        }
        cout << endl;
    }
    cout << endl;
}

int main(){

    HashChainNode hash(2);

    hash.Insert(Node(12,"post rock"));
    hash.Insert(Node(592,"shoegaze"));
    cout << "After inserting key(12),key(592):\n";
    hash.DisplayTable();
    hash.Insert(Node(6594,"blues"));        // evoke TableDoubling()
    cout << "After inserting key(6594), evoke TableDoubling():\n";
    hash.DisplayTable();
    hash.Insert(Node(7,"folk"));
    cout << "After inserting key(7):\n";
    hash.DisplayTable();
    hash.Insert(Node(123596,"hiphop"));     // evoke TableDoubling()
    cout << "After inserting key(123596), evoke TableDoubling():\n";
    hash.DisplayTable();
    hash.Insert(Node(93,"soul"));
    hash.Insert(Node(2288,"indie"));
    hash.Insert(Node(793,"jazz"));
    cout << "After inserting key(93),key(2288),key(793):\n";
    hash.DisplayTable();
    hash.Insert(Node(8491,"electro"));      // evoke TableDoubling()
    cout << "After inserting key(8491), evoke TableDoubling():\n";
    hash.DisplayTable();
    hash.Insert(Node(323359,"pop"));
    cout << "After inserting key(323359):\n";
    hash.DisplayTable();

    cout << "Searching: genre(8491) is " << hash.Search(8491) << ".\n\n";
    cout << "Searching: genre(7) is " << hash.Search(7) << ".\n\n";

    hash.Delete(7);
    cout << "After deleting key(7):\n";
    cout << "Searching: genre(7) is " << hash.Search(7) << ".\n\n";

    hash.Delete(592);
    cout << "After deleting key(592):\n";
    hash.DisplayTable();

    cout << "Want to  delete key(592) again:\n";
    hash.Delete(592);

    hash.Delete(123596);
    hash.Delete(323359);
    hash.Delete(793);
    hash.Delete(93);
    cout << "After deleting key(123596),key(323359),key(793),key(93):\n";
    hash.DisplayTable();

    hash.Delete(6594);      // evoke TableShrinking()
    cout << "After deleting key(6594), evoke TableShrinking():\n";
    hash.DisplayTable();

    return 0;
}

output:

After inserting key(12),key(592):
#slot#0: (12,post rock)
#slot#1: (592,shoegaze)

After inserting key(6594), evoke TableDoubling():
#slot#0:
#slot#1: (12,post rock) (6594,blues)
#slot#2:
#slot#3: (592,shoegaze)

After inserting key(7):
#slot#0:
#slot#1: (12,post rock) (7,folk) (6594,blues)
#slot#2:
#slot#3: (592,shoegaze)

After inserting key(123596), evoke TableDoubling():
#slot#0:
#slot#1:
#slot#2: (7,folk) (6594,blues)
#slot#3: (12,post rock)
#slot#4: (123596,hiphop)
#slot#5:
#slot#6:
#slot#7: (592,shoegaze)

After inserting key(93),key(2288),key(793):
#slot#0: (2288,indie) (793,jazz)
#slot#1:
#slot#2: (7,folk) (6594,blues)
#slot#3: (12,post rock) (93,soul)
#slot#4: (123596,hiphop)
#slot#5:
#slot#6:
#slot#7: (592,shoegaze)

After inserting key(8491), evoke TableDoubling():
#slot#0: (2288,indie)
#slot#1: (793,jazz)
#slot#2:
#slot#3:
#slot#4:
#slot#5: (7,folk) (6594,blues)
#slot#6: (12,post rock)
#slot#7: (93,soul)
#slot#8: (123596,hiphop)
#slot#9:
#slot#10:
#slot#11: (8491,electro)
#slot#12:
#slot#13:
#slot#14: (592,shoegaze)
#slot#15:

After inserting key(323359):
#slot#0: (2288,indie)
#slot#1: (793,jazz)
#slot#2:
#slot#3:
#slot#4:
#slot#5: (7,folk) (6594,blues)
#slot#6: (12,post rock)
#slot#7: (93,soul)
#slot#8: (123596,hiphop)
#slot#9:
#slot#10:
#slot#11: (8491,electro)
#slot#12:
#slot#13: (323359,pop)
#slot#14: (592,shoegaze)
#slot#15:

Searching: genre(8491) is electro.

Searching: genre(7) is folk.

After deleting key(7):
Searching: genre(7) is ...
no such data.

After deleting key(592):
#slot#0: (2288,indie)
#slot#1: (793,jazz)
#slot#2:
#slot#3:
#slot#4:
#slot#5: (6594,blues)
#slot#6: (12,post rock)
#slot#7: (93,soul)
#slot#8: (123596,hiphop)
#slot#9:
#slot#10:
#slot#11: (8491,electro)
#slot#12:
#slot#13: (323359,pop)
#slot#14:
#slot#15:

Want to  delete key(592) again:
data not found.

After deleting key(123596),key(323359),key(793),key(93):
#slot#0: (2288,indie)
#slot#1:
#slot#2:
#slot#3:
#slot#4:
#slot#5: (6594,blues)
#slot#6: (12,post rock)
#slot#7:
#slot#8:
#slot#9:
#slot#10:
#slot#11: (8491,electro)
#slot#12:
#slot#13:
#slot#14:
#slot#15:

After deleting key(6594), evoke TableShrinking():
#slot#0: (2288,indie)
#slot#1:
#slot#2:
#slot#3: (12,post rock)
#slot#4:
#slot#5: (8491,electro)
#slot#6:
#slot#7:


以上是以Chaining解決Collision之介紹。



參考資料:


Hash Table系列文章

Hash Table:Intro(簡介)
Hash Table:Chaining
Hash Table:Open Addressing

回到目錄:

目錄:演算法與資料結構