A data structure question, any suggestions on what i should use?

Sharky Forums


Results 1 to 2 of 2

Thread: A data structure question, any suggestions on what i should use?

  1. #1
    Mako Shark Mancora's Avatar
    Join Date
    Jun 2001
    Location
    The other side of where the fishes swim
    Posts
    4,313

    A data structure question, any suggestions on what i should use?

    ===========
    The problem
    ===========

    I’m whipping up a character frequency counting program, in final form we're going to feed it text files. Its not only going to count single character combinations but also double, triple and quadruple.

    Here’s an example of how it we want to chop up the data

    ie we're given the string
    "See Bob run."

    The trigrams created from that would look like
    Code:
    "See" "ee " "e B" " Bo" "Bob" "ob " "b r" " ru" "run" "un."
    So if we run it though say, 30MB of text files we're going to end up with a _lot_ of strings, and notice how with that sentance, while its only 12 chars long, the resulting trigrams total some 30 characters, so for small files we will be ending up with more character combinations than there are characters in the original string.

    This wont be true for larger files because as we find more combinations it will get progressively harder to find new combinations.


    I _could_ make this a lot smaller by basically saying, look theres only some 100 odd characters that can show up in a written text file, lets not store entire strings and just use some encoding scheme to turn a word into a number and store that number. Thing is I want to make this useable on unicode so I could push foreign languages through this thing, and that’s waaay to many combinations.


    ===========
    Actual data storage ideas
    ===========



    Since I don’t want to start off by storing every single character combination (the trigrams would be some 2 megs, and lets not even think about unicode [some 32 terabytes]) arrays are out for this data storage.

    So I’m thinking the structure is limited to something that uses some kind of linked nodes which will let me add combinations as I find them.

    Due to the size (which my guestamate puts at around 50k-100k items) I don’t want to have O(n) insertion times so a standard fair linked list is out. Which makes me tempted to go with something like a binary tree, a self balancing binary tree (something like AVL or red black), or a skip list and I think all of those should have around log2(n) insertion and search times.


    One other idea we were just toying with was to express this as some type of tree, where the first node points to say any other character in the alphabet, ie “abc” and “abd” the first two nodes, “a” and “b” would overlap. The upshot of this idea is that my memory imprint would be _probably_ be smaller than if I stored entire strings because I’m using the same node twice. I still have to do a memory analysis of this to see at what point of overlap it makes sense to use this system.

    I’d look something like this (the numbers are how many times that character has occured)
    [for implementing this my node would probably consist of an int counter, a char, and a linked list pointing to the next nodes (not an array since that could be a significant waste of space, especially if its unicode)]





    ======================

    Since I’m going over data structures and stuff now in school I‘m wondering if maybe some of you had come across some bright ideas for how to store this kind of stuff, or if maybe there’s a data structure sitting out there that I don’t know about that would fit this situation a lot better.

  2. #2
    Tiger Shark UmneyDurak's Avatar
    Join Date
    Apr 2004
    Location
    Escaped from 2nd floor of Soda Hall.
    Posts
    671
    Hmm maybe B+ Tree? Check out this link: B+ Tree Look Under tree structured indexing. I think it can be adjusted to store the data.
    Another option might some kind of variation of Huffman encoding scheme? I think it can be used to just count the instances of each combination.
    (\__/)
    (='.'=)
    (")_(")
    Last edited by UmneyDurak; 09-24-2005 at 12:18 AM.
    01+01=10

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •