Sorting over 1 million numbers

Sharky Forums


Page 1 of 3 123 LastLast
Results 1 to 15 of 33

Thread: Sorting over 1 million numbers

  1. #1
    Expensive Sushi
    Join Date
    Apr 2001
    Location
    Pittsburgh, PA
    Posts
    20

    Question Sorting over 1 million numbers

    I was wondering how you would be able to sort over 1 million numbers all random. They are all in 1 file and have to be made into another file. This is using C++.

    I cant use an array because an array of 100,000,000 would be just oo big so would a linked list. Somehow I was thinking about sorting 1000 at a time then sorting those over and over again like the mergesort but im not sure.

    any help?

    Bryan

    ------------------
    "Just remember the shark
    that bit you is the one
    gets to have a head full
    of lead later on"
    "Just remember the shark
    that bit you is the one
    gets to have a head full
    of lead later on"

  2. #2
    Ursus Arctos Moderatis Grizzly's Avatar
    Join Date
    Sep 2000
    Location
    Providence, RI USA
    Posts
    3,077

    Post

    The fastest way to sort and order large amounts of data is to organize it into a Binary Tree structure. Doing an insertion sort on a Binary tree is 100x quicker than any sort of sequential insertion sort.

    If you're unsure, or are unfamiliar with Binary Tree's, than you should definitely look into it. Not only is it the best method to sort 1 million units, but it's essential the *only* way to do it with any reasonable speed.

    Hope that helps

  3. #3
    Expensive Sushi
    Join Date
    Feb 2001
    Posts
    34

    Post

    Originally posted by Grizzly:
    The fastest way to sort and order large amounts of data is to organize it into a Binary Tree structure. Doing an insertion sort on a Binary tree is 100x quicker than any sort of sequential insertion sort.

    If you're unsure, or are unfamiliar with Binary Tree's, than you should definitely look into it. Not only is it the best method to sort 1 million units, but it's essential the *only* way to do it with any reasonable speed.

    Hope that helps
    Uhm.
    No.
    This almost sounds like heapsort.

    However, quicksort will outperform heapsort and will definitely outperform repeated insertions into a binary tree with an inorder traversal.
    to use c++'s stl, include <algorithm> and call sort()



  4. #4
    Hammerhead Shark
    Join Date
    Oct 2000
    Location
    Toronto, Canada
    Posts
    1,493

    Post

    You see why the binary tree might not work because he sort over a million number and it might too big to fit into memory, since we need to keep all those number, plus all those pointers in the binary tree.

    I think the original thought of using merge sort is probably the best way to do it.

    It is the same merge sort algorithm you learnt at school except when you divide your list into 2 arrays, you divide them into 2 files.

    PS If memory is a problem, you might initially divide your file into multiple files.

    ------------------
    Home Athlon600Mhz@600Mhz, Asus K7M, 128MB SDRAM, RivaTNT2 32MB, Viewsonic A90 19", iMac Greenish Case, Logitech SR30, Kodak DC240i (Bluesberry) Digital Camera
    Office PIII700Mhz@700Mhz, 192MB SDRAM, NVidia RivaTNT2 32MB, IBM P96 19"

    uwcdc.com or namgor.com
    DHAHL3seasons GP:73 G:121 A:55 Pts:176 GWG:12 +/-:184
    UWSWA6seasons GP:41 G:53 A:46 Pts:99 GWG:5 +/-:-25
    MCBHL3seasons GP:14 G:20 A:8 Pts:28 GWG:4 +/-:19

    uwcdc.com or monkis.com

  5. #5
    Reef Shark
    Join Date
    Oct 2000
    Posts
    397

    Smile

    Originally posted by Snakex:
    I was wondering how you would be able to sort over 1 million numbers all random. They are all in 1 file and have to be made into another file. This is using C++.

    I cant use an array because an array of 100,000,000 would be just oo big so would a linked list. Somehow I was thinking about sorting 1000 at a time then sorting those over and over again like the mergesort but im not sure.

    any help?

    Bryan

    Mergesort is definitely the way to go. In fact, you can find algorithms that use a small memory footprint by leaving the temporary results in files.

    The correct method is to sort in multiple chunks of the cluster (sector) size of the disk. First sort (using any algorithm) one cluster size of data in memory and storing the partial results in a file. Then mergesort in the file by 2 cluster sizes, then 4, etc.

  6. #6
    Expensive Sushi
    Join Date
    Feb 2001
    Posts
    34

    Post

    Originally posted by Conrad Song:
    Mergesort is definitely the way to go. In fact, you can find algorithms that use a small memory footprint by leaving the temporary results in files.

    The correct method is to sort in multiple chunks of the cluster (sector) size of the disk. First sort (using any algorithm) one cluster size of data in memory and storing the partial results in a file. Then mergesort in the file by 2 cluster sizes, then 4, etc.
    Yeah, maybe in archaic systems w/o virtual memory.
    And mergesort is not in-place, compounding any space requirements you already have.


  7. #7
    Ursus Arctos Moderatis Grizzly's Avatar
    Join Date
    Sep 2000
    Location
    Providence, RI USA
    Posts
    3,077

    Post

    I stand corrected! My limited knowledge of C++ sorting has shown through I didn't even consider attempting to store 1 million numbers in memory, all having links to another. That would definitely....ummm, suck big time!

    Well, I'm glad to see we got some real programming Guru's here now. Wasn't the case a few months ago, outside of a few veterans. Your contributions to the forums are most welcome. I'll most likely have a question or two in the future

  8. #8
    Reef Shark
    Join Date
    Oct 2000
    Posts
    397

    Smile

    The fastest database programs make no assumptions about the availability of memory. Mergesort is ideal for situations where memory is limited, but disk space is abundant. Under these conditions, a custom merge sort is significantly faster than those which rely on the O.S. to provide virtual memory.

    Of course, if you don't really care, relying on virtual memory is certainly the simplest solution.

    Originally posted by sjelkjd:
    Yeah, maybe in archaic systems w/o virtual memory.
    And mergesort is not in-place, compounding any space requirements you already have.
    [/B]

  9. #9
    Expensive Sushi
    Join Date
    Feb 2001
    Posts
    34

    Post

    Originally posted by Conrad Song:
    The fastest database programs make no assumptions about the availability of memory. Mergesort is ideal for situations where memory is limited, but disk space is abundant. Under these conditions, a custom merge sort is significantly faster than those which rely on the O.S. to provide virtual memory.

    Of course, if you don't really care, relying on virtual memory is certainly the simplest solution.

    Mergesort is not in place!!!!!
    Therefore, it uses more memory than an in-place sorting routine, ie heap/quicksort.

    Additionally, the pivoting done with quicksort could easily be distributed into files, just as you are proposing with this modified mergesort.

  10. #10
    Expensive Sushi
    Join Date
    Apr 2001
    Location
    Pittsburgh, PA
    Posts
    20

    Wink

    Thanks guys, i will try the mergesort when i get to school tommorow.

    By the way if you want to know more about what I have to work with. I have like a 5 meg filespace on my student server and the computer sall run windows NT.

    Bryan

    ------------------
    "Just remember the shark
    that bit you is the one
    gets to have a head full
    of lead later on"
    "Just remember the shark
    that bit you is the one
    gets to have a head full
    of lead later on"

  11. #11
    Reef Shark
    Join Date
    Oct 2000
    Posts
    397

    Post

    Understood, but the memory usage is not significantly greater than sorting which takes place in virtual memory, since this is disk as well!

    Also mergesort has the property of being an in-order sort.

    Your idea for quicksort is interesting. I will have to try it sometime.

    Originally posted by sjelkjd:
    Mergesort is not in place!!!!!
    Therefore, it uses more memory than an in-place sorting routine, ie heap/quicksort.

    Additionally, the pivoting done with quicksort could easily be distributed into files, just as you are proposing with this modified mergesort.

  12. #12
    Hammerhead Shark
    Join Date
    Oct 2000
    Location
    Toronto, Canada
    Posts
    1,493

    Post

    Well the idea of merge sort and quick sort are both partitioning them into pieces and use the idea divide-and-conqeur. In here, we divide them into files. Merge sort is in-place and has the advantage of easier to program, since you divide the partition always into 2 (or /2-1 )

    [edit for spelling err]

    ------------------
    Home Athlon600Mhz@600Mhz, Asus K7M, 128MB SDRAM, RivaTNT2 32MB, Viewsonic A90 19", iMac Greenish Case, Logitech SR30, Kodak DC240i (Bluesberry) Digital Camera
    Office PIII700Mhz@700Mhz, 192MB SDRAM, NVidia RivaTNT2 32MB, IBM P96 19"

    uwcdc.com or namgor.com

    [This message has been edited by namgor (edited May 23, 2001).]
    DHAHL3seasons GP:73 G:121 A:55 Pts:176 GWG:12 +/-:184
    UWSWA6seasons GP:41 G:53 A:46 Pts:99 GWG:5 +/-:-25
    MCBHL3seasons GP:14 G:20 A:8 Pts:28 GWG:4 +/-:19

    uwcdc.com or monkis.com

  13. #13
    Expensive Sushi
    Join Date
    Feb 2001
    Posts
    34

    Post

    Originally posted by Conrad Song:
    Understood, but the memory usage is not significantly greater than sorting which takes place in virtual memory, since this is disk as well!

    Also mergesort has the property of being an in-order sort.

    Your idea for quicksort is interesting. I will have to try it sometime.

    Well, if you go and set all your file sizes to be equal to disk cluster/ram frame size you're just emulating virtual memory. Why not let the OS do it?(Since in this case access is simple enough that the generic model is perfectly appropriate)

    Second, I said mergesort is not in place.

    You need to allocate a whole new array everytime you merge - ie, you can't do it in the same array it originally occupied.


  14. #14
    Expensive Sushi
    Join Date
    Feb 2001
    Posts
    34

    Post

    Originally posted by namgor:
    Well the idea of merge sort and quick sort are both partitioning them into pieces and use the idea divide-and-conqeur. In here, we divide them into files. Merge sort is in-place and has the advantage of easier to program, since you divide the partition always into 2 (or /2-1 )

    [edit for spelling err]

    Merge sort is not in place.
    (last step of mergesort)
    A[0]=2
    A[1]=3
    A[2]=4

    A[3]=1
    A[4]=2
    A[5]=3

    Try merging these two lists in place.
    Secondly, how is this tough code?

    #include <algorithm>
    int main()
    {
    int a[N]=...
    std::sort<int*>(a,a+N);
    }

    [This message has been edited by sjelkjd (edited May 23, 2001).]

  15. #15
    Hammerhead Shark
    Join Date
    Oct 2000
    Location
    Toronto, Canada
    Posts
    1,493

    Post

    Alright I see what you mean, but arent we going to put the elements into files at the first place? I thought dealing with a million number is the real problem, thats why we said we want to divide them into files, we aren't going to put them into array anyway!

    ------------------
    Home Athlon600Mhz@600Mhz, Asus K7M, 128MB SDRAM, RivaTNT2 32MB, Viewsonic A90 19", iMac Greenish Case, Logitech SR30, Kodak DC240i (Bluesberry) Digital Camera
    Office PIII700Mhz@700Mhz, 192MB SDRAM, NVidia RivaTNT2 32MB, IBM P96 19"

    uwcdc.com or namgor.com
    DHAHL3seasons GP:73 G:121 A:55 Pts:176 GWG:12 +/-:184
    UWSWA6seasons GP:41 G:53 A:46 Pts:99 GWG:5 +/-:-25
    MCBHL3seasons GP:14 G:20 A:8 Pts:28 GWG:4 +/-:19

    uwcdc.com or monkis.com

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •