Sorting over 1 million numbers

**Snakex** · 05-21-2001, 05:02 PM

I was wondering how you would be able to sort over 1 million numbers all random. They are all in 1 file and have to be made into another file. This is using C++.

I cant use an array because an array of 100,000,000 would be just oo big so would a linked list. Somehow I was thinking about sorting 1000 at a time then sorting those over and over again like the mergesort but im not sure.

any help?

Bryan

------------------
"Just remember the shark
that bit you is the one
gets to have a head full
of lead later on"

**Grizzly** · 05-21-2001, 09:58 PM

The fastest way to sort and order large amounts of data is to organize it into a Binary Tree structure. Doing an insertion sort on a Binary tree is 100x quicker than any sort of sequential insertion sort.

If you're unsure, or are unfamiliar with Binary Tree's, than you should definitely look into it. Not only is it the best method to sort 1 million units, but it's essential the *only* way to do it with any reasonable speed.

Hope that helps

**sjelkjd** · 05-22-2001, 03:46 AM

Originally posted by Grizzly:
The fastest way to sort and order large amounts of data is to organize it into a Binary Tree structure. Doing an insertion sort on a Binary tree is 100x quicker than any sort of sequential insertion sort.

If you're unsure, or are unfamiliar with Binary Tree's, than you should definitely look into it. Not only is it the best method to sort 1 million units, but it's essential the *only* way to do it with any reasonable speed.

Hope that helps

Uhm.
No.
This almost sounds like heapsort.

However, quicksort will outperform heapsort and will definitely outperform repeated insertions into a binary tree with an inorder traversal.
to use c++'s stl, include <algorithm> and call sort()

**namgor** · 05-22-2001, 10:15 AM

You see why the binary tree might not work because he sort over a million number and it might too big to fit into memory, since we need to keep all those number, plus all those pointers in the binary tree.

I think the original thought of using merge sort is probably the best way to do it.

It is the same merge sort algorithm you learnt at school except when you divide your list into 2 arrays, you divide them into 2 files.

PS If memory is a problem, you might initially divide your file into multiple files.

------------------
Home Athlon600Mhz@600Mhz, Asus K7M, 128MB SDRAM, RivaTNT2 32MB, Viewsonic A90 19", iMac Greenish Case, Logitech SR30, Kodak DC240i (Bluesberry) Digital Camera
Office PIII700Mhz@700Mhz, 192MB SDRAM, NVidia RivaTNT2 32MB, IBM P96 19"

uwcdc.com or namgor.com

**Conrad Song** · 05-22-2001, 10:17 AM

Originally posted by Snakex:
I was wondering how you would be able to sort over 1 million numbers all random. They are all in 1 file and have to be made into another file. This is using C++.

I cant use an array because an array of 100,000,000 would be just oo big so would a linked list. Somehow I was thinking about sorting 1000 at a time then sorting those over and over again like the mergesort but im not sure.

any help?

Bryan

Mergesort is definitely the way to go. In fact, you can find algorithms that use a small memory footprint by leaving the temporary results in files.

The correct method is to sort in multiple chunks of the cluster (sector) size of the disk. First sort (using any algorithm) one cluster size of data in memory and storing the partial results in a file. Then mergesort in the file by 2 cluster sizes, then 4, etc.

**sjelkjd** · 05-22-2001, 11:52 AM

Originally posted by Conrad Song:
Mergesort is definitely the way to go. In fact, you can find algorithms that use a small memory footprint by leaving the temporary results in files.

The correct method is to sort in multiple chunks of the cluster (sector) size of the disk. First sort (using any algorithm) one cluster size of data in memory and storing the partial results in a file. Then mergesort in the file by 2 cluster sizes, then 4, etc.

Yeah, maybe in archaic systems w/o virtual memory.
And mergesort is not in-place, compounding any space requirements you already have.

**Grizzly** · 05-22-2001, 05:53 PM

I stand corrected! My limited knowledge of C++ sorting has shown through

I didn't even consider attempting to store 1 million numbers in memory, all having links to another. That would definitely....ummm, suck big time!

Well, I'm glad to see we got some real programming Guru's here now. Wasn't the case a few months ago, outside of a few veterans. Your contributions to the forums are most welcome. I'll most likely have a question or two in the future

**Conrad Song** · 05-22-2001, 06:54 PM

The fastest database programs make no assumptions about the availability of memory. Mergesort is ideal for situations where memory is limited, but disk space is abundant. Under these conditions, a custom merge sort is significantly faster than those which rely on the O.S. to provide virtual memory.

Of course, if you don't really care, relying on virtual memory is certainly the simplest solution.

Originally posted by sjelkjd:
Yeah, maybe in archaic systems w/o virtual memory.
And mergesort is not in-place, compounding any space requirements you already have.
[/B]

**sjelkjd** · 05-22-2001, 07:20 PM

Originally posted by Conrad Song:
The fastest database programs make no assumptions about the availability of memory. Mergesort is ideal for situations where memory is limited, but disk space is abundant. Under these conditions, a custom merge sort is significantly faster than those which rely on the O.S. to provide virtual memory.

Of course, if you don't really care, relying on virtual memory is certainly the simplest solution.

Mergesort is not in place!!!!!
Therefore, it uses more memory than an in-place sorting routine, ie heap/quicksort.

Additionally, the pivoting done with quicksort could easily be distributed into files, just as you are proposing with this modified mergesort.

**Snakex** · 05-22-2001, 07:53 PM

Thanks guys, i will try the mergesort when i get to school tommorow.

By the way if you want to know more about what I have to work with. I have like a 5 meg filespace on my student server and the computer sall run windows NT.

Bryan

------------------
"Just remember the shark
that bit you is the one
gets to have a head full
of lead later on"

**Conrad Song** · 05-22-2001, 10:52 PM

Understood, but the memory usage is not significantly greater than sorting which takes place in virtual memory, since this is disk as well!

Also mergesort has the property of being an in-order sort.

Your idea for quicksort is interesting. I will have to try it sometime.

Originally posted by sjelkjd:
Mergesort is not in place!!!!!
Therefore, it uses more memory than an in-place sorting routine, ie heap/quicksort.

Additionally, the pivoting done with quicksort could easily be distributed into files, just as you are proposing with this modified mergesort.

**namgor** · 05-23-2001, 12:18 AM

Well the idea of merge sort and quick sort are both partitioning them into pieces and use the idea divide-and-conqeur. In here, we divide them into files. Merge sort is in-place and has the advantage of easier to program, since you divide the partition always into 2 (or /2-1 )

[edit for spelling err]

------------------
Home Athlon600Mhz@600Mhz, Asus K7M, 128MB SDRAM, RivaTNT2 32MB, Viewsonic A90 19", iMac Greenish Case, Logitech SR30, Kodak DC240i (Bluesberry) Digital Camera
Office PIII700Mhz@700Mhz, 192MB SDRAM, NVidia RivaTNT2 32MB, IBM P96 19"

uwcdc.com or namgor.com

[This message has been edited by namgor (edited May 23, 2001).]

**sjelkjd** · 05-23-2001, 01:27 AM

Originally posted by Conrad Song:
Understood, but the memory usage is not significantly greater than sorting which takes place in virtual memory, since this is disk as well!

Also mergesort has the property of being an in-order sort.

Your idea for quicksort is interesting. I will have to try it sometime.

Well, if you go and set all your file sizes to be equal to disk cluster/ram frame size you're just emulating virtual memory. Why not let the OS do it?(Since in this case access is simple enough that the generic model is perfectly appropriate)

Second, I said mergesort is not in place.

You need to allocate a whole new array everytime you merge - ie, you can't do it in the same array it originally occupied.

**sjelkjd** · 05-23-2001, 01:34 AM

Originally posted by namgor:
Well the idea of merge sort and quick sort are both partitioning them into pieces and use the idea divide-and-conqeur. In here, we divide them into files. Merge sort is in-place and has the advantage of easier to program, since you divide the partition always into 2 (or /2-1 )

[edit for spelling err]

Merge sort is not in place.
(last step of mergesort)
A[0]=2
A[1]=3
A[2]=4

A[3]=1
A[4]=2
A[5]=3

Try merging these two lists in place.
Secondly, how is this tough code?

#include <algorithm>
int main()
{
int a[N]=...
std::sort<int*>(a,a+N);
}

[This message has been edited by sjelkjd (edited May 23, 2001).]

**namgor** · 05-23-2001, 09:58 AM

Alright I see what you mean, but arent we going to put the elements into files at the first place? I thought dealing with a million number is the real problem, thats why we said we want to divide them into files, we aren't going to put them into array anyway!

------------------
Home Athlon600Mhz@600Mhz, Asus K7M, 128MB SDRAM, RivaTNT2 32MB, Viewsonic A90 19", iMac Greenish Case, Logitech SR30, Kodak DC240i (Bluesberry) Digital Camera
Office PIII700Mhz@700Mhz, 192MB SDRAM, NVidia RivaTNT2 32MB, IBM P96 19"

uwcdc.com or namgor.com

Thread: Sorting over 1 million numbers

Thread Tools

Display

Sorting over 1 million numbers

Posting Permissions