Click to See Complete Forum and Search --> : Sorting over 1 million numbers


Snakex
05-21-2001, 05:02 PM
I was wondering how you would be able to sort over 1 million numbers all random. They are all in 1 file and have to be made into another file. This is using C++.

I cant use an array because an array of 100,000,000 would be just oo big so would a linked list. Somehow I was thinking about sorting 1000 at a time then sorting those over and over again like the mergesort but im not sure.

any help?

Bryan

------------------
"Just remember the shark
that bit you is the one
gets to have a head full
of lead later on"

Grizzly
05-21-2001, 09:58 PM
The fastest way to sort and order large amounts of data is to organize it into a Binary Tree structure. Doing an insertion sort on a Binary tree is 100x quicker than any sort of sequential insertion sort.

If you're unsure, or are unfamiliar with Binary Tree's, than you should definitely look into it. Not only is it the best method to sort 1 million units, but it's essential the *only* way to do it with any reasonable speed.

Hope that helps http://www.sharkyforums.com/ubb/biggrin.gif

sjelkjd
05-22-2001, 03:46 AM
Originally posted by Grizzly:
The fastest way to sort and order large amounts of data is to organize it into a Binary Tree structure. Doing an insertion sort on a Binary tree is 100x quicker than any sort of sequential insertion sort.

If you're unsure, or are unfamiliar with Binary Tree's, than you should definitely look into it. Not only is it the best method to sort 1 million units, but it's essential the *only* way to do it with any reasonable speed.

Hope that helps http://www.sharkyforums.com/ubb/biggrin.gif
Uhm.
No.
This almost sounds like heapsort.

However, quicksort will outperform heapsort and will definitely outperform repeated insertions into a binary tree with an inorder traversal.
to use c++'s stl, include <algorithm> and call sort()

namgor
05-22-2001, 10:15 AM
You see why the binary tree might not work because he sort over a million number and it might too big to fit into memory, since we need to keep all those number, plus all those pointers in the binary tree.

I think the original thought of using merge sort is probably the best way to do it.

It is the same merge sort algorithm you learnt at school except when you divide your list into 2 arrays, you divide them into 2 files.

PS If memory is a problem, you might initially divide your file into multiple files.

------------------
Home Athlon600Mhz@600Mhz, Asus K7M, 128MB SDRAM, RivaTNT2 32MB, Viewsonic A90 19", iMac Greenish Case, Logitech SR30, Kodak DC240i (Bluesberry) Digital Camera
Office PIII700Mhz@700Mhz, 192MB SDRAM, NVidia RivaTNT2 32MB, IBM P96 19"

uwcdc.com (http://www.uwcdc.com) or namgor.com (http://www.namgor.com)

Conrad Song
05-22-2001, 10:17 AM
Originally posted by Snakex:
I was wondering how you would be able to sort over 1 million numbers all random. They are all in 1 file and have to be made into another file. This is using C++.

I cant use an array because an array of 100,000,000 would be just oo big so would a linked list. Somehow I was thinking about sorting 1000 at a time then sorting those over and over again like the mergesort but im not sure.

any help?

Bryan



Mergesort is definitely the way to go. In fact, you can find algorithms that use a small memory footprint by leaving the temporary results in files.

The correct method is to sort in multiple chunks of the cluster (sector) size of the disk. First sort (using any algorithm) one cluster size of data in memory and storing the partial results in a file. Then mergesort in the file by 2 cluster sizes, then 4, etc.

sjelkjd
05-22-2001, 11:52 AM
Originally posted by Conrad Song:
Mergesort is definitely the way to go. In fact, you can find algorithms that use a small memory footprint by leaving the temporary results in files.

The correct method is to sort in multiple chunks of the cluster (sector) size of the disk. First sort (using any algorithm) one cluster size of data in memory and storing the partial results in a file. Then mergesort in the file by 2 cluster sizes, then 4, etc.
Yeah, maybe in archaic systems w/o virtual memory.
And mergesort is not in-place, compounding any space requirements you already have.

Grizzly
05-22-2001, 05:53 PM
I stand corrected! My limited knowledge of C++ sorting has shown through http://www.sharkyforums.com/ubb/tongue.gif I didn't even consider attempting to store 1 million numbers in memory, all having links to another. That would definitely....ummm, suck big time!

Well, I'm glad to see we got some real programming Guru's here now. Wasn't the case a few months ago, outside of a few veterans. Your contributions to the forums are most welcome. I'll most likely have a question or two in the future http://www.sharkyforums.com/ubb/biggrin.gif

Conrad Song
05-22-2001, 06:54 PM
The fastest database programs make no assumptions about the availability of memory. Mergesort is ideal for situations where memory is limited, but disk space is abundant. Under these conditions, a custom merge sort is significantly faster than those which rely on the O.S. to provide virtual memory.

Of course, if you don't really care, relying on virtual memory is certainly the simplest solution.

Originally posted by sjelkjd:
Yeah, maybe in archaic systems w/o virtual memory.
And mergesort is not in-place, compounding any space requirements you already have.
[/B]

sjelkjd
05-22-2001, 07:20 PM
Originally posted by Conrad Song:
The fastest database programs make no assumptions about the availability of memory. Mergesort is ideal for situations where memory is limited, but disk space is abundant. Under these conditions, a custom merge sort is significantly faster than those which rely on the O.S. to provide virtual memory.

Of course, if you don't really care, relying on virtual memory is certainly the simplest solution.



Mergesort is not in place!!!!!
Therefore, it uses more memory than an in-place sorting routine, ie heap/quicksort.

Additionally, the pivoting done with quicksort could easily be distributed into files, just as you are proposing with this modified mergesort.

Snakex
05-22-2001, 07:53 PM
Thanks guys, i will try the mergesort when i get to school tommorow.

By the way if you want to know more about what I have to work with. I have like a 5 meg filespace on my student server and the computer sall run windows NT.

Bryan

------------------
"Just remember the shark
that bit you is the one
gets to have a head full
of lead later on"

Conrad Song
05-22-2001, 10:52 PM
Understood, but the memory usage is not significantly greater than sorting which takes place in virtual memory, since this is disk as well!

Also mergesort has the property of being an in-order sort.

Your idea for quicksort is interesting. I will have to try it sometime.

Originally posted by sjelkjd:
Mergesort is not in place!!!!!
Therefore, it uses more memory than an in-place sorting routine, ie heap/quicksort.

Additionally, the pivoting done with quicksort could easily be distributed into files, just as you are proposing with this modified mergesort.

namgor
05-23-2001, 12:18 AM
Well the idea of merge sort and quick sort are both partitioning them into pieces and use the idea divide-and-conqeur. In here, we divide them into files. Merge sort is in-place and has the advantage of easier to program, since you divide the partition always into 2 (or /2-1 )

[edit for spelling err]

------------------
Home Athlon600Mhz@600Mhz, Asus K7M, 128MB SDRAM, RivaTNT2 32MB, Viewsonic A90 19", iMac Greenish Case, Logitech SR30, Kodak DC240i (Bluesberry) Digital Camera
Office PIII700Mhz@700Mhz, 192MB SDRAM, NVidia RivaTNT2 32MB, IBM P96 19"

uwcdc.com (http://www.uwcdc.com) or namgor.com (http://www.namgor.com)

[This message has been edited by namgor (edited May 23, 2001).]

sjelkjd
05-23-2001, 01:27 AM
Originally posted by Conrad Song:
Understood, but the memory usage is not significantly greater than sorting which takes place in virtual memory, since this is disk as well!

Also mergesort has the property of being an in-order sort.

Your idea for quicksort is interesting. I will have to try it sometime.



Well, if you go and set all your file sizes to be equal to disk cluster/ram frame size you're just emulating virtual memory. Why not let the OS do it?(Since in this case access is simple enough that the generic model is perfectly appropriate)

Second, I said mergesort is not in place.

You need to allocate a whole new array everytime you merge - ie, you can't do it in the same array it originally occupied.

sjelkjd
05-23-2001, 01:34 AM
Originally posted by namgor:
Well the idea of merge sort and quick sort are both partitioning them into pieces and use the idea divide-and-conqeur. In here, we divide them into files. Merge sort is in-place and has the advantage of easier to program, since you divide the partition always into 2 (or /2-1 )

[edit for spelling err]



Merge sort is not in place.
(last step of mergesort)
A[0]=2
A[1]=3
A[2]=4

A[3]=1
A[4]=2
A[5]=3

Try merging these two lists in place.
Secondly, how is this tough code?

#include <algorithm>
int main()
{
int a[N]=...
std::sort<int*>(a,a+N);
}

[This message has been edited by sjelkjd (edited May 23, 2001).]

namgor
05-23-2001, 09:58 AM
Alright I see what you mean, but arent we going to put the elements into files at the first place? I thought dealing with a million number is the real problem, thats why we said we want to divide them into files, we aren't going to put them into array anyway!

------------------
Home Athlon600Mhz@600Mhz, Asus K7M, 128MB SDRAM, RivaTNT2 32MB, Viewsonic A90 19", iMac Greenish Case, Logitech SR30, Kodak DC240i (Bluesberry) Digital Camera
Office PIII700Mhz@700Mhz, 192MB SDRAM, NVidia RivaTNT2 32MB, IBM P96 19"

uwcdc.com (http://www.uwcdc.com) or namgor.com (http://www.namgor.com)

Conrad Song
05-23-2001, 10:47 AM
It's not the memory that makes virtual memory slow, it is the file system. When virtual memory is used there is overhead in managing the virtual memory mananger, and no guarantee that the order of pages will stay correctly ordered together to give the most efficient access to the disk.

I apologize for blanking on the term, I meant to say that the sort is order stable. My meaning is that subsequent sorts do not disturb the ordering of sorts done previously. This has nice properties.

If the merge sort is destructive to the original set of keys, then you need to allocate at most 1x more space. If you are more clever you can get away with less.

Originally posted by sjelkjd:
Well, if you go and set all your file sizes to be equal to disk cluster/ram frame size you're just emulating virtual memory. Why not let the OS do it?(Since in this case access is simple enough that the generic model is perfectly appropriate)

Second, I said mergesort is not in place.

You need to allocate a whole new array everytime you merge - ie, you can't do it in the same array it originally occupied.

sjelkjd
05-23-2001, 04:11 PM
Originally posted by Conrad Song:
It's not the memory that makes virtual memory slow, it is the file system. When virtual memory is used there is overhead in managing the virtual memory mananger, and no guarantee that the order of pages will stay correctly ordered together to give the most efficient access to the disk.

I apologize for blanking on the term, I meant to say that the sort is order stable. My meaning is that subsequent sorts do not disturb the ordering of sorts done previously. This has nice properties.

If the merge sort is destructive to the original set of keys, then you need to allocate at most 1x more space. If you are more clever you can get away with less.



You don't access the file system w/ virtual memory.
If you're splitting them into files then you have to access the file system.
It doesn't matter in which way the pages are ordered since you have to do a lookup for each page in the page table anyway.
However, the TLB is likely to accelerate this process.

Additionally, since quicksort is recursive on the data, the LRU scheme used by memory managers(which has a low overhead compared to the cost of the computation) will be perfectly acceptable.

Galen of Edgewood
05-23-2001, 04:25 PM
Originally posted by sjelkjd:
You don't access the file system w/ virtual memory.

Virtual memory is basically this.

Your OS uses space on your harddrive (often refered to as swap space) and it will act like RAM. So, to access the virtual memory, it must access the harddrive. This introduces the bottleneck of the harddrive access, seek, and transfer times. That is why a NT or 2000 machine with 256 MB of RAM runs faster than a similarly equiped NT or 2000 machine with 128 MB of RAM. The one with the lower RAM will require more Harddrive access because it does not have as much storage space in its physical RAM.

------------------
Do not meddle in the affairs of dragons, for you are crunchy and good with ketchup.

sjelkjd
05-23-2001, 10:24 PM
Originally posted by Galen of Edgewood:
Virtual memory is basically this.

Your OS uses space on your harddrive (often refered to as swap space) and it will act like RAM. So, to access the virtual memory, it must access the harddrive. This introduces the bottleneck of the harddrive access, seek, and transfer times. That is why a NT or 2000 machine with 256 MB of RAM runs faster than a similarly equiped NT or 2000 machine with 128 MB of RAM. The one with the lower RAM will require more Harddrive access because it does not have as much storage space in its physical RAM.


Virtual memory uses a page table to determine location of any given page, rather than the disk file system.

The operating system puts all this into one file. The file system has no notion of where the individual pages go.

Using virtual memory to go to hard disk will be faster than explicitly requesting files because trying to manage file access yourself requires you to go through the OS to provide the same services that are already available from the OS.

You're going to have several more layers of indirection if you try to manage it yourself, which will be slower.

You're basically trying to emulate virtual memory, but without access to privileged instructions.

sjelkjd
05-23-2001, 10:27 PM
Originally posted by Conrad Song:
If the merge sort is destructive to the original set of keys, then you need to allocate at most 1x more space. If you are more clever you can get away with less.



However, you have the run-time overhead of being clever, as well as the programming time overhead of figuring it out.

Snakex
05-23-2001, 10:39 PM
So what have we concluded the best way is here?

I tried to start doing the mergesort with files but I was trying to figure out how to find the hi low and mid for the function if any of you have any suggestions please tell me.

Thanks


------------------
"Just remember the shark
that bit you is the one
gets to have a head full
of lead later on"

namgor
05-23-2001, 11:31 PM
it isnt difficult and is the very similar to what's in your textbook, do you know what they are ?

------------------
Home Athlon600Mhz@600Mhz, Asus K7M, 128MB SDRAM, RivaTNT2 32MB, Viewsonic A90 19", iMac Greenish Case, Logitech SR30, Kodak DC240i (Bluesberry) Digital Camera
Office PIII700Mhz@700Mhz, 192MB SDRAM, NVidia RivaTNT2 32MB, IBM P96 19"

uwcdc.com (http://www.uwcdc.com) or namgor.com (http://www.namgor.com)

Conrad Song
05-24-2001, 03:02 AM
Originally posted by Snakex:
So what have we concluded the best way is here?

I tried to start doing the mergesort with files but I was trying to figure out how to find the hi low and mid for the function if any of you have any suggestions please tell me.

Thanks


Forget coding it. If you're using C, there is a qsort() routine you can call. If you're using C++, there is the stable_sort genetic function. I'm sure there's also an equivalent in Java. No point reinventing the wheel...

[This message has been edited by Conrad Song (edited May 24, 2001).]

gazuga
05-24-2001, 07:51 PM
You should look into Radix Sort. For sorting numbers it is an extremely fast algorithm and is pretty easy to understand. It's actually an O(N) algorithm whereas quicksort and mergesort would both be O(N log N). Of course the downside is that it only works for numbers, and there is always the problem of memory, as discussed above.

EdtheROM
05-31-2001, 08:35 PM
Originally posted by sjelkjd:
Originally posted by Conrad Song:
Mergesort is definitely the way to go. In fact, you can find algorithms that use a small memory footprint by leaving the temporary results in files.

The correct method is to sort in multiple chunks of the cluster (sector) size of the disk. First sort (using any algorithm) one cluster size of data in memory and storing the partial results in a file. Then mergesort in the file by 2 cluster sizes, then 4, etc.
Yeah, maybe in archaic systems w/o virtual memory.
And mergesort is not in-place, compounding any space requirements you already have.



The underlying disk/file structure hasn't changed much, so the hint to use multiples of the cluster size is a good one.

Just why would you think that virtual memory would obviate this??

I'm interested...

-ed-

Stasis
05-31-2001, 08:36 PM
Use Quick Sort.

------------------
//SyS Specs//
AMD T-Bird 1.333GHz
256 mb DDR RAM (Micron)
Iwill KA266 MoBo
40GB Maxtor ATA100 7200RPM
LeadTek GeForce 2 PRO 64mb w/TV OUT
19" Sylvania Monitor
Intelli Mouse Optical(Explorer sux)
^^^^^^^^^^^^^^^^^^^^^
This system is the best
in this thread!

EdtheROM
05-31-2001, 09:04 PM
Originally posted by Snakex:
I was wondering how you would be able to sort over 1 million numbers all random. They are all in 1 file and have to be made into another file. This is using C++.

I cant use an array because an array of 100,000,000 would be just oo big so would a linked list. Somehow I was thinking about sorting 1000 at a time then sorting those over and over again like the mergesort but im not sure.

any help?

Bryan



*****

perhaps with a Shell sort, or modified Shell sort??

Snakex
06-01-2001, 06:06 PM
heres what I did, it took the use of 3 temp files.

Take 2 batches of 10000 numbers from the big file then sort those in an array then place them in temp1.num and temp2.num. Then merge temp1.num and temp2.num to temp3.num and then get 10000 more numbers in temp1.num sorted. Now merge temp1.num and temp3.num into temp2.num and keep doing that untill all the numbers are sorted

------------------
"Just remember the shark
that bit you is the one
gets to have a head full
of lead later on"

sjelkjd
06-02-2001, 05:33 AM
Originally posted by EdtheROM:
The underlying disk/file structure hasn't changed much, so the hint to use multiples of the cluster size is a good one.

Just why would you think that virtual memory would obviate this??

I'm interested...

-ed-


I suspect that if you look at the number of page accesses for a "dumb" quicksort and this file-based mergesort, you're going to get very similar numbers.
The analysis of the spatial locality of the accesses will prove educational:
For quicksort, at every level, you're basically going to have to load in every page.
Your operating system will basically keep the location of the "less" and the "Greater" array in memory, forcing a page miss once per level for them, and then a page miss for every page of the array.
When you subdivide, you repeatedly access pages for each half.
So, approximately nlogn page misses total(on a good day) and near n^2 on a bad(where n in this case is number of pages of memory worth of numbers).

For mergesort, first we have the cost of splitting everything up into files. This is the cost of loading the array into memory, and writing it all back out to disk(approx n page misses per level, log n levels).

Then we're looking at a bottom level of 1 page miss per page.
Now we have files of size cluster*2.
going through it again, there are n/2 files, but still n pages worth of numbers.
So again, n page misses.
This continues until mergesort finishes:
n log n page misses.

As far as i can tell, both methods incur roughly the same number of page misses.
With pathologically bad input, sure, quicksort will have n^2 page faults.

The likely scenario is that both will have roughly similar disk access needs, with one caveat. Mergesort is not in-place. THerefore, You not only have the overhead of splitting the array into files in the first place, you also have to put it back together.

Now, conventionally, memory accesses are cheap and don't count. When we're not dealing with primary memory, however, they get real expensive, real fast.

That's why i think that this mergesort is a big loser in this case: you're going to have to go to disk roughly twice as much.

zboss
06-02-2001, 12:34 PM
Originally posted by Snakex:
I was wondering how you would be able to sort over 1 million numbers all random. They are all in 1 file and have to be made into another file. This is using C++.

I cant use an array because an array of 100,000,000 would be just oo big so would a linked list...


Use a Hash Function. This doesn't sort the items at all, but uses pointers. That way when you need to insert/delete a new item into the list, you don't have to move the items. So, the first time around may be a little slower, but subsequent performance will be much faster.



------------------
zBoss

namgor
06-02-2001, 02:24 PM
Originally posted by zboss:
Use a Hash Function. This doesn't sort the items at all, but uses pointers. That way when you need to insert/delete a new item into the list, you don't have to move the items. So, the first time around may be a little slower, but subsequent performance will be much faster.


I thought we talked about this, using pointers is very expensive in this case. Not a good soln'.

------------------
DHAHL3seasons GP:73 G:121 A:55 Pts:176 GWG:12 +/-:184
UWSWA1season GP:9 G:12 A:8 Pts:20 GWG:3 +/-:-3

uwcdc.com (http://www.uwcdc.com) or namgor.com (http://www.namgor.com)

EdtheROM
06-02-2001, 08:41 PM
Originally posted by sjelkjd:
Originally posted by EdtheROM:
The underlying disk/file structure hasn't changed much, so the hint to use multiples of the cluster size is a good one.

Just why would you think that virtual memory would obviate this??

I'm interested...

-ed-


I suspect that if you look at the number of page accesses for a "dumb" quicksort and

*********
*********
*********

That's why i think that this mergesort is a big loser in this case: you're going to have to go to disk roughly twice as much.

*****

given all of this, have you ever considered Singleton's sort??

-ed-

sjelkjd
06-02-2001, 09:07 PM
Originally posted by EdtheROM:
*****

given all of this, have you ever considered Singleton's sort??

-ed-
Actually, i've never heard of Singleton's sort.
Could you describe the algorithm?