Click to See Complete Forum and Search --> : Custom Parser Implementation help


PsychoKow51
06-06-2007, 10:54 PM
I work with a number of process automation systems (DCSs, PLCs). These systems have configuration export files that are not very easy to query due their contetn structure. For one system whose files are especially tedious I created a program to convert the file to an xml structure in VBA - just to see if it could be done without too much complexity. The VBA version is really slow - 1GB file takes 45 minutes.

I woudl liek to rewrite in C# and do some smarter things 0 like buffering the input and ouput streams more and multithreading. I have never written anything multithreaded and wanted to share my idea for how to design the application to get some feedback:

I see four threads ( I intend to use methods that rely on the threadpool):
Read Data from file
Parse to initial xml format
Insert default values that are omitted from export file using XLT
Write data to file

it will take multiple reads to get enough information through the parsing to be able to send it to the transforms. The transfor will be more than enough to send directly to the write.

I am not sure how I keep the threads synchronized to process the right data in the right order.

Thoughts?

eshbach
06-06-2007, 11:27 PM
There are a few things to watch out for when working with big files in C#. One is that there's a size limit on the constructor for String and StringBuilder. I'm not sure if the limit is documented, or exactly what it is, but it's pretty small. I've seen string constructors fail with initializations as low as 200k.

In the interest of memory use, parallelism, and reliability you're going to want to work on this file in many small chunks.

One simple solution you could adapt would be as follows:

Decide upon a read block size, 8k maybe.

Create one Thread that will be your writer and create a FileStream object in it for the output file. Then call the WaitOne() method to wait until an event is fired.

Back in the original method, Create a FileStream object for the input file. In a loop, read 8k into a buffer and create a Thread object pointing to a method that will work on the data (use ParameterizedThreadStart).

Start the threads passing each the current buffer and the index of the loop iterator so that you can keep track of the order of the file. Have each thread operate on its buffer (parsing and populating as neccessary).

When each worker thread finishes, have it raise an event and send its new xml data to the writer thread.

The writer thread should collect data until it has a certain amount (you decide) buffered (in order mind you) and then flush that out to the output file and continue waiting for more data.

PsychoKow51
06-07-2007, 06:16 AM
I was thinking of using a much larger input buffer - is there any reason not to use something just below the string constructor limit (the 200k you mentioned)

If I understnad the process you outlined, I will need two indexes: one for the order of data sent to the parser, and one for the order of data sent to the output stream - corrrect?

If it helps to understand the problem a bit better: a record of data in the export file I'll be reading can be spread accross thousands of lines. subsets of data whithin a record can be either spread across a few lines, or just one. Based on teh 200K limit you mentioned, I might not be able to do the XSLT work in this process because I might not be able to hold onto a string that is large enough to send the entire element to the transform.

eshbach
06-07-2007, 11:44 PM
I was thinking of using a much larger input buffer - is there any reason not to use something just below the string constructor limit (the 200k you mentioned)

If I understnad the process you outlined, I will need two indexes: one for the order of data sent to the parser, and one for the order of data sent to the output stream - corrrect?

If it helps to understand the problem a bit better: a record of data in the export file I'll be reading can be spread accross thousands of lines. subsets of data whithin a record can be either spread across a few lines, or just one. Based on teh 200K limit you mentioned, I might not be able to do the XSLT work in this process because I might not be able to hold onto a string that is large enough to send the entire element to the transform.

A larger buffer would probably be fine.

Rather than index the order of data sent to the writer, I would add the data to a buffer (perhaps a list) in order of its read index, and when the buffer surpases a certain size, flush it to disk.

Also, it is very possible to have a string longer than 200k in C#. I have had strings in hundreds of megabytes work fine. The difficulty is in creating the string efficiently. Concatenation won't work because (since strings are immutable) you'll be making so many new objects that you'll run out of memory pretty quickly. Unfortunantly the StringBuilder class is not much help as it will cause OutOfMemory exceptions with relatively small amounts of text.

In the past I've used char* instead of string and Marshal.AllocHGlobal along with Marshal.PtrToStringAuto to make large string objects.

slavik
06-13-2007, 03:54 PM
what do these files looks like (just post like 5-10 lines)

another question, do you think that using regular expressions would make this task simpler? (if yes, then consider learning some perl) :)

PsychoKow51
06-30-2007, 06:42 PM
Sorry I tok so long to reply. Here's a snipet:


FUNCTION_BLOCK_DEFINITION NAME="__4540F3F9_726FDFC4__" CATEGORY=""
user="ADMINISTRATOR" time=1162300632/* "31-Oct-2006 08:17:12" */
{
SFC_ALGORITHM
{
STEP NAME="ABORT_LOGIC"
{
DESCRIPTION="START"
RECTANGLE= { X=196 Y=44 H=40 W=100 }
}
STEP NAME="IDLE_VLV_BS"
{
RECTANGLE= { X=196 Y=180 H=40 W=100 }
ACTION NAME="A1"
{
ACTION_TYPE=ASSIGN
QUALIFIER=P
EXPRESSION="'^/STEP_MSG.CV' := ""Idling Equipment"""
DELAY_TIME=0
}

Regular Expressions (which I love dearly) really don't help unless I can stop the pattern matches from being greedy (matching longest possible match). There are other considerations too:
1) it is not known upfront what keywords will be in the file
2) there are only about eleven rules to worry about to know what to do with the current character.
3) I ony need to look ahead 5 characters to match the longest pattern.
4) all the patterns I need to find are string literals.
5) some patterns that are multiline have single line patterns as subsets that make it tricky to differentiate which pattern to apply if I work on whole lines at a time.

Basically, the code is not easier with RegExp (I can't believe I said that, but it's true)

I'm too old for perl - I use awk ;)
and for the other ancient dabblers, its nawk on Solaris and gawk on my stupid Windows machine.

PsychoKow51
06-30-2007, 06:43 PM
The above file was "fixed" by the forum software. There is indenting for many of the embedded {} sections. But it is not as with well formesd xml where every embedded gets indented.

Candyman
07-01-2007, 03:38 PM
The above file was "fixed" by the forum software. There is indenting for many of the embedded {} sections. But it is not as with well formesd xml where every embedded gets indented.If you put it inside [code] tags, it will preserve indentation.