modifying existing XML with SAX

Discussion:

(too old to reply)

p***@gmail.com

2006-04-24 20:34:46 UTC

I'm working on a c++ application that writes its output as XML, I
currently have it running using DOM (IXMLDOMDocument), but the load()
and save() calls can take a LONG time when the document gets to be more
than a meg or two. That and the fact that the entire document has to be
loaded in to memory makes this really bad for my pocket-pc application.

After some searching I found that SAX was probably my answer. Just load
a document with an IStream and modify it with SAX. Unfortunately, I
haven't found much in the way of examples of how to do that exactly.

Has anybody got a C++ example of how to modify an existing XML
document(see below for an example) with SAX but not using the DOM
load() and save() commands?

Thanks,
PaulH

Here is an example of the XML I'm working with. I would just want to
append a new "ENTRY" element and all sub-elements each time.
<?xml version="1.0" encoding="UTF-8"?>
<LOGFILE>
<ENTRY>
<TIME>timestamp</TIME>
<DATE>datestamp</DATE>
<MODULE>module data</MODULE>
<MODULE2>module data</MODULE2>
</ENTRY>
<ENTRY>
<TIME>timestamp</TIME>
<DATE>datestamp</DATE>
<MODULE>module data</MODULE>
<MODULE2>module data</MODULE2>
</ENTRY>
</LOGFILE>

Igor Tandetnik

2006-04-24 20:45:24 UTC

Permalink

Post by p***@gmail.com
After some searching I found that SAX was probably my answer. Just
load a document with an IStream and modify it with SAX.
Unfortunately, I haven't found much in the way of examples of how to
do that exactly.

SAX does not build the document and does not really allow you to modify
it. SAX simply produces and sends your way a stream of events, such as
"element starts", "piece of text found", "element ends" and so on. The
SAX parser itself does not maintain any data (beyond whatever state is
necessary to properly match up tags). You are free to do whatever you
want with those events.

For example, you can write your SAX handler to simply format the data
you receive as text, and output it to a file or stream. With some work,
you can have a SAX handler that outputs the exact copy of the document.
With a little bit more work, you can actually change the document on the
fly. It would be particularly easy if the changes you need made are
localized to a small part of the document and don't require maintaining
a lot of context.

You might also want to look into XSLT: chances are you can perform your
transformation in a purely declarative manner, without writing a line of
code.

--
With best wishes,
Igor Tandetnik

With sufficient thrust, pigs fly just fine. However, this is not
necessarily a good idea. It is hard to be sure where they are going to
land, and it could be dangerous sitting under them as they fly
overhead. -- RFC 1925

p***@gmail.com

2006-04-24 22:13:46 UTC

Permalink

I'm not sure I see how XSLT will help me. I'm not transforming the XML.
I'm trying to append a new
<ENTRY>
<TIME>timestamp</TIME>
<DATE>datestamp</DATE>
<MODULE>module data</MODULE>
<MODULE2>module data</MODULE2>
</ENTRY>
to the end of the existing XML just above the </LOGFILE>. I do this
constantly while the program is running at some user-defined interval
(typically every 30 seconds).

Is there a way to roll my own IStream implementation for
IXMLDomDocument load() and save() such that it only keeps a few KB in
memory at any given time and not the whole document?

Igor Tandetnik

2006-04-24 22:24:22 UTC

Permalink

Personally, for something like this, I'd probably not bother with XML
parser at all. It should be easy to read backward from the end of the
file, find the offset of the line containing </LOGFILE>, write your
entry starting at this offset, and finally write </LOGFILE> line back at
the end. It does not sound particularly efficient to reread the whole
log every time you want to append an entry, with SAX or otherwise.

Post by p***@gmail.com
Is there a way to roll my own IStream implementation for
IXMLDomDocument load() and save() such that it only keeps a few KB in
memory at any given time and not the whole document?

You can have IXMLDomDocument read directly from disk file - just give it
a file:// URL. That won't help you any - IXMLDomDocument itself still
keeps the whole document in memory, not as raw data but as DOM tree.

Simon Trew

2006-04-25 10:12:48 UTC

Permalink

What you could do is use SAX to build your own mini-DOM-like-thing (MDLT).
For each element, in the MDLT mark the place in the stream that it starts
and ends (together with any other information you want to preserve). Then
you can actually discard its content since you know you can go back in the
stream and reread it when you need to. You would need to make an IStream
(based on an existing IStream) that can make the appropriate section of
stream look like a full stream (i.e. substracts the starting stream offset
and reports end-of-stream when it reaches the end of stream offset). Of
course this only works for random-access streams.

If you then actually want to reproduce a document, you just reread the
original with SAX, and write it out again or replace it with any extra parts
that you happen to have in memory (e.g. stuff you have added).

I did this very successfully a few years ago with documents that were
gigabytes in size. Of course you'd need to decide the policy for what to
keep and what to discard. e.g. you might keep all tags at nesting levels 1-3
and discard all the other contents. The policy would of course depend very
much on the structure of the documents you were expecting to encounter.

S.

Post by p***@gmail.com
I'm not sure I see how XSLT will help me. I'm not transforming the XML.
I'm trying to append a new
<ENTRY>
<TIME>timestamp</TIME>
<DATE>datestamp</DATE>
<MODULE>module data</MODULE>
<MODULE2>module data</MODULE2>
</ENTRY>
to the end of the existing XML just above the </LOGFILE>. I do this
constantly while the program is running at some user-defined interval
(typically every 30 seconds).
Is there a way to roll my own IStream implementation for
IXMLDomDocument load() and save() such that it only keeps a few KB in
memory at any given time and not the whole document?

p***@gmail.com

2006-04-25 14:33:39 UTC

Permalink

That sounds like the Right Way to do this. Do you know of any existing
examples anywhere of something similiar that would get me going in the
right direction?

-PaulH