Discussion:
Parsing large XML files FAST
(too old to reply)
PedroX
2005-06-26 13:32:38 UTC
Permalink
Hello:

I need to parse some large XML files, and save the data in an Access DB. I
was using MSXML 2 and ASP, but it turns out to be extremely slow when then
XML documents are like 10 mb in size. It's taking over an hour to parse such
sizes!?

I don't really need to use ASP or a web server at all because I am parsing
all in my own computer. Is there any executable that can do this parsing
faster than the way I was doing it?

Thanks in advance.
Martin Honnen
2005-06-26 13:58:52 UTC
Permalink
Post by PedroX
I need to parse some large XML files, and save the data in an Access DB. I
was using MSXML 2 and ASP, but it turns out to be extremely slow when then
XML documents are like 10 mb in size. It's taking over an hour to parse such
sizes!?
I don't really need to use ASP or a web server at all because I am parsing
all in my own computer. Is there any executable that can do this parsing
faster than the way I was doing it?
If you want to use MSXML but not ASP then you could use VB and the SAX
interface that MSXML provides, that way the parser pushes stuff it has
parsed into your event handlers and you can process the huge file chunk
by chunk. That should be faster than the DOM parsing you can do in ASP.

The SAX API is documented here:
<http://msdn.microsoft.com/library/default.asp?url=/library/en-us/xmlsdk/html/ac6be45a-177e-4b80-a918-dc73e357f7bb.asp>
--
Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/
PedroX
2005-06-26 15:02:00 UTC
Permalink
I tried to learn more about Sax. It seems that I'd be easier for me if I
could call sax from a ASP page. References for Sax with VBScript are almost
non-existent on the net, or very hard to find. But from the VB examples I
saw, using Sax looks complicated no matter how you implement it. But I'd
give it a try if I find a good tutorial of Sax an VBscript.

My other option would be then try to optimize my ASP code that calls MSXML
4.0. If anyone here is willing to take a look at it, I'd be very grateful.

I already posted (partially) the structure of the XML I am working with in a
recent post, to which Martin Honnen also replied to. The subject was "MSXML2
and Xpath problem.".

Thanks.
Martin Honnen
2005-06-26 15:20:29 UTC
Permalink
Post by PedroX
I tried to learn more about Sax. It seems that I'd be easier for me if I
could call sax from a ASP page. References for Sax with VBScript are almost
non-existent on the net, or very hard to find. But from the VB examples I
saw, using Sax looks complicated no matter how you implement it. But I'd
give it a try if I find a good tutorial of Sax an VBscript.
You cannot use SAX vom VBScript, you need VB (or C++ if that is an option).
Post by PedroX
My other option would be then try to optimize my ASP code that calls MSXML
4.0. If anyone here is willing to take a look at it, I'd be very grateful.
Do you need to use classic ASP? Or can you use ASP.NET? With .NET you
could use XmlTextReader. Like SAX it doesn't not read the complete XML
into memory but XmlTextReader is easier to use than the event based SAX
parsing.
--
Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/
PedroX
2005-06-26 16:37:29 UTC
Permalink
Post by Martin Honnen
Do you need to use classic ASP? Or can you use ASP.NET? With .NET you
could use XmlTextReader. Like SAX it doesn't not read the complete XML
into memory but XmlTextReader is easier to use than the event based SAX
parsing.
I could use ASP.Net, but I'd have to install it in my computer. I'll do that
and then try to figure the XmlTextReader and XML parsing in general. Sax
would involve a learning curve anyway, so might as well get acquainted with
.net, which I am not.
PedroX
2005-06-26 20:36:26 UTC
Permalink
I can't believe that my ISP (read: SYMPATICO.CA), which is the biggest
communications monopoly in Canada, forces me half the time to resort to
google to post on usenet. They suck, they just suck, and they'll keep
on sucking forever. Anyway ... back to my XML issue....

I've installed .net, and searched documentation that would explain the
use
of XmlTextReader.
But I guess I cannot just skip the ABC of ASP.Net . Most of the scripts
are
in C , and sometimes, there's no
indication of what scripting language is used. Can someone post an
equivalent of the code below using in ASP.NET and
the XmlTextReader? I am really stuck, and I don't have the time, the
energy
and the brains to re-invent the wheel.
Thanks in advance.


'------- XML OBJECT ---------

Set objXMLDoc = Server.CreateObject("MSXML2.DOMDocument.4.0")
objXMLDoc.async = false
objXMLDoc.setProperty "SelectionLanguage", "XPath"
objXMLDoc.setProperty "SelectionNamespaces", " ... etc ..."

strURL = "D:\wwwroot\webdir\odp_split\CD_01\odp_content_002.xml"

'--- load XML
bLoadResult = objXMLDoc.load(strURL)

'--- get number of nodes I want to loop

Set oNode = objXMLDoc.selectNodes("//node_name")

For a = 1 to oNode.length

'-- Get some data I need from each node ---
Set oNode = objXMLDoc.selectSingleNode("//node_name[" & a
&"]/node_value")

'--- insert in DB. Commands for Connection to DB, etc are assumed

strSQL = "INSERT INTO MyTable (Field1) VALUES ('"&ioNode.text&"')"

oConn.Execute(strSQL)

Next
PedroX
2005-06-26 13:59:52 UTC
Permalink
Post by PedroX
I need to parse some large XML files, and save the data in an Access DB. I
was using MSXML 2 and ASP, but it turns out to be extremely slow when then
I made a mistake. I am actually using MSXML 4.0.
Brian Staff
2005-06-27 02:49:07 UTC
Permalink
Post by PedroX
Is there any executable that can do this parsing
faster than the way I was doing it?
Post by PedroX
objXMLDoc.selectNodes("//node_name")
I am not an expert on techniques of parsing, but if performance were a
problem for me, I would try and use as much explicit node naming as
possible...for instance I would maybe recode the above statement to be
something like this:

objXMLDoc.selectNodes("rootNode/childNode/node_name")

I know if _I_ was the parser, I would be able to find those nodes in a 10mb
structure quicker using the second technique rather than using the first.

JAT - Brian
PedroX
2005-06-27 15:06:41 UTC
Permalink
Post by Brian Staff
I am not an expert on techniques of parsing, but if performance were a
problem for me, I would try and use as much explicit node naming as
possible...for instance I would maybe recode the above statement to be
objXMLDoc.selectNodes("rootNode/childNode/node_name")
WOW. That DID make a difference.
What was taking over an hour before now takes about 2 minutes!
Thank you !!!!!!!!!!!!!!!!
Bryce K. Nielsen
2005-06-27 16:46:23 UTC
Permalink
Post by PedroX
WOW. That DID make a difference.
What was taking over an hour before now takes about 2 minutes!
Thank you !!!!!!!!!!!!!!!!
This will make a huge difference. Remember that with XPath, the //node_name
means that it will search *every* node in the entire document. If you make
it more specific, it will be a lot faster.

However, when dealing with 10mb+ documents, you should really start using
SAX and not DOM. I was unaware that VBScript couldn't do SAX, since MSXML's
SAX parser is just a COM object, I figured you could (I've just never
tried). If you can't implement the interface, you could always create a COM
Wrapper that does specifically what you need and call that from your ASP
page. I.e. using VB, create a COM object that takes an XML string, it
implements the SAX parser to do the inserts into Access, etc.

But the point is, 10mb+, stay away from DOM, use SAX...

Bryce K. Nielsen
SysOnyx, Inc. (www.sysonyx.com)
Makers of xmlDig, the XML-SQL Extractor
http://www.sysonyx.com/products/xmldig

P.S. Why did you cross-post this? I typically find better results when I
post messages to one board at a time...
PedroX
2005-06-27 20:43:36 UTC
Permalink
Post by Bryce K. Nielsen
But the point is, 10mb+, stay away from DOM, use SAX...
I wanted to, but I the whole thing (including the alternative .NET's
XmlTextReader)
is just beyond my comprehension. I found no tutorials that I could
understand.
I know VBScript, Javascript / JScript, and that's pretty much it.
No Java, no C, no Visual Basic per se (although is similar to VBScript).
Peter Flynn
2005-10-02 20:41:39 UTC
Permalink
Post by PedroX
Post by Bryce K. Nielsen
But the point is, 10mb+, stay away from DOM, use SAX...
I wanted to, but I the whole thing (including the alternative .NET's
XmlTextReader)
is just beyond my comprehension.
There's a brief explanation in the XML FAQ at the end of the question
on data/document XML at http://xml.silmaril.ie/developers/docdata/
Post by PedroX
I found no tutorials that I could understand.
I know VBScript, Javascript / JScript, and that's pretty much it.
No Java, no C, no Visual Basic per se (although is similar to VBScript).
I'm afraid those are all programming languages, and they won't help
very much with a markup language.

///Peter

Brian Staff
2005-06-27 20:27:25 UTC
Permalink
Post by PedroX
WOW. That DID make a difference.
What was taking over an hour before now takes about 2 minutes!
Well, it was a bit of a guess on my part<g> - but it is encouraging to know
that explicit Xpath naming does really make a difference.

Brian
Bryce K. Nielsen
2005-06-27 23:38:53 UTC
Permalink
Post by Brian Staff
Well, it was a bit of a guess on my part<g> - but it is encouraging to know
that explicit Xpath naming does really make a difference.
Yeah, it will. The double-slash is like a wildcard, search *every* node for
this xpath. If you use an explicit path, it knows to only look in one area.
Also don't forget that the result set of a wildcard search could be large,
where-as an explicit one will probably only return the one node...

-BKN
Loading...