Written by

Software Developer at J2 Interactive

Question Rizmaan Marikar · Sep 8, 2019

Efficient way to convert a large xml file to csv

#Caché #CSV #Performance #XML #ZEN

Hi,

I have a large XML zen report file that I want to convert to a CSV, what is the most efficient way to do it.

XML file:

<Details>
<Col1> aa1 </Col1>
<Col2> bb1</Col2>
<Col3> cc1 </Col3>
</Details>
<Details>
<Col1> aa2 </Col1>
<Col2> bb2</Col2>
<Col3> cc3 </Col3>
</Details>

Expected Output:

Col1,Col2,col3
aa1,bb1,bb1
aa2,bb2,bb2

Thank you.

Discussion (12)0

Add reply

Comments

Julius Kavay · Sep 8, 2019

It will depend on the structure of your xml file (next time a short sample would be helpfull).

Usually, you can do it with the help of %XLM.TextReader / %XML.Reader class.

0 0

Rizmaan Marikar Sep 9, 2019 to Julius Kavay

i have created a sample and expected output

0 0

Marc Mundt Jan 28, 2020 to Marc Mundt

Apache Tika is another option. Without writing any code, it can be run from the command-line and output an XLSX as a tab-separated file.

https://tika.apache.org/

java -jar tika-app-1.23.jar -t sample.xlsx > sample.tsv

0 0

Jonathan Levinson Sep 9, 2019 to Alexander Koblov

I think most XSLT implementations read the whole XML input file into memory before beginning to process it. You need to test if an XSLT approach works with files of the size with which you deal. Of course, you also need to test how files of a particular size work in the particular machine environment which is running the application. There is not a one size fits all solution for parsing XML files using XSLT when memory issues are a concern. I believe ISC supports a SAX parser. SAX parsers put less burdens on memory than XLST.

SAX is described here: https://docs.oracle.com/javase/tutorial/jaxp/intro/simple.html

0 0

Rizmaan Marikar Sep 10, 2019 to Alexander Koblov

thank you for the solution provided, i will definitely try this, the challenge is that in the XML file is generated, will always not follow the same amount of columns, so i need to generate the xdata stylesheet on runtime.

0 0

Julius Kavay Sep 10, 2019 to Rizmaan Marikar

The intention behind my post was, to give you one idea (of many other possibilities), how to convert XML to CSV. A empty (chars) element is just one of some other unhandled cases (missing tags, other tags inside of COLx tag, etc.).

If you need some speed and your XML-File obeys following constraints:

- the file is a correct XML-File
- contains only the 'Data', 'Details' and 'ColX' tags
- no selfclosing tags, like <sometag/>

then you could try the QAD-way (quick-and-dirty) of conversion.

Again, below an example routine (without excessive testing).
All ISC people and ordinary programer, please look the other way ;-))

Convert() Public
{
   set len=32000 // chunk size, a safe value is:
                 // 32767 - longestColumnValue - tagSizes   set fi="c:\temp\example-t.xml" // inp file (xml)   set fo="c:\temp\example-t.csv" // output file (csv)   open fi:"ru":1 open:$t fo:"nw":1   if '$t close fi quit   set xml=$$inp(fi,len) // first xml-chunk   set i=0, p=0   while 1 {      set i=$locate(xml,"\<\/?\w+\>",i,j,v) // next (opening or closing) tag      if i {                                            // let see, what we got          if v="<Details>" { set row="", p=-1 // start a new row         } elseif v="</Details>" { d out(fo,row) s p=0 // complete, write out         } elseif p,v["<Col" { s p=j, o=$zstrip(v,"*AP") // new column, keep start         } elseif p,v["</Col" {s $li(row,o)=$$val($e(xml,p,i-1)) // get value         }                                   // everything else is don't care         set i=j      } else {         set tmp=$$inp(fi,len) // next xml-chunk         if tmp="" quit                                   // done         // remove processed data, add new one         if p>0 { set xml=$e(xml,p,*)_tmp,p=1,i=0 } else { s xml=$e(xml,i,*)_tmp,i=0 }      }   }   close fi   close fo
}

val(val){   quit $zstrip(val,"<>w") // add handling of charcter entities like &lt; etc.}out(fo,row){   use fo
   write $listtostring(row,";",1),! // delimiter!}inp(fn,len){   use fn
   try { read xml#len } catch { s xml="" } // in case, $zeof-mode is off   quit xml}

The above converter reads a simple test XML-file with two million 'ColX' items in 5 seconds and creates a CSV file with 100000 rows and 20 columns (in each row).

Instead of file you can also use an stream.

0 0

Marc Mundt Jan 28, 2020 to Aaron Vail

You can use the same approaches that were suggested above, though you'll need to spend time figuring out the structure of the .xslx file so you can write your extraction logic.

To save the time/effort, the Apache POI project provides a Java library that can read/write MS Office formats, including XLSX:
https://poi.apache.org/
https://kalliphant.com/poi-convert-xlsx-to-csv-example/

0 0

Eduard Lebedyuk Jan 29, 2020 to Aaron Vail

Check

Apache POI would be faster and more specialized, LibreOffice is the best bet on converting arbitrary documents between arbitrary formats.

0 0

Julius Kavay Sep 9, 2019 to Rizmaan Marikar

If you "donate" your XML-File a version info and an extra root-node:

<?xml version="1.0" encoding="UTF-8" ?>
<Data>
  <Details>
   ...
  </Details>
</Data>

and use, for example, the %XML.Textreader class (see belov).

Then with few lines of code the job is done:

XML ; XML to CSV
#define ROOT "Data"
#define DEL    ";"        ##; your CSV-delimiter, $c(9) or ";" or ...
#;
   set inpFile="c:\temp\example.xml"
   set outFile="c:\temp\example.csv"
   if ##class(%XML.TextReader).ParseFile(inpFile, .rdr) {
      if rdr.Read(), rdr.NodeType="element", rdr.Name=$$$ROOT {
         open outFile:"nw":1
         if $t {
            use outFile
            while rdr.Read() {
               if rdr.NodeType="element",rdr.Name="Details" {
                  set line=""
               } elseif rdr.NodeType="chars" {
                 set line=line_$lb(rdr.Value)
               } elseif rdr.NodeType="endelement",rdr.Name="Details" {
                 w $lts(line,$$$DEL),!
               } elseif rdr.NodeType="endelement",rdr.Name=$$$ROOT {
                  close outFile
                  quit
               }
            }
         } else { w "file open problem",! }
      } else { w "XML root-element problem",! }
   } else { w "XML structure problem",! }

0 0

Alexander Koblov · Sep 9, 2019

I'm not sure for how much this is efficient, but you can use XSLT to do the transformation.

Class CMT.XmlToCsv [ Abstract ]
{

ClassMethod transform(
	infile As %String,
	outfile As %String) As %Status
{
	Set tXSL=##class(%Dictionary.CompiledXData).%OpenId(..%ClassName(1)_"||XmlToCsv").Data
	Set tSC=##class(%XML.XSLT.CompiledStyleSheet).CreateFromStream(tXSL,.tCompiledStyleSheet)
    If $$$ISERR(tSC) Quit tSC
	quit ##class(%XML.XSLT.Transformer).TransformFileWithCompiledXSL(infile,tCompiledStyleSheet,outfile)
}

XData XmlToCsv
{
<?xml version="1.0"?>

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text" />
    <xsl:variable name="separator" select="','" />
    <xsl:variable name="newline" select="'&#10;'" />

    <xsl:template match="/">
        <xsl:text>Col1,Col2,Col3</xsl:text>
        <xsl:value-of select="$newline" />
        <xsl:for-each select="//Details">
            <xsl:value-of select="Col1" />
            <xsl:value-of select="$separator" />
            <xsl:value-of select="Col2" />
            <xsl:value-of select="$separator" />
            <xsl:value-of select="Col3" />
            <xsl:value-of select="$newline" />
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>
}

}

And then call it from terminal:

set p=##class(CMT.XmlToCsv).transform("c:\temp\input.xml","c:\temp\output.txt") 
zw p

I took XSLT from https://stackoverflow.com/a/46176699/82675

0 0

Rizmaan Marikar · Sep 10, 2019

Thank you, i tried your solution, one issue that i noticed is;

Suppose if <Col1> </Col1> is empty, with the above solution

rdr.NodeType="chars"

will not evaluate, and the CSV file will be wrong.

tweaked it as follows;

#define DEL ","  
 #define DC """"if ##class(%XML.TextReader).ParseFile(myfile, .rdr) {
      if rdr.Read(), rdr.NodeType="element", rdr.Name=ROOT {
         if $t {
            set (nodeType,nodeName,line,header,childElementValue) = ""
            set cnt=1
            while rdr.Read() {
           set nodeType = rdr.NodeType, nodeName = rdr.Name, length = $length(rdr.Path,"/")
     
               if nodeType="element",nodeName=NAME {
                  set line=""
               }
               elseif cnt=1,nodeType="element",(length>3){
               set header = header_$$$DC_nodeName_$$$DC_$$$DEL
               } 
               elseif ((nodeType="chars") & (length>3)) {
               set childElementValue=rdr.Value
               } 
               elseif nodeType="endelement",length=4 {
               set line=line_$$$DC_childElementValue_$$$DC_$$$DEL,childElementValue=""
               }
               elseif nodeType="endelement",nodeName=NAME {
               if cnt=1 {
               do stream.WriteLine(header)
               set cnt=0
               }
                   do stream.WriteLine(line)
               } 
               elseif nodeType="endelement",nodeName=ROOT {
                  quit
               }
               
            }
         } else { w "file open problem",! }
      } else { w "XML root-element problem",! }
   } else { w "XML structure problem",! }

Although when it comes to a large file ParseFile method takes a significant amount of time to process the file, and i want to improve the code to run faster.

for example calling stream.WriteLine for each line can slow down the process, i want to try and batch the data and send to the stream to make the process much faster, and there is a concern for the <MAXLENGH> error.

Any Ideas?

0 0

Aaron Vail · Jan 28, 2020

What if my XML is an Excel file? Here is an example section. Is it even possible to read this in so I can create a CSV file as output?

<?xml version="1.0" encoding="UTF-8"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:x2="http://schemas.microsoft.com/office/excel/2003/xml" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:rs="urn:schemas-microsoft-com:rowset" xmlns:s="uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882" xmlns:z="#RowsetSchema" xmlns:udc="http://schemas.microsoft.com/data/udc" xmlns:udcs="http://schemas.microsoft.com/data/udc/soap" xmlns:udcxf="http://schemas.microsoft.com/data/udc/xmfile" xmlns:c="urn:schemas-microsoft-com:office:component:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40">
   <OfficeDocumentSettings xmlns="urn:schemas-microsoft-com:office:office">
       <DownloadComponents/>
   </OfficeDocumentSettings>
   <ss:Styles>
       <ss:Style ss:ID="BrdrLftWrp">
           <ss:Alignment ss:Vertical="Top" ss:WrapText="1"/>
           <ss:Borders>
               <ss:Border ss:Position="Left" ss:LineStyle="Continuous" ss:Weight="1"/>
           </ss:Borders>
           <ss:NumberFormat ss:Format="@"/>
       </ss:Style>

.....

0 0