Efficient way to convert a large xml file to csv
Hi,
I have a large XML zen report file that I want to convert to a CSV, what is the most efficient way to do it.
XML file:
<Details> <Col1> aa1 </Col1> <Col2> bb1</Col2> <Col3> cc1 </Col3> </Details> <Details> <Col1> aa2 </Col1> <Col2> bb2</Col2> <Col3> cc3 </Col3> </Details>
Expected Output:
Col1,Col2,col3 aa1,bb1,bb1 aa2,bb2,bb2
Thank you.
Comments
It will depend on the structure of your xml file (next time a short sample would be helpfull).
Usually, you can do it with the help of %XLM.TextReader / %XML.Reader class.
i have created a sample and expected output
Apache Tika is another option. Without writing any code, it can be run from the command-line and output an XLSX as a tab-separated file.
java -jar tika-app-1.23.jar -t sample.xlsx > sample.tsv
I think most XSLT implementations read the whole XML input file into memory before beginning to process it. You need to test if an XSLT approach works with files of the size with which you deal. Of course, you also need to test how files of a particular size work in the particular machine environment which is running the application. There is not a one size fits all solution for parsing XML files using XSLT when memory issues are a concern. I believe ISC supports a SAX parser. SAX parsers put less burdens on memory than XLST.
SAX is described here: https://docs.oracle.com/javase/tutorial/jaxp/intro/simple.html
thank you for the solution provided, i will definitely try this, the challenge is that in the XML file is generated, will always not follow the same amount of columns, so i need to generate the xdata stylesheet on runtime.
The intention behind my post was, to give you one idea (of many other possibilities), how to convert XML to CSV. A empty (chars) element is just one of some other unhandled cases (missing tags, other tags inside of COLx tag, etc.).
If you need some speed and your XML-File obeys following constraints:
- the file is a correct XML-File
- contains only the 'Data', 'Details' and 'ColX' tags
- no selfclosing tags, like <sometag/>
then you could try the QAD-way (quick-and-dirty) of conversion.
Again, below an example routine (without excessive testing).
All ISC people and ordinary programer, please look the other way ;-))
Convert() Public
{
set len=32000 // chunk size, a safe value is:
// 32767 - longestColumnValue - tagSizes set fi="c:\temp\example-t.xml" // inp file (xml) set fo="c:\temp\example-t.csv" // output file (csv) open fi:"ru":1 open:$t fo:"nw":1 if '$t close fi quit set xml=$$inp(fi,len) // first xml-chunk set i=0, p=0 while 1 { set i=$locate(xml,"\<\/?\w+\>",i,j,v) // next (opening or closing) tag if i { // let see, what we got if v="<Details>" { set row="", p=-1 // start a new row } elseif v="</Details>" { d out(fo,row) s p=0 // complete, write out } elseif p,v["<Col" { s p=j, o=$zstrip(v,"*AP") // new column, keep start } elseif p,v["</Col" {s $li(row,o)=$$val($e(xml,p,i-1)) // get value } // everything else is don't care set i=j } else { set tmp=$$inp(fi,len) // next xml-chunk if tmp="" quit // done // remove processed data, add new one if p>0 { set xml=$e(xml,p,*)_tmp,p=1,i=0 } else { s xml=$e(xml,i,*)_tmp,i=0 } } } close fi close fo
}
val(val){ quit $zstrip(val,"<>w") // add handling of charcter entities like < etc.}out(fo,row){ use fo
write $listtostring(row,";",1),! // delimiter!}inp(fn,len){ use fn
try { read xml#len } catch { s xml="" } // in case, $zeof-mode is off quit xml}The above converter reads a simple test XML-file with two million 'ColX' items in 5 seconds and creates a CSV file with 100000 rows and 20 columns (in each row).
Instead of file you can also use an stream.
You can use the same approaches that were suggested above, though you'll need to spend time figuring out the structure of the .xslx file so you can write your extraction logic.
To save the time/effort, the Apache POI project provides a Java library that can read/write MS Office formats, including XLSX:
https://poi.apache.org/
https://kalliphant.com/poi-convert-xlsx-to-csv-example/
Check
Apache POI would be faster and more specialized, LibreOffice is the best bet on converting arbitrary documents between arbitrary formats.
If you "donate" your XML-File a version info and an extra root-node:
<?xml version="1.0" encoding="UTF-8" ?> <Data> <Details> ... </Details> </Data>
and use, for example, the %XML.Textreader class (see belov).
Then with few lines of code the job is done:
XML ; XML to CSV
#define ROOT "Data"
#define DEL ";" ##; your CSV-delimiter, $c(9) or ";" or ...
#;
set inpFile="c:\temp\example.xml"
set outFile="c:\temp\example.csv"
if ##class(%XML.TextReader).ParseFile(inpFile, .rdr) {
if rdr.Read(), rdr.NodeType="element", rdr.Name=$$$ROOT {
open outFile:"nw":1
if $t {
use outFile
while rdr.Read() {
if rdr.NodeType="element",rdr.Name="Details" {
set line=""
} elseif rdr.NodeType="chars" {
set line=line_$lb(rdr.Value)
} elseif rdr.NodeType="endelement",rdr.Name="Details" {
w $lts(line,$$$DEL),!
} elseif rdr.NodeType="endelement",rdr.Name=$$$ROOT {
close outFile
quit
}
}
} else { w "file open problem",! }
} else { w "XML root-element problem",! }
} else { w "XML structure problem",! }
I'm not sure for how much this is efficient, but you can use XSLT to do the transformation.
Class CMT.XmlToCsv [ Abstract ]
{
ClassMethod transform(
infile As %String,
outfile As %String) As %Status
{
Set tXSL=##class(%Dictionary.CompiledXData).%OpenId(..%ClassName(1)_"||XmlToCsv").Data
Set tSC=##class(%XML.XSLT.CompiledStyleSheet).CreateFromStream(tXSL,.tCompiledStyleSheet)
If $$$ISERR(tSC) Quit tSC
quit ##class(%XML.XSLT.Transformer).TransformFileWithCompiledXSL(infile,tCompiledStyleSheet,outfile)
}
XData XmlToCsv
{
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" />
<xsl:variable name="separator" select="','" />
<xsl:variable name="newline" select="' '" />
<xsl:template match="/">
<xsl:text>Col1,Col2,Col3</xsl:text>
<xsl:value-of select="$newline" />
<xsl:for-each select="//Details">
<xsl:value-of select="Col1" />
<xsl:value-of select="$separator" />
<xsl:value-of select="Col2" />
<xsl:value-of select="$separator" />
<xsl:value-of select="Col3" />
<xsl:value-of select="$newline" />
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
}
}
And then call it from terminal:
set p=##class(CMT.XmlToCsv).transform("c:\temp\input.xml","c:\temp\output.txt")
zw p
I took XSLT from https://stackoverflow.com/a/46176699/82675
Thank you, i tried your solution, one issue that i noticed is;
Suppose if <Col1> </Col1> is empty, with the above solution
rdr.NodeType="chars"
will not evaluate, and the CSV file will be wrong.
tweaked it as follows;
#define DEL ","
#define DC """"if ##class(%XML.TextReader).ParseFile(myfile, .rdr) {
if rdr.Read(), rdr.NodeType="element", rdr.Name=ROOT {
if $t {
set (nodeType,nodeName,line,header,childElementValue) = ""
set cnt=1
while rdr.Read() {
set nodeType = rdr.NodeType, nodeName = rdr.Name, length = $length(rdr.Path,"/")
if nodeType="element",nodeName=NAME {
set line=""
}
elseif cnt=1,nodeType="element",(length>3){
set header = header_$$$DC_nodeName_$$$DC_$$$DEL
}
elseif ((nodeType="chars") & (length>3)) {
set childElementValue=rdr.Value
}
elseif nodeType="endelement",length=4 {
set line=line_$$$DC_childElementValue_$$$DC_$$$DEL,childElementValue=""
}
elseif nodeType="endelement",nodeName=NAME {
if cnt=1 {
do stream.WriteLine(header)
set cnt=0
}
do stream.WriteLine(line)
}
elseif nodeType="endelement",nodeName=ROOT {
quit
}
}
} else { w "file open problem",! }
} else { w "XML root-element problem",! }
} else { w "XML structure problem",! }Although when it comes to a large file ParseFile method takes a significant amount of time to process the file, and i want to improve the code to run faster.
for example calling stream.WriteLine for each line can slow down the process, i want to try and batch the data and send to the stream to make the process much faster, and there is a concern for the <MAXLENGH> error.
Any Ideas?
What if my XML is an Excel file? Here is an example section. Is it even possible to read this in so I can create a CSV file as output?
<?xml version="1.0" encoding="UTF-8"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:x2="http://schemas.microsoft.com/office/excel/2003/xml" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:rs="urn:schemas-microsoft-com:rowset" xmlns:s="uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882" xmlns:z="#RowsetSchema" xmlns:udc="http://schemas.microsoft.com/data/udc" xmlns:udcs="http://schemas.microsoft.com/data/udc/soap" xmlns:udcxf="http://schemas.microsoft.com/data/udc/xmfile" xmlns:c="urn:schemas-microsoft-com:office:component:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40">
<OfficeDocumentSettings xmlns="urn:schemas-microsoft-com:office:office">
<DownloadComponents/>
</OfficeDocumentSettings>
<ss:Styles>
<ss:Style ss:ID="BrdrLftWrp">
<ss:Alignment ss:Vertical="Top" ss:WrapText="1"/>
<ss:Borders>
<ss:Border ss:Position="Left" ss:LineStyle="Continuous" ss:Weight="1"/>
</ss:Borders>
<ss:NumberFormat ss:Format="@"/>
</ss:Style>
.....