Zstrip to clean string
I'm extracting text from HTML (more on how - here), and after I extract text it has two problems:
- Lot's of $c(10) control characters
- Multiple whitespaces
Here's an example of the text extracted from HTML page:
set text = " "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" Word1"_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" Word2 "_$c(10)_"Word3 "_$c(10,10,10,10)_" "_$c(10)_" "_$c(10)_" © 2017 "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)
I want to remove control characters and multiple whitespaces from this string, and there's $zstrip function for that:
write $zstrip($zstrip(text, "*C"), "<=>P") >Word1 Word2 Word3 © 2017
But I need to use $zstrip twice. Is there any way to remove control characters and multiple whitespaces using one $zstrip?
Comments
I think, your best bet is:
write $zstrip($zstrip(text,"*c"),"<=>w")
because you want to remoe ALL (i.e.: *) control chars but only SOME (i.e.: <=>) whitespaces.
You could try something like:
set chars=$c(0,1,2,3,....31, 32 /* blank */, ...<maybe other control chars, above 128>)
write $tr(text,chars)
hth
Just a quick note: the last sample with $tr removes _all_ listed chars, so all blanks (rather than leading, trailing and repeationg) would be removed as well.