Written by

Senior Cloud Architect at InterSystems
Question Eduard Lebedyuk · Oct 30, 2017

Zstrip to clean string

I'm extracting text from HTML (more on how - here), and after I extract text it has two problems:

  • Lot's of $c(10) control characters
  • Multiple whitespaces

Here's an example of the text extracted from HTML page:

set text = " "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" Word1"_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" Word2 "_$c(10)_"Word3 "_$c(10,10,10,10)_" "_$c(10)_" "_$c(10)_" © 2017 "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)_" "_$c(10)

I want to remove control characters and multiple whitespaces from this string,  and there's $zstrip function for that:

write $zstrip($zstrip(text, "*C"), "<=>P")
>Word1 Word2 Word3 © 2017

But I need to use $zstrip twice. Is there any way to remove control characters and multiple whitespaces using one $zstrip?

Comments

Julius Kavay · Oct 30, 2017

I think, your best bet is:

write $zstrip($zstrip(text,"*c"),"<=>w")

because you want to remoe ALL (i.e.: *) control chars but only SOME (i.e.: <=>) whitespaces.

You could try something like:

set chars=$c(0,1,2,3,....31, 32 /* blank */, ...<maybe other control chars, above 128>)

write $tr(text,chars)

hth

0
Alexey Maslov  Oct 31, 2017 to Julius Kavay

Just a quick note: the last sample with $tr removes _all_ listed chars, so all blanks (rather than leading, trailing and repeationg) would be removed as well.

0