How To: scripting: bash internals vs. sed | cut | awk (copy from #!)

iMBeCil · 2017-08-28 08:54:24

[This is a copy of the HowTo on the Crunchbang forum; I occasionally need it, so I figure others might as well; this version contains minor improvements suggested by other users. @mods: feel free to delete/(re)move this topic if you feel it doesn't belong here.]

1) Introduction
2) Examples
2.1) 'sed+cut' way
2.2) 'bash-only'-way
3) Benchmarks
3.1) Test script
3.2) Benchmark results
4) Final words

1. INTRODUCTION
What?
Plenty of bash scripts often need to extract information from strings variable (or command output) like:

variable="   something   somethingelse XXXX-YYYY-ZZ  0xab2345   whatever"
#         ^^^ note spaces at the beginning of the string

Being myself in need for such stuff too, and wandering around internet for solutions, I noticed that most of the solutions heavily use some or other combination of 'sed', 'cut' and/or 'awk'. Basically, one 'precook' the string with 'sed', and then use 'cut' or 'awk' to extract required part of the string.

Why?
Being some sort of a stubborn (don't want to use python/ruby/perl...) purist (why call other tools like 'sed', when 'bash' itself might be powerful enough), I stumbled across so called 'a surprising number of string manipulation operations' and 'arrays', both internal to bash. And discovered that most stuff can be programmed with these in place of 'sed', 'cut' and 'awk'. Since most google searches point to sed/cut/awk tools, I got an idea that it might be good to promote bash internal commands approach via few examples.

How?
It can be done surprisingly easy, and IMHO more cleanly syntactically, compared to 'sed', 'cut' and 'awk'. There are some drawbacks, most notably certain oneliners are impossible to construct (because of the way arrays and string manipulation works in bash), but overall impression is that 'arrays' and 'string manipulation operations' make script more readable.

Stay tuned for next posts with Examples ...

Last edited by iMBeCil (2017-08-28 09:12:56)

iMBeCil · 2017-08-28 08:55:19

2. Examples
OK, here follows a very simple example. I want to extract all four number in separate variables, from string like:

1680x1050+2880+23

Of course, you recognize this is as a typical 'geometry'-like string. You can get a lot of those from 'xrandr' for example

2.1 'sed+cut'-way
Here is simple script for doing it via 'sed':

# example-SED
# define string
xrandroutput="1680x1050+2880+23"

# define TAB ('\t') character; needed for 'sed'
# and convenient for 'cut'
TAB=$(echo -e "\t")

# replace 'x' with '\t'
array=`echo "$xrandroutput" | sed "s/x/$TAB/"`

# replace '+' with '\t'
array=`echo -e "$array" | sed "s/+/$TAB/g"`

# store values
H=`echo -e "$array" | cut -f 1`
W=`echo -e "$array" | cut -f 2`
X=`echo -e "$array" | cut -f 3`
Y=`echo -e "$array" | cut -f 4`

Hacker will shout: 'why two sed's?' And they will be right, it can be a little shorter:

# example-SED-SINGLE
# define string
xrandroutput="1680x1050+2880+23"

# define TAB ('\t') character; needed for 'sed'
# and convenient for 'cut'
TAB=$(echo -e "\t")

# replace 'x' and '+' with '\t'
array=`echo "$xrandroutput" | sed "s/[x+]/$TAB/g"`

# store values
H=`echo -e "$array" | cut -f 1`
W=`echo -e "$array" | cut -f 2`
X=`echo -e "$array" | cut -f 3`
Y=`echo -e "$array" | cut -f 4`

That's it ... that's how I - more-less - saw people do it. Probably, it can be further optimized, but this is a gist of it.

iMBeCil · 2017-08-28 08:56:38

2.2 'bash-only'-way
Here is a promised example with bash internal command only:

# example-ARRAY
# define string
xrandroutput="1680x1050+2880+23"

# replace all 'x' with space " ", using powerful
# bash internal string search-replace pattern:
# ${string//substring/replacement},
# and store result in variable called 'array' 
# (not yet of array type)
array=${xrandroutput//x/" "}      # "1680 1050+2880+23"

# replace all '+' with space " ", and store result as
# an array in variable called 'array' using '(' and ')'
# note: parentheses will honor space as delimiter
# and make an array of values
array=( ${array//+/" "} )     # ( "1680" "1050" "2880" "23" )

# print data
echo "array[0] = ${array[0]}"
echo "array[1] = ${array[1]}"
echo "array[2] = ${array[2]}"
echo "array[3] = ${array[3]}"

Of course, two replacements patterns can be combined in single one:

# example-ARRAY-SINGLE
...
# replace all 'x' and '+' with space " "
# and store result in array variable called 'array'
array=( ${xrandroutput//[x+]/" "} )
...

Isn't it simpler and cleaner? Not to mention that in addition we have all the data in single (array) variable, which is sometime convenient, for example for less namespace cluttering.

OK, writing this made me very hungry ... going to eat and drink a beer, and then I will do some benchmarking.

iMBeCil · 2017-08-28 08:58:12

3. Benchmarks
3.1 Test script
Is it worth using arrays and string manipulation with internal bash commands, or is it just one more way of doing things? One way to see that is to run some benchmarks to see how fast is particular solution. First, I will explain the way I did benchmarks. Below is a skeleton for benchmark script 'test'. The idea is that we run certain (large) number of above examples, and time its execution.

#!/bin/bash
# 
# Usage: test [ITER]

iter=10000
# see if we supplied no of iterations 
if [ -n "$1" ]
then    
    iter="$1"
fi

TAB=$(echo -e "\t")

# iterate
for i in `seq $iter`
do
    #
    # do stuff
    #
done

and then we run this script from command line as:

$ /usr/bin/time -f "\nReal: %E\nUser: %U\nSys: %S" ./test 100000

Inside 'for' loop we put stuff like (I removed most of the comments, to make it shorter):

# example-SED
xrandroutput="1680x1050+2880+23"
array=`echo "$xrandroutput" | sed "s/x/$TAB/"`
array=`echo -e "$array" | sed "s/+/$TAB/g"`
H=`echo -e "$array" | cut -f 1`

i.e.

# example-ARRAY
xrandroutput="1680x1050+2880+23"
array=${xrandroutput//x/" "}      # "1680 1050+2880+23"
array=( ${array//+/" "} )     # ( "1680" "1050" "2880" "23" )

Note that in 'example-SED' we have to put result in variable 'H', while in 'example-ARRAY' all four values are inside 'array' array, accessible by array[ i] syntax.

iMBeCil · 2017-08-28 08:59:27

3.2 Benchmark results
To get meaningful results, I run every benchmark several times. (I could have done it much more systematically, by making some statistics, but the difference is so large that it is not necessary.) Furthermore I tried to choose number of iterations in a way to make script running for about 10 seconds, and afterwards normalize results.

The results are:

example-SED:   1000 iteration = 2.17 secs
example-ARRAY: 1000 iteration = 0.011 secs

Yes, this is factor of almost 200 times, in favor of example-ARRAY! 8)

Therefore, using internal bash commands is quite faster than calling 'sed'!

Is it surprising? Well, I did expect some increase, but not for factor 200 ... Actually, I call more knowledgeable people here to make 'example-SED' faster. Perhaps, it can be done/programmed significantly better than I did.

iMBeCil · 2017-08-28 09:00:48

4. Final words
So is it worth using bash internals? Let me try to summarize:
PROS (for bash internals):
- it is significantly faster
- code is cleaner (IMHO), with less namespace clutter
- it seems to be easier

CONS (against):
- it is strongly bash-dependent
- although arrays and string manipulation commands should be avaliable in modern bash (above version 3), there might be compatibility problems for older bash versions (but really old versions)
- can't do certain oneliners (and impress friends), which are otherwise easily accessible via piping.

And, as a final words:
a) I hope someone will find use of this TL;DR of mine. I know that some will say 'Oh, I know this', some will say 'What on earth is he talking about', but I hope there will be someone who will learn something from it.
b) Sorry for TL;DR, couldn't find shorter way to explain it. Sorry for awkward EngRish.
c) Do not hesitate to make fool of me, if I did something wrong and/or stupid above.

The End.

Note: there exists version of this HowTo translated to Ukrainian (thanks to user tivasyk).

Last edited by iMBeCil (2017-08-28 09:04:59)

brontosaurusrex · 2017-08-28 10:08:05

Thanks, this will come handy.

Head_on_a_Stick · 2017-08-28 10:38:59

Thanks for this!

The remarkable Greg's Wiki has a good section on parameter expansion:

http://mywiki.wooledge.org/BashGuide/Pa … _Expansion

@OP, your computer must be very fast, here's my laptop:

empty@Diproton:~ $ time testshell 500000
    0m05.43s real     0m05.82s user     0m00.02s system
empty@Diproton:~ $

However, I can get a free speed boost by switching to a faster shell:

empty@Diproton:~ $ sed -i 's/bash/ksh93/' ~/bin/testshell
empty@Diproton:~ $ time testshell 500000
    0m02.78s real     0m02.98s user     0m00.00s system
empty@Diproton:~ $

https://packages.debian.org/jessie/ksh

iMBeCil · 2017-08-28 11:17:40

You're welcome.

Head_on_a_Stick wrote:

The remarkable Greg's Wiki has a good section on parameter expansion:
http://mywiki.wooledge.org/BashGuide/Pa … _Expansion

Nice wiki, but somehow I always return to Advanced Bash-Scripting Guide.

Head_on_a_Stick wrote:

Thanks for this!
@OP, your computer must be very fast, here's my laptop:
empty@Diproton:~ $ time testshell 500000
    0m05.43s real     0m05.82s user     0m00.02s system
empty@Diproton:~ $

I presented results normalized to 1000 iterations (but actually running much more, to get realistic time). Your example on my laptop (macbook pro from 2013, i7-4850HQ 2.3 GHz) gives comparable result to yours:

$ time ./test 50000
real  0m5.647s      user  0m5.637s      sys  0m0.020s

Head_on_a_Stick · 2017-08-28 11:33:46

^ Oh, OK, mine is an i5-4330M@2.6GHz so they should be roughly the same.

My `sed` script was taking much longer but it was a transcription error, sorry for the noise :8

By the way, the use of backticks is discouraged (although it makes no practical difference in your script) — the en vogue method is to use $(foo) instead.

Did you try ksh93?

It really is blisteringly fast and supports most of the features of bash.

iMBeCil · 2017-08-28 12:37:54

^Yes, I'm completely familiar about $( ... ) sytax, but at that time was still using bacticks. I'll change it when I find time. Thanks for pointing it out.

As for the ksh93 and/or other shells, I'm not that 'advanced', or to put it differently, my spare time doesn't allow me to learn details od yet another shell at the moment. Right now, I'm reasonably fluent in bash and a bit less in tcsh ... and so far it works for me. For really complicated stuff, I resort to python.

brontosaurusrex · 2017-08-28 15:57:46

How is this (It's oneliner and it is bash)

#!/bin/bash

stuff="1680x1050+2880+23"

read -r h w x y <<< $(echo ${stuff//[!0-9]/ })

# echo "$h $w $x $y"

bench

time for i in $(seq 1000); do mine; done

real    0m2.399s
user    0m0.016s
sys    0m0.248s

time for i in $(seq 1000); do yourArrays; done

real    0m2.816s
user    0m0.056s
sys    0m0.252s

p.s. If one need arrays, it could be

#!/bin/bash

stuff="1680x1050+2880+23"

arr=(${stuff//[!0-9]/ })

# echo ${arr[0]} ${arr[1]} ${arr[2]} ${arr[3]}

According to #bash, this '${stuff//[!0-9]/ }' is called/named special parameter expansion replacing non-numeric characters with space. http://mywiki.wooledge.org/BashFAQ/073

Last edited by brontosaurusrex (2017-08-28 18:20:42)

iMBeCil · 2017-08-28 17:00:49

^Nice solution!

But, I don't know what exactly are you comparing? What is 'yourArrays'? Is it my 'example-SED'-like script, or 'example-ARRAY'-like script?

Not to mention that we should agree on benchmark test. For example, on my laptop:

$ time for i in $(seq 1000); do ./mine; done
real	0m1.163s
user	0m1.022s
sys	0m0.208s

(where 'mine' is your 'read'-oneliner).

OTOH, your 'read'-oneliner in 'my' benchmark script gives:

$ time ./test 1000
real	0m0.316s
user	0m0.237s
sys	0m0.105

It means that your loop from 1 to 1000 is spending a lot of time on actual reading script 'mine' from HDD/SDD.

brontosaurusrex · 2017-08-28 17:07:05

Yeah I compared to your array script and obviously you can compare both your way as well. I see about disk-read, yeah that is a bad method then (I assumed 'script' would be magically cached).

Last edited by brontosaurusrex (2017-08-28 17:42:51)

#1 2017-08-28 08:54:24

How To: scripting: bash internals vs. sed | cut | awk (copy from #!)

#2 2017-08-28 08:55:19

Re: How To: scripting: bash internals vs. sed | cut | awk (copy from #!)

#3 2017-08-28 08:56:38

Re: How To: scripting: bash internals vs. sed | cut | awk (copy from #!)

#4 2017-08-28 08:58:12

Re: How To: scripting: bash internals vs. sed | cut | awk (copy from #!)

#5 2017-08-28 08:59:27

Re: How To: scripting: bash internals vs. sed | cut | awk (copy from #!)

#6 2017-08-28 09:00:48

Re: How To: scripting: bash internals vs. sed | cut | awk (copy from #!)

#7 2017-08-28 10:08:05

Re: How To: scripting: bash internals vs. sed | cut | awk (copy from #!)

#8 2017-08-28 10:38:59

Re: How To: scripting: bash internals vs. sed | cut | awk (copy from #!)

#9 2017-08-28 11:17:40

Re: How To: scripting: bash internals vs. sed | cut | awk (copy from #!)

#10 2017-08-28 11:33:46

Re: How To: scripting: bash internals vs. sed | cut | awk (copy from #!)

#11 2017-08-28 12:37:54

Re: How To: scripting: bash internals vs. sed | cut | awk (copy from #!)

#12 2017-08-28 15:57:46

Re: How To: scripting: bash internals vs. sed | cut | awk (copy from #!)

#13 2017-08-28 17:00:49

Re: How To: scripting: bash internals vs. sed | cut | awk (copy from #!)

#14 2017-08-28 17:07:05

Re: How To: scripting: bash internals vs. sed | cut | awk (copy from #!)

Board footer