Friday 4 February 2022

How to use GNU Stream Editor (sed)

sed is a Unix tool, a GNU stream editor for filtering and transforming text.

From its manual:

       Sed  is a stream editor.  A stream editor is used to perform basic text
       transformations on an input stream (a file or input from a pipeline).
       While  in  some  ways similar to an editor which permits scripted edits
       (such as ed), sed works by making only one pass over the input(s), and
       is consequently more efficient. But it is sed's ability to filter text
       in a pipeline which particularly distinguishes it from other  types  of
       editors.


  • uses regular expressions
  • used for other text manipulation operations like text substitution, insert, delete, search
  • alternative tools: Perl, AWK
  • it reads text line by line from a file or input stream into an internal buffer known as the pattern space (each line of input is copied into a pattern space). It then uses one or multiple operations which have been described by a sed script to the pattern space.
  • sed script can be either described on the command line or read through an isolated file

Syntax:

sed <option> <script> <input_file>

Some options:

       -f script-file, --file=script-file = add the contents of script-file to the commands to be executed
       -i, --in-place = edit files in place
       -n, --quiet, --silent = suppress automatic printing of pattern space


From sed manual:

       Addresses
       Sed commands can be given with no addresses, in which case the  command
       will  be  executed for all input lines; with one address, in which case
       the command will only be executed for input lines which match that ad‐
       dress;  or  with  two addresses, in which case the command will be exe‐
       cuted for all input lines which match  the  inclusive  range  of  lines
       starting  from  the first address and continuing to the second address.
       Three things to note about address ranges: the  syntax  is  addr1,addr2
       (i.e.,  the  addresses  are separated by a comma); the line which addr1
       matched will always be accepted, even if addr2 selects an earlier line;
       and  if  addr2 is a regexp, it will not be tested against the line that
       addr1 matched.

       After the address (or address-range), and before the command, a !   may
       be inserted, which specifies that the command shall only be executed if
       the address (or address-range) does not match.

Commands which accept address ranges:
  • p = Print the current pattern space
  • s/regexp/replacement/ = Substitute the regex match(es) with replacement. Attempt to match regexp against the pattern space.  If  successful, replace that portion matched with replacement. The replacement may contain the special character & to refer to that portion of the pattern space which matched, and the special escapes \1 through \9 to refer to the corresponding matching sub-expressions in the regexp.
From man sed:

     [2addr]s/regular expression/replacement/flags
             Substitute the replacement string for the first instance of the regular expression in the pattern space.  Any character other than backslash or
             newline can be used instead of a slash to delimit the RE and the replacement.  Within the RE and the replacement, the RE delimiter itself can
             be used as a literal character if it is preceded by a backslash.

             An ampersand (“&”) appearing in the replacement is replaced by the string matching the RE.  The special meaning of “&” in this context can be
             suppressed by preceding it by a backslash.  The string “\#”, where “#” is a digit, is replaced by the text matched by the corresponding
             backreference expression (see re_format(7)).

             A line can be split by substituting a newline character into it.  To specify a newline character in the replacement string, precede it with a
             backslash.

             The value of flags in the substitute function is zero or more of the following:

                   N       Make the substitution only for the N'th occurrence of the regular expression in the pattern space.

                   g       Make the substitution for all non-overlapping matches of the regular expression, not just the first one.

                   p       Write the pattern space to standard output if a replacement was made.  If the replacement string is identical to that which it
                           replaces, it is still considered to have been a replacement.

                   w file  Append the pattern space to file if a replacement was made.  If the replacement string is identical to that which it replaces, it
                           is still considered to have been a replacement.

                   i or I  Match the regular expression in a case-insensitive way.

 
How to remove hex characters from the beginning of the file?
 
Example: The following command removes a BOM Unicode character (xEFBBBF) from the beginning of the file. Removal is done in-place:

$ sed -i '1s/^\xef\xbb\xbf//' commands.sql 
 
1 - execute command only on the first line, other lines are unaffected
s/execute substitute command
^ - the beginning of the line (only match at the start of the line)
\xEF\xBB\xBF - bytes to be removed - UTF-8 BOM (escaped hex string)
// - replace with empty string
 
 
If we wanted to keep the original file intact and create a new file, with all the changes:

$ sed '1s/^\xef\xbb\xbf//' < commands.sql > new_commands.sql


References:

No comments: