V-TEK Weblog about webdevelopment and linux

25Sep/090

Learning Linux Part 5/15: Edit & transform textfiles

This part I will try to explain some of the document processing features that linux has. We will cut this part in 3 pieces:

  • REGULAR EXPRESSIONS
  • SEARCH AND SORT TOOLS
  • PROGRAMMABLE FILTERS

REGULAR EXPRESSIONS

A regular expression are formulas for matching strings that follow some pattern. For example, if you would like to find all files that start with an "a" and which end up with "cd". You might use the following command:

find . -name "^a.cd$"

This 'find' command will search for all files starting with an "a" (^a) followed by ANY character (.) and which is ends with the characters "cd" (cd$). This means that a file like "abcd.text" or "www.abcd" will not be included in the search results.

Below you will find a list of characters which can be used at regular expressions.

  • ^ - Points to the beginning of the line.
  • $ - Points to the end of the line
  • . - The dot is used as a joker sign and stands for any -single- character ( with the exception of the newline character )
  • [ ] - Is used to set alternative characters.

REPETITION OPERATORS

Below you will find repetition operators, which can be used to search for a certain character or characters which occur more then one time.

  • * - Is used as wildcard for any number of characters (includes no character!)
  • ? - The character before the question mark doesn't have to exist, but if it does it can have the maximal size of 1 character.
  • + - The character before the question mark has to exist at least one time
  • \{n\} - The regular expression has to occur exactly n times
  • \{n,\} - The regular expression has to occur at least n times
  • \{,m\} - The regular expression can exist for a maximum of m times, but it doesn't have to.
  • \{n,m\} - The regular expression has to occur at least n times, and no more then m times.
  • regex|regex - this is a sort of OR statement, which will look for the sign before or after the pipe sign.

SEARCH AND SORT TOOLS

Most linux distributions come with a lot of search and sort tools, like grep, find, cut, uniq, etc.

I will describe the functions of these tools below and their possibilities.

GREP AND FIND

The command "grep" is used to search for a certain text phrase in a textfile or in the result of a command. Another command which searches for certain directories or files is "find". With find you can search for any file property like chmod values, ownerships, access or modification time or the filename itselves.

The strongest part is when you combine grep and find. For example, if you'd like to search for a certain text phrase "search_string" in all php files in a documentroot, you could use something like this:

find /path/to/documentroot -name '*.php$' -exec grep search_string {} \;

Note - Take a good look at the \; on the end of the line, which escapes the ";" character.

SORT

Sort is a tool which can be used to sort the contents of a file by alphabet or numeric. The command line usage of sort is:

sort [OPTION]... [FILE]...

A list of options which can be given to sort is:

  • -m - Join sorted files together
  • -u - Send only unique lines to the standard output
  • -o filename - Specifies the filename where the sorted output can be written to
  • -b - ignores spaces at the beginning of every line
  • -n - sort numeric ( without this option, it's alphabetical )
  • -t - Specify the field seperator. If not given, then the tab key is used as the default seperator

CUT

The command "cut" can be used to show certain fields of a file or output of a command. For example, if you give the following command:

cut -d':' -f1 /etc/passwd

All the usernames of the /etc/passwd file will be printed. This is because the delimeter flag "-d" is used and the -f1 flag tells "cut" to print the first field.

A couple of other options of "cut" are:

  • -bn - points to the characters which appear at the position of byte n
  • -cn - points to the characters which appear at the position of character n
  • -fn - (like described above) - points to the number of the field
  • -dn - (like described above) - points to the delimiter

UNIQ

If you have a file with a lot of duplicate lines, uniq can be of your assistance. By default it discards lines with more than one occurance. The flag "-d" can be used to inverse the process and prints only the duplicates.

CAT

This tool simply displays the content of a file, like described earlier. 

TR

The "tr" command can be used to replace characters from the standard input, and send the results to the standard output ( most of the times, your screen ). The -d flag allows you to delete a certain character or textphrase from the input. For example, if you'd like to remove all semicolons from the /etc/passwd file and save them in /etc/passwd.new, you can enter the following command:

tr -d ":" < /etc/passwd > /etc/passwd.new

NL

This tool has te same functionality like "cat", only difference is that this tool shows line numbers

OD

Prints the octal output of a file. This command is nearly used.

SPLIT

With split you can split up certain text files in several smaller text files. For example if you have a big /var/log/messages, which is hardly searchable, then you can split it into several pieces using the "split" command. When you use the "-l" flag you can specify the number of lines each file may contain. The "-b" flag specifies the total size of each output file, instead of the amount of lines ( for example split files up in splitted files of 1mb or something similair).

JOIN

Sometimes you would like to join several text files based on common fields. The "join" command makes this possible for you.
To perform a simple join operation on two files where the first fields are the same, enter:

join phonedir names

If the phonedir file contains the following names:

Adams A.        555-6235
Dickerson B.    555-1842
Erwin G.        555-1234
Jackson J.      555-0256
Lewis B.        555-3237
Norwood M.      555-5341
Smartt D.       555-1540
Wright M.       555-1234
Xandy G.        555-5015

and the names file contains these names and department numbers:

Erwin           Dept. 389
Frost           Dept. 217
Nicholson       Dept. 311
Norwood         Dept. 454
Wright          Dept. 520
Xandy           Dept. 999

the join command displays:

Erwin G.        555-1234        Dept. 389
Norwood M.      555-5341        Dept. 454
Wright M.       555-1234        Dept. 520
Xandy G.        555-5015        Dept. 999

PROGRAMMABLE FILTERS

One of the hardest, but most powerfull, ways to modify the contents of a file is the use of programmable filters. Two of these programmable filters are "sed" and "gawk".

SED

Sed is a so called "stream" editor, used for filtering and transforming text(files). Sed can perform several operations on text files. When sed is used, it walks through every line to see if the given command can be applied to the line. The results of this operation are displayed on the screen.

The syntax of the sed command is:

sed [options] {optional-a-script-here} filenames

If you have very complex options, you can also use a seperate text file for the commands. An example with such an input file:

sed [options] -f input-file-with-commands filename

Substitution

In my daily work I use the substitution feature of sed the most. Searching for a certain text phrase and replace it with another text phrase is a job we all face several times.

The syntax for substition commands in sed goes like this:

sed 's/{text-phrase}/{replacement}/[flags]'

Where flags can be:

  • n - only replace the n-th time the text-phrase occurs
  • g - replace every occurance
  • p - print for each succesfull replacement the corresponding line
  • w filename - on a succesfull replacement, write the modified line to 'filename'

Line based operations

Ofcourse 'sed' is not only used for substitution, but there are also a lot of options which are more line based. Some of these options are:

  • d - Every line that matches with the given text phrase will not be displayed.
    • sed /textphrase/d filename
  • a\ text - Append text after a line
  • i\ text - Insert text before the line
  • c\ text - Replaces lines of text

GAWK

Gawk is not an ordinary filter, but more some kind of programming language. Because learning gawk would fall out of the scope of this document, we wont go very deep on this. Base syntax of gawk is:

gawk [options] '/text-pattern/{action}' filename

So, if we enter this line at the prompt to find the UID of the user www-data.

gawk -F':' '/www-data/{print $3}' /etc/passwd

Then the output displays the UID of the user "www-data". Short explanation of the above command:

  1. -F':' - set the field delimiter to a semicolon (default delimiter is a space).
  2. www-data - this is the text phrase, we we're looking for
  3. print $3 - this is the action to perform, which means that the 3rd column has to be printed
  4. /etc/passwd - this is the filename that we would like to process.

But there's more to do with gawk. You can also use some logical operators to perform tasks if the line matches the text phrase, like this:

gawk -F':' '$1 == "www-data" {print $0}' /etc/passwd

Again a little explanation:

  1. $1 == "www-data" - means that IF column 1 equals "www-data" the action is going to be performed.
  2. print $0 - The variable $0 is a reference to the entire line. So if the line statement is true, the whole line will be printed.

About admin

No description. Please complete your profile.
Comments (0) Trackbacks (0)

No comments yet.


Leave a comment


No trackbacks yet.

Recent Comments

Tags

Apache cakephp cpanel dav dd-wrt DNS Ext3 Framework GIT GNU grub javascript Kernel lilo Linux LPI LVM MVC MySQL Netbeans Netfilter PHP piping ps3 redirection regex Ruby on Rails Samba Sendmail Shell Skype SNMP Squid Symfony ubuntu upnp vim windows X Xorg X Window System