File tips with unix commands

December 14, 2015

When dealing with strings and files, I think it worth learning at least some basics of unix commands.


String manipulation with sed

sed, a stream editor, is a Unix utility that parses and transforms text.

Replace a string in a file:

sed -i 's/to_be_replaced/new_string/g' file.txt
  • g - apply the replacement to all matches to the regexp, not just the first.
  • -i or --in-place - this option specifies that files are to be edited in-place.

You can also chain multiple commands by separting them with ;. Operations will apply with the same order.

sed -i 's/word1/new_string1/g; s/word2/new_string2/g' file.txt

If you need to keep track of a string, you can delimit it by round brackets and then use \1 to recall it. For instance, if you want to transform function (arg1, arg2) { ... into (arg1, arg2) => { ..., you can hit

sed -i 's/function (\(.*\))/(\1) =>/g' file.txt

\(.*\) matches arg1, arg2.


Data calculation

Suppose you have a data file data.csv with multiple columns and you want to extract only some columns.

data.csv

Col1;Col2;Col3;Col4
c1-1;c2-1;c3-1;c4-1
c1-2;c2-2;c3-2;c4-2
c1-3;c2-3;c3-3;c4-3
# Only columns 1 and 3
cut -f1,3 -d';' data.csv
# cut -f1,3 -d$'\t' data.tsv if you have tab instead

# From column 1 to column 3
cut -f1-3 -d';' data.csv

# Exclude column 2
cut -f2 --complement -d';' data.csv

Calculate the average of column 2

awk -F ';' '{ total += $2; count++ } END { print total/count }' data.csv
  • -F specify field separators

Compressing files

tar.gz

Compress:

tar -zcvf file-name.tar.tgz *
  • -c, --create - create a new archive
  • -f, --file FILE - this must be the last flag of the command, and the tar file must be immediately after. It tells tar the name and path of the compressed file.
  • -z, --gzip, --gunzip, --ungzip - tells tar to decompress the archive using gzip
  • -v - verbose output shows you all the files being extracted.

Extract:

tar -xvzf file-name.tar.tgz -C /tmp
  • -x, --extract - extract files
  • -C, --directory DIR - optional, extract in DIR directory

Sorting csv

sort enables you to sort lines of a text field by columns. Equivalent to ORDER BY in sql.

Input file data.csv:

1;100;C
2;102;D
3;103;E
4;104;F
4;101;G
sort -t';' -k1,2 data.csv
  • -t, --field-separator=SEP - Use SEP instead of non-blank to blank transition
  • -k, --key=POS1[,POS2] - Start a key at POS1 (origin 1), end it at POS2 (default end of line)

Output:

1;100;C
2;102;D
3;103;E
4;101;G
4;104;F

Join files

join is a tool to join lines of two files on specific columns. The two files must be sorted on join fields.

Assume we have two files:

data1.csv

1;100
2;102
3;103
4;104
5;105

data2.csv

A;4
C;3
B;1
D;2

If you want to join the two file with column 1 of data1.csv and column 2 of data2.csv, you can hit:

join -t $';' -1 1 -2 2 -o 1.1,1.2,2.1 <(sort -t';' -k1 data1.csv) <(sort -t';' -k2 data2.csv)
  • -t CHAR - use CHAR as input and output field separator
  • -1 FIELD - join on this FIELD of file 1
  • -2 FIELD - join on this FIELD of file 2

Result:

1 100 B
2 102 D
3 103 C
4 104 A

Note that it's an inner jointure ; line with id 5 from file data1.csv do not appear in the output.

References:

  1. Sed a stream editor - gnu.org
  2. Sed - An Introduction and Tutorial - grymoire.com
  3. The GNU Awk User’s Guide - gnu.org
  4. Join - stackexchange.com
  5. join(1) - Linux man page - linux.die.net
  6. sort(1) - Linux man page - linux.die.net