Bash scripts are powerful in data processing. This post records some tips that I encountered during the data processing, and will keep updated.
Data statistics
Handle duplicate
Obtain the unique lines of a file:
1
2
3
4sort -u "<filename>" -o "<output_filename>"
# slightly lower
sort "<filename>" | uniq > "<output_filename>"Random sort
1
2# -R: sort by random order
sort -R "<filename>"Count the number of each lines
1
2
3# -i: ignores the case
# -c: count the # of lines
sort "<filename>" | uniq -icPrint duplicate lines
1
2
3
4
5
6
7
8# print all duplicate lines
sort "<filename>" | uniq -iD
# print only uniq duplicate lines
sort "<filename>" | uniq -iD |uniq -i
# or case-sensitive:
sort "<filename>" | uniq -D |uniq -i
Count the occurrence of a string “xxx”.
- vim mode
1
:%s/"xxx"//gn
- grepIf count strings [“xxx”, “yyy”]:
1
$ grep -o "xxx" filename | wc -l
1 | $ grep -o "xxx\|yyy" filename | wc -l |
Training and dev set split
- Split train and test set:
Given unsplit data file “data.tsv”
- Shuffle among lines
1
2$ shuf "data.tsv" -o "shuffle.tsv"
- Split into train/dev sets
- Count # of lines:
1
$ wc -l "shuffle.tsv"
- Split 90% into train set and 10% into devset:
1
2$ head -n <#0.9count> "shuffle.tsv" > "train.tsv"
$ tail -n <#0.1count> "shuffle.tsv" > "dev.tsv"
- Split into train / dev / test sets
- Count # of lines:
1
$ wc -l "shuffle.tsv"
- Split 80%/10%/10% into train/dev/test sets:
1
2
3$ head -n <#0.8count> "shuffle.tsv" > "train.tsv"
$ sed -n "<#0.8count+1>,<#0.9count>p" shuffle.tsv > "dev.tsv"
$ tail -n <#0.1count> "shuffle.tsv" > "dev.tsv"
Check data
- Check the specific line given files
1
sed -n <line_num>p <filename.txt>
Job running
Redirection
Redirect output to file
Redirect the standard output (stdout
) and standard error (stderr
) to an output file.1
2
3
4
5[bash_command] > "<out_file_name>" 2>&1
# 0 is stdin
# 1 is stdout
# 2 is stderr
- File descriptor
1
is the standard output (stdout
), and file descriptor2
is the standard error (stderr
). &
indicates that what follows is a file descriptor and not a filename.
Output the stdout to both screen and file
Copy standard input to each FILE, and also to standard output.1
[bash_command] | tee "<out_filename>"
Run jobs in background
ampersand(&
)
Ampersand(&
) starts a subprocess (i.e. children process to the current bash session), and will terminate when exiting current session.1
[bash_command] &
nohup
nohup
caches the hangup signal, i.e. the subprocess will still run when closing the current process.1
nohup [bash_command]
It can be stopped by press Ctrl
+ z
. Ctrl
+ z
does not woking when &
exists.
nohup
+ ampersand(&
) + redirection
1 | nohup [bash_command] > "<out.filename>" 2>&1 & |
File management
- find files larger than 100M in the current directory:
1
2
3
4
5$ find . -type <type-name> -size +/-<file-size>
# e.g. find . -type f -size +100M
# <type-name>: d: directory, f: file
# +: >, -: <
# <file-size>: k/M/G find large directories
1
2
3$ du -h --max-depth=1
$ du -hm --max-depth=2 | sort -nr | head -12Reports the amount of disk space
1
2
3
4
5
6# report all dirs
cd /
du -sh *
# subdir
du -lh --max-depth=1
Count
1 | # count the # of lines in a file |
Unzip Chinese character
1 | unzip -O GBK <filename>.zip |