This chapter is based on the Data Carpentries The Unix Shell and Introduction to the Command Line for Genomics lessons.
Learning objectives:
A shell is a computer program that presents a command line interface which allows you to control your computer using commands entered with a keyboard instead of controlling graphical user interfaces (GUIs) with a mouse/keyboard combination.
There are many reasons to learn about the shell:
In this lesson you will learn how to use the command line interface to move around in your file system.
On a Mac or Linux machine, you can access a shell through a program called Terminal, which is already available on your computer. If you’re using Windows, you’ll need to download a separate program to access the shell.nux machine, you can access a shell through a program called Terminal, which is already available on your computer. If you’re using Windows, you can install a Windows Subsystem for Linux. We are going to the use RStudio Terminal.
At any point, you can use the clear
command or the Ctrl+L
keyboard
shortcut to clear the screen.
The shell is a program where users can type commands. With the shell, it’s possible to invoke complicated programs like climate modeling software or simple commands that create an empty directory with only one line of code. The most popular Unix shell is Bash (the Bourne Again SHell — so-called because it’s derived from a shell written by Stephen Bourne). Bash is the default shell on most modern implementations of Unix and in most packages that provide Unix-like tools for Windows.
When the shell is first opened, you are presented with a prompt, indicating that the shell is waiting for input.
$
The shell typically uses $
as the prompt, but may use a different
symbol. In the examples for this lesson, we’ll show the prompt as $
. Most importantly: when typing commands, either from these lessons or
from other sources, do not type the prompt, only the commands that
follow it.
So let’s try our first command, ls
which is short for listing. This
command will list the contents of the current directory:
$ ls
Desktop Downloads Movies Pictures
Documents Library Music Public
If the shell can’t find a program whose name is the command you typed, it will print an error message such as:
$ ks
ks: command not found
This might happen if the command was mis-typed or if the program corresponding to that command is not installed.
To create a directory, you can use the mkdir
command, short for
make directory:
$ mkdir my_dir
To create (empty) files, use the touch
command:
$ touch my_file.txt
► Question
Make sure you are in the wsbim2122_data
directory and create a new one
called wsbim2122_notes
and, in that directory, 3 files respectively
called shell_notes.txt
, report.Rmd
and omics.md
.
Note that the files created above contain a file name
(shell_notes
, report
, omics
) and the file extension (txt
,
Rmd
and md
)! You can of course create a file without extension,
but you (and the computer) won’t know what type of file it is from its
name.
The tips about variable naming that we saw in previous years also hold for file names. Complicated names of files and directories can make your life painful when working on the command line. Here we provide a few useful tips for the names of your files.
Don’t use spaces. Spaces can make a name more meaningful, but since
spaces are used to separate arguments on the command line it is
better to avoid them in names of files and directories. You can use
-
or _
instead.
Don’t begin the name with -
(dash). Commands treat names starting
with - as options.
Stick with letters, numbers, . (period or ‘full stop’), - (dash) and _ (underscore).
Many other characters have special meanings on the command line. There are special characters that can cause your command to not work as expected and can even result in data loss.
If you need to refer to names of files or directories that have spaces or other special characters, you should surround the name in quotes ("").
To view a file, you can use cat
to display the whole file at
once. This might work for the file that you just edited (try it out),
but not for any of the files in the wsbim2122_data/data
directory.
► Question
wsbim2122_data/data
directory.seq
.head
or tail
to view the few first and last lines
of that fils.head
manual
page to view the 20 first/last lines.
cat
is a terrific program, but when the file is really big, it can
be annoying to use. The program, less
, is useful for this
case. less
opens the file as read only, and lets you navigate
through it. The navigation commands are identical to the man program:
b
to go backwardg
to go to the beginningG
to go to the endq
to quit► Question
View any of the seq
using less
.
less
also gives you a way of searching through files. Use the /
key to begin a search. Enter the word you would like to search for and
press enter. The screen will jump to the next location where that word
is found.
Shortcut: If you hit /
then enter
, less will repeat the
previous search. less searches from the current location and works its
way forward. Scroll up a couple lines on your terminal to verify you
are at the beginning of the file. Note, if you are at the end of the
file and search for the sequence CAA
, less will not find it. You
either need to go to the beginning of the file (by typing g) and
search again using /
or you can use ?
to search backwards in the
same way you used /
previously.
► Question
Search for the sequence TATATA
in your file.
For this course, given that we are using RStudio, you will be able to view and edit your files with it directly. However, this won’t be possible if you use the terminal outside of RStudio, such as when you connect to a server.
There exist several text-based editors, i.e. editors that can be used
in the shell without a dedicated graphical user interface. Here, we
will use nano
for its simplicity.
Open one of the files that you just created with
$ nano shell_notes.txt
Let’s type in a few lines of text. Once we’re happy with our text, we
can press Ctrl+O
(press the Ctrl
or Control
key and, while
holding it down, press the O
key) to write our data to disk (we’ll
be asked what file we want to save this to: press Return
to accept
the suggested default).
Once our file is saved, we can use Ctrl+X
to quit the editor and
return to the shell.
Make sure you are inside the wsbim2122_data/data
directory and create
two new directories, namely fas
and sim
.
To copy the seq_1.fas
file into the fas
:
$ cp seq_1.fas fas
► Question
fas
directory.sim_1.csv
file into the sim
directory and check that it
worked.
To move the seq_2.fas
file into the fas
:
$ mv seq_2.fas fas
► Question
fas
directory only.sim_2.csv
file into the sim
directory and check that it
worked.
We now have two copies of the seq_1.fas
and sim_1.csv
files. The
rm
command can be used to delete them, either one by one
$ rm seq_1.fas
$ rm sim_1.csv
or in one go
$ rm seq_1.fas sim_1.csv
We can use wildcards for accessing or manipulate multiple files at
once. For instance, to move all fas
files to the seq
directory, we
can:
$ mv seq*.fas seq
► Question
csv
files into the sim
directory.csv
files in the data
directory.
► Question
Reproduce the following folder structure
wsbim2122_data/new_dir
wsbim2122_data/new_dir/file1.txt
wsbim2122_data/new_dir/file2.txt
wsbim2122_data/new_dir/data/seq_1.fas
wsbim2122_data/new_dir/data/sim_100.csv
Above, you searched for the TATATA
pattern in one of the fas
files. We could do the same thing without opening the file with the
grep
command.
$ grep TATATA seq_1.fas
The command above greps all the lines that match the pattern of
interest and prints them in the standard output. If there’s one a few,
we could easily count the lines. Firstly, this approach is error
prone, as it relies on a manual step. Secondly, it doesn’t
scale. Instead, we can use the wc
, short for word count
command. Let first test it on the seq_1.fas
file:
$ wc seq_1.fas
► Question
Read the wc
manual page to understand the output.
Can you adapt the command above to return the number of lines only.
Below we first redirect the output of grep
into a new file and
then count the number of lines with wc
:
$ grep TATATA seq_1.fas > tatata_1
$ wc tatata_1
The > file.txt
redirects the output to a file called file.txt
. If
a file named file.txt
already exists, the it will be overwritten. If
however, you want to add the output to file.txt
, then you should use
>>
. Execute following lines to compare the two redirection
operators.
$ wc -l seq_1.fas > nlines1.txt
$ wc -l seq_2.fas > nlines1.txt
$ wc -l seq_1.fas > nlines2.txt
$ wc -l seq_2.fas >> nlines2.txt
cat nlines1.txt
cat nlines2.txt
To avoid the creation of this intermediate file, the output of grep
can be directly piped into wc
:
$ grep TATATA seq_1.fas | wc
► Question
Cound the number of line containing the TATATA
pattern in all the
fas
files.
► Question
How many fas
files do you have?
Use head
to extract only the sequence headers (the first lines)
from all the fas
files.
Use grep
to extract only the sequence headers (the first lines)
from all the fas
files.
Use grep
to extract everything but the sequence headers from all
the fas
files. See the grep
manual page to find an appropriate
tag.
► Question
sim_*.csv
files.id
b where simulated over all files?id
d where simulated over all files?
► Question
Imagine that you task is to load all the simulation data into R to
compare the x
and y
values for each id
. You could read each
file into R individually and then combine the
data.
An alternative would be to use what we have learned so far to create a
single file, all_sims.csv
containing all the simulation data.
Check that the number of line in all_sims.csv
matches what you found above.
A really powerful thing about the command line is that you can write scripts. Scripts let you save commands to run them and also lets you put multiple commands together. Though writing scripts may require an additional time investment initially, this can save you time as you run them repeatedly. Scripts can also address the challenge of reproducibility: if you need to repeat an analysis, you retain a record of your command history within the script.
Let’s create a shell script that combines all the simulation files into a single new file. This will allow us to easily recreate that file once additional simulations are performed.
.sh
.touch combine_all_simulations.sh
Add the shell commands from the exercise into that file. You can use
an editor, either nano
directly from the terminal, or RStudio.
Execute your script with
sh combine_all_simulations.sh
Let’s test the -l
(for long) tag of ls
$ ls combine_all_simulations.sh
-rw-rw-r-- 1 lgatto lgatto 79 sept. 17 22:32 combine_all_simulations.sh
This long output gives us, in addition to the file name, the time and data is was created, the owner (and group) of the file (twice lgatto here) and the files permissions.
There are ten slots in the permissions list. The first character in this list is related to file type, not permissions, so we’ll ignore it for now. The next three characters relate to the permissions that the file owner has, the next three relate to the permissions for group members, and the final three characters specify what other users outside of your group can do with the file. We’re going to concentrate on the three positions that deal with your permissions (as the file owner).
Here the three positions that relate to the file owner are rw-
. The
r
means that you have permission to read the file, the w
indicates
that you have permission to write to (i.e. make changes to) the file,
and the third position is a -
, indicating that you don’t have
permission to execute that file, i.e. to run it as a program.
Our goal for now is to change permissions on this file so that it becomes executable. An for security purposes, we want only the file owner to be able to execute it.
$ chmod u+x combine_all_simulations.sh
$ ls -l combine_all_simulations.sh
-rwxrw-r-- 1 lgatto lgatto 79 sept. 17 22:32 combine_all_simulations.sh
And now, it will be possible to execute the script without excplicitly
using sh
.
combine_all_simulations.sh
If you want to make sure you don’t want to inadvertently delete a
file, it is possible to remove all w
permissions:
$ chmod a+w combine_all_simulations.sh
$ ls -l combine_all_simulations.sh
-r-xr--r-- 1 lgatto lgatto 79 sept. 17 22:32 combine_all_simulations.sh
If you now try to delete the file with rm combine_all_simulations.sh
, you’ll be asked if you want to override
your file permissions:
rm: remove write-protected regular file ‘combine_all_simulations.sh'?
If you enter n
(for no), the file will not be deleted. If you enter
y
, you will delete the file. This gives us an extra measure of
security, as there is one more step between us and deleting our data
files.
Important: The rm
command permanently removes the file. Be
careful with this command. It doesn’t just nicely put the files in the
Trash. They’re really gone.
By default, rm
will not delete directories. You can tell rm
to
delete a directory using the -r
(recursive) option.
To complete this chapter on shell, let’s conclude by mentioning the R
system()
and system2()
functions, that enable to invoke a system
command such as a shell script:
system("combine_all_simulations.sh")
Page built: 2024-10-04 using R version 4.4.1 (2024-06-14)