Jake Heare Research Central: 1 8 2015 Remote ipython notebook and Github Usage

Its been a while since I've posted anything on here due to the holidays. My paper is still available on PeerJ and is about to have a new version with some updates on it.

In December, my first RAD-Seq data came back from sequencing but I was not able to process it at the time due to A) completing the manuscript and B) lack of technical skill to work with RAD-Seq data. Over the course of the next year I will be focusing my efforts on processing and analyzing the RAD-Seq data to make some unique discoveries based on genotype of my three populations and the reciprocal transplant experiment.

Before that though I need to do some housekeeping to set myself up for success. My personal computer is not capable of running some of the higher end RAD-Seq programs such as CLC and Stacks. These programs can take hours to days to process relatively simple RAD-Seq information and need to be constantly running to finish. Prior to my forays into RAD-Seq, my lab has set up a private server to run Stacks with which I will be performing my process and analysis. I'm also averse to having to actually work on the terminal the servers are connected to because it means I have to physically be near the server limiting when and where I can work. Instead I have opted to run everything remotely through an SSH Tunnel (using PuTTY on Windows) so I can easily log in from anywhere and check on the status of my projects. When I program like this, I use Python with the ipython shell. This allows for me to produce high quality notebooks that I can look back on which I will obviously be posting on here in the future. Now using ipython notebook remotely can be a bit of a pain if you've never done it before. Below is a quick run through of how I do it with PuTTY.

1. Add the IP address for your server into PuTTY followed by ":8888". The number can be different depending on whatever port you want to use but I usually go with something in the upper thousands. The IP address should look something like this 128.00.000.00:8888 with the IP address being different obviously. This will open a tunnel to your server as well as create a port for you to open your notebook on.

2. Start your ipython notebook in whatever directory you like with the command:

ipython notebook --no-browser --port=8888

This will start the notebook shell on the server but will not start the GUI for the notebook. As you can see I've put in the port that I added to the IP address.

3. Start the notebook GUI in whatever browser you like by entering the following into the address bar:

localhost:8888

This will direct your browser to access the port you are using to connect the notebook from the server to your local computer. Now you're in ipython notebook's GUI, you can code however you like.

Obviously once you've created a notebook or script you want to keep a copy of it somewhere secure so that in the future you can re use it or have a checkpoint to make new code from. The notebooks are saved to the remote server, so when you log off you won't have access to them anymore which gets really annoying if you want a local copy to post on your blog or share with others. Another annoying feature is when you update a notebook you immediately lose a previous version of that notebook as its overwritten by the new one. While not necessarily bad, this can lead to trouble if you're like me and decide to do something wonky that breaks everything. Version control is a feature commonly used in programming by developers to make sure they have a working version to go back to if they screw everything up later on. This is a great thing for scientists who want to produce a reproducible script for others to share and use. The easiest way is to use the program git and the website Github to securely back up your scripts/notebooks/etc for future use and sharing publicly. Git was designed for backups and version control using repositories of code/files/folders, so if you aren't using it now definitely check it out. Luckily our server is equipped with git so I don't need to go through the process of downloading and installing it. Setting up git, github, and backing things up is very simple when using your local computer but becomes less intuitive if you are doing things remotely like I am. Being up for a challenge, like I often am, I decided to figure out how to remotely set up a repository on the server, download all the current contents of the repository on Github, and upload any new files/folders or changes that I made to repository remotely. Below is how I accomplished that in a few relatively easy steps.

The easiest way I've found to set up a remote repository on a server is this:

Make a new or use an existing repository on your Github account.
Clone the repository using the command:
git clone [URL of Github Repository]
You can find this url on the lower right hand corner of the github repository page above the download link. I suggest using the HTTPS option as its the one that I've gotten to work. This will make an exact duplicate of everything in the repository. Do not place brackets around the url.
Check your repository remote name with:
git remote
This should show that the repository remote alias is "origin". This name can be changed but for our purposes that is unnecessary. Alternate aliases can be created as well.
Check your repository status with:
git status
This will tell you if you have any untracked files or changes that need to be added or committed. At this point there should be nothing that needs adding or committing, so we'll make a new ipynb file to commit to the repository.
Launch ipython notebook from your repository directory, create a new notebook, save notebook, and shutdown ipython.
Check your repository status again with:
git status
You should see that there is a folder for ipynb_checkpoints and a file for your new notebook listed as untracked changes.
Add the untracked files to the repository using:
git add -A
This will add all untracked files. If you have files or folders you don't want to add, you can add each file/folder individually but for our purposes add all is the best option.
Commit the newly added files to the repository with a message explaining the commit with:
git commit -a -m 'your message here'
The -a commits all the added files, again you can specify them individually but it best to do them all. The -m tells the commit function that the following statement in quotation marks is the commit message. If you fail to do this, git commit will automatically open Vi/Vim. Vi/Vim is a terminal text editor and comes with its own set of operators and functions. If you accidentally open Vi/Vim don't worry, just use your directional buttons to move the cursor to the last available line (it will beep at you when you try to go beyond that line) from there you can delete the hash mark and type your description. Once you finish typing the description hit ESC once followed by SHIFT+ZZ (on Windows) to save your comment and quit Vi/Vim. Once this is done the files will be committed to the repository.
Push the commits to github with:
git push origin master
Once you do this github will ask for your github username and password. Origin designates your servers local repository alias while master designates the master portion of the repository on github. If you want you can replace master with any name and github will generate a new branch on the repository. Now you should have a copy of all your servers project files on the appropriate github repository.
After adding new notebooks, updating current notebooks, or producing any new files in the server repository YOU MUST RE RUN STEPS 7,8,9 to add, commit, and push the files from the server to github.

Also if something changes on the github master, the push command may fail because the server's repository is not up-to-date. You can quickly fix this with the command:
git fetch origin; git merge origin/master
Where origin is the name of the servers working repository. This will download any new changes to the repository and clear the way for you to push any local commits you have. You can then just rerun the push command from step 9 and this will update githubs repository.

You can find a more in depth guide to using remote functions at gitref.org.

Once all this is done, you are good to go. You can program in any language including python and R remotely and have all your scripts/notebooks/files/etc backed up and publicly available for you or others to download and peruse. Hopefully this will make analyzing RAD-Seq data easier to back up and share for others, including readers of this blog.

If you are interested in viewing my github repositories for future coding or even collaborations you can find me on github with the username: jheare

Happy New Year!

Jake Heare Research Central

Thursday, January 8, 2015

1 8 2015 Remote ipython notebook and Github Usage

No comments:

Post a Comment