Ok, first a quick update on the GenomeTweet project. I’ve added a new species! You can see the Caenorhabditis elegans nematode genome at @GenomeNematode. The only genome that has finished tweeting is the HIV genome. Here are the rest that are currently running:
The next to finish is E. coli, which should be complete in a couple of weeks. Yeast will take another month after that. The fruit fly is going to take approximately 2.5 years. The nematode will take slightly less than the fruit fly, finishing just over 2 years from now. The human genome has been split into separate accounts for each genome but will still take approximately 5 years to finish.
The accounts have been offline every once in a while over the last few days as I made further improvements behind the scenes. I know some people checking these blog posts are interested in how the project works so I thought I’d give an update on the improvements. To see how it basically works, read this previous blog post. I’ve been changing the scripts almost continuously since I started this project. Once I’m happy that I’ve done all I can, I’ll share the code so others can use it. But I’ve still got a few changes I’d like to make.
The script described in the previous post was very simple. It would read a file that contained the genome already prepared into 140-character lines, then tweet each line. Simple as that. Since then the scripts have grown massive, then shrunk back down as I completely rewrote them. Although the tweeting is automatic, I’ve been learning more Python as I go and much of this project still relied on manual interaction. Sometimes things went wrong at Twitter’s end and caused my scripts to fail. For example, when Twitter was over capacity. When this happened, I’d have to look at the Twitter account for that genome, copy the most recent tweet to the clipboard, open the genome file, find the line that was most recently tweeted, delete that line and all lines above it, then restart the tweeting script. That’s a lot of work for a supposedly automated project. It wasn’t so bad if one script failed, but potentially 28 could fail. I hadn’t anticipated that Twitter would cause so many problems. I needed to rewrite the scripts so that they could handle these problems themselves. Over the last week I’ve made multiple changes.
It currently works like this: Each genome (or human chromosome) has a genome file and a script that tweets it. The script communicates with Twitter, reads in the file, and tweets each line until the genome is complete. That’s the ideal world. If something does go wrong at Twitter’s end, then the script stops trying to tweet the genome, it waits a few seconds (to give Twitter some time to start behaving), accesses Twitter and makes a note of the most recent successful tweet to be tweeted. The script then looks for this tweeted line in the genome file and deletes it and everything before it, so that the genome file is essentially reset and ready to start being tweeted. The script then runs again from the beginning, tweeting the genome file either until it finishes, or until there’s another error. This means the script will keep running even when there is a problem but it will only alter the genome file if it has already started trying to tweet. Occasionally I want to prepare the genome files and make them ready to tweet then try tweeting. For that I have a genomeRestart script, which works with all the genomes. It asks which genomes you want to restart, it then checks Twitter for the most recent tweet, updates the genome files, then runs the genome’s tweeting script, which should run fine and if there’s a further problem then it deals with it within that script. The purpose of the genomeRestart script is for the occasions when I deliberately stop a script from tweeting, perhaps because I’m adding improvements to the code, or the scripts stop because of a problem at my end rather than Twitter’s (loss of connection or a power cut).
Now I’m free to sit back and let the scripts do their thing. The only manual interaction at the moment is if there’s a power cut or a loss of internet connection (not had either yet). If it happens, all I need to do is run the genomeRestart script, select the genomes I want to restart (all of them), and it will prepare their genome files and then run the scripts. Easy as that. The scripts get better and better each week as I learn more Python, but I still have further improvements in mind. It’s addictive. The biggest improvement would be to host the whole project on a server so it isn’t running from one of my computers.