>>> dir()
['__builtins__', '__doc__', '__name__', '__package__']
>>> from test_function import combineSequences
>>> dir()
['__builtins__', '__doc__', '__name__', '__package__', 'combineSequences']
>>> help(combineSequences)
Help on function combineSequences in module test_function:
combineSequences(seqDict, key1, key2, key3)
combines the sequences stored in the seqDict dictionary parameter at key1 and key2, and stores it at key3, possibly overwriting key3's contents. Example usage:
>>> sd = {1:'AAGC', 2:'CAT'}
>>> combineSequences(sd, 1, 2, 3)
>>> sd[3]
'AAGCCAT'
>>> sd = {{0:'ABC', 2:'DEF'}
>>> combineSequences(sd, 0, 2, 4)
>>> sd[4]
This is an ok approach, and if you are good about planning the different inputs to give the function and the results I expect, I could completely test the function this way. The problem is that every time I change what I've done and want to re-test I'm going to have to do a lot of typing (and I could forget an important test). There are several ways to automate unit testing in python. We'll talk about one method.
Testing with doctest
The basic idea of doctest is that we've already given in our documentation an example of how to use our function. We want to make sure that our documented examples work (and keep working throughout the life of the code). We should be able to automate this somehow, which is what doctest does. It goes through function level comments and looks for things that look like python usage (">>> ") and then runs those python lines, testing that each line returns what you expected.
You can now test a file's behavior on the command line using your new-found knowledge of the doctest module:
$ python -m doctest -v test_function.py
Trying:
sd = {1:'AAGC', 2:'CAT'}
Expecting nothing
ok
Trying:
combineSequences(sd, 1, 2, 3)
Expecting nothing
ok
Trying:
sd[3]
Expecting:
'AAGCCAT'
ok
1 items had no tests:
test_function
1 items passed all tests:
3 tests in test_function.combineSequences
3 tests in 2 items.
3 passed and 0 failed.
Test passed.
But if you are going to be really through in testing a function, you might have more tests than you want in your documentation of the function. Don't worry, doctest can still handle that. Just put all the tests in a text file (it doesn't need to be a .py file, doctest will just treat it like one giant comment string, looking for things that look like python usage (">>> ") and run those lines). Then just start up a python session:
>>> import doctest
>>> doctest.testfile("example.txt")
TestResults(failed=0, attempted=4)
>>>
If you want you can import the .py file into a python session and use that session to test the function. When you come up with a good test, you can just paste it into the .txt file that you'l
l use later.
A good set of unit tests will check every line of code within a function. For example, if your function had the parameters "sequenceWindow" and "sequence" and the following if block
if(sequenceWindow > len(sequence)):
return False
elif(sequenceWindow < 3):
return False
else:
return "AAA" in sequence[0:sequenceWindow]
your unit tests for that function should at least pass one case where sequenceWindow > len(sequence), one case where sequenceWindow <= len(sequence) and less than 3, and one case where sequenceWindow is <= len(sequence) and greater than or equal to 3. If you don't have those three situations covered, at least one section of the if blocks behavior will not be tested.
You can find out more about doctest at http://docs.python.org/library/doctest.html and see a different approach to unit testing in Python at http://docs.python.org/library/unittest.html
Programming excercise 1
Download test_function.py and example.txt. Add tests where you pick different function parameters and the return value you expect. Is the function correct? Thre things to note:
1. doctest is very picky. For example, if python returns 'ABC', and your text file says "ABC" doctest will give an error.
2. Testing works best when you can think of the general types of inputs for a function and the general types of return behavior. In combineSequences(), if it works for the inputs of "ABC" and "DEF", is there anything in the code that implies that using "ABD" and "CEF" as parameters would cause a different type of behavior?
3. In general it's important to test both that the function works in the way you expect it to, and that it doesn't work in the way you expect. If you think a function should throw errors in certains cases, make sure it does so.
Biopython
There is a community-run open-source drive to provide a series of packages and modules for biology, including sequence features, support tools for alignments, database access, processing file formats (eg FASTA, GenBank). and tools to interface with other tools (eg Clustalw, NCBI, Blast). It's a community project that has excellent documentation, so you can figure out the full details of what biopython does pretty easily.Biopython documentation: We can see a list of the Entrez functions in the Python documentation. http://biopython.org/DIST/docs/api/toc-Bio.Entrez-module.html How to use BioPython Here is an example of connecting to Entrez using biopython.
import sys
from Bio import Entrez
# *Always* tell NCBI who you are
Entrez.email = "gabow@cbio.mskcc.org"
def retrieve_annotation(id_list):
"""Annotates Entrez Gene IDs using Bio.Entrez, in particular epost (to
submit the data to NCBI) and esummary to retrieve the information.
Returns a list of dictionaries with the annotations.""" #this is a docstring, useful with __doc__ and IDE
request = Entrez.epost("gene",id=",".join(id_list))
try:
result = Entrez.read(request)
except RuntimeError as e:
#FIXME: How generate NAs instead of causing an error with invalid IDs?
print "An error occurred while retrieving the annotations."
print "The error returned was %s" % e
webEnv = result["WebEnv"]
queryKey = result["QueryKey"]
data = Entrez.esummary(db="gene", webenv=webEnv, query_key = queryKey)
annotations = Entrez.read(data)
print "Retrieved %d annotations for %d genes" % (len(annotations), len(id_list))
#returns a list of dictionaries
return annotations
If you wanted to print the dictionaries
def print_data(annotation):
for gene_data in annotation:
gene_id = gene_data["Id"]
gene_symbol = gene_data["NomenclatureSymbol"]
gene_name = gene_data["Description"]
print "ID: %s - Gene Symbol: %s - Gene Name: %s" % (gene_id, gene_symbol, gene_name)
Let's look at what exactly happens in the code. The most important pint is that very little of it is anything you haven't seen in standard python. All of print_data is standard, as is most of retieve_annotation. The only part of the code that involves BioPython are the references to the Entrez object and the functions related to the Entrez object, such as "Entrez.esummary()"
You can see several examples, such as the above code, at http://biopython.org/wiki/Category:Cookbook and in the tutorial
When to work with BioPython
Biopython has some very specific modules, such as the above, that is used to connect to one specific online resource. The other modules are for more broad biological concepts, such as sequences.
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
seq = Seq('gcatgacgttattacgactctgtcacgccgcggtgcgactgaggcgtggcgtctgctggg', generic_dna)
print seq
You can manipulate Seq objects similarly to how we manipulate string objects. You can also do some extra things with Seq objects, such as reading or writing FASTA sequences.
If you are going to work with a very standard biological file format or do a standard bioinformatics task, someone might already have implemented a way of doing it in BioPython.
For example, the SeqIO package can read abi, ace, clustal, embl, fasta, fastq, genebank, ig, img, nexus, phd, phylip, pir, seqxml, sff, stockholm,swiss, tab, qual, uniprot-xml files, and can write files into many of these formats too.
Programming excercise 2
Part 1: download seq_start.py. It gives a method for fetching from eutils a nucleotide sequence (nucleotide db) from a parameter id and returns a fasta file with xml encoding. It then creates a dictionary that contains the sequence. Modify this function to just return just the sequence. Bonus: try to write this function without the skeleton code. If you are writing this from the ground up, looking at this and this might be very helpful.
Part 2: take the sequence and encode it as a biopython Seq object.
Hint: make sure you are using the right alphabet.
Download this sequence file. Write a program to open it using SeqIO, get each SeqRecord object and print its sequence and sequnce complement
Answer1
Answer2