samedi 25 avril 2015

Matching Barcodes to sequences python?

I have sequence files and barcode files. The barcode files may have barcodes of any length that look like "ATTG, AGCT, ACGT" for example. The sequence files look like "ATTGCCCCCCCGGGGG, ATTGTTTTTTTT, AGCTAAAAA" for example. I need to match the barcodes to the sequences that contain them at the beginning. Then for each set of sequences with the same barcode I have to do calculations on them with the rest of the program (which is written already). I just dont know how to get them to match. Ive went through using print statements and The part where it is messed up is the "potential_barcode = line(:len(barcode)" line. Also, where it says #simple to fasta that is where I should be reading in the matched sequences. I'm pretty new at this so I probably made a lot of mistakes. Thanks for your help!

bcodefname = sys.argv[1]
infname = sys.argv[2]
barcodefile = open(bcodefname, "r")
for barcode in barcodefile:
        barcode = barcode.strip()
        print "barcode: %s" % barcode
        outfname = "%s.%s" % (bcodefname,barcode)
#           print outfname
        outf = open("outfname", "w")
        handle = open(infname, "r")
        for line in handle:
                potential_barcode = line[:len(barcode)]
                print potential_barcode
                if potential_barcode == barcode:
                        outseq = line[len(barcode):]
                        sys.stdout.write(outseq)
                        outf.write(outseq)
                        fastafname = infname + ".fasta"
                        print fastafname
                        mafftfname = fastafname + ".mafft"
                        stfname = mafftfname + ".stock"
                        print stfname
#simp to fasta#
#                       handle = open(infname, "r")
                        outf2 = open(fastafname, "w")
                        for line in handle:
                                linearr = line.split()
                                seqid = linearr[0]
                                seq = linearr[1]
                                outf2.write(">%s\n%s\n" % (seqid,seq))
#                       handle.close()
#                       outf.close()
#mafft#
                        cmd = "mafft %s > %s" % (fastafname,mafftfname)
                        sys.stderr.write("command: %s\n" % cmd)
                        os.system(cmd)
                        sys.stderr.write("command done\n")

Aucun commentaire:

Enregistrer un commentaire