Biopython - 序列操作

生物蟒蛇 模块提供了各种内置方法，我们可以通过这些方法对序列执行各种基本和高级操作。基本操作与字符串方法非常相似，如切片、连接、查找、计数、剥离、拆分等。下面列出了一些高级操作

互补和反向互补： Biopython提供了complement()和reverse_complement()函数，可以用来寻找给定核苷酸序列的互补序列，得到一个新的序列，而互补序列也可以反向互补得到原始序列。下面是描述函数的一个简单示例：

Syntax: complement(self)

Return Type:

编程需要懂一点英语

Python3

# Import Libraries
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
 
# Creating sequence
seq = Seq('CTGACTGAAGCT', IUPAC.ambiguous_dna)
 
# Creating complement of the sequence and print
comp = seq.complement()
comp
 
# Creating reverse complement and print
rev_comp = comp.reverse_complement()
rev_comp

Python3

# Import libraries
from Bio.Data import IUPACData
import pprint
 
# Printing the dataset
pprint.pprint(IUPACData.ambiguous_dna_complement)

Python3

# Import Libraries
from Bio.Seq import Seq
from Bio.SeqUtils import GC
from Bio.Alphabet import IUPAC
 
# Creating sequence
seq = Seq("CTGACTGAAGCT", IUPAC.unambiguous_dna)
 
# Getting GC count
print(GC(seq))

Python3

# Import Libraries
from Bio.Seq import Seq
from Bio.Seq import transcribe
from Bio.Alphabet import IUPAC
 
# Creating sequence
dna_seq = Seq("CTGACTGAAGCT", IUPAC.unambiguous_dna)
 
# Transcription to RNA
print(transcribe(dna_seq))
 
# Reverse Transcription to DNA
rna_seq = transcribe(dna_seq)
print(rna_seq.back_transcribe())

Python3

# import libraries
from Bio.Data import CodonTable
 
# Creating table
table = CodonTable.unambiguous_dna_by_name["Standard"]
 
# Print table
print(table)

Python3

# Import Libraries
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
 
# Creating sequence
rna = Seq('UACCGGAUUGUUUUCCCGGGCUGAUCCUGUGCCCGA', IUPAC.unambiguous_rna)
print(rna)
 
# Translating RNA
print(rna.translate())
 
# Stop translation to first stop codon ( asterisk '*' is stop codon)
print(rna.translate(to_stop = True))

输出：

Seq('GACTGACTTCGA', IUPACAmbiguousDNA()) 
Seq('TCGAAGTCAGTC', IUPACAmbiguousDNA())

在上面的例子中，complement() 方法创建 DNA 或 RNA 序列的互补序列，而reverse_complement()函数创建序列的互补序列并将结果从左到右反转。

biopython 的Bio.Data.IUPACData模块提供了用于执行补码操作的ambiguous_dna_complement变量。

蟒蛇3

# Import libraries
from Bio.Data import IUPACData
import pprint
 
# Printing the dataset
pprint.pprint(IUPACData.ambiguous_dna_complement)

输出：

{
   'A': 'T',
   'B': 'V',
   'C': 'G',
   'D': 'H',
   'G': 'C',
   'H': 'D',
   'K': 'M',
   'M': 'K',
   'N': 'N',
   'R': 'Y',
   'S': 'S',
   'T': 'A',
   'V': 'B',
   'W': 'W',
   'X': 'X',
   'Y': 'R'}

GC含量（鸟嘌呤-胞嘧啶含量）： GC含量基本上是DNA或RNA分子中含氮碱基的百分比，即鸟嘌呤或胞嘧啶。可以通过计算 GC 核苷酸数除以核苷酸总数来预测。以下是计算 GC 含量的基本示例：

Syntax: Bio.SeqUtils.GC(seq)

Return Type:

编程需要懂一点英语

蟒蛇3

# Import Libraries
from Bio.Seq import Seq
from Bio.SeqUtils import GC
from Bio.Alphabet import IUPAC
 
# Creating sequence
seq = Seq("CTGACTGAAGCT", IUPAC.unambiguous_dna)
 
# Getting GC count
print(GC(seq))

输出：

50.00

转录：它基本上是将 DNA 转换为 RNA 序列的过程。实际的生物转录是通过逆向互补（GACT->AGUC）得到以DNA为模板链的mRNA的过程。在Biopython中，只需将字母T更改为U即可将碱基 DNA 链直接转换为 mRNA。下面给出一个简单的例子：

Syntax: transcribe(self)

Return Type:

编程需要懂一点英语

蟒蛇3

# Import Libraries
from Bio.Seq import Seq
from Bio.Seq import transcribe
from Bio.Alphabet import IUPAC
 
# Creating sequence
dna_seq = Seq("CTGACTGAAGCT", IUPAC.unambiguous_dna)
 
# Transcription to RNA
print(transcribe(dna_seq))
 
# Reverse Transcription to DNA
rna_seq = transcribe(dna_seq)
print(rna_seq.back_transcribe())

输出：-

Seq('CUGACUGAAGCU', IUPACUnambiguousRNA())
Seq('CTGACTGAAGCT', IUPACUnambiguousDNA())

翻译：是将RNA序列翻译成蛋白质序列的过程。序列模块具有用于此目的的内置translate()方法。如果我们必须在第一个密码子处停止翻译，可以通过将to_stop = True参数传递给 translation() 方法。

Biopython 使用NCBI 的 The Genetic Codes 页面提供的翻译表。翻译表的完整列表如下：

Syntax: translate(self, table=’Standard’, stop_symbol=’*’, to_stop=False, cds=False, gap=’-‘)
Return Type:

编程需要懂一点英语

蟒蛇3

# import libraries
from Bio.Data import CodonTable
 
# Creating table
table = CodonTable.unambiguous_dna_by_name["Standard"]
 
# Print table
print(table)

输出：

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------+---------+---------+---------+--

下面给出了一个简单的翻译示例：

蟒蛇3

# Import Libraries
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
 
# Creating sequence
rna = Seq('UACCGGAUUGUUUUCCCGGGCUGAUCCUGUGCCCGA', IUPAC.unambiguous_rna)
print(rna)
 
# Translating RNA
print(rna.translate())
 
# Stop translation to first stop codon ( asterisk '*' is stop codon)
print(rna.translate(to_stop = True))

输出：

Seq('UACCGGAUUGUUUUCCCGGGCUGAUCCUGUGCCCGA', IUPACUnambiguousRNA())
Seq('YRIVFPG*SCAR', HasStopCodon(IUPACProtein(), '*'))
Seq('YRIVFPG', IUPACProtein())