I'm trying to code the mismatch kernel in python. I would like to extract all substrings of a string given a numpy boolean array mask, where the extraction pattern is not necessarily continuous (e.g. mask = [False, True, False, True], such that from 'ABCD' I extract 'BD'). After extracting substrings according to this pattern I can then count all the common substrings between my two sequences. Concerning the extraction step string[theta]
doesn't work to extract such substring. I now have the following chunk of code which works:
def function(s1, s2, k, theta):
l1 = []
l2 = []
# substrings of s1
substrk_itr1 = (s1[i:i+k] for i in range(len(s1) - k + 1))
l1 = [''.join(substr[i] for i, b in enumerate(theta) if b)
for substr in substrk_itr1]
# substrings of s2
substrk_itr2 = (s2[i:i+k] for i in range(len(s2) - k + 1))
l2 = [''.join(substr[i] for i, b in enumerate(theta) if b)
for substr in substrk_itr2]
L = l1 + l2
C = Counter(L)
c1 = Counter(l1)
c2 = Counter(l2)
x = sum([c1[w] * c2[w] for w in C if w])
return x
where (s1,s2) are strings I want to extract all substrings from by first considering all substrings of length k, and then reextract a substring according to the boolean pattern theta. You can make tests with the following values, and you should theoretically get 2.
k = 5
theta = np.array([False,True, True, True, False])
X = 'AAATCGGGT'
Y = 'AAATTGGGT'
The issue is that this code is too slow, (I use it to compute a kernel, so I run it thousands of times). I profiled the code and the bottleneck is due to the join function mostly.
Is there a way to perform the extraction step faster with python code, or in a more pythonic way ? If I write such code in cython could it be faster ? On the doc they are saying :
In many use cases, C strings (a.k.a. character pointers) are slow and cumbersome. For one, they usually require manual memory management in one way or another, which makes it more likely to introduce bugs into your code.
Thank you for your help !
In addition to taking advantage of using types as suggested by @DavidW, using the appropriate data structures is important. Python's container objects like list
are slow since they are not contiguous in memory, and they should be avoided when writing performance-minded cython code.
In this case, we can also save some of those for
loops as well. Instead of iterating over theta
twice for both the s1
and the s2
sublists, we can handle both strings at the same time. We can also compare the characters one by one, and then break out of the comparison early as soon as we hit the first character/nucleotide that is not matching.
Below is my cython version of your code, which should runs well over an order of magnitude faster than the question's code (about 1 second for 1 million iterations versus 40 seconds or so). I have added some comments that hopefully will be helpful. As for your concern about C string management, at least for simple one-off functions like this, as long as you call an appropriate free
for each call to malloc
, you should be fine. Since no char*
had to be dynamically allocated directly at the C level via such means, there is no need for worrying about memory management here. Also, indexing into char*
rather than str
, especially in a tight loop like this, can avoid some slight python overhead.
from libc.stdint cimport int8_t
cimport cython
"""
I have included three @cython decorators here to save some checks.
The first one, boundscheck is the only useful one in this scenario, actually.
You can see their effect if you generate cython annotations!
Read more about it here:
http://cython.readthedocs.io/en/latest/src/reference/compilation.html
"""
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
def fast_count_substr_matches(str s1, str s2, int k, int8_t[:] theta):
cdef int i, j, m#you used k, unfortunately, so stuck with m...
cdef bytes b1 = s1.encode("utf-8")
#alternatively, could just pass in bytes strings instead
#by prefixing your string literals like b'AAATCGGGT'
cdef bytes b2 = s2.encode("utf-8")
cdef char* c1 = b1
cdef char* c2 = b2
#python str objects have minor overhead when accessing them with str[index]
#this is why I bother converting them to char* at the start
cdef int count = 0
cdef bint comp#A C-type int that can be treated as python bool nicely
for i in range(len(s1) - k + 1):
for j in range(len(s2) - k + 1):
comp = True
for m in range(k):
if theta[m] == True and c1[i + m] != c2[j + m]:
comp = False
break
if comp:
count += 1
return count
Please let me know if there is anything in this answer that could be cleared up. Hope this helps!
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments