How to (fastly) get a substring according to a specific pattern in python?

debugcn Published at Dev

Tib

I'm trying to code the mismatch kernel in python. I would like to extract all substrings of a string given a numpy boolean array mask, where the extraction pattern is not necessarily continuous (e.g. mask = [False, True, False, True], such that from 'ABCD' I extract 'BD'). After extracting substrings according to this pattern I can then count all the common substrings between my two sequences. Concerning the extraction step string[theta] doesn't work to extract such substring. I now have the following chunk of code which works:

def function(s1, s2, k, theta):
 l1 = []
 l2 = []

 # substrings of s1
 substrk_itr1 = (s1[i:i+k] for i in range(len(s1) - k + 1))
 l1 = [''.join(substr[i] for i, b in enumerate(theta) if b)
       for substr in substrk_itr1]

 # substrings of s2
 substrk_itr2 = (s2[i:i+k] for i in range(len(s2) - k + 1))
 l2 = [''.join(substr[i] for i, b in enumerate(theta) if b)
       for substr in substrk_itr2]

 L = l1 + l2
 C = Counter(L)
 c1 = Counter(l1)
 c2 = Counter(l2)
 x = sum([c1[w] * c2[w] for w in C if w])
 return x

where (s1,s2) are strings I want to extract all substrings from by first considering all substrings of length k, and then reextract a substring according to the boolean pattern theta. You can make tests with the following values, and you should theoretically get 2.

k = 5
theta = np.array([False,True, True, True, False])
X = 'AAATCGGGT'
Y = 'AAATTGGGT'

The issue is that this code is too slow, (I use it to compute a kernel, so I run it thousands of times). I profiled the code and the bottleneck is due to the join function mostly.

Is there a way to perform the extraction step faster with python code, or in a more pythonic way ? If I write such code in cython could it be faster ? On the doc they are saying :

In many use cases, C strings (a.k.a. character pointers) are slow and cumbersome. For one, they usually require manual memory management in one way or another, which makes it more likely to introduce bugs into your code.

Thank you for your help !

CodeSurgeon

In addition to taking advantage of using types as suggested by @DavidW, using the appropriate data structures is important. Python's container objects like list are slow since they are not contiguous in memory, and they should be avoided when writing performance-minded cython code.

In this case, we can also save some of those for loops as well. Instead of iterating over theta twice for both the s1 and the s2 sublists, we can handle both strings at the same time. We can also compare the characters one by one, and then break out of the comparison early as soon as we hit the first character/nucleotide that is not matching.

Below is my cython version of your code, which should runs well over an order of magnitude faster than the question's code (about 1 second for 1 million iterations versus 40 seconds or so). I have added some comments that hopefully will be helpful. As for your concern about C string management, at least for simple one-off functions like this, as long as you call an appropriate free for each call to malloc, you should be fine. Since no char* had to be dynamically allocated directly at the C level via such means, there is no need for worrying about memory management here. Also, indexing into char* rather than str, especially in a tight loop like this, can avoid some slight python overhead.

from libc.stdint cimport int8_t
cimport cython

"""
I have included three @cython decorators here to save some checks.
The first one, boundscheck is the only useful one in this scenario, actually.
You can see their effect if you generate cython annotations!
Read more about it here: 
http://cython.readthedocs.io/en/latest/src/reference/compilation.html
"""
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
def fast_count_substr_matches(str s1, str s2, int k, int8_t[:] theta):
    cdef int i, j, m#you used k, unfortunately, so stuck with m...
    cdef bytes b1 = s1.encode("utf-8")
    #alternatively, could just pass in bytes strings instead
    #by prefixing your string literals like b'AAATCGGGT'
    cdef bytes b2 = s2.encode("utf-8")
    cdef char* c1 = b1
    cdef char* c2 = b2
    #python str objects have minor overhead when accessing them with str[index]
    #this is why I bother converting them to char* at the start
    cdef int count = 0
    cdef bint comp#A C-type int that can be treated as python bool nicely

    for i in range(len(s1) - k + 1):
        for j in range(len(s2) - k + 1):
            comp = True
            for m in range(k):
                if theta[m] == True and c1[i + m] != c2[j + m]:
                    comp = False
                    break         
            if comp:
                count += 1
    return count

Please let me know if there is anything in this answer that could be cleared up. Hope this helps!

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-08-9

Comments

0 comments

From Dev

Related Related

Article

How to (fastly) get a substring according to a specific pattern in python?

How to (fastly) get a substring according to a specific pattern in python?

How to get a substring value from a varchar in oracle based on specific pattern

Extract a substring according to a pattern

How to edit a specific pattern according to an array of patterns?

How to get substring after specific substring in javascript

Redis Python - how to delete all keys according to a specific pattern In python, without python iterating

How to get service according value with pattern?

How to get substring with pattern using Java

How to get a string after a specific substring?

How to match a substring following after a string satisfying the specific pattern

In ruby, how can I get the substring from a pattern to a pattern? Also, how do I get the last such pattern?

Extract substring with a specific pattern in MySql

How to write regex pattern to get a substring of a string in c#

How to get a specific line based on a pattern if found

How to get the pattern of Number Format of a specific locale?

Get substring from an optional pattern

Get substring according to occurrence, reg exp java

pandas python sorting according to a pattern

How to get node bounds according to its specific ancestor in JavaFX 8?

how to get the specific result according to a certain condition in MongoDb through mongoose

Move files according to a specific pattern in the file name

Permute rows in data frame according to a specific pattern

How do I get a substring of a string in Python?

How to get longest alphabetically ordered substring in python

How to get substring after last specific character in JavaScript?

How to get a substring from a specific character to the end of the string in swift 4?

How to substring a string Python from a specific index in Python?

How to extract a substring pattern in Postgresql

How to get substring(s) of a string if said substring(s) are between two specific characters

How to replace an unknown-substring in a string with specific pattern using Java code?