Find common sub string from the list of strings

thiruvenkadam Published at Dev

thiruvenkadam

How can I take out only a prefix of a string from the list of strings? The caveat is that I do not know the prefix before hand. Only through this function, I will know the prefix.

(eg):
string_list = ["test11", "test12", "test13"]
# Prefix is test1
string_list = ["test-a", "test-b", "test-c"]
# Prefix is test-
string_list = ["test1", "test1a", "test12"]
# Prefix is test1
string_list = ["testa-1", "testb-1", "testc-1"]
# Prefix is test

In case there is nothing common in all the strings of the list, it should be an empty string.

Mike Müller

Solution

This function works:

def find_prefix(string_list):
    prefix = []
    for chars in zip(*string_list):
        if len(set(chars)) == 1:
            prefix.append(chars[0])
        else:
            break
    return ''.join(prefix)

Tests

string_lists = [["test11", "test12", "test13"],
                ["test-a", "test-b", "test-c"],
                ["test1", "test1a", "test12"],
                ["testa-1", "testb-1", "testc-1"]]


for string_list in string_lists:
    print(string_list)
    print(find_prefix(string_list))

Output:

['test11', 'test12', 'test13']
test1
['test-a', 'test-b', 'test-c']
test-
['test1', 'test1a', 'test12']
test1
['testa-1', 'testb-1', 'testc-1']
test

Speed

It is always fun to time things:

string_list = ["test11", "test12", "test13"]

%timeit get_large_subset(string_list)
100000 loops, best of 3: 14.3 µs per loop

%timeit find_prefix(string_list)
100000 loops, best of 3: 6.19 µs per loop

long_string_list = ['test{}'.format(x) for x in range(int(1e4))]

%timeit get_large_subset(long_string_list)
100 loops, best of 3: 7.44 ms per loop

%timeit find_prefix(long_string_list)
100 loops, best of 3: 2.38 ms per loop

very_long_string_list = ['test{}'.format(x) for x in range(int(1e6))]

%timeit get_large_subset(very_long_string_list)
1 loops, best of 3: 761 ms per loop

%timeit find_prefix(very_long_string_list)
1 loops, best of 3: 354 ms per loop

Conclusion: Using sets in this way is fast.

Collected from the Internet

Please contact [email protected] to delete if infringement.