Compressing overlapping strings in Python
A friend published a challenge in one of our Discord groups:
The Challenge
Goal:
Given a list of N strings, your task is to create a new string that incorporates all the strings from the list. The resultant string should be the shortest possible by taking advantage of overlapping sections between the strings.
Guidelines:
- The resulting string should contain all the original strings entirely.
- The strings from the list can appear in any order in the resulting string.
- If multiple shortest strings are possible, returning any one of them is acceptable.
- If there's no possible overlap among the strings, the result will simply be a concatenation of all the strings in the list.
- All strings in the input list are unique.
A list of N strings, where each string has a length between 1 and 100 inclusive.
Output:
A single string that represents the most compressed combination of the given strings.
Example:
Given the input list:
strings = ["pleasure", "apple", "rendered", "clap", "suren"]
A possible output is:
"clappleasurendered"
I decided to give it a goal since it’s still almost 100 days until December and Advent of Code.
Let’s tackle it!
To start with, I want to be able to define an overlap between two words. Python’s namedtuples are great for defining data structures that are easier to understand than regular tuples and more lightweight than objects.
Setting things up
from collections import namedtuple
Overlap = namedtuple('Overlap', ['start', 'end', 'length'])
Overlap
is a tuple that has two words
(start and end) and length of the overlap. If you’ve ever followed my
solutions during Advent of Code, you know how I love to use them for puzzles
like these all the time. Tuples are immutable and namedtuples allow us to
document what they are and what each item means so it’s a win-win.
Finding overlap between two words
To find an overlap between two words, I decided to go with a TDD approach and start with two words that don’t overlap:
def find_overlap(first, second):
"""
Finds overlapping substrings between two words
>>> find_overlap('magnificent', 'challenge')
"""
return None
if __name__ == '__main__':
import doctest
doctest.testmod()
All tests green! Our function matches its first spec and returns
None
if there’s no overlap.
Next, I want to add a case that checks if the end of the first word overlaps with the beginning of the second word:
Failed example:
find_overlap('apple', 'pleasure')
Expected:
Overlap(start='apple', end='pleasure', length=3)
Got nothing
def find_overlap(first, second):
"""
Finds an overlapping substrings between two words
>>> find_overlap('magnificent', 'challenge')
>>> find_overlap('apple', 'pleasure')
Overlap(start='apple', end='pleasure', length=3)
"""
max_overlap = min(len(first), len(second))
for i in range(1, max_overlap):
if first[-i:] == second[:i]:
return Overlap(start=first, end=second, length=i)
return None
First I added a new test, got a failure and then implemented it to pass! 2 specs now work.
Next, I’ll add the mirror case of previous: second word’s end overlaps with the first one’s start.
Failed example:
find_overlap('erratic', 'player')
Expected:
Overlap(start='player', end='erratic', length=2)
Got nothing
Let’s write code that makes it pass!
def find_overlap(first, second):
"""
Finds an overlapping substrings between two words
>>> find_overlap('magnificent', 'challenge')
>>> find_overlap('apple', 'pleasure')
Overlap(start='apple', end='pleasure', length=3)
>>> find_overlap('erratic', 'player')
Overlap(start='player', end='erratic', length=2)
"""
max_overlap = min(len(first), len(second))
for i in range(1, max_overlap):
if first[-i:] == second[:i]:
return Overlap(start=first, end=second, length=i)
for i in range(1, max_overlap):
if second[-i:] == first[:i]:
return Overlap(start=second, end=first, length=i)
return None
We loop over again, this time turning them around and the tests pass!
However, we’ve now introduced an extra
for
loop that doubles the time of our
worst case.
Since we have passing tests, we can refactor this to take away the second
for
loop:
def find_overlap(first, second):
"""
Finds an overlapping substrings between two words
>>> find_overlap('magnificent', 'challenge')
>>> find_overlap('apple', 'pleasure')
Overlap(start='apple', end='pleasure', length=3)
>>> find_overlap('erratic', 'player')
Overlap(start='player', end='erratic', length=2)
"""
max_overlap = min(len(first), len(second))
for i in range(1, max_overlap):
if first[-i:] == second[:i]:
return Overlap(start=first, end=second, length=i)
if second[-i:] == first[:i]:
return Overlap(start=second, end=first, length=i)
return None
We now check for both options on each iteration, making the code much cleaner and reducing two for loops into one.
This works for words where the overlap is not symmetric.
However, adding another test case shows our fault:
>>> find_overlap('payee', 'eerily')
Overlap(start='payee', end='eerily', length=2)
Failed example:
find_overlap('payee', 'eerily')
Expected:
Overlap(start='payee', end='eerily', length=2)
Got:
Overlap(start='payee', end='eerily', length=1)
Our matching case was too eager and returned as soon as it found a match.
To fix this, I needed to make it go through all the options and return the end result:
def find_overlap(first, second):
"""
Finds an overlapping substrings between two words
>>> find_overlap('magnificent', 'challenge')
>>> find_overlap('apple', 'pleasure')
Overlap(start='apple', end='pleasure', length=3)
>>> find_overlap('erratic', 'player')
Overlap(start='player', end='erratic', length=2)
>>> find_overlap('payee', 'eerily')
Overlap(start='payee', end='eerily', length=2)
"""
max_overlap = min(len(first), len(second))
overlap = None
for i in range(1, max_overlap):
if first[-i:] == second[:i]:
overlap = Overlap(start=first, end=second, length=i)
if second[-i:] == first[:i]:
overlap = Overlap(start=second, end=first, length=i)
return overlap
Now our tests are green again but we do a lot of extra work as if we have two 10-character words that overlap for 2 characters, we still match it through for all variations between lengths of 1 and 10 – twice.
To tackle this, I added an extra check: if we had an overlap at some point and now we don’t anymore either direction, return the overlap:
def find_overlap(first, second):
"""
Finds an overlapping substrings between two words
>>> find_overlap('magnificent', 'challenge')
>>> find_overlap('apple', 'pleasure')
Overlap(start='apple', end='pleasure', length=3)
>>> find_overlap('erratic', 'player')
Overlap(start='player', end='erratic', length=2)
>>> find_overlap('payee', 'eerily')
Overlap(start='payee', end='eerily', length=2)
"""
max_overlap = min(len(first), len(second))
overlap = None
for i in range(1, max_overlap):
if first[-i:] == second[:i]:
overlap = Overlap(start=first, end=second, length=i)
elif second[-i:] == first[:i]:
overlap = Overlap(start=second, end=first, length=i)
elif overlap:
return overlap
return overlap
All still green and now we stop as soon as we found the maximum overlap.
What happens now if we have a special case where there’s overlap both ways?
Let’s add a test and see!
>>> find_overlap('eeriest', 'estimatee')
Overlap(start='eeriest', end='estimatee', length=3)
>>> find_overlap('estimatee', 'eeriest')
Overlap(start='eeriest', end='estimatee', length=3)
In this test, there’s overlap between the end
est
of
eeriest
and start of
estimatee
, but also another one with the
ending ee
of
estimatee
and start of
eeriest
. And I added a mirror test of
that to make sure our code doesn’t favor first word over the second (which it
does if they are equal length which I’m fine with at this point).
Time to combine the words
Now we have a function that finds overlaps and I’m quite happy with it. Let’s start looking at the next part of the challenge:
create a new string that incorporates all the strings from the list
def combine_words(words):
"""
Compresses words based on overlaps.
>>> combine_words(['magnificent', 'challenge'])
'magnificentchallenge'
"""
return ''.join(words)
The first case of this function is two words that don’t overlap and they just get combined in either direction.
Next case, we want to see if we can combine two words that overlap:
Failed example:
combine_words(['apple', 'pleasure'])
Expected:
'appleasure'
Got:
'applepleasure'
To make this pass, we check if there is an overlap and if is, combine words:
def combine_words(words):
"""
Compresses words based on overlaps.
>>> combine_words(['magnificent', 'challenge'])
'magnificentchallenge'
>>> combine_words(['apple', 'pleasure'])
'appleasure'
"""
if overlap := find_overlap(*words):
return f'{overlap.start}{overlap.end[overlap.length:]}'
return ''.join(words)
And we are green again! But this only works with two words.
A quick Python primer if there’s some unfamiliar parts in the above code:
-
overlap := find_overlap(*words)
uses a walrus operator (:=
) that enables us to make assignments inif
orwhile
clauses and match against that result. -
find_overlap(*words)
uses*words
that is argument unpacking and it expands a list into separate arguments for the function -
return f'{overlap.start}{overlap.end[overlap.length:]}'
uses f-strings where everything inside curly braces ({ .. }
) gets evaluated as Python
Failed example:
combine_words(['apple', 'pleasure', 'surely'])
Exception raised:
Traceback (most recent call last):
File "<doctest __main__.combine_words[2]>", line 1, in <module>
combine_words(['apple', 'pleasure', 'surely'])
File "/string-compression-challenge/tdd.py", line 43, in combine_words
if overlap := find_overlap(*words):
TypeError: find_overlap() takes 2 positional arguments but 3 were given
Now we tried to pass three words into our
find_overlap
function which breaks
because it only accepts two. Time to refactor!
This time, the complexity of the code and challenge increased a lot and I had to remind myself that when I have a failing test, my only goal is to make that test pass. I can write inefficient code, I can use extra variables, I can name them badly. This is not the moment when writing good code is the goal.
To get there, I made a modification to the
find_overlap
function so it does not
return anymore None
but an
Overlap
with
length
of 0:
# replace return None with
return Overlap(start=first, end=second, length=0)
With this change, I was able to change
combine_words
to pass our new test:
def combine_words(words):
"""
Compresses words based on overlaps.
>>> combine_words(['magnificent', 'challenge'])
'magnificentchallenge'
>>> combine_words(['apple', 'pleasure'])
'appleasure'
>>> combine_words(['apple', 'pleasure', 'surely'])
'appleasurely'
"""
overlap = find_overlap(*words[:2])
rest = words[2:]
combined = f'{overlap.start}{overlap.end[overlap.length:]}'
while next_word := rest.pop() if rest else False:
overlap = find_overlap(combined, next_word)
combined = f'{overlap.start}{overlap.end[overlap.length:]}'
return combined
This solution starts by combining the first two words and then tacks the next word in the list either in the beginning or end of that list based on the overlap of these two new words until there are no more words in the list.
Let’s tackle the example case from the challenge:
Failed example:
combine_words(['pleasure', 'apple', 'rendered', 'clap', 'suren'])
Expected:
'clappleasurendered'
Got:
'clappleasuresurendered'
Our current solution relies on the ordering of the words and does not find the most optimal solution. Let’s make the test green!
As I was working on this, I discovered a bug in our
find_overlap
:
Failed example:
find_overlap('sure', 'pleasure')
Expected:
Overlap(start='pleasure', end='sure', length=4)
Got:
Overlap(start='sure', end='pleasure', length=0)
Before I could continue on the
combine_words
, I needed to fix this case
where the entire word could be part of an overlap.
def find_overlap(first, second):
"""
Finds an overlapping substrings between two words
>>> find_overlap('magnificent', 'challenge')
Overlap(start='magnificent', end='challenge', length=0)
>>> find_overlap('apple', 'pleasure')
Overlap(start='apple', end='pleasure', length=3)
>>> find_overlap('erratic', 'player')
Overlap(start='player', end='erratic', length=2)
>>> find_overlap('payee', 'eerily')
Overlap(start='payee', end='eerily', length=2)
>>> find_overlap('eeriest', 'estimatee')
Overlap(start='eeriest', end='estimatee', length=3)
>>> find_overlap('estimatee', 'eeriest')
Overlap(start='eeriest', end='estimatee', length=3)
>>> find_overlap('suren', 'pleasure')
Overlap(start='pleasure', end='suren', length=4)
>>> find_overlap('sure', 'pleasure')
Overlap(start='pleasure', end='sure', length=4)
"""
max_overlap = min(len(first), len(second))
overlap = None
for i in range(0, max_overlap+1):
if first[-i:] == second[:i]:
overlap = Overlap(start=first, end=second, length=i)
elif second[-i:] == first[:i]:
overlap = Overlap(start=second, end=first, length=i)
elif overlap:
return overlap
if overlap:
return overlap
return Overlap(start=first, end=second, length=0)
I fixed two things here: one was the indices (from
1, max_overlap
to
0, max_overlap+1
to make sure we catch
entire words. This one was left because of my original attempt to avoid
matching just one letter eagerly (but since we fixed that earlier, it needed
to be changed here).
Second, if the entire shorter word was looped over and there was an overlap,
the code never ended into the
elif overlap
in the loop and needed
another one.
Now I’m bit more confident that function works.
def combine_overlap(overlap):
"""
>>> combine_overlap(Overlap(start='clap', end='apple', length=2))
'clapple'
"""
return f'{overlap.start}{overlap.end[overlap.length:]}'
def combine_words(words):
"""
Compresses words based on overlaps.
>>> combine_words(['magnificent', 'challenge'])
'challengemagnificent'
>>> combine_words(['apple', 'pleasure'])
'appleasure'
>>> combine_words(['apple', 'pleasure', 'surely'])
'appleasurely'
>>> combine_words(['pleasure', 'apple', 'rendered', 'clap', 'suren'])
'clappleasurendered'
"""
perms = permutations(words)
shortest = None
for permutation in perms:
combined = ''
permutation = list(permutation)
while next_word := permutation.pop() if permutation else False:
overlap = find_overlap(combined, next_word)
combined = combine_overlap(overlap)
if not shortest or len(combined) < len(shortest):
shortest = combined
return shortest
My next solution that passed the test was to create all the permutations possible, combine them in each order and find the shortest solution.
Tests are green but this is also going through a lot of permutations and with large number of input words will become very slow. It will require factorial of input length iterations and factorials grow real fast. With 5 words, it goes through 120 for loop iterations and with 10 words 3,6 million loop iterations. And since our list can be up to 100 words, that’s… too many.
Let’s look at the original challenge to see where we are:
- The resulting string should contain all the original strings entirely. ✅
- The strings from the list can appear in any order in the resulting string. ✅
- If multiple shortest strings are possible, returning any one of them is acceptable. ✅
- If there's no possible overlap among the strings, the result will simply be a concatenation of all the strings in the list. ✅
- All strings in the input list are unique. ✅
We match all the guidelines and I think we do reach the optimal solution, given enough time and processing power.
What’s next?
I was hoping to come up with an optimization but so far I haven’t and I figured this blog post would still be worthy of publishing so I considered a trick and calling it part 1 so it looks like I planned it up front. This blog post was less about solving this particular puzzle and more about the process of arriving at a solution through test-driven development.
I might revisit this a bit later if I find a good evening with inspiration to dive into it.
Comments
Loading comments...
Continue discussion in Mastodon »