Compressing overlapping strings in Python

Aug 27th, 2023

by Juha-Matti Santala

This was published in categories: python blaugust blaugust-2023

A friend published a challenge in one of our Discord groups:

The Challenge

Goal:

Given a list of N strings, your task is to create a new string that incorporates all the strings from the list. The resultant string should be the shortest possible by taking advantage of overlapping sections between the strings.

Guidelines:

The resulting string should contain all the original strings entirely.
The strings from the list can appear in any order in the resulting string.
If multiple shortest strings are possible, returning any one of them is acceptable.
If there's no possible overlap among the strings, the result will simply be a concatenation of all the strings in the list.
All strings in the input list are unique.

A list of N strings, where each string has a length between 1 and 100 inclusive.

Output:

A single string that represents the most compressed combination of the given strings.

Example:

Given the input list:

strings = ["pleasure", "apple", "rendered", "clap", "suren"]

A possible output is:

"clappleasurendered"

I decided to give it a goal since it’s still almost 100 days until December and Advent of Code.

Let’s tackle it!

To start with, I want to be able to define an overlap between two words. Python’s namedtuples are great for defining data structures that are easier to understand than regular tuples and more lightweight than objects.

Setting things up

from collections import namedtuple

Overlap = namedtuple('Overlap', ['start', 'end', 'length'])

Overlap is a tuple that has two words (start and end) and length of the overlap. If you’ve ever followed my solutions during Advent of Code, you know how I love to use them for puzzles like these all the time. Tuples are immutable and namedtuples allow us to document what they are and what each item means so it’s a win-win.

Finding overlap between two words

To find an overlap between two words, I decided to go with a TDD approach and start with two words that don’t overlap:

def find_overlap(first, second):
  """
  Finds overlapping substrings between two words
  >>> find_overlap('magnificent', 'challenge')
  """
  return None

if __name__ == '__main__':
    import doctest
    doctest.testmod()

All tests green! Our function matches its first spec and returns None if there’s no overlap.

Next, I want to add a case that checks if the end of the first word overlaps with the beginning of the second word:

Failed example:
    find_overlap('apple', 'pleasure')
Expected:
    Overlap(start='apple', end='pleasure', length=3)
Got nothing

def find_overlap(first, second):
    """
    Finds an overlapping substrings between two words
    >>> find_overlap('magnificent', 'challenge')
    >>> find_overlap('apple', 'pleasure')
    Overlap(start='apple', end='pleasure', length=3)
    """
    max_overlap = min(len(first), len(second))
    for i in range(1, max_overlap):
        if first[-i:] == second[:i]:
            return Overlap(start=first, end=second, length=i)
    
    return None

First I added a new test, got a failure and then implemented it to pass! 2 specs now work.

Next, I’ll add the mirror case of previous: second word’s end overlaps with the first one’s start.

Failed example:
    find_overlap('erratic', 'player')
Expected:
    Overlap(start='player', end='erratic', length=2)
Got nothing

Let’s write code that makes it pass!

def find_overlap(first, second):
    """
    Finds an overlapping substrings between two words
    >>> find_overlap('magnificent', 'challenge')
    >>> find_overlap('apple', 'pleasure')
    Overlap(start='apple', end='pleasure', length=3)
    >>> find_overlap('erratic', 'player')
    Overlap(start='player', end='erratic', length=2)
    """
    max_overlap = min(len(first), len(second))
    for i in range(1, max_overlap):
        if first[-i:] == second[:i]:
            return Overlap(start=first, end=second, length=i)
    for i in range(1, max_overlap):
        if second[-i:] == first[:i]:
            return Overlap(start=second, end=first, length=i)
    
    return None

We loop over again, this time turning them around and the tests pass!

However, we’ve now introduced an extra for loop that doubles the time of our worst case.

Since we have passing tests, we can refactor this to take away the second for loop:

def find_overlap(first, second):
    """
    Finds an overlapping substrings between two words
    >>> find_overlap('magnificent', 'challenge')
    >>> find_overlap('apple', 'pleasure')
    Overlap(start='apple', end='pleasure', length=3)
    >>> find_overlap('erratic', 'player')
    Overlap(start='player', end='erratic', length=2)
    """
    max_overlap = min(len(first), len(second))
    for i in range(1, max_overlap):
        if first[-i:] == second[:i]:
            return Overlap(start=first, end=second, length=i)
        if second[-i:] == first[:i]:
            return Overlap(start=second, end=first, length=i)
    
    return None

We now check for both options on each iteration, making the code much cleaner and reducing two for loops into one.

This works for words where the overlap is not symmetric.

However, adding another test case shows our fault:

>>> find_overlap('payee', 'eerily')
    Overlap(start='payee', end='eerily', length=2)

Failed example:
    find_overlap('payee', 'eerily')
Expected:
    Overlap(start='payee', end='eerily', length=2)
Got:
    Overlap(start='payee', end='eerily', length=1)

Our matching case was too eager and returned as soon as it found a match.

To fix this, I needed to make it go through all the options and return the end result:

def find_overlap(first, second):
    """
    Finds an overlapping substrings between two words
    >>> find_overlap('magnificent', 'challenge')
    >>> find_overlap('apple', 'pleasure')
    Overlap(start='apple', end='pleasure', length=3)
    >>> find_overlap('erratic', 'player')
    Overlap(start='player', end='erratic', length=2)
    >>> find_overlap('payee', 'eerily')
    Overlap(start='payee', end='eerily', length=2)
    """
    max_overlap = min(len(first), len(second))
    overlap = None
    for i in range(1, max_overlap):
        if first[-i:] == second[:i]:
            overlap = Overlap(start=first, end=second, length=i)
        if second[-i:] == first[:i]:
            overlap = Overlap(start=second, end=first, length=i)
    
    return overlap

Now our tests are green again but we do a lot of extra work as if we have two 10-character words that overlap for 2 characters, we still match it through for all variations between lengths of 1 and 10 – twice.

To tackle this, I added an extra check: if we had an overlap at some point and now we don’t anymore either direction, return the overlap:

def find_overlap(first, second):
    """
    Finds an overlapping substrings between two words
    >>> find_overlap('magnificent', 'challenge')
    >>> find_overlap('apple', 'pleasure')
    Overlap(start='apple', end='pleasure', length=3)
    >>> find_overlap('erratic', 'player')
    Overlap(start='player', end='erratic', length=2)
    >>> find_overlap('payee', 'eerily')
    Overlap(start='payee', end='eerily', length=2)
    """
    max_overlap = min(len(first), len(second))
    overlap = None
    for i in range(1, max_overlap):
        if first[-i:] == second[:i]:
            overlap = Overlap(start=first, end=second, length=i)
        elif second[-i:] == first[:i]:
            overlap = Overlap(start=second, end=first, length=i)
        elif overlap:
            return overlap
    
    return overlap

All still green and now we stop as soon as we found the maximum overlap.

What happens now if we have a special case where there’s overlap both ways?

Let’s add a test and see!

>>> find_overlap('eeriest', 'estimatee')
    Overlap(start='eeriest', end='estimatee', length=3)
>>> find_overlap('estimatee', 'eeriest')
    Overlap(start='eeriest', end='estimatee', length=3)

In this test, there’s overlap between the end est of eeriest and start of estimatee, but also another one with the ending ee of estimatee and start of eeriest. And I added a mirror test of that to make sure our code doesn’t favor first word over the second (which it does if they are equal length which I’m fine with at this point).

Time to combine the words

Now we have a function that finds overlaps and I’m quite happy with it. Let’s start looking at the next part of the challenge:

create a new string that incorporates all the strings from the list

def combine_words(words):
    """
    Compresses words based on overlaps.
    >>> combine_words(['magnificent', 'challenge'])
    'magnificentchallenge'
    """
    return ''.join(words)

The first case of this function is two words that don’t overlap and they just get combined in either direction.

Next case, we want to see if we can combine two words that overlap:

Failed example:
    combine_words(['apple', 'pleasure'])
Expected:
    'appleasure'
Got:
    'applepleasure'

To make this pass, we check if there is an overlap and if is, combine words:

def combine_words(words):
    """
    Compresses words based on overlaps.
    >>> combine_words(['magnificent', 'challenge'])
    'magnificentchallenge'
    >>> combine_words(['apple', 'pleasure'])
    'appleasure'
    """
    if overlap := find_overlap(*words):
        return f'{overlap.start}{overlap.end[overlap.length:]}'
    return ''.join(words)

And we are green again! But this only works with two words.

A quick Python primer if there’s some unfamiliar parts in the above code:

overlap := find_overlap(*words) uses a walrus operator (:=) that enables us to make assignments in if or while clauses and match against that result.
find_overlap(*words) uses *words that is argument unpacking and it expands a list into separate arguments for the function
return f'{overlap.start}{overlap.end[overlap.length:]}' uses f-strings where everything inside curly braces ({ .. }) gets evaluated as Python

Failed example:
    combine_words(['apple', 'pleasure', 'surely'])
Exception raised:
    Traceback (most recent call last):
      File "<doctest __main__.combine_words[2]>", line 1, in <module>
        combine_words(['apple', 'pleasure', 'surely'])
      File "/string-compression-challenge/tdd.py", line 43, in combine_words
        if overlap := find_overlap(*words):
    TypeError: find_overlap() takes 2 positional arguments but 3 were given

Now we tried to pass three words into our find_overlap function which breaks because it only accepts two. Time to refactor!

This time, the complexity of the code and challenge increased a lot and I had to remind myself that when I have a failing test, my only goal is to make that test pass. I can write inefficient code, I can use extra variables, I can name them badly. This is not the moment when writing good code is the goal.

To get there, I made a modification to the find_overlap function so it does not return anymore None but an Overlap with length of 0:

# replace return None with
return Overlap(start=first, end=second, length=0)

With this change, I was able to change combine_words to pass our new test:

def combine_words(words):
    """
    Compresses words based on overlaps.
    >>> combine_words(['magnificent', 'challenge'])
    'magnificentchallenge'
    >>> combine_words(['apple', 'pleasure'])
    'appleasure'
    >>> combine_words(['apple', 'pleasure', 'surely'])
    'appleasurely'
    """
    overlap = find_overlap(*words[:2])
    rest = words[2:]
    combined = f'{overlap.start}{overlap.end[overlap.length:]}'
    while next_word := rest.pop() if rest else False:
        overlap = find_overlap(combined, next_word)
        combined = f'{overlap.start}{overlap.end[overlap.length:]}'
    return combined

This solution starts by combining the first two words and then tacks the next word in the list either in the beginning or end of that list based on the overlap of these two new words until there are no more words in the list.

Let’s tackle the example case from the challenge:

Failed example:
    combine_words(['pleasure', 'apple', 'rendered', 'clap', 'suren'])
Expected:
    'clappleasurendered'
Got:
    'clappleasuresurendered'

Our current solution relies on the ordering of the words and does not find the most optimal solution. Let’s make the test green!

As I was working on this, I discovered a bug in our find_overlap:

Failed example:
    find_overlap('sure', 'pleasure')
Expected:
    Overlap(start='pleasure', end='sure', length=4)
Got:
    Overlap(start='sure', end='pleasure', length=0)

Before I could continue on the combine_words, I needed to fix this case where the entire word could be part of an overlap.

def find_overlap(first, second):
    """
    Finds an overlapping substrings between two words
    >>> find_overlap('magnificent', 'challenge')
    Overlap(start='magnificent', end='challenge', length=0)
    >>> find_overlap('apple', 'pleasure')
    Overlap(start='apple', end='pleasure', length=3)
    >>> find_overlap('erratic', 'player')
    Overlap(start='player', end='erratic', length=2)
    >>> find_overlap('payee', 'eerily')
    Overlap(start='payee', end='eerily', length=2)
    >>> find_overlap('eeriest', 'estimatee')
    Overlap(start='eeriest', end='estimatee', length=3)
    >>> find_overlap('estimatee', 'eeriest')
    Overlap(start='eeriest', end='estimatee', length=3)
    >>> find_overlap('suren', 'pleasure')
    Overlap(start='pleasure', end='suren', length=4)
    >>> find_overlap('sure', 'pleasure')
    Overlap(start='pleasure', end='sure', length=4)
    """
    max_overlap = min(len(first), len(second))
    overlap = None
    for i in range(0, max_overlap+1):
        if first[-i:] == second[:i]:
            overlap = Overlap(start=first, end=second, length=i)
        elif second[-i:] == first[:i]:
            overlap = Overlap(start=second, end=first, length=i)
        elif overlap:
            return overlap
    if overlap:
        return overlap
    
    return Overlap(start=first, end=second, length=0)

I fixed two things here: one was the indices (from 1, max_overlap to 0, max_overlap+1to make sure we catch entire words. This one was left because of my original attempt to avoid matching just one letter eagerly (but since we fixed that earlier, it needed to be changed here).

Second, if the entire shorter word was looped over and there was an overlap, the code never ended into the elif overlap in the loop and needed another one.

Now I’m bit more confident that function works.

def combine_overlap(overlap):
    """
    >>> combine_overlap(Overlap(start='clap', end='apple', length=2))
    'clapple'
    """
    return f'{overlap.start}{overlap.end[overlap.length:]}'

def combine_words(words):
    """
    Compresses words based on overlaps.
    >>> combine_words(['magnificent', 'challenge'])
    'challengemagnificent'
    >>> combine_words(['apple', 'pleasure'])
    'appleasure'
    >>> combine_words(['apple', 'pleasure', 'surely'])
    'appleasurely'
    >>> combine_words(['pleasure', 'apple', 'rendered', 'clap', 'suren'])
    'clappleasurendered'
    """
    perms = permutations(words)
    shortest = None
    for permutation in perms:
        combined = ''
        permutation = list(permutation)
        while next_word := permutation.pop() if permutation else False:
            overlap = find_overlap(combined, next_word)
            combined = combine_overlap(overlap)
        if not shortest or len(combined) < len(shortest):
            shortest = combined
    return shortest

My next solution that passed the test was to create all the permutations possible, combine them in each order and find the shortest solution.

Tests are green but this is also going through a lot of permutations and with large number of input words will become very slow. It will require factorial of input length iterations and factorials grow real fast. With 5 words, it goes through 120 for loop iterations and with 10 words 3,6 million loop iterations. And since our list can be up to 100 words, that’s… too many.

Let’s look at the original challenge to see where we are:

The resulting string should contain all the original strings entirely. ✅

The strings from the list can appear in any order in the resulting string. ✅

If multiple shortest strings are possible, returning any one of them is acceptable. ✅

If there's no possible overlap among the strings, the result will simply be a concatenation of all the strings in the list. ✅

All strings in the input list are unique. ✅

We match all the guidelines and I think we do reach the optimal solution, given enough time and processing power.

What’s next?

I was hoping to come up with an optimization but so far I haven’t and I figured this blog post would still be worthy of publishing so I considered a trick and calling it part 1 so it looks like I planned it up front. This blog post was less about solving this particular puzzle and more about the process of arriving at a solution through test-driven development.

I might revisit this a bit later if I find a good evening with inspiration to dive into it.

Comments

Comment by replying to this post in Mastodon.

Loading comments...

Continue discussion in Mastodon »