A portmanteau is a word that squishes together two overlapping words, like math + athlete = mathlete. You can make up your own, like eskimo + kimono = eskimono. Inspired by Darius Bacon, I covered this as a programming exercise in my 2012 Udacity course. Recently I was re-inspired by Tom Murphy VII, who added a new twist: portmantout words, which are defined as:
Given a set of words, W, a portmantout of W is a string, S, such that:
To make sure we understand the definition, I'll define the function S = portman(L).
To find the overlap between two words, portman
considers each suffix of the previous word and takes the longest suffix that starts the next word; we drop that from the word. (I'll also define functions to list the prefixes and suffixes of a word.)
def portman(words):
"Join a sequence of words, eliminating from each word the overlap with previous word."
prev = words[0]
result = [prev]
for word in words[1:]:
overlap = next(filter(word.startswith, suffixes(prev)))
result.append(word[len(overlap):])
prev = word
return ''.join(result)
def prefixes(word) -> list:
"All non-empty prefixes of word, shortest first."
return [word[:i+1] for i in range(len(word))]
def suffixes(word) -> list:
"All non-empty suffixes of word, longest first."
return [word[i:] for i in range(len(word))]
# An example:
W = {'on', 'one', 'two', 'won'}
S = 'twone'
L = ['two',
'won',
'on',
'one']
assert portman(L) == S
assert all(w in S for w in W)
assert set(L) == W
assert portman(['eskimo', 'kimono', 'monolith']) == 'eskimonolith'
assert prefixes('word') == ['w', 'wo', 'wor', 'word']
assert suffixes('word') == ['word', 'ord', 'rd', 'd']
Note that portman(['one', 'two'])
would raise an error, because there is no overlap from 'one'
to 'two'
.
I watched part of Tom Murphy's entertaining video, but then I stopped because I wanted to solve the problem at least partially on my own. Here's my strategy:
Dictionary
data structure.My Dictionary
data structure contains three things:
{'o': ['on', 'one', ...], ...}
from collections import defaultdict, Counter
from functools import lru_cache
class Dictionary:
"A collection of words, with prefixes and unique words precomputed."
def __init__(self, stream):
self.words = {w.strip().lower() for w in stream}
self.uniq = self.words - subwords(self.words)
self.pre = multimap((p, w) for w in self.words for p in prefixes(w))
def subwords(words) -> set:
"All words that are hiding inside any of these words."
return {sub for word in words
for i in range(len(word))
for sub in prefixes(word[i:])
if sub is not word and sub in words}
def multimap(pairs) -> dict:
"Given (key, val) pairs, make a dict of {key: [val,...]}."
result = defaultdict(list)
for key, val in pairs:
result[key].append(val)
return result
# A small example:
W = {'on', 'one', 'two', 'won'}
d = Dictionary(W)
d.words, d.uniq, dict(d.pre)
({'on', 'one', 'two', 'won'}, {'one', 'two', 'won'}, {'w': ['won'], 'wo': ['won'], 'won': ['won'], 'o': ['one', 'on'], 'on': ['one', 'on'], 'one': ['one'], 't': ['two'], 'tw': ['two'], 'two': ['two']})
That looks right. Now I can load the big dictionary, call it D
, and explore it a bit:
D = Dictionary(open('wordlist.asc'))
len(D.words), len(D.uniq), len(D.pre), D.pre['somew']
(108709, 64389, 249623, ['somewhats', 'somewhen', 'somewise', 'someway', 'somewhere', 'someways', 'somewhat'])
The function natalie
will take a Dictionary as input and produce a list of overlapping words, e.g.
d = Dictionary({'on', 'one', 'two', 'won'})
natalie(d) ⇒ ['two', 'won', 'on', 'one']
and we will solve the whole problem as follows:
portman(natalie(d)) ⇒ 'twone'
Within natalie
, we repeatedly add words to a result
list until we have used up all the unique words
from the dictionary. On each iteration we either add a new_word
(thus decreasing the size of the remaining words
), or we re-use a repeated_word
, choosing one that will bridge to a word that we have not used yet.
def natalie(d: Dictionary) -> list:
"Return a list of words that cover the dictionary."
words = set(d.uniq) # all the words we need to cover
result = [words.pop()] # a list of overlapping words
firsts = Counter(word[0] for word in words) # Count of first letters of words
while words:
prev = result[-1]
word = (new_word(d, prev, words) or repeated_word(d, prev, firsts))
result.append(word)
if word in words:
words.remove(word)
B = word[0]
firsts[B] -= 1
if not firsts[B]: del firsts[B]
return result
Selecting a new_word
is easy: consider the suffixes of the previous word, longest suffix first, and if that suffix is a prefix of an unused word, then take it.
def new_word(d: Dictionary, prev: str, words: set) -> str or None:
"Find an overlapping word to follow the previous word (or None)."
for suf in suffixes(prev):
if suf in d.pre:
for word in d.pre[suf]:
if word in words:
return word
Now suppose we reach a situation where the previous word was 'one'
, and the only remaining unused word is 'two'
. Since there is no overlap, new_word
will fail, but we can find the shortest previously-used word that will bridge
from the 'e'
at the end of 'one'
to the 't'
at the start of 'two'
:
@lru_cache(1000)
def bridge(d, A, B) -> str:
"Find a word that bridges A to B: it starts with A and ends with B."
return shortest(word for word in d.pre[A] if word.endswith(B))
def shortest(items): return min(items, key=len, default=None)
bridge(D, 'e', 't')
'eat'
But in general, there will be several unusued words that we might bridge to; we will greedily choose the shortest bridge to any of the unused words. That sounds expensive when there are thousands of unused words, so we will summarize them with a counter, firsts
, that gives for each letter the number of unused words that start with that letter (so, if the unused words are {'two', 'three', 'four'}
, then firsts
will be {'t': 2, 'f': 1}
).
Furthermore, it might be that no word can bridge directly to an unused word. In that case we can take two steps, first bridging from A
to an intermediate letter L
, and then bridging from L
to a letter B
that starts some unused word. The function repeated_word
handles all these cases:
def repeated_word(d: Dictionary, prev: str, firsts: Counter) -> str:
"Find a previously-used word that will bridge to one of the letters in firsts."
A = prev[-1]
word = shortest(bridge(d, A, B) for B in firsts if bridge(d, A, B))
if word:
return word
else:
candidates = [[bridge(d, A, L), bridge(d, L, B)]
for L in alphabet for B in firsts
if A != L != B and bridge(d, A, L) and bridge(d, L, B)]
return min(candidates, key=lambda seq: sum(map(len, seq)))[0]
alphabet = 'abcdefghijklmnopqrstuvwxyz'
For example, suppose the previous word is 'one'
, and the only remaining unused word is 'quit'
. There is no word that bridges from 'e'
to 'q'
. So we will have to get there in two steps. Here's the first step:
repeated_word(D, 'one', {'q': 1})
'elhi'
And here's the second:
repeated_word(D, 'elhi', {'q': 1})
'iraq'
We're ready to solve the problem!
%%time
L = natalie(D)
S = portman(L)
CPU times: user 23.2 s, sys: 50.7 ms, total: 23.2 s Wall time: 23.3 s
len(L), len(D.uniq), len(S)
(101796, 64389, 575150)
Here we see the start of the list:
L[:20]
['hyperactivity', 'typewriting', 'tingling', 'lingeries', 'escarping', 'carpings', 'stropped', 'peddles', 'lesbians', 'answers', 'ersatzes', 'zestiest', 'establisher', 'sherlocks', 'locksmiths', 'swankily', 'lyricizing', 'zingers', 'erstwhile', 'whiled']
And the start of the string:
S[:2000]
'hyperactivitypewritinglingeriescarpingstroppeddlesbianswersatzestiestablisherlocksmithswankilyricizingerstwhileditorializingedifiestashedableachesapeakednessayistsunamicabilitieschewalsocializedshopliftsarsaparillasersheiksignalmanacsimonizedelweissestinescapablymphomassacrestingstowagesturalliescudosserscoundrellyinglycosideswipingrassessorshipmateshipwaysidestepsonshipshapelessnessentiallylstreptococcuspidalliersnorterservilitieschewinglessonedictallyhoedownstairsicknessesamesospheredityranniseismologyratessellatediouslyesteryearshotspursierrantrystsarinasmuchnessayeditorialistlesslyerbassoonistsarismsectarianismelteriescargotsardomsilhouettinglesseesaweddingsulfurizedematadorsallyingrowthsplurgesturingtailskidskinsmendeleviumbrageoushereditarilynxesculentsunamissilerysipelasticizeducationshorelinesmandragoraphobiaxialitympanumskullsleighingelessonslaughtsktskingedgewaysinewspeakshepherdingingivaerologistsaritzassagaislesionstagehandsomesthesisesquicentenniallymphocytestifyingestantalizationiumsketchingsubfamiliescapistsetsessilencesariannulightingstungstensilesiamesestinassimilableakishkestrelsewhereforestedhorseshoestringsideswipestilencesspitsawtoothpickstretchablenchedarwinistsimmeshinglersophsuperlativelysianticipatoryxescarpmentsktskedgesturescuestashingleditorializestsarevnasturtiumsavagedlyceesquiringboltsaristshamrockshenaniganswerabilitypographerselfedscoopsfullerythematologicalumnymphsilkweedinessentialleliciteddyingscreechymicsouvenirstargazersalutessellationspeedwaysquintingeingeniousnessayersequesterselysiumteenthronementsalivateddiesesquicentennialslumberserksadistsubclassessablearingdovestingsaguarosebaywoodsiestashesitatediousnessayingsluicewaywardlyristsumpsychrotherapiestradiologistshowingspansyllabicspikeletspoonerismswazilandownershipsidetrackingshipshottingliestablishmentsolarismsturdynamistickinessayscarrierspearmenianschlussreineducabilitypifyinguinalterablyricizeducativehiclessoningrainingloriouslynessesquipedalianestheticsextsoddynamismse'
If you want to see the whole string, I'll write it to the file natalie.txt.
open('natalie.txt', 'w').write(S)
575150
Each time I run this, I get a slightly different result (there are no random
calls, but each Python re-start gets a random hash seed, which results in a different iteration order over dicts and sets). I had originally planned to add some randomness and take the best of k trials (like I did in my TSP notebook), but all the trials I have seen so far fall within a narrow range of 575,000 to 581,000 letters, so I think it is not worth the effort for what would probably be a 1% improvement at best. My string is 6% shorter than Murphy's solution with 611,820 letters.
I'll stop here (but you should feel free to do some experimentation of your own). Some ideas:
W = {'one', 'two'}
, if my algorithm chooses 'one'
as the starting word, it will fail. You could fix that by allowing new words to be added to the start of the list (hint: use a dequeue
) as well as the end.This gives some examples of how the functions are used, and some assurance that they are doing the right thing.
def test(D):
W = {'on', 'one', 'two', 'won'}
S = 'twone'
L = ['two', 'won', 'on', 'one']
assert portman(L) == S
assert all(w in S for w in W)
assert set(L) == W
assert prefixes('word') == ['w', 'wo', 'wor', 'word']
assert suffixes('word') == ['word', 'ord', 'rd', 'd']
assert subwords({'hello', 'world', 'he', 'hell', 'el', 'or'}) == (
{'el', 'he', 'hell', 'or'})
assert multimap([(1, 10), (2, 20), (3, 30), (2, 22), (3, 33)]) == (
{1: [10], 2: [20, 22], 3: [30, 33]})
assert 'portmanteau' in D.words
assert 'port' in D.words
assert 'port' not in D.uniq
assert set(D.pre['hello']) == {'hello', 'helloed', 'helloes', 'helloing', 'hellos'}
assert bridge(D, 'a', 'z') == 'abuzz'
assert bridge(D, 'a', 't') == 'at'
assert bridge(D, 'e', 't') == 'eat'
assert bridge(D, 'f', 'd') == 'fad'
assert portman(['two', 'won', 'on', 'one']) == 'twone'
assert portman(['eskimo', 'kimono', 'monolith']) == 'eskimonolith'
return 'pass'
test(D)
'pass'