Dissertation

zur Erlangung des akademischen Grades

Doktor der Naturwissenschaften

(Dr. rer. nat.)

der Technischen Fakultät

der Christian-Albrechts-Universität zu Kiel

eingereicht im Jahr 2014

Kiel Computer Science Series (KCSS) 2014/8 v1.0 dated 2014-11-26

ISSN 2193-6781 (print version)

ISSN 2194-6639 (electronic version)

Electronic version, updates, errata available via https://www.informatik.uni-kiel.de/kcss

The author can be contacted via mail@mikemueller.name

Published by the Department of Computer Science, Kiel University

Dependable Systems Group

Please cite as:

@book{MMueller14,

author = {Mike M\"uller},
title = {Avoiding and Enforcing Repetitive Structures in Words},
publisher = {Department of Computer Science, CAU Kiel},
year = {2014},
number = {2014/8},
isbn = {978-3-73474-197-5},
series = {Kiel Computer Science Series},
note = {Dissertation, Faculty of Engineering, Kiel University}
}

© 2014 by Mike Müller

Herstellung und Verlag: BoD — Books on Demand GmbH, Norderstedt

About this Series

The Kiel Computer Science Series (KCSS) covers dissertations, habilitation theses, lecture notes, textbooks, surveys, collections, handbooks, etc. written at the Department of Computer Science at Kiel University. It was initiated in 2011 to support authors in the dissemination of their work in electronic and printed form, without restricting their rights to their work. The series provides a unified appearance and aims at high-quality typography. The KCSS is an open access series; all series titles are electronically available free of charge at the department’s website. In addition, authors are encouraged to make printed copies available at a reasonable price, typically with a print-on-demand service.

Please visit http://www.informatik.uni-kiel.de/kcss for more information, for instructions how to publish in the KCSS, and for access to all existing publications.

1. Gutachter:

Prof. Dr. Dirk Nowotka

Christian-Albrechts-Universität zu Kiel

2. Gutachter:

Prof. Dr. Narad Rampersad

University of Winnipeg

Datum der mündlichen Prüfung: 17. November 2014

Zusammenfassung

Der Fokus dieser Arbeit liegt auf der Untersuchung sich wiederholender Strukturen in Wörtern, einem zentralen Thema auf dem Gebiet der Wortkombinatorik. Die in der vorliegenden Arbeit präsentierten Ergebnisse zielen darauf ab, die bestehende Theorie bezüglich des Vorkommens beziehungsweise Fehlens solcher Strukturen zu erweitern.

Im ersten Teil untersuchen wir, ob derartige Strukturen zwangsweise in unendlichen Wörtern über endlichen Alphabeten auftauchen. Die Form von sich wiederholenden Strukturen, die wir untersuchen, beinhaltet funktionale Abhängigkeiten zwischen den wiederholten Teilen. Insbesondere gehen wir dabei Vermeidbarkeitsfragestellungen von Mustern nach, deren sich wiederholende Struktur durch die Anwendung einer Permutation verschleiert wurde. In diesem neuartigen Szenario tritt der überraschende Effekt auf, dass bestimmte vermeidbare Muster in einem größeren Alphabet nicht mehr vermieden werden können. Der zweite und umfangreichste Teil dieser Arbeit befasst sich mit Wortgleichungen, die eine bestimmte, auf Involutionen basierende, sich wiederholende Struktur in ihren Lösungsmengen erzwingen. Czeizler et al. (2009) führten eine verallgemeinerte Version der von Lyndon und Schützenberger untersuchten klassischen Wortgleichungen ul = vmwn ein. Wir lösen die beiden letzten und anspruchsvollsten Fälle und vervollständigen damit die Klassifikation dieser Gleichungen bezüglich der sich wiederholenden Strukturen, die in den Lösungen auftreten.

Im letzten Teil untersuchen wir die Auswirkungen des Mischens auf wiederholungsfreie Wörter. Wir konstruieren sowohl endliche als auch unendliche quadratfreie Wörter, die mit sich selbst so gemischt werden können, dass ein quadratfreies Wort entsteht. Zusätzlich zeigen wir, dass die sich wiederholende Struktur, die beim Mischen eines Wortes mit sich selbst entsteht, in unendlichen Wörtern vermieden werden kann.

Abstract

The focus of this thesis is on the study of repetitive structures in words, a central topic in the area of combinatorics on words. The results presented in the thesis at hand are meant to extend and enrich the existing theory concerning the appearance and absence of such structures.

In the first part we examine whether these structures necessarily appear in infinite words over a finite alphabet. The repetitive structures we are concerned with involve functional dependencies between the parts that are repeated. In particular, we study avoidability questions of patterns whose repetitive structure is disguised by the application of a permutation. This novel setting exhibits the surprising behaviour that avoidable patterns may become unavoidable in larger alphabets.

The second and major part of this thesis deals with equations on words that enforce a certain repetitive structure involving involutions in their solution set. Czeizler et al. (2009) introduced a generalised version of the classical equations ul = vmwn that were studied by Lyndon and Schützenberger. We solve the last two remaining and most challenging cases and thereby complete the classification of these equations in terms of the repetitive structures appearing in the admitted solutions.

In the final part we investigate the influence of the shuffle operation on words avoiding ordinary repetitions. We construct finite and infinite square-free words that can be shuffled with themselves in a way that preserves square-freeness. We also show that the repetitive structure obtained by shuffling a word with itself is avoidable in infinite words.

Acknowledgements

“Knowledge is in the end based on acknowledgement.”

 

LUDWIG WITTGENSTEIN

First and foremost I want to thank my supervisor Dirk Nowotka for introducing me to this interesting field called “combinatorics on words”, that I had never heard of before, and for providing me with both the freedom and the necessary funding to spend three and a half years pursuing my research interests in this area.

I would also like to thank Narad Rampersad for honouring me with his service as external reviewer of my thesis. Furthermore, I wish to express my gratitude towards him for being part of my examining committee, together with Steffen Börm, Otmar Spinas, and Thomas Wilke, whom I sincerely wish to thank as well.

I am delighted to thank all people that were and are still engaged in the Dependable Systems group at the Kiel University, that is Florin Manea, Robert Merca¸s, Tim Grebien, Philipp Sieweck, Thorsten Ehlers and Gesa Walsdorf, for being wonderful colleagues, providing a great environment to work in, and also sharing plenty of after-work hours with me.

Thanks are also due to the staff of the Department of Mathematics and Statistics at the University of Turku, where I had the pleasure to work one autumn and winter long. The people there made this Finnish winter so much nicer than it may sound. In particular, I would like to thank Juhani Karhumäki, Luca Zamboni, Svetlana Puzynina, and Yury Nikulin for hosting me, all the fruitful discussions we had, walls that we climbed, and in general the time that we shared. My deepest gratitude goes to Tero Harju with whom I had the unique pleasure of staring at a blackboard for countless hours, during which I learnt priceless lessons from him, not only about research, but life in general. It is certainly not exaggerated to say that this thesis owes its existence in large part also to his guidance and encouragement. I would like to deeply thank Dirk Nowotka again for making this collaboration possible.

Thanks also go to all my former colleagues at the University of Stuttgart, namely Volker Diekert, Ulrich Hertrampf, Manfred Kufleitner, Jürn Laun, Steffen Kopecki, Alexander Lauser, Heike Photien, and Horst Prote.

I would like to thank all coauthors that worked with me on the topics presented in this thesis; apart from those already mentioned these are Michaël Rao and Shinnosuke Seki. The latter also voluntarily proofread parts of this thesis, for which special thanks are in order. Any errors that might have remained are of course my sole responsibility.

Besides all the people that directly influenced my work, I would also like to thank everyone that made my private life enjoyable while working on this thesis. Therefore thanks to all friends and family members, flatmates and companions in Kiel, Stuttgart, Plüderhausen and surroundings for reminding me that there is more to life than work. Thank you, Tom and Joanna, Peter and Svenja, Tobi and Anne, Hanno, Malte, Max, Sven and Birte, Neel, Maika, Sonja, Axel and Monika, Jürgen and Anja, Leander and Alissa, Christoph, Marcel, Marco, Henning, Günther, and Thorben.

I owe special thanks to my parents, Roland and Elfriede, who were always there for me and always supported the choices I made, despite the fact that they couldn’t explain to others what exactly their son is working on.

Last but not least, I want to thank Izabela for her endless support, for believing in me even when I didn’t, and for patiently listening to all the problems I have been working on and my ideas how to solve them - both the good ones and the bad ones.

Contents

List of Figures

List of Tables

CHAPTER 1

Introduction

“Words are, of course, the most powerful drug used by mankind.”

 

RUDYARD KIPLING

The history of studying repetitive structures in sequences of symbols, called words here, can be traced back to the beginning of the last century, when the Norwegian mathematician Thue published two seminal papers [86, 87] concerning words that do not exhibit some particular repetitive structures.

More precisely, he showed how to construct an infinite sequence using two different symbols that does not contain three consecutive identical blocks, and how to derive a sequence on three symbols from it where every two consecutive blocks of the same length are different. Unfortunately, the venues he chose to publish his groundbreaking work (written in German) were largely unknown to the researchers interested in similar questions, and many of his results were independently rediscovered afterwards.

Various other aspects of words have been studied before, and the list of authors includes such illustrious names as Bernoulli [5] and Gauß [38]. The excellent survey by Berstel and Perrin [9] provides a comprehensive account of the history of word-related research. Nowadays however, Thue’s work is unanimously regarded as the first systematic investigation of combinatorial properties of words and, in particular, the first work in general on repetitive structures therein. His results are well-known and a summary as well as an annotated full translation of his papers were written by Berstel [6, 7].

Half a century after Thue’s initial work, the study of repetitions and other properties of words emerged into a research area of its own, which is nowadays known as combinatorics on words. Three monographs of Lothaire [63–65] have been exclusively devoted to topics this area is concerned with, and the connections to other fields of mathematics and computer science such as number theory or pattern matching algorithms are highlighted, for instance, in the textbooks of Allouche and Shallit [1] or Crochemore, Hancart, and Lecroq [23].

One topic of interest concerns relations between different words, commonly expressed as word equations. Makanin [67] showed that it is algorithmically decidable whether a given word equation is satisfiable, while other research focused on the form of solutions of some specific classes of equations. Lyndon and Schützenberger [66] and Lentin [61] for instance studied equations, whose only solutions consist of words featuring the kind of repetitive structure that Thue tried to avoid.

In some sense, Thue’s goal was to avoid repetitions, whereas Lyndon and Schützenberger as well as Lentin tried to enforce them. Both these approaches will be undertaken in this thesis, albeit using a different notion of repetitiveness.

The first kind of alternative repetitive structure, that is not so blatant and hence can be deemed as some sort of hidden repetition, was introduced by Erdoős [34], who asked whether there is an infinite word with the property that no block is a rearrangement of the symbols of the following block of the same length. This kind of repetitive structure is nowadays known as an abelian repetition, and Erdoős’ question has been positively answered by Evdokimov [35], who showed that such a word using 25 symbols exists. Evdokimov’s result was later improved on by Pleasants [74] and Keränen [58], who lowered the required number of different symbols to five and four, respectively.

In the last couple of years, various other notions of repetitiveness were introduced and studied from a combinatorial point of view. Most of these are based on an equivalence relation on words and consecutive blocks that are equivalent with respect to this equivalence relation are considered repetitive. The k-abelian [52], k-binomial [83] or the sum-of-digits equivalence [17] serve as examples of relations that were used to define generalised repetitions.

The notion we shall be concerned with in this thesis is inspired by a phenomenon arising in DNA-strands: such a strand, composed of the bases Adenine (A), Cytosine (C), Guanine (G), and Thymine (T) bonds with its Watson-Crick complement to form a double stranded DNA helix. Here A is complementary to T and C is the complement of G. Furthermore, the two joint strands are oriented in the opposite way after the bonding, where the orientation is determined using the so-called 3’- and 5’-end, see Figure 1.1 for a graphical representation.

Figure 1.1. The bonding of two Watson-Crick complementary DNA-strands.

A suitable abstraction of this natural principle is obtained by modelling the DNA-strand as a word over a finite alphabet (for instance {A, C, G, T}) and the Watson-Crick complement as a function θ on these words. In order to reflect the main properties of the Watson-Crick complement, the function θ is chosen to be an involution, meaning that θ(θ(w)) = w for all words w, that also acts as an antimorphism, which means that θ(uv) = θ(v)θ(u) for all words u and v. In this setting, two words are considered to be equivalent, if one is the image of the other under such a function θ, and a repetitive structure in a word is a concatenation of equivalent blocks. This notion of equivalence has been introduced recently and a series of papers has already been devoted to the study of its properties [19, 30, 32, 54–56, 89].

Outline

This thesis contributes to the theory of repetitive structures in words and provides solutions to some problems that were left open in the existing literature.

In Chapter 2 we introduce the terminology and notation from combinatorics on words that is relevant to this thesis, and recall some of the most fundamental theorems on repetitive structures on words, such as Thue’s and Lyndon and Schützenberger’s, as well as some recent results that will be used in the following chapters.

In Chapter 3 we will follow Thue’s approach and study avoidability questions involving functional dependencies between the blocks. We generalise the approach previously taken by Bischoff, Currie, and Nowotka [10], who allowed involutions to be applied to the blocks, by allowing permutations of higher order as well. This enables us to define patterns that are avoidable when using some finite set of symbols but can not be avoided in words using a larger set of symbols. Such behaviour with respect to avoidability is not known to exist using other notions of repetitivity. We provide a full classification of all such repetitive structures involving three blocks and one permutation in terms of symbol sets admitting infinite words avoiding such structures. The results of this chapter are based on joint work with Florin Manea and Dirk Nowotka and have been presented at the 16th International Conference on Developments in Language Theory 2012 [70].

In Chapter 4 we focus on equations enforcing repetitive structure, pursuing Lyndon and Schützenberger’s avenue. We close an open problem by Czeizler et al. [30] regarding the generalised repetitivity of solutions of equations of the form u1u2 · · · ul = v1v2 · · · vmw1w2 · · · wn, where ui ∈ {u, θ(u)} for all 1 ≤ i, ≤ l, vj ∈ {v,θ(v)} for all 1 ≤ jm, wk ∈ {w,θ(w)} for all 1 ≤ kn, and θ is an antimorphic involution. The results of this chapter comprise the main part of this thesis and are based on two joint papers, the first of which was written together with Florin Manea and Dirk Nowotka and presented at the IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science 2013 [69], and the second one was the result of a joint effort with Florin Manea, Dirk Nowotka and Shinnosuke Seki and was presented at the 39th International Symposium on Mathematical Foundations of Computer Science 2014 [71].

In Chapter 5 we study how words without repetitive structure behave when they are shuffled. We show how to construct square-free words that can be shuffled with themselves in a square-freeness preserving manner, and prove the existence of an infinite square-free word that can be shuffled with itself to reproduce itself, answering two open questions of Harju [45]. Furthermore, we improve upon a result by Currie [26] regarding positions of palindromes in square-free words. Finally, we give a simple proof showing that a certain repetitive structure called a shuffle-square is avoidable. The results contained in this chapter are based on common work with Tero Harju [47, 48], Michaël Rao and Svetlana Puzynina [73], and the theorem concerning shuffle-square avoidability was announced in [22].

CHAPTER 2

Preliminaries

“The secret to getting ahead is getting started.”

 

MARK TWAIN

2.1 Words

Let Σ be a non-empty, finite set, called alphabet. We call the elements of Σ letters. A word over Σ is a (finite or infinite) sequence of letters from Σ. The set of non-empty finite words over Σ, also known as the free semigroup generated by Σ, is denoted as Σ+. Endowing Σ+ with a unique neutral element, which is called empty word and denoted as ε, we obtain the free monoid generated by Σ, which is denoted as Σ* (hence, Σ* = Σ+ {ε}). The set of infinite words over Σ will be denoted as Σω. If S ⊆ Σ* is a set of words, then S+ and S* denote the free semigroup and free monoid generated by the words in S.

We make use of the notation Σm to describe an alphabet of m letters, which consists of the digits from 0 to m − 1, so Σm = {0, 1, · · · , m − 1}. Words over Σ2 are called binary words, and words over Σ3 will be referred to as ternary words.

For a finite word w, we denote its length, that is the number of letters it consists of, by |w|. So, if w = w1 w2 · · · wn, where wi ∈ Σ for all 1 ≤ in, then |w| = n. The empty word ε is the unique word of length 0.