The most common data type we use (or we’ll be using) on our Swift journey used to be the worst headache, join me to see what’s behind this collection.
Introduction
All of us have been using Strings
during our career, they are even one of the first types we learn almost on every programming language.
So you might think “if it is the first, it is the easiest”, but that is partially correct because they might be just a collection of characters surrounded by double quotes (well other languages can even use single quotes, but that’s not the big deal).
What is a String
As I mentioned before, a String
is a collection of characters but let’s take what Apple says on its official documentation.
A Unicode string value that is a collection of characters.
🤔 hmm… ok at least we agree with something it is a collection of characters, (we’ll be back with the rest of that sentence later.)
Let’s take what Apple says on the Overview section of the same documentation
A string is a series of characters, such as “Swift”, that forms a collection
Well, that was easier to understand if we follow what Apple says the most basic way to create a string should be like this.
let myCoolString = "Hey I'm a cool collection of characters"
There you go 👆, that is a String
.
Demystifying Characters
Now that we know that Strings
are just a collection of characters that means that each letter/value/item on the String
is a Character
Because it is a collection we can loop through it such as any other collection
for char: Character in myCoolString {
print(char)
}
And 1 feature that collections have is subscripts
, we can access the element using square-brackets to pass the specific position, so if we try to get one of the characters of our previous String
let’s say the 3rd character (2 because indexes start from zero) we would do it like this…
let yCharacter = myCoolString[2]
Did you try it? ~ No?, don’t worry it is fine because that produces a compiler error with the following message “‘subscript(_:)’ is unavailable: cannot subscript String with an Int, use a String.Index instead.”.
So, can’t we use subscripts
and why should we use String.Index
? ~ To have a better understanding to answer these questions let’s look at what is a character.
What is a Character
The simplest way to explain a character might be to say that “a character refers to a single character on a collection for characters” (I know, that sounds redundant), but that is just half of the story.
Just like we did with Strings
, let’s ask for Apple documentation…
A single extended grapheme cluster that approximates a user-perceived character.
Oh-oh! 😧 that sounded much more different, so probably you may ask yourself “what the heck is a grapheme cluster?”, so let’s keep digging deeper.
What is a grapheme cluster
Let’s read first what Unicode.org says about them.
It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + grave-accent is a user-perceived character: users think of it as a single character, yet is represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.
Oh crap! you might be hating me just right now because I’m making you read a lot (I hope you are a reader lover so you are not hating me, not yet at least).
Too much text right, I will try to make it simple, a grapheme cluster is the combination of 1 or more Unicode Scalars, (🤣 probably you’re going to hate me right now if you haven’t yet, please don’t 🙏) let’s go one step deeper (I promise it will be the last).
What is Unicode Scalar
We are going to split this to understand what means, first things first, Unicode…
Unicode is the standard for digital representation of the characters used in writing all of the world’s languages.
I think that’s easy to understand, it is a digital representation of the characters (that’s what is important), and what does Unicode Scalar mean?
Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive
I know it is too much info, but let me try to use other words, according to the quote up there, it means that it is a code point between ranges that is the digital representation of the characters, which means that the characters we see on screen are linked to its own Unicode scalar and these are 21-bit numbers that represent a specific character in the Unicode standard.
And maybe the simpler way could be “characters are linked to its code (Unicode scalar) to represent each of them”.
Using Unicode Scalars
The way to use Unicode scalars is like this \u{0048}
where 0048 is a hexadecimal number, if we use that same value on a String
we get the letter H.
let letterHUsingUnicodeScalar = "\u{0048}" // H
Have you ever seen the little snowman emoji ☃? there is a Unicode scalar for that one
let snowman = "\u{2603}"
Now that you know about what are the Unicode scalars we can return to the graphemes again.
Using grapheme clusters
As we mentioned earlier, the grapheme cluster is the combination of 1 or more Unicode Scalars so, let’s use a grapheme cluster for a-acute (á).
let aAcute = "\u{0061}\u{0301}"
Can you see that we are now using 2 scalars to get the a-acute? but there is something more, although we can use these 2 scalars we can also use the precomposed character
let aAcute = "\u{00E1}"
and we will get the same character
Comparing grapheme clusters
One of the interesting things is that previously we got the a-acute in 2 different ways, combining 2 scalars and using the precomposed scalar, but although they were created using 2 different forms they are the same and you can try it by yourself.
let aAcute = "\u{0061}\u{0301}"
let aAcutePrecomposed = "\u{00E1}"
aAcute == aAcutePrecomposed // The result will be true!
😯 how is this possible? ~ The answer is canonical equivalents, this term means that doesn’t matter if we are combining a scalar or using precomposed scalar because the character would be the same, and this is the short answer from unicode.org
Two character sequences are said to be canonical equivalents if their full canonical decompositions are identical.
And just a reminder 💡 ~ Using scalars like we did before return Strings
so all our previous example variables are String
type, so if we call the property
on both variables what do you think we will get?; If you thought on 1 well you’re correct, now does it make sense canonical equivalents? ~ I hope so 😅.
aAcute.count // 1
aAcutePrecomposed.count // 1
Iterating through String Collection
Now that we demystified characters we can now come back to what we were trying to explain at first, so let’s see the code again
let yCharacter = myCoolString[2]
Using that we get the following error subscript(_:)’ is unavailable: cannot subscript String with an Int, use a String.Index instead. ~ the first part might be easy to understand now that we know how a character is created.
We cannot use integers because our character might be created using scalar unions and if you remember on our example about the a-acute we have \u{0061} and \u{0301} but they are á, so how the collection would know which scalar or scalar unions should it get?
And… the solution is on the next part of our error message, …use a String.Index instead.
The String.Index
type is a namespaced type that contains the logic to read the position of characters, if we ask Apple what is that type for we’ll find the following
A position of a character or code unit in a string.
So, that’s the solution, we delegate the work to look through code units to String.Index
but now… how do we use it?
Using String.Index to get a character
Index
is a namespaced struct inside String
and the cool part is that it already has some straightforward methods and properties.
We are not going through each one but we will see some of the most common, we are going to use the same example that we tried to do at the beginning of this article.
We will try to get the 3rd character (letter y) of the following string
let myCoolString = "Hey I'm a cool collection of characters"
Easiest things first, we can just look for the String.Index
of y and that is as straightforward as just calling a simple method.
let firstIndexOfY = myCoolString.firstIndex(of: "y")
That will return an optional Index
, it is optional because what would happen if the character that we are looking for is not there? ~ It just returns nil
, so we need to use a safety mechanism or just unwrap our value to be able to use it on the subscript (please don’t use unwrap that always is code smell even if you are completely sure that it will have a value).
let firstIndexOfY = myCoolString.firstIndex(of: "y")!
print(myCoolString[firstIndexOfY])
if let firstIndexOfY = myCoolString.firstIndex(of: "y") {
print(myCoolString[firstIndexOfY])
}
It doesn’t make too much sense to do this it seems odd, but well maybe there is a situation when you need to do this.
Now, let’s assume we don’t know the characters on our String
, and we want to get the 3rd character on it.
First just to be safe, we could just check the length of our String
so we won’t look for a position that doesn’t exist.
if myCoolString.count >= 2 {
// we now know our String has at least 3 characters
// (remember collections start with zero)
}
Now that we are safe we need to know which is the first index of our String
we know it is zero but we need a String.Index
value, so String
already covers that, it is as easy as just call the startIndex
property
if myCoolString.count >= 2 {
let start = myCoolString.startIndex // this returns a String.Index type
}
We have our first index, but we want the third one, and the easiest way, because we know we are at the beginning of our String
, we can just jump 2 more positions to get the index we are looking for and guess what, String
has a method to do that and it is index(_:offsetBy:)
, let’s see first what Apple says about this method.
Returns an index that is the specified distance from the given index.
Ok, pretty easy, we just need to give an index and we have it, it is our start
constant, now we just need a distance for offsetBy:
and if we look at what this property is, 😯 oh dang! it is String.IndexDistance
😔 now let’s see how we are going to get that…
🤪 just kidding… String.IndexDistance
is just a typealias
for Int
so we just need to pass an integer of how many positions we want to move, if we are already on the first position and we want the third one, we just need to jump twice so
if myCoolString.count >= 2 {
let start = myCoolString.startIndex
let end = myCoolString.index(start, offsetBy: 2) // This returns a String.Index type
}
Now we have the expected position, such as we did in our first example of this chapter we can use subscript to get it.
if myCoolString.count >= 2 {
let start = myCoolString.startIndex
let end = myCoolString.index(start, offsetBy: 2)
print(myCoolString[end]) // This prints 'y'
}
Easy right? ~ well, kind of 😅.
Conclusion
Although String
might be the most basic/common type that we can learn on any programming language it has its complex topics, now you have learned how is this collection built, how the characters are built, and how to navigate to look for something inside of Strings
.
So there you go, you have more demystified knowledge about Strings
to avoid a couple of headaches that this type can make us so easily and just because we don’t pay attention to what is behind a “simple” String
.
I hope this article helps you, I’ll appreciate it if you can share it and #HappyCoding👨💻
References
- String - Apple Documentation
- Character - Apple Documentation
- Characters and Grapheme Clusters - Documentation Archive
- Grapheme Cluster Boundaries - Unicode.org
- Glossary - Unicode
- String.Index - Apple Documentation
- index(:offsetBy:) - Apple Documentation
- Swift Programming: The Big Nerd Ranch Guide (3rd Edition)