A URL in Any Language
The Internet may touch all corners of the globe, but for millions of people it still remains frustratingly out of touch. The URL—that web address that connects users to content—has long remained limited to a handful of Latin-based characters.
This year, we are witnessing the emergence of full-length URLs in non-Latin scripts, known as internationalized domain names (IDNs). Here are three working examples:
- правительство.рф (http://правительство.рф)
- παράδειγμα.δοκιμή (http://παράδειγμα.δοκιμή)
- 례.테스트 (http://실례.테스트)
To give you an idea of just how many different scripts can be supported, the following exhibit includes the top-level IDNs that have, as of May 2011, received formal approval by ICANN. The IDNs are displayed next to their country code equivalents.
The regions represented here constitute more than 2.5 billion people, most of whom do not speak English as a native language. The regions also represent where most of the growth in Internet usage will occur over the next decade.
IDNs promise to improve the Internet for millions of people, but they also face obstacles to widespread adoption. This article provides an overview of IDNs as well as these obstacles, and the challenges that web developers will face as IDNs become more common.
The Year of the IDN
An IDN is a domain name that supports one or more non-ASCII characters. This means that any accented character (e.g., ä or é) would turn a conventional domain into an IDN. For example, the German liqueur Jägermeister uses the IDN jägermeister.com. But IDNs are getting the most attention these days for their support for non-Latin scripts such as Arabic, Hebrew, Greek, Cyrillic, and Chinese.
IDNs have been around for a number of years, but mostly limited to functioning as partial IDNs—a name that’s only partly composed of non-Latin characters. For example, президент.ru (the home page of the Russian President) is a partial IDN. While the second-level domain name (президент) is in Cyrillic, the top-level domain name (.ru), is still in ASCII. The Internet protocol strings such as "http" and "ftp" remain in ASCII.
Although partial IDNs function perfectly well, they’re not very practical because they require the user to switch keyboard inputs midstream. That's why the recent introduction of support for non-ASCII top-level domain names is significant.
What About IDNs and .com?
The .com domain name is the world’s most popular top-level domain (TLD) name. In June of this year, ICANN gave the green light to the expansion of generic TLDs (such as .com and .net) in different scripts. VeriSign, the owner of .com, has made it clear that is intends to offer language-specific versions of .com in the years ahead. Given VeriSign's market power, it's safe to say that many people will become aware of IDNs once .com goes multilingual.
How IDNs Work
The Internet was not designed to be multilingual. The domain name system (DNS) was intended only to support the characters a through z, A through Z, 0 through 9, and the hyphen. Upgrading the DNS to support upwards of 100,000 characters is no trivial feat.
In fact, the DNS has not been upgraded. The name servers scattered around the world—the servers that translate domain names into their numeric IP addresses—still only support ASCII characters. But what has changed is the addition of a sort of multilingual translation service at the front end (the user’s browser) to supplement the DNS.
When an IDN is input, the web browser algorithmically converts it into an odd-looking combination of ASCII characters (known as punycode).
For example, the IDN for McDonald’s Russia (макдональдс.рф) is converted to the punycode equivalent http://xn--80aalb1aicli8a5i.xn--p1ai/. It is this punycode string that is sent to the name server. When an IDN is initially registered, it is this punycode address that is stored within the name server.
Why Is There Punycode in My Browser?
Users of IDNs may notice punycode showing up their web browsers, such as this example from Chrome:
This is usually intentional. With Internet Explorer, Chrome, and Firefox, the IDN is generally displayed in punycode format unless the language preference of the browser matches the language of the IDN.
There are real and perceived security fears that IDNs may be used to trick users into visiting phony websites. For example, рayрal.com and paypal.com are actually two different domains (the first uses the Cyrillic character "р"). To minimize this risk, some IDNs, such as the Russian IDN, do not permit mixed-script domain names. Of course, it's important that registries enforce these limitations, which remains to be seen.
For the time being, expect inconsistencies in how different browsers handle different IDNs.
Russia's IDN Land Grab
In late 2010, Russia opened its top-level IDN, .Рф (the Cyrillic abbreviation of Russian Federation) for registration. In less than a year, nearly one million IDNs have been registered, making .Рф not only the most popular IDN, but also one of the world’s most popular country codes. Russian President Dmitry Medvedev, one the chief advocates for IDNs, said last year: “We taught the World Wide Web to speak Russian.”
Today, Russians can access a small number of government and corporate websites by inputting Cyrillic-only URLs. And a number of companies have gotten onboard as well; Yandex, Russia’s leading search engine, can be reached at http://Яндекс.рф, and Russia’s largest mobile carrier is at http://МТС.рф.
India: One Country With Multiple IDNs
India has more than 20 official languages, which use a range of different scripts. Needless to say, one IDN won’t easily support all of these languages. ICANN recently approved seven IDNs, as shown in the graphic to the right.
It’s far too early to know how adoption of IDNs in this country will unfold, particularly since English is still widely viewed as the informal official language of the country. However, companies that are serious about winning the consumer market should also be serious about speaking the native languages of consumers.
Obstacles to IDN Adoption
While IDNs are in many ways a natural evolution of the Internet, most Internet users are completely unaware that they are even a possibility. And companies have largely been slow to promote them. In some cases, executives are concerned that promoting an IDN address may conflict with existing efforts to promote the Latin-based addresses. The thinking goes: Even if the current URLs are not ideal, why confuse users any more than we have to?
There are also a number of technical and logistical challenges still to overcome. For example, there are concerns over how best to manage domains that may be represented by “variant” characters. For example, China has two distinct IDNs, as shown to the right.
One IDN is in simplified Chinese script while the other is in traditional script. The question that must be resolved still is whether or not a domain registered with one IDN should be aligned with the other. This is just one example of numerous variant issues that must be resolved by ICANN in the coming months (it has working groups studying the challenges right now).
Challenges to Supporting IDNs on Websites
Although IDNs are a long way from becoming commonplace, web developers should begin to consider the implications of supporting IDNs. Looking ahead, one can no longer assume that a URL will support only ASCII characters. Here are some challenges to keep in mind:
- In a world of multilingual URLs, web developers will need to balance security with usability. URL input fields that automatically block non-Latin characters may need to be modified.
- How should IDNs be promoted? For instance, McDonald’s in Russia still displays its Latin URL on its home page instead of the Cyrillic equivalent. Should companies display both URLs, and which URL should be the “front door?” There are no easy answers here, but the questions need to at least be asked.
- If the domain is in a different script, should subdirectories also be in the same script? As you can see below on the home page of the Egyptian government, the top-level and second-level domains are in Arabic but the subdirectories are in Latin characters. If a company is going to ask users to input long URLs to go directly to subdirectory sites, usability needs to be taken into account.
With the emergence of IDNs, we are inching closer to a more linguistically local Internet in which users no longer have to leave their native languages to get where they want to go. And though there are many obstacles ahead, this is a positive step towards making the Internet truly accessible to the world.