UX Magazine

Defining and Informing the Complex Field of User Experience (UX)
Article No. 715 August 16, 2011

A URL in Any Language

:
Getting to know the next generation of URLs

The Internet may touch all corners of the globe, but for millions of people it still remains frustratingly out of touch. The URL—that web address that connects users to content—has long remained limited to a handful of Latin-based characters.

Until now.
This year, we are witnessing the emergence of full-length URLs in non-Latin scripts, known as internationalized domain names (IDNs). Here are three working examples:

  • правительство.рф (http://правительство.рф)
  • παράδειγμα.δοκιμή (http://παράδειγμα.δοκιμή)
  • 례.테스트 (http://실례.테스트)

To give you an idea of just how many different scripts can be supported, the following exhibit includes the top-level IDNs that have, as of May 2011, received formal approval by ICANN. The IDNs are displayed next to their country code equivalents.

Map of approved IDNs

The regions represented here constitute more than 2.5 billion people, most of whom do not speak English as a native language. The regions also represent where most of the growth in Internet usage will occur over the next decade.

IDNs promise to improve the Internet for millions of people, but they also face obstacles to widespread adoption. This article provides an overview of IDNs as well as these obstacles, and the challenges that web developers will face as IDNs become more common.

The Year of the IDN

An IDN is a domain name that supports one or more non-ASCII characters. This means that any accented character (e.g., ä or é) would turn a conventional domain into an IDN. For example, the German liqueur Jägermeister uses the IDN jägermeister.com. But IDNs are getting the most attention these days for their support for non-Latin scripts such as Arabic, Hebrew, Greek, Cyrillic, and Chinese.

IDNs have been around for a number of years, but mostly limited to functioning as partial IDNs—a name that’s only partly composed of non-Latin characters. For example, президент.ru (the home page of the Russian President) is a partial IDN. While the second-level domain name (президент) is in Cyrillic, the top-level domain name (.ru), is still in ASCII. The Internet protocol strings such as "http" and "ftp" remain in ASCII.

Although partial IDNs function perfectly well, they’re not very practical because they require the user to switch keyboard inputs midstream. That's why the recent introduction of support for non-ASCII top-level domain names is significant.

What About IDNs and .com?

The .com domain name is the world’s most popular top-level domain (TLD) name. In June of this year, ICANN gave the green light to the expansion of generic TLDs (such as .com and .net) in different scripts. VeriSign, the owner of .com, has made it clear that is intends to offer language-specific versions of .com in the years ahead. Given VeriSign's market power, it's safe to say that many people will become aware of IDNs once .com goes multilingual.

How IDNs Work

The Internet was not designed to be multilingual. The domain name system (DNS) was intended only to support the characters a through z, A through Z, 0 through 9, and the hyphen. Upgrading the DNS to support upwards of 100,000 characters is no trivial feat.

In fact, the DNS has not been upgraded. The name servers scattered around the world—the servers that translate domain names into their numeric IP addresses—still only support ASCII characters. But what has changed is the addition of a sort of multilingual translation service at the front end (the user’s browser) to supplement the DNS.

When an IDN is input, the web browser algorithmically converts it into an odd-looking combination of ASCII characters (known as punycode).

For example, the IDN for McDonald’s Russia (макдональдс.рф) is converted to the punycode equivalent http://xn--80aalb1aicli8a5i.xn--p1ai/. It is this punycode string that is sent to the name server. When an IDN is initially registered, it is this punycode address that is stored within the name server.

Why Is There Punycode in My Browser?

Users of IDNs may notice punycode showing up their web browsers, such as this example from Chrome:

Punycode version of McDonald's Russian IDN in Chrome

This is usually intentional. With Internet Explorer, Chrome, and Firefox, the IDN is generally displayed in punycode format unless the language preference of the browser matches the language of the IDN.

There are real and perceived security fears that IDNs may be used to trick users into visiting phony websites. For example, рayрal.com and paypal.com are actually two different domains (the first uses the Cyrillic character "р"). To minimize this risk, some IDNs, such as the Russian IDN, do not permit mixed-script domain names. Of course, it's important that registries enforce these limitations, which remains to be seen.

For the time being, expect inconsistencies in how different browsers handle different IDNs.

Russia's IDN Land Grab

In late 2010, Russia opened its top-level IDN, .Рф (the Cyrillic abbreviation of Russian Federation) for registration. In less than a year, nearly one million IDNs have been registered, making .Рф not only the most popular IDN, but also one of the world’s most popular country codes. Russian President Dmitry Medvedev, one the chief advocates for IDNs, said last year: “We taught the World Wide Web to speak Russian.”

Today, Russians can access a small number of government and corporate websites by inputting Cyrillic-only URLs. And a number of companies have gotten onboard as well; Yandex, Russia’s leading search engine, can be reached at http://Яндекс.рф, and Russia’s largest mobile carrier is at http://МТС.рф.

India: One Country With Multiple IDNs

Indias multiple IDNsIndia has more than 20 official languages, which use a range of different scripts. Needless to say, one IDN won’t easily support all of these languages. ICANN recently approved seven IDNs, as shown in the graphic to the right.

It’s far too early to know how adoption of IDNs in this country will unfold, particularly since English is still widely viewed as the informal official language of the country. However, companies that are serious about winning the consumer market should also be serious about speaking the native languages of consumers.

Obstacles to IDN Adoption

While IDNs are in many ways a natural evolution of the Internet, most Internet users are completely unaware that they are even a possibility. And companies have largely been slow to promote them. In some cases, executives are concerned that promoting an IDN address may conflict with existing efforts to promote the Latin-based addresses. The thinking goes: Even if the current URLs are not ideal, why confuse users any more than we have to?

In addition, the emergence of social networking platforms also may hinder the growth of IDNs. For example, Twitter doesn’t currently recognize IDNs (though Facebook does).

Chinas two IDNsThere are also a number of technical and logistical challenges still to overcome. For example, there are concerns over how best to manage domains that may be represented by “variant” characters. For example, China has two distinct IDNs, as shown to the right.

One IDN is in simplified Chinese script while the other is in traditional script. The question that must be resolved still is whether or not a domain registered with one IDN should be aligned with the other. This is just one example of numerous variant issues that must be resolved by ICANN in the coming months (it has working groups studying the challenges right now).

Challenges to Supporting IDNs on Websites

Although IDNs are a long way from becoming commonplace, web developers should begin to consider the implications of supporting IDNs. Looking ahead, one can no longer assume that a URL will support only ASCII characters. Here are some challenges to keep in mind:

  • In a world of multilingual URLs, web developers will need to balance security with usability. URL input fields that automatically block non-Latin characters may need to be modified.
  • How should IDNs be promoted? For instance, McDonald’s in Russia still displays its Latin URL on its home page instead of the Cyrillic equivalent. Should companies display both URLs, and which URL should be the “front door?” There are no easy answers here, but the questions need to at least be asked.
  • If the domain is in a different script, should subdirectories also be in the same script? As you can see below on the home page of the Egyptian government, the top-level and second-level domains are in Arabic but the subdirectories are in Latin characters. If a company is going to ask users to input long URLs to go directly to subdirectory sites, usability needs to be taken into account.

Egyptian government site using a URL with mixed scripts

With the emergence of IDNs, we are inching closer to a more linguistically local Internet in which users no longer have to leave their native languages to get where they want to go. And though there are many obstacles ahead, this is a positive step towards making the Internet truly accessible to the world.

ABOUT THE AUTHOR(S)

User Profile

John Yunker (@johnyunker) is a leading expert on web and content globalization and author of the book The Art of the Global Gateway: Strategies for Successful Multilingual Navigation. He writes the popular blog Global by Design and is co-founder of Byte Level Research

Add new comment

Comments

24
28

Dear John, A great article on the change occurring in the internet world!
For those who are not familiar with latin character sets, but use the internet will find being able to use there own language, arabic, chinese, hebrew, hindi,russian, just to name a few, will find this to be a major benefit.
Companies who want to reach out to non english speaking communities will be able to now advertise webnames in non english characters, the potential of what will happen over the next 2 years is absolutely mind boggling and the opportunity to purchase some of the most popular non latin web names is available
right now at http://internationalwebsitenames.com

Thanks John for a great article again, would like to correspond in the near future to keep up with events!

15
29

great article

28
25

"(...)since English is still widely viewed as the informal official language of the country (...)"

well, sorry for being picky but actually, English is an official language of India in a very formal way, it's stated in the Indian constitution...

23
25

New Dashcom (not Dotcom) IDN domain names are already available in ANY language or text.

Domains names in the format 祝你-好运 or अरे-दोस्त or добрый-вечер

Now free at http://dashworlds.com

31
25

Wondering why such an outdated map of India is used for representation...

24
24

Great to hear for those of us that speak languages that are not Latin-based or entirely Latin-based.

Frankly, I'm surprised it has taken this long.