« Chirac Declares War on Google Library Project | Main | Last Night's Boston Digital Divide Meetup »

March 22, 2005

Local Languages, RSS and the Digital Divide

Susan Mernit recently posted a blog entry about the issue of RSS and its ability to display non-Latinate languages. Susan's blog was in turn inspired by postings on Rebecca MacKinnon's blog about diversity in the blogosphere. Both blogs include a quote from Richard Sambrook of the BBC:


I was speaking to one of our BBC World Service software engineers yesterday who made a point I hadn't appreciated but which potentially has a hugely negative effect on diversity: The issue is RSS does not have a way to display right to left languages correctly and is not very compatible with non Latin languages. I believe it just was not thought about deeply by the people and development effort behind RSS.

This slows down the growth of non Latin RSS adoption. We need to develop multiple language RSS and hopefully redefine standards and approaches.

I've spent a lot of time in the last few years noting the importance of producing local-language content on the Internet. According to various surveys, it's believed that around two-thirds of all Internet content is produced in English, even though English speakers make up less than 10% of the world's population. Some languages, such as Spanish and Chinese, are finally beginning to blossom online, though they still trail behind English, the lingua franca of online discourse.

Unfortunately, there haven't been many global surveys regarding language and Internet content. One important study came from the Barcelona media company Vilaweb in 2001, which found around 68% percent of all websites to be in English. When the study came out, I decided to make a quick chart that compared the number of Web pages found in a given language with the number of people worldwide who spoke that language. Here's what I found. (Again, please not that this data is several years old, so take it with a grain of salt...)

Web Pages and Languages, ranked by the number of speakers per web page:


































LanguageWeb Pages % of all sites # of speakers % of humans people/web ratio
English 214,250,996 68.39 322,000,000 5.34% 1.5 people/page
Icelandic 136,788 0.04 250,000 .004% 1.83 people/page
Sweden 2,929,241 0.93 9,000,000 .14% 3.07 people/page
Danish 1,374,886 0.44 5,292,000 .085% 3.85 people/page
Norwegian 1,259,189 0.40 5,000,000 .08% 3.86 people/page
Finnish 1,198,956 0.38 6,000,000 .095% 5.00 people/page
German 18,069,744 5.77 98,000,000 1.57% 5.4 people/page
Dutch 3,161,844 1.01 20,000,000 .32% 6.3 people/page
Estonian 173,265 0.06 1,100,000 .018% 6.36 people/page
Japanese 18,335,739 5.85 125,000,000 2.01% 6.8 people/page
Italian 4,883,497 1.56 37,000,000 .59% 7.58 people/page
French 9,262,663 2.96 72,000,000 1.16% 7.77 people/page
Catalan 443,301 0.14 4,353,000 .07% 9.8 people/page
Czech 991,075 0.32 12,000,000 .19% 12.1 people/page
Basque 36,321 0.01 588,000 .0094% 16.19 people/page
Slovenian 134,454 0.04 2,218,000 .036% 16.5 people/page
Korean 4,046,530 1.29 75,000,000 1.21% 18.5 people/page
Latvian 60,959 0.02 1,550,000 .025% 25.4 people/page
Russian 5,900,956 1.88 170,000,000 2.73% 28.8 people/page
Hungarian 498,625 0.16 14,500,000 .23% 29.1 people/page
Portuguese 4,291,237 1.37 170,000,000 2.73% 39.6 people/page
Greek 287,980 0.09 12,000,000 .19% 41.67 people/page
Spanish 7,573,064 2.42 332,000,000 5.34% 43.8 people/page
Lithuanian 82,829 0.03 4,000,000 .064% 48.29 people/page
Polish 848,672 0.27 44,000,000 .71% 51.8 people/page
Hebrew 198,030 0.06 12,000,000 .19% 60.6 people/page
Chinese 12,113,803 3.87 885,000,000 14.2% 73.1 people/page
Turkish 430,996 0.14 59,000,000 .95% 136.9 people/page
Bulgarian 51,336 0.02 9,000,000 .14% 175.3 people/page
Romanian 141,587 0.05 26,000,000 .42% 183.6 people/page
Arabic 127,565 0.04 202,000,000 3.25% 1583.5 people/page

As you can see here, there were about one and a half English speakers for every Web page in English. Interestingly, the next highest ranking came from Iceland; while they're aren't many Icelandic speakers, they've produced a lot of online content, so the ratio of speakers to Web pages is close as well. But compare this to Arabic-language content: there were so few Web pages in Arabic at the time compared to the large population of Arabic speakers, you end up with more than 1,500 people per Arabic Web page. Of course, this data is a few years old, and I'd love to update this chart, but so far I haven't seen a recent study that tabulates the number of pages for each of these languages.

So how of all of this relate to RSS? Well, RSS has become the de-facto way to syndicate content on the Internet. Blogs and news services rely on RSS, as do a growing number of blog consumers. But like the recent Pew study demonstrated, the average blogger is white, well-educated, well-off and English speaking. There's no way we can seriously bridge the digital divide as long as people can't create or access knowledge in their native language. If somehow we managed to bring Internet access to every village in the developing world, it won't mean much if those villagers are stuck using the Net only in English.

Fortunately, the UNICODE project has helped bring local languages to the Internet by providing a universal scheme for displaying tens of thousands of non-Latin characters. This means that I can go to an Israeli newspaper and read it in Hebrew, or an Iranian blog in Farsi. Slowly but surely, language is becoming democratized online, even if the amount of content or readers for a particular language leaves much to be desired.

But what if RSS isn't capable of handling all of these UNICODE languages? Will a Gujarati family in New Jersey be able to read news feeds coming from Gujarati bloggers in India? Honestly, I'm not sure. Technically, it should work: if you take a look at Hoder's blog, written in Farsi (Persian), you can subscribe to his Farsi RSS feed. I don't speak Farsi, but when I tried to subscribe to his feed using Mozilla Thunderbird, I received tons of blog entries, all of which were indeed written in Farsi. So in this particular case, the system apparently works.

But does this apply to all languages supported by UNICODE? Frankly, I have no idea, so I'm hoping some UNICODE techie will jump in and set the record straight. But I certainly hope it can work for all UNICODE-supported languages. Otherwise, we'll see a new facet of the linguistic digital divide. For languages with RSS support, knowledge gets produced and disseminated at a rapid pace, allowing more online knowledge to be produced, and an expanding community of people able to talk about this knowledge and contribute even further to it. But for languages that can't be transmitted via RSS, they'll be stuck sharing content at a much slower pace, to smaller, less-connected audience. Internet users shouldn't be penalized just because their language isn't on a global top 10 list. So let's make sure that people can blog and publish in the language of their choice -- and that their RSS feeds will support them, 100 percent. -andy

Posted by acarvin at March 22, 2005 2:29 PM

Listen to this article Listen to a computer-generated podcast of this article

pc gamesword gamesmahjongpuzzle gamesshooter gamesadventure gameskids gamesdownloadable pc games