Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 8, 2011

Languages of the World (Wide Web)
(Google Research Blog)

Filed under: Language — Patrick Durusau @ 3:54 pm

Languages of the World (Wide Web)

Interesting post about linking between sites in different languages, if using somewhat outdated (2008) data. The authors allude to later data but give no specifics.

I mention it here as an example of where different subjects (the websites in particular languages), are treated as collective subjects for the purpose of examining links (associations in topic map speak) between the collective subjects.

Or as described by the authors:

To see the connections between languages, start by taking the several billion most important pages on the web in 2008, including all pages in smaller languages, and look at the off-site links between these pages. The particular choice of pages in our corpus here reflects decisions about what is `important’. For example, in a language with few pages every page is considered important, while for languages with more pages some selection method is required, based on pagerank for example.

We can use our corpus to draw a very simple graph of the web, with a node for each language and an edge between two languages if more than one percent of the offsite links in the first language land on pages in the second. To make things a little clearer, we only show the languages which have at least a hundred thousand pages and have a strong link with another language, meaning at least 1% of off-site links go to that language. We also leave out English, which we’ll discuss more in a moment. (Figure 1)

Being able to decompose the collective subjects to reveal numbers for sites in particular locations or particular sites would have made this study more compelling.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress