From the webpage:
To foster the study of the structure and dynamics of Web traffic networks, we make available a large dataset (‘Click Dataset’) of HTTP requests made by users at Indiana University. Gathering anonymized requests directly from the network rather than relying on server logs and browser instrumentation allows one to examine large volumes of traffic data while minimizing biases associated with other data sources. It also provides one with valuable referrer information to reconstruct the subset of the Web graph actually traversed by users. The goal is to develop a better understanding of user behavior online and create more realistic models of Web traffic. The potential applications of this data include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.
The data was generated by applying a Berkeley Packet Filter to a mirror of the traffic passing through the border router of Indiana University. This filter matched all traffic destined for TCP port 80. A long-running collection process used the pcap library to gather these packets, then applied a small set of regular expressions to their payloads to determine whether they contained HTTP GET requests.
Data available under terms and restrictions, including transfer by physical hard drive (~ 2.5 TB of data).
Intrigued by the notion of a “subset of the Web graph actually traversed by users.”
Does that mean that semantic annotation should occur on the portion of the “…Web graph actually traversed by users” before reaching other parts?
If the language of 4,148,237 English Wikipedia pages is never in doubt for any user, do we really need triples to record that for every page?