Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 14, 2014

The growing problem of “link rot” and best practices for media and online publishers

Filed under: Hypertext,WWW — Patrick Durusau @ 6:11 pm

The growing problem of “link rot” and best practices for media and online publishers by Leighton Walter Kille.

From the post:

The Internet is an endlessly rich world of sites, pages and posts — until it all ends with a click and a “404 not found” error message. While the hyperlink was conceived in the 1960s, it came into its own with the HTML protocol in 1991, and there’s no doubt that the first broken link soon followed.

On its surface, the problem is simple: A once-working URL is now a goner. The root cause can be any of a half-dozen things, however, and sometimes more: Content could have been renamed, moved or deleted, or an entire site could have evaporated. Across the Web, the content, design and infrastructure of millions of sites are constantly evolving, and while that’s generally good for users and the Web ecosystem as a whole, it’s bad for existing links.

In its own way, the Web is also a very literal-minded creature, and all it takes is a single-character change in a URL to break a link. For example, many sites have stopped using “www,” and even if their content remains the same, the original links may no longer work. The rise of CMS platforms such as WordPress have led to the fall of static HTML sites with their .htm and .html extensions, and with each relaunch, untold thousands of links die.

Even if a core URL remains the same, many sites frequently append login information or search terms to URLs, and those are ephemeral. And as the Web has grown, the problem has been complicated by Google and other search engines that crawl the Web and archive — briefly — URLs and pages. Many work, but their long-term stability is open to question.

Hmmm, link rot, do you think that impacts the Semantic Web? 😉

If you can have multiple IRI’s for the same subject, well, you can have a different result.

Leighton has a number of suggestions to lessen your own link rot. For the link rot (as far as identifiers) of others, I suggest topic maps.

I first saw this at Full Text Reports as: Website linking: The growing problem of “link rot” and best practices for media and online publishers.

September 12, 2014

A Greater Voice for Individuals in W3C – Tell Us What You Would Value [Deadline: 30 Sept 2014]

Filed under: Standards,WWW — Patrick Durusau @ 6:54 pm

A Greater Voice for Individuals in W3C – Tell Us What You Would Value by Coralie Mercier.

From the post:

How is the W3C changing as the world evolves?

Broadening in recent years the W3C focus on industry is one way. Another was the launch in 2011 of W3C Community Groups to make W3C the place for new standards. W3C has heard the call for increased affiliation with W3C, and making W3C more inclusive of the web community.

W3C responded through the development of a program for increasing developer engagement with W3C. Jeff Jaffe is leading a public open task force to establish a program which seeks to provide individuals a greater voice within W3C, and means to get involved and help shape web technologies through open web standards.

Since Jeff announced the version 2 of the Webizen Task Force, we focused on precise goals, success criteria and a selection of benefits, and we built a public survey.

The W3C is a membership based organisation supported by way of membership fees, as to form a common set of technologies, written to the specifications defined through the W3C, which the web is built upon.

The proposal (initially called Webizen but that name may change and we invite your suggestions in the survey), seeks to extend participation beyond the traditional forum of incorporated entities with an interest in supporting open web standards, through new channels into the sphere of individual participation, already supported through the W3C community groups.

Today the Webizen Task Force is releasing a survey which will identify whether or not sufficient interest exists. The survey asks if you are willing to become a W3C Webizen. It offers several candidate benefits and sees which ones are of interest; which ones would make it worthwhile to become Webizens.

I took the survey today and suggest that you do the same before 30 September 2014.

In part I took the survey because on one comment on the original post that reads:

What a crock of shit! The W3C is designed to not be of service to individuals, but to the corporate sponsors. Any ideas or methods to improve web standards should not be taken from sources other then the controlling corporate powers.

I do think that as a PR stunt the Webizen concept could be a good ploy to allow individuals to think they have a voice, but the danger is that they may be made to feel as if they should have a voice.

This could prove detrimental in the future.

I believe the focus of the organization should remain the same, namely as a organization that protects corporate interests and regulates what aspects of technology can be, and should be, used by individuals.

The commenter apparently believes in a fantasy world where those with the gold don’t make the rules.

I am untroubled by those with the gold making the rules, so long as the rest of us have the opportunity for persuasion, that is to be heard by those making the rules.

My suggestion at #14 of the survey reads:

The anti-dilution of “value of membership” position creates a group of second class citizens, which can only lead to ill feelings and no benefit to the W3C. It is difficult to imagine that IBM, Oracle, HP or any of the other “members” of the W3C are all that concerned with voting on W3C specifications. They are likely more concerned with participating in the development of those standards. Which they could do without being members should they care to submit public comments, etc.

In fact, “non-members” can contribute to any work currently under development. If their suggestions have merit, I rather doubt their lack of membership is going to impact acceptance of their suggestions.

Rather than emphasizing the “member” versus “non-member” distinction, I would create a “voting member” and “working member” categories, with different membership requirements. “Voting members” would carry on as they are presently and vote on the administrative aspects of the W3C. “Working members” who consist of employees of “voting members,” “invited experts,” and “working members” who meet some criteria for interest in and expertise at a particular specification activity. Like an “invited expert” but without heavy weight machinery.

Emphasis on the different concerns of different classes of membership would go a long way to not creating a feeling of second class citizenship. Or at least it would minimize it more than the “in your face” type approach that appears to be the present position.

Being able to participate in teleconferences for example, should be sufficient for most working members. After all, if you have to win votes for a technical position, you haven’t been very persuasive in presenting your position.

Nothing against “voting members” at the W3C but I would rather be a “working member” any day.

How about you?

Take the Webizen survey.

September 9, 2014

BootCaT: Simple Utilities to Bootstrap Corpora And Terms from the Web

Filed under: Corpora,Natural Language Processing,WWW — Patrick Durusau @ 6:10 pm

BootCaT: Simple Utilities to Bootstrap Corpora And Terms from the Web

From the webpage:

Despite certain obvious drawbacks (e.g. lack of control, sampling, documentation etc.), there is no doubt that the World Wide Web is a mine of language data of unprecedented richness and ease of access.

It is also the only viable source of “disposable” corpora built ad hoc for a specific purpose (e.g. a translation or interpreting task, the compilation of a terminological database, domain-specific machine learning tasks). These corpora are essential resources for language professionals who routinely work with specialized languages, often in areas where neologisms and new terms are introduced at a fast pace and where standard reference corpora have to be complemented by easy-to-construct, focused, up-to-date text collections.

While it is possible to construct a web-based corpus through manual queries and downloads, this process is extremely time-consuming. The time investment is particularly unjustified if the final result is meant to be a single-use corpus.

The command-line scripts included in the BootCaT toolkit implement an iterative procedure to bootstrap specialized corpora and terms from the web, requiring only a list of “seeds” (terms that are expected to be typical of the domain of interest) as input.

In implementing the algorithm, we followed the old UNIX adage that each program should do only one thing, but do it well. Thus, we developed a small, independent tool for each separate subtask of the algorithm.

As a result, BootCaT is extremely modular: one can easily run a subset of the programs, look at intermediate output files, add new tools to the suite, or change one program without having to worry about the others.

Any application following “the old UNIX adage that each program should do only one thing, but do it well” merits serious consideration.

Occurs to me that BootCaT would also be useful for creating small text collections for comparison to each other.

Enjoy!

I first saw this in a tweet by Alyona Medelyan.

September 3, 2014

A Web Magna Carta?

Filed under: Government,Politics,WWW — Patrick Durusau @ 4:40 pm

Crowdsourcing a Magna Carta for the Web at the Internet Governance Forum by Harry Halpin.

From the post:

At the Internet Governance Forum this week in Istanbul, we’ve been discussing how to answer the question posed by Tim Berners-Lee and the World Wide Web Foundation at the occasion of the 25th anniversary of the Web: What is the Web Web Want? How can a “Magna Carta” for Web rights be crowd-sourced directly from the users of the Web itself?

A session on the Magna Carta (panel and Q&A) is part of the agenda this week at IGF on Thursday [4] September at 10:00 CET in Room 4 and folks can participate remotely over WebEx, IRC, and Twitter. Please tweet your questions about the Magna Carta with #webwewant to Twitter or join the channel #webwewant at irc.freenode.org. The session will be livestreamed.

Before you get too excited about a Magna Carta for Web rights, recall some of the major events in history of the Magna Carta. Or see: Treasures in Full: Magna Carta (British Library) which includes the ability to read an image of the Magna Carta.

First, the agreement was an attempt to limit the powers of King John by a group of feudal barons, who wanted to protect their rights and property, not those of all subjects of King John. Moreover, both the king and the barons were willing to use force against the other in order to prevail.

The Magna Carta was renounced by King John and there ensued the First Baron’s War (after about three months).

I welcome the conversation but for a Magna Carta for the Web to succeed, sovereign states (read nations) must agree to enforceable limits on their power, much as King John did.

Twenty-five feudal barons, under article 61 of the Magna Carta (originally unnumbered) could enforce the Magna Carta:

Since, moreover, we have conceded all the above things (from reverence) for God, for the reform of our kingdom and the better quieting of the discord that has sprung up between us and our barons, and since we wish these things to flourish unimpaired and unshaken for ever, we constitute and concede to them the following guarantee:- namely, that the barons shall choose any twenty-five barons of the kingdom they wish, who with all their might are to observe, maintain and secure the observance of the peace and rights which we have conceded and confirmed to them by this present charter of ours; in this manner, that if we or our chief Justiciar or our bailiffs or any of our servants in any way do wrong to anyone, or transgress any of the articles of peace or security, and the wrong doing has been demonstrated to four of the aforesaid twenty-five barons, those four barons shall come to us or our chief Justiciar, (if we are out of the kingdom), and laying before us the grievance, shall ask that we will have it redressed without delay. And if we, or our chief Justiciar (should we be out of the kingdom) do not redress the grievance within forty days of the time when it was brought to the notice of us or our chief Justiciar (should we be out of the kingdom), the aforesaid four barons shall refer the case to the rest of the twenty-five barons and those twenty-five barons with the whole community of the land shall distrain and distress us in every way they can, namely by taking of castles, estates and possessions, and in such other ways as they can, excepting (attack on) our person and those of our queen and of our children until, in their judgment, satisfaction has been secured; and when satisfaction has been secured let them behave towards us as they did before. And let anyone in the country who wishes to do so take an oath to obey the orders of the said twenty-five barons in the execution of all the aforesaid matters and with them to oppress us to the best of his ability, and we publicly and freely give permission for the taking the oath to anyone who wishes to take it, and we will never prohibit anyone from taking it. [source: http://www.iamm.com/magnaarticles.htm]

To cut to the chase, the King in Article 61 agrees the twenty-five barons could seize his castles, estates and possessions, excepting they cannot attack the king, queen, and their children, in order to force the king to follow the terms of the Magna Carta.

In modern terms, the barons could seize the Treasury Department, Congress, etc., but not take the President and his family hostage.

Do we have twenty-five feudal barons, by that I mean the global IT companies, willing to join together to enforce a Magna Carta for the Web on nations and principalities?

Without enforcers, a modern Magna Carta for the Web will be a pale imitation of its inspiration.

May 28, 2014

The Deep Web you don’t know about

Filed under: Deep Web,Tor,WWW — Patrick Durusau @ 4:27 pm

The Deep Web you don’t know about by Jose Pagliery.

From the post:

Then there’s Tor, the darkest corner of the Internet. It’s a collection of secret websites (ending in .onion) that require special software to access them. People use Tor so that their Web activity can’t be traced — it runs on a relay system that bounces signals among different Tor-enabled computers around the world.

(video omitted)

It first debuted as The Onion Routing project in 2002, made by the U.S. Naval Research Laboratory as a method for communicating online anonymously. Some use it for sensitive communications, including political dissent. But in the last decade, it’s also become a hub for black markets that sell or distribute drugs (think Silk Road), stolen credit cards, illegal pornography, pirated media and more. You can even hire assassins.

If you take the figures of 54% of the deep web being databases, plus the 13% said to be on intranets, that leaves 33% of the deep web unaccounted for. How much of that is covered by Tor is hard to say.

But, we can intelligently guess that search doesn’t work any better in Tor than other segments of the Web, deep or not.

Given the risk of using even the Tor network, Online privacy is dead by Jose Pagliery (NSA vs. Silk Road), finding what you want efficiently could be worth a premium price.

Is guarding online privacy the the tipping point for paid collocation services?

May 23, 2014

The Secret History of Hypertext

Filed under: Hypertext,WWW — Patrick Durusau @ 2:23 pm

The Secret History of Hypertext by Alex Wright.

From the post:

When Vannevar Bush’s “As We May Think” first appeared in The Atlantic’s pages in July 1945, it set off an intellectual chain reaction that resulted, more than four decades later, in the creation of the World Wide Web.

In that landmark essay, Bush described a hypothetical machine called the Memex: a hypertext-like device capable of allowing its users to comb through a large set of documents stored on microfilm, connected via a network of “links” and “associative trails” that anticipated the hyperlinked structure of today’s Web.

Historians of technology often cite Bush’s essay as the conceptual forerunner of the Web. And hypertext pioneers like Douglas Engelbart, Ted Nelson, and Tim Berners-Lee have all acknowledged their debt to Bush’s vision. But for all his lasting influence, Bush was not the first person to imagine something like the Web.

Alex identifies several inventors in the early 20th who proposed systems quite similar to Vannevar Bush’s, prior to the publication of “As We May Think”. A starting place that may get you interested in learning the details of these alternate proposals.

Personally I would separate the notion of “hypertext” from the notion of networking remote sites together (not by Bush but by others) and that pushes the history of hypertext much further back in time.

Enjoy!

I first saw this in a tweet by Ed H. Chi.

May 20, 2014

Theorizing the Web, an experience

Filed under: WWW — Patrick Durusau @ 4:14 pm

Theorizing the Web, an experience by Chas Emerick.

From the post:

Last week, I attended Theorizing the Web (TtW). I can say without hesitation that it was one of the most challenging, enlightening, and useful conference experiences I’ve ever had. I’d like to provide a summary account of my experience, and maybe offer some (early, I’m still processing) personal takeaways that might be relevant to you, especially if you are involved professionally in building the software and technology that is part of what is theorized at TtW.

The first thing you need to know is that TtW is not a technology conference. Before I characterize it positively though, it’s worth considering the conference’s own statement:

Theorizing the Web is an inter- and non-disciplinary annual conference that brings together scholars, journalists, artists, activists, and commentators to ask big questions about the interrelationships between the Web and society.

While there were a few technologists in attendance, even fewer were presenting. As it said on the tin, TtW was fundamentally about the social, media, art, legal, and political aspects and impacts of the internet and related technologies.

Before I enumerate some of my highlights of TtW, I want to offer some context of my own, a thread that mostly winds around:

When I saw the tweet by Chas, I thought this was a technical conference, but I quickly learned my error. 😉

Before you watch videos from the conference, Theorizing the Web, take a slow read of Chas’ post.

Whether you will draw the same conclusions as Chas or different ones remains to be seen. What is clear from his post, this conference covered many subjects that aren’t visible at many other conferences.

If you have a favorite video from the conference let me know. I will be watching at least some of them before offering my perspective.

May 12, 2014

High-Performance Browser Networking

Filed under: Networks,Topic Map Software,WWW — Patrick Durusau @ 10:42 am

High-Performance Browser Networking by Ilya Grigorik.

From the foreword:

In High Performance Browser Networking, Ilya explains many whys of networking: Why latency is the performance bottleneck. Why TCP isn’t always the best transport mechanism and UDP might be your better choice. Why reusing connections is a critical optimization. He then goes even further by providing specific actions for improving networking performance. Want to reduce latency? Terminate sessions at a server closer to the client. Want to increase connection reuse? Enable connection keep-alive. The combination of understanding what to do and why it matters turns this knowledge into action.

Ilya explains the foundation of networking and builds on that to introduce the latest advances in protocols and browsers. The benefits of HTTP 2.0 are explained. XHR is reviewed and its limitations motivate the introduction of Cross-Origin Resource Sharing. Server-Sent Events, WebSockets, and WebRTC are also covered, bringing us up to date on the latest in browser networking.

Viewing the foundation and latest advances in networking from the perspective of performance is what ties the book together. Performance is the context that helps us see the why of networking and translate that into how it affects our website and our users. It transforms abstract specifications into tools that we can wield to optimize our websites and create the best user experience possible. That’s important. That’s why you should read this book.

Network latency may be responsible for a non-responsive app but can you guess who the user is going to blame?

Right in one, the app!

“Not my fault” isn’t a line item on any bank deposit form.

You or someone on your team needs to be tasked with performance, including reading High-Performance Browser Networking.

I first saw this in a tweet by Jonas Bonér

April 27, 2014

Net Neutrality – Priority Check

Filed under: WWW — Patrick Durusau @ 3:36 pm

I remain puzzled over the “sky is falling” responses to rumors about possible FCC rules on Net Neutraility (NN). (See: New York Times, The Guardian and numerous others. ) There are no proposed rules at the moment but a lack of content for comment hasn’t slowed the production of commentary.

Should I be concerned about Netflix being set upon by an even more rapacious predator (Comcast)? (A common NN example.) What priority should NN have among the issues vying for my attention? (Is net neutrality dying? Has the FCC killed it? What comes next? Here’s what you need to know)

Every opinion is from a point of view and mine is from the perspective of a lifetime of privilege, at least when compared to the vast majority of humanity. So what priority does NN have among the world at large? For one answer to that question, I turned to the MyWorld2015 Project.

MY World is a United Nations global survey for citizens. Working with partners, we aim to capture people’s voices, priorities and views, so world leaders can be informed as they begin the process of defining the next set of global goals to end poverty.

world opinion

If I am reading the chart correctly, Phone and internet access come in at #14.

Perhaps being satiated with goods and services for the first thirteen priorities makes NN loom large.

Having 95% of all possible privileges isn’t the same as having 96% of all possible privileges.*

*(Estimate. Actual numbers for some concerned residents of the United States are significantly higher than 96%.)


Just in case you are interested:

FCC Inbox for Open Internet Comments

Tenative Agenda for 15 May 2014 Meeting, which includes Open Internet The Open Meeting is scheduled to commence at 10:30 a.m. in Room TW-C305, at 445 12th Street, S.W., Washington, D.C. The event will be shown live at FCC.gov/live.

FCC website

April 11, 2014

Navigating the WARC File Format

Filed under: Common Crawl,WWW — Patrick Durusau @ 1:22 pm

Navigating the WARC File Format by Stephen Merity.

From the post:

Recently CommonCrawl has switched to the Web ARChive (WARC) format. The WARC format allows for more efficient storage and processing of CommonCrawl’s free multi-billion page web archives, which can be hundreds of terabytes in size.

This document aims to give you an introduction to working with the new format, specifically the difference between:

  • WARC files which store the raw crawl data
  • WAT files which store computed metadata for the data stored in the WARC
  • WET files which store extracted plaintext from the data stored in the WARC

If you want all the nitty gritty details, the best source is the ISO standard, for which the final draft is available.

If you’re more interested in diving into code, we’ve provided three introductory examples in Java that use the Hadoop framework to process WAT, WET and WARC.

If you aren’t already using Common Crawl data, you should be.

Fresh Data Available:

The latest dataset is from March 2014, contains approximately 2.8 billion webpages and is located
in Amazon Public Data Sets at /common-crawl/crawl-data/CC-MAIN-2014-10.

What are you going to look for in 2.8 billion webpages?

March 19, 2014

The Seven Parts of “HTML 5 Fundamentals”

Filed under: CSS3,HTML5,WWW — Patrick Durusau @ 2:08 pm

The Seven Parts of “HTML 5 Fundamentals” by Greg Duncan.

From the post:

It’s Web Wednesday and today we’re going to take a step back and share a series by David Giard, who’s going to give us a fundamental look at HTML5. Oh, I know YOU don’t need this, but you might have a "friend" who does (cough… like me… cough…).

HTML 5 Fundamentals

Read this series of articles to learn more about HTML5 and CSS3

Part 1- An Introduction to HTML5

Part 2 – New Tags

Part 3 – New Attributes

Part 4 – New Input Types

Part 5 – CSS3

Part 6 – More CSS3

Part 7 – HTML5 JavaScript APIs

Part 1 has this jewel:

Due to the enormous scope of HTML5 and the rate at which users tend to upgrade to new browsers, it is unlikely that HTML5 will be on all computers for another decade.

Let’s see, a web “year” is approximately 3 months according to Tim Berners-Lee, so in forty (40) web years, HTML5 will be on all computers.

That’s a long time to wait so I would suggest learning features as they are supported by the top three browsers. You won’t ever be terribly behind and at the same time, your webpages wit

That would make an interesting listing if it doesn’t exist already. The features of HTML5 as a matrix against the top three browsers.

Legend for the matrix: One browser – start learning, Two browsers – start writing, Three browsers – deploy.

Yes?

I first saw this in a tweet by Microsoft Channel 9.

January 26, 2014

Pricing “The Internet of Everything”

Filed under: Transparency,WWW — Patrick Durusau @ 8:11 pm

I was reading Embracing the Internet of Everything To Capture Your Share of $14.4 Trillion by Joseph Bradley, Joel Barbier, and Doug Handler, when I realized their projected Value at Stake of $14.4 trillion left out an important number. The price for an Internet of Everything.

Prices are usually calculated by the product price multiplied by the quantity of the product. Let’s start there to evaluate Cisco’s pricing.

In How Many Things Are Currently Connected To The “Internet of Things” (IoT)?, appearing in Forbes, Rob Soderberry, Cisco Executive, said that:

the number of connected devices reached 8.7 billion in 2012.

The Internet of Everything (IoE) paper projects 50 billion “things” being connected by 2020.

Roughly that’s 41.3 billion more connections than exist at present.

Let’s take some liberties with Cisco’s numbers. Assume the networking in each device, leaving aside the cost of a new device with networking capability, is $10. So $10 times 41.3 billion connections = $410.3 billion. The projected ROI just dropped from $14.4 trillion to $14 trillion.

Let’s further assume that Internet connectivity has radically dropped in price and so it only $10 per month. For our additional 41.3 billion devices, $10 times 41.3 billion things times 12 or $4.130 trillion per year. The projected ROI just dropped to $10 trillion.

I say the ROI “dropped,” but that’s not really true. Someone is getting paid for Internet access, the infrastructure to support it, etc. Can you spell “C-i-s-c-o?”

In terms of complexity, consider Mark Zuckerberg’s (Facebook founder) Internet.org, which is working with Ericsson, MediaTek, Nokia, Opera, Qualcomm, and Samsung:

to help bring web access to the five billion people who are not yet connected. (From: Mark Zuckerberg launches Internet.org to help bring web access to the whole world by Mark Wilson.)

A coalition of major players working on connecting 5 billion people versus Cisco’s hand waving about connecting 50 billion “things.”

That’s not a cost estimate but it does illustrate the enormity of the problem of creating the IoE.

But the cost of the proposed IoE isn’t just connecting to the Internet.

For commercial ground vehicles the Cisco report says:

As vehicles become more connected with their environment (road, signals, toll booths, other vehicles, air quality reports, inventory systems), efficiencies and safety greatly increase. For example, the driver of a vending-machine truck will be able to look at a panel on the dashboard to see exactly which locations need to be replenished. This scenario saves time and reduces costs.

Just taking roads and signals, do you know how much is spent on highway and street construction in the United States every month?

Would you believe it averages between $77 billion and 83+ billion a month? US Highway and Street Construction Spending:
82.09B USD for Nov 2013

And the current state of road infrastructures in the United States?

Forty-two percent of America’s major urban highways remain congested, costing the economy an estimated $101 billion in wasted time and fuel annually. While the conditions have improved in the near term, and Federal, state, and local capital investments increased to $91 billion annually, that level of investment is insufficient and still projected to result in a decline in conditions and performance in the long term. Currently, the Federal Highway Administration estimates that $170 billion in capital investment would be needed on an annual basis to significantly improve conditions and performance. (2013 Report Card: Roads D+. For more infrastructure reports see: 2013 Report Card )

I read that to say an estimated $170 billion is needed annually just to improve current roads. Yes?

That doesn’t include the costs of Internet infrastructure, the delivery vehicle, other vehicles, inventory systems, etc.

I am certain that however and whenever the Internet of Things comes into being, Cisco, as part of the core infrastructure now, will prosper. I can see Cisco’s ROI from the IoE.

What I don’t see is the ROI for the public or private sector, even assuming the Cisco numbers are spot on.

Why? Because there is no price tag for the infrastructure to make the IoE a reality. Someone, maybe a lot of someones, will be paying that cost.

If you encounter costs estimates sufficient for players in the public or private sectors to make their own ROI calculations, please point them out. Thanks!

PS: A future Internet more to my taste would have tagged Cisco’s article with “speculation,” “no cost data,” etc. as aids for unwary readers.

PPS: Apologies for only U.S. cost figures. Other countries will have similar issues but I am not as familiar with where to find their infrastructure data.

January 7, 2014

Small Crawl

Filed under: Common Crawl,Data,Webcrawler,WWW — Patrick Durusau @ 7:40 pm

meanpath Jan 2014 Torrent – 1.6TB of crawl data from 115m websites

From the post:

October 2012 was the official kick off date for development of meanpath – our source code search engine. Our goal was to crawl as much of the web as we could using mostly open source software and a decent (although not Google level) financial investment. Outside of many substantial technical challenges, we also needed to acquire a sizeable list of seed domains as the starting block for our crawler. Enter Common Crawl which is an open crawl of the web that can be accessed and analysed by everyone. Of specific interest to us was the Common Crawl URL Index which we combined with raw domain zone files and domains from the Internet Census 2012 to create our master domain list.

We are firm supporters of open access to information which is why we have chosen to release a free crawl of over 115 million sites. This index contains only the front page HTML, robots.txt, favicons, and server headers of every crawlable .com, .net, .org, .biz, .info, .us, .mobi, and .xxx that were in the 2nd of January 2014 zone file. It does not execute or follow JavaScript or CSS so is not 100% equivalent to what you see when you click on view source in your browser. The crawl itself started at 2:00am UTC 4th of January 2014 and finished the same day.

Get Started:
You can access the meanpath January 2014 Front Page Index in two ways:

  1. Bittorrent – We have set up a number of seeds that you can download from using this descriptor. Please seed if you can afford the bandwidth and make sure you have 1.6TB of disk space free if you plan on downloading the whole crawl.
  2. Web front end – If you are not interested in grappling with the raw crawl files you can use our web front end to do some sample searches.

Data Set Statistics:

  1. 149,369,860 seed domains. We started our crawl with a full zone file list of all domains in the .com (112,117,307), .net (15,226,877), .org (10,396,351), .info (5,884,505), .us (1,804,653), .biz (2,630,676), .mobi (1,197,682) and .xxx (111,809) top level domains (TLD) for a total of 149,369,860 domains. We have a much larger set of domains that cover all TLDs but very few allow you to download a zone file from the registrar so we cannot guarantee 100% coverage. For statistical purposes having a defined 100% starting point is necessary.
  2. 115,642,924 successfully crawled domains. Of the 149,369,860 domains only 115,642,924 were able to be crawled which is a coverage rate of 77.42%
  3. 476 minutes of crawling. It took us a total of 476 minutes to complete the crawl which was done in 5 passes. If a domain could not be crawled in the first pass we tried 4 more passes before giving up (those excluded by robots.txt are not retried). The most common reason domains are not able to be crawled is a lack of any valid A record for domain.com or www.domain.com
  4. 1,500GB of uncompressed data. This has been compressed down to 352.40gb using gzip for ease of download.

I just scanned the Net for 2TB hard drives and the average runs between $80 and $100. There doesn’t seem to be much difference between internal and external.

The only issue I foresee is that some ISPs limit downloads. You can always tunnel to another box using SSH but that requires enough storage on the other box as well.

Be sure to check out meanpath’s search capabilities.

Perhaps the day of boutique search engines is getting closer!

December 31, 2013

NSA Cloud On The “Open Internet”

Filed under: Cybersecurity,NSA,Security,WWW — Patrick Durusau @ 11:22 am

The FCC defines the “Open Internet” as:

The “Open Internet” is the Internet as we know it. It’s open because it uses free, publicly available standards that anyone can access and build to, and it treats all traffic that flows across the network in roughly the same way. The principle of the Open Internet is sometimes referred to as “net neutrality.” Under this principle, consumers can make their own choices about what applications and services to use and are free to decide what lawful content they want to access, create, or share with others. This openness promotes competition and enables investment and innovation.

The Open Internet also makes it possible for anyone, anywhere to easily launch innovative applications and services, revolutionizing the way people communicate, participate, create, and do business—think of email, blogs, voice and video conferencing, streaming video, and online shopping. Once you’re online, you don’t have to ask permission or pay tolls to broadband providers to reach others on the network. If you develop an innovative new website, you don’t have to get permission to share it with the world.

Pay particular attention to the line:

This openness promotes competition and enables investment and innovation.

The National Security Agency (NSA) and other state-sponsored cyber-criminals are dark clouds on that “openness.”

For years, many of us have seen:

MS error report

But as the Spiegel staff report in: Inside TAO: Documents Reveal Top NSA Hacking Unit

NSA staff capture such reports and mock Microsoft with slides such as:

NSA image

(Both of the images are from the Spiegel story.)

It doesn’t require a lot of imagination to realize that Microsoft will have to rework its error reporting systems to encrypt such reports, resulting in more overhead for users, the Internet and Microsoft.

Other software vendors and services will be following suite, adding more cost and complexity to services on the Internet, rather than making services more innovative and useful.

The NSA and other state-sponsored cyber-criminals are a very dark cloud over the very idea of an “open Internet.”

What investments will be made to spur competition and innovation on the Internet in the future is unknown. What we do know is that left unchecked, the NSA and other state-sponsored cyber-criminals are going to make security, not innovation, the first priority in investment.

State-sponsored cyber-criminals are far more dangerous than state-sponsored terrorists. Terrorists harm a few people today. Cyber-criminals are stealing the future from everyone.

PS: The Spiegel story is in three parts: Part 1: Documents Reveal Top NSA Hacking Unit, Part 2: Targeting Mexico, Part 3: The NSA’s Shadow Network. Highly recommended for your reading.

November 28, 2013

2013 Arrives! (New Crawl Data)

Filed under: Common Crawl,Data,Dataset,WWW — Patrick Durusau @ 10:56 am

New Crawl Data Available! by Jordan Mendelson.

From the post:

We are very please to announce that new crawl data is now available! The data was collected in 2013, contains approximately 2 billion web pages and is 102TB in size (uncompressed).

We’ve made some changes to the data formats and the directory structure. Please see the details below and please share your thoughts and questions on the Common Crawl Google Group.

Format Changes

We have switched from ARC files to WARC files to better match what the industry has standardized on. WARC files allow us to include HTTP request information in the crawl data, add metadata about requests, and cross-reference the text extracts with the specific response that they were generated from. There are also many good open source tools for working with WARC files.

We have switched the metadata files from JSON to WAT files. The JSON format did not allow specifying the multiple offsets to files necessary for the WARC upgrade and WAT files provide more detail.


We have switched our text file format from Hadoop sequence files to WET files (WARC Encapsulated Text) that properly reference the original requests. This makes it far easier for your processes to disambiguate which text extracts belong to which specific page fetches.

Jordan continues to outline the directory structure of the 2013 crawl data and lists additional resources that will be of interest.

If you aren’t Google or some reasonable facsimile thereof (yet), the Common Crawl data set is your doorway into the wild wild content of the WWW.

How do your algorithms fare when matched against the full range of human expression?

August 28, 2013

the BOMB in the GARDEN

Filed under: Marketing,W3C,WWW — Patrick Durusau @ 6:17 pm

the BOMB in the GARDEN by Matthew Butterick.

From the post:

It’s now or nev­er for the web. The web is a medi­um for cre­ators, in­clud­ing de­sign­ers. But af­ter 20 years, the web still has no cul­ture of de­sign ex­cel­lence. Why is that? Because de­sign ex­cel­lence is in­hib­it­ed by two struc­tur­al flaws in the web. First flaw: the web is good at mak­ing in­for­ma­tion free, but ter­ri­ble at mak­ing it ex­pen­sive. So the web has had to rely large­ly on an ad­ver­tis­ing econ­o­my, which is weak­en­ing un­der the strain. Second flaw: the process of adopt­ing and en­forc­ing web stan­dards, as led by the W3C, is hope­less­ly bro­ken. Evidence of both these flaws can be seen in a) the low de­sign qual­i­ty across the web, and b) the speed with which pub­lish­ers, de­vel­op­ers, and read­ers are mi­grat­ing away from the web, and to­ward app plat­forms and me­dia plat­forms. This ev­i­dence strong­ly sug­gests that the web is on its way to be­com­ing a sec­ond-class plat­form. To ad­dress these flaws, I pro­pose that the W3C be dis­band­ed, and that the lead­er­ship of the web be re­or­ga­nized around open-source soft­ware prin­ci­ples. I also en­cour­age de­sign­ers to ad­vo­cate for a bet­ter web, lest they find them­selves confined to a shrink­ing ter­ri­to­ry of possibilities.

Apologies to Matthew for my mangling of the typography of his title.

This rocks!

This is one of those rare, read this at least once a month posts.

That is if you want to see a Web that supports high quality design and content.

If you like the current low quality, ad driven Web, just ignore it.

August 22, 2013

Three RDFa Recommendations Published

Filed under: HTML5,RDF,RDFa,WWW — Patrick Durusau @ 2:52 pm

Three RDFa Recommendations Published

From the announcement:

  • HTML+RDFa 1.1, which defines rules and guidelines for adapting the RDFa Core 1.1 and RDFa Lite 1.1 specifications for use in HTML5 and XHTML5. The rules defined in this specification not only apply to HTML5 documents in non-XML and XML mode, but also to HTML4 and XHTML documents interpreted through the HTML5 parsing rules.
  • The group also published two Second Editions for RDFa Core 1.1 and XHTML+RDFa 1.1, folding in the errata reported by the community since their publication as Recommendations in June 2012; all changes were editorial.
  • The group also updated the a RDFa 1.1 Primer.

The deeper I get into HTML+RDFa 1.1, the more I think a random RDFa generator would be an effective weapon against government snooping.

Something copies some percentage of your text and places it in a comment and generates random RDFa 1.1 markup for it, thus: <!– – your content + RDFa – –>.

Improves the stats for the usage of RDFa 1.1 and if the government tries to follow all the RDFa 1.1 rules, well, let’s just say they will have less time for other mischief. 😉

July 7, 2013

import.io

Filed under: Data,ETL,WWW — Patrick Durusau @ 4:18 pm

import.io

The steps listed by import.io on its “How it works” page:

Find: Find an online source for your data, whether it’s a single web page or a search engine within a site. Import•io doesn’t discriminate; it works with any web source.

Extract: When you have identified the data you want, you can begin to extract it. The first stage is to highlight the data that you want. You can do this by giving us a few examples and our algorithms will identify the rest. The next stage is to organise your data. This is as simple as creating columns to sort parts of the data into, much like you would do in a spreadsheet. Once you have done that we will extract the data into rows and columns.

If you want to use the data once, or infrequently, you can stop here. However, if you would like a live connection to the data or want to be able to access it programatically, the next step will create a real-time connection to the data.

Connect: This stage will allow you to create a real-time connection to the data. First you have to record how you obtained the data you extracted. Second, give us a couple of test cases so we can ensure that, if the website changes, your connection to the data will remain live.

Mix: One of the most powerful features of the platform is the ability to mix data from many sources to form a single data set. This allows you to create incredibly rich data sets by combing hundred of underlying data points from many different websites and access them via the application or API as a single source. Mixing is as easy a clicking the sources you want to mix together and saving that mix as a new real-time data set.

Use: Simply copy your data into your favourite spreadsheet software or use our APIs to access it in an application.

Developer preview but interesting for a couple of reasons.

First simply as an import service. I haven’t tried it (yet) so your mileage may vary. Reports welcome.

Second, I like the (presented) ease of use approach.

Imagine a topic map application for some specific domain that was as matter-of-fact as what I quote above.

Something to think about.

June 18, 2013

Shortfall of Linked Data

Filed under: Linked Data,LOD,Semantics,WWW — Patrick Durusau @ 8:58 am

Preparing a presentation I stumbled upon a graphic illustration of why we need better semantic techniques for the average author:

Linked Data in 2011:

LOD

Versus the WWW:

WWW

This must be why you don’t see any updated linked data clouds. The comparison is too shocking.

Particularly when you remember the WWW itself is only part of a much larger data cloud. (Ask the NSA about the percentages.)

Data is being produced every day, pushing us further and further behind with regard to its semantics. (And making the linked data cloud an even smaller percentage of all data.)

Authors have semantics in mind when they write.

The question is how to capture those semantics in machine readable form as nearly as seamlessly as authors write?

Suggestions?

May 15, 2013

NSA — Untangling the Web: A Guide to Internet Research

Filed under: Humor,Requirements,Research Methods,WWW — Patrick Durusau @ 2:28 pm

NSA — Untangling the Web: A Guide to Internet Research

A Freedom of Information Act (FOIA) request caused the NSA to disgorge its guide to web research, which is some six years out of date.

From the post:

The National Security Agency just released “Untangling the Web,” an unclassified how-to guide to Internet search. It’s a sprawling document, clocking in at over 650 pages, and is the product of many years of research and updating by a NSA information specialist whose name is redacted on the official release, but who is identified as Robyn Winder of the Center for Digital Content on the Freedom of Information Act request that led to its release.

It’s a droll document on many levels. First and foremost, it’s funny to think of officials who control some of the most sophisticated supercomputers and satellites ever invented turning to a .pdf file for tricks on how to track down domain name system information on an enemy website. But “Untangling the Web” isn’t for code-breakers or wire-tappers. The target audience seems to be staffers looking for basic factual information, like the preferred spelling of Kazakhstan, or telephonic prefix information for East Timor.

I take it as guidance on how “good” does your application or service need to be to pitch to the government?

I keep thinking to attract government attention, an application needs to fall just short of solving P = NP?

On the contrary, the government needs spell checkers, phone information and no doubt lots of other dull information, quickly.

Perhaps an app that signals fresh doughnuts from bakeries within X blocks would be just the thing. 😉

May 13, 2013

Seventh ACM International Conference on Web Search and Data Mining

Filed under: Conferences,Data Mining,Searching,WWW — Patrick Durusau @ 10:08 am

WSDM 2014 : Seventh ACM International Conference on Web Search and Data Mining

Abstract submission deadline: August 19, 2013
Paper submission deadline: August 26, 2013
Tutorial proposals due: September 9, 2013
Tutorial and paper acceptance notifications: November 25, 2013
Tutorials: February 24, 2014
Main Conference: February 25-28, 2014

From the call for papers:

WSDM (pronounced “wisdom”) is one of the premier conferences covering research in the areas of search and data mining on the Web. The Seventh ACM WSDM Conference will take place in New York City, USA during February 25-28, 2014.

WSDM publishes original, high-quality papers related to search and data mining on the Web and the Social Web, with an emphasis on practical but principled novel models of search, retrieval and data mining, algorithm design and analysis, economic implications, and in-depth experimental analysis of accuracy and performance.

WSDM 2014 is a highly selective, single track meeting that includes invited talks as well as refereed full papers. Topics covered include but are not limited to:

(…)

Papers emphasizing novel algorithmic approaches are particularly encouraged, as are empirical/analytical studies of specific data mining problems in other scientific disciplines, in business, engineering, or other application domains. Application-oriented papers that make innovative technical contributions to research are welcome. Visionary papers on new and emerging topics are also welcome.

Authors are explicitly discouraged from submitting papers that do not present clearly their contribution with respect to previous works, that contain only incremental results, and that do not provide significant advances over existing approaches.

Sets a high bar but one that can be met.

Would be very nice PR to have a topic map paper among those accepted.

May 1, 2013

Vote for Web Science MOOC!

Filed under: CS Lectures,WWW — Patrick Durusau @ 9:05 am

Please help me to realize my Web science massive open online course by René Pickhardt.

René has designed a Web Science MOOC but needs your vote at: https://moocfellowship.org/submissions/web-science to get the course funded.

Details on the course are at: Please help me to realize my Web science massive open online course.

The Web is important but to be honest, I am hopeful success here will encourage René to do a MOOC on graphs.

So I have an ulterior motive for promoting this particular MOOC. 😉

April 6, 2013

Ultimate library challenge: taming the internet

Filed under: Data,Indexing,Preservation,Search Data,WWW — Patrick Durusau @ 3:40 pm

Ultimate library challenge: taming the internet by Jill Lawless.

From the post:

Capturing the unruly, ever-changing internet is like trying to pin down a raging river. But the British Library is going to try.

For centuries, the library has kept a copy of every book, pamphlet, magazine and newspaper published in Britain. Starting on Saturday, it will also be bound to record every British website, e-book, online newsletter and blog in a bid to preserve the nation’s ”digital memory”.

As if that’s not a big enough task, the library also has to make this digital archive available to future researchers – come time, tide or technological change.

The library says the work is urgent. Ever since people began switching from paper and ink to computers and mobile phones, material that would fascinate future historians has been disappearing into a digital black hole. The library says firsthand accounts of everything from the 2005 London transit bombings to Britain’s 2010 election campaign have already vanished.

”Stuff out there on the web is ephemeral,” said Lucie Burgess the library’s head of content strategy. ”The average life of a web page is only 75 days, because websites change, the contents get taken down.

”If we don’t capture this material, a critical piece of the jigsaw puzzle of our understanding of the 21st century will be lost.”

For more details, see Jill’s post or, Click to save the nations digital memory (British Library press release), or 100 websites: Capturing the digital universe (sample of results of archiving with only 100 sites).

The content gathered by the project will be made available to the public.

A welcome venture, particularly since the results will be made available to the public.

An unanswerable question but I do wonder how we would view Greek drama if all of it had been preserved?

Hundreds if not thousands of plays were written and performed every year.

The Complete Greek Drama lists only forty-seven (47) that have survived to this day.

If whole scale preservation is the first step, how do we preserve paths to what’s worth reading in a data labyrinth as a second step?

I first saw this in a tweet by Jason Ronallo.

March 26, 2013

Our Internet Surveillance State [Intelligence Spam]

Filed under: Privacy,WWW — Patrick Durusau @ 3:21 pm

Our Internet Surveillance State by Bruce Schneier.

Nothing like a good rant to get your blood pumping during a snap of cold weather! 😉

Bruce writes:

Maintaining privacy on the Internet is nearly impossible. If you forget even once to enable your protections, or click on the wrong link, or type the wrong thing, and you’ve permanently attached your name to whatever anonymous service you’re using. Monsegur slipped up once, and the FBI got him. If the director of the CIA can’t maintain his privacy on the Internet, we’ve got no hope.

In today’s world, governments and corporations are working together to keep things that way. Governments are happy to use the data corporations collect — occasionally demanding that they collect more and save it longer — to spy on us. And corporations are happy to buy data from governments. Together the powerful spy on the powerless, and they’re not going to give up their positions of power, despite what the people want.

And welcome to a world where all of this, and everything else that you do or is done on a computer, is saved, correlated, studied, passed around from company to company without your knowledge or consent; and where the government accesses it at will without a warrant.

Welcome to an Internet without privacy, and we’ve ended up here with hardly a fight.

I don’t disagree with anything Bruce writes but I do not counsel despair.

Nor would I suggest any stop using the “Internet, email, cell phones, web browser, social networking sites, search engines,” in order to avoid spying.

But remember that one of the reasons U.S. intelligence services have fallen on hard times is the increased reliance on “easy” data to collect.

Clipping articles from newspaper or now copy-n-paste from emails and online zines, isn’t the same as having culturally aware human resources on the ground.

“Easy” data collection is far cheaper, but also less effective.

My suggestion is that everyone go “bare” and load up all listeners with as much junk as humanly possible.

Intelligence “spam” as it were.

Routinely threaten to murder fictitious characters in books or conspire to kidnap them. Terror plots, threats against Alderaan, for example.

Apparently even absurd threats, ‘One Definition of “Threat”,’ cannot be ignored.

A proliferation of fictional threats will leave them too little time to spy people going about their lawful activities.

BTW, not legal advice but I have heard that directly communicating any threat to any law enforcement agency is a crime. And not a good idea in any event.

Nor should you threaten any person or place or institution that isn’t entirely and provably fictional.

When someone who thinks mining social networks sites is a blow against terrorism overhears DC comic characters being threatened, that should be enough.

March 13, 2013

Aaron Swartz’s A Programmable Web: An Unfinished Work

Filed under: Semantic Web,Semantics,WWW — Patrick Durusau @ 3:04 pm

Aaron Swartz’s A Programmable Web: An Unfinished Work

Abstract:

This short work is the first draft of a book manuscript by Aaron Swartz written for the series “Synthesis Lectures on the Semantic Web” at the invitation of its editor, James Hendler. Unfortunately, the book wasn’t completed before Aaron’s death in January 2013. As a tribute, the editor and publisher are publishing the work digitally without cost.

From the author’s introduction:

” . . . we will begin by trying to understand the architecture of the Web — what it got right and, occasionally, what it got wrong, but most importantly why it is the way it is. We will learn how it allows both users and search engines to co-exist peacefully while supporting everything from photo-sharing to financial transactions.

We will continue by considering what it means to build a program on top of the Web — how to write software that both fairly serves its immediate users as well as the developers who want to build on top of it. Too often, an API is bolted on top of an existing application, as an afterthought or a completely separate piece. But, as we’ll see, when a web application is designed properly, APIs naturally grow out of it and require little effort to maintain.

Then we’ll look into what it means for your application to be not just another tool for people and software to use, but part of the ecology — a section of the programmable web. This means exposing your data to be queried and copied and integrated, even without explicit permission, into the larger software ecosystem, while protecting users’ freedom.

Finally, we’ll close with a discussion of that much-maligned phrase, ‘the Semantic Web,’ and try to understand what it would really mean.”

Table of Contents: Introduction: A Programmable Web / Building for Users: Designing URLs / Building for Search Engines: Following REST / Building for Choice: Allowing Import and Export / Building a Platform: Providing APIs / Building a Database: Queries and Dumps / Building for Freedom: Open Data, Open Source / Conclusion: A Semantic Web?

Even if you disagree with Aaron, on issues both large and small, as I do, it is a very worthwhile read.

But I will save my disagreements for another day. Enjoy the read!

January 22, 2013

Click Dataset [HTTP requests]

Filed under: Dataset,Graphs,Networks,WWW — Patrick Durusau @ 2:41 pm

Click Dataset

From the webpage:

To foster the study of the structure and dynamics of Web traffic networks, we make available a large dataset (‘Click Dataset’) of HTTP requests made by users at Indiana University. Gathering anonymized requests directly from the network rather than relying on server logs and browser instrumentation allows one to examine large volumes of traffic data while minimizing biases associated with other data sources. It also provides one with valuable referrer information to reconstruct the subset of the Web graph actually traversed by users. The goal is to develop a better understanding of user behavior online and create more realistic models of Web traffic. The potential applications of this data include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

The data was generated by applying a Berkeley Packet Filter to a mirror of the traffic passing through the border router of Indiana University. This filter matched all traffic destined for TCP port 80. A long-running collection process used the pcap library to gather these packets, then applied a small set of regular expressions to their payloads to determine whether they contained HTTP GET requests.

Data available under terms and restrictions, including transfer by physical hard drive (~ 2.5 TB of data).

Intrigued by the notion of a “subset of the Web graph actually traversed by users.”

Does that mean that semantic annotation should occur on the portion of the “…Web graph actually traversed by users” before reaching other parts?

If the language of 4,148,237 English Wikipedia pages is never in doubt for any user, do we really need triples to record that for every page?

January 10, 2013

Common Crawl URL Index

Filed under: Common Crawl,Data,WWW — Patrick Durusau @ 1:48 pm

Common Crawl URL Index by Lisa Green.

From the post:

We are thrilled to announce that Common Crawl now has a URL index! Scott Robertson, founder of triv.io graciously donated his time and skills to creating this valuable tool. You can read his guest blog post below and be sure to check out the triv.io site to learn more about how they help groups solve big data problems.

From Scott’s post:

If you want to create a new search engine, compile a list of congressional sentiment, monitor the spread of Facebook infection through the web, or create any other derivative work, that first starts when you think “if only I had the entire web on my hard drive.” Common Crawl is that hard drive, and using services like Amazon EC2 you can crunch through it all for a few hundred dollars. Others, like the gang at Lucky Oyster , would agree.

Which is great news! However if you wanted to extract only a small subset, say every page from Wikipedia you still would have to pay that few hundred dollars. The individual pages are randomly distributed in over 200,000 archive files, which you must download and unzip each one to find all the Wikipedia pages. Well you did, until now.

I’m happy to announce the first public release of the Common Crawl URL Index, designed to solve the problem of finding the locations of pages of interest within the archive based on their URL, domain, subdomain or even TLD (top level domain).

What research project would you want to do first?

December 29, 2012

The Top 5 Website UX Trends of 2012

Filed under: Graphics,Interface Research/Design,Usability,Users,WWW — Patrick Durusau @ 7:00 pm

The Top 5 Website UX Trends of 2012

From the post:

User interface techniques continued to evolve in 2012, often blurring the lines between design, usability, and technology in positive ways to create an overall experience that has been both useful and pleasurable.

Infinite scrolling, for example, is a technological achievement that also helps the user by enabling a more seamless experience. Similarly, advances in Web typography have an aesthetic dimension but also represent a movement toward greater clarity of communication.

Quick coverage of:

  1. Single-Page Sites
  2. Infinite Scrolling
  3. Persistent Top Navigation or “Sticky Nav”
  4. The Death of Web 2.0 Aesthetics
  5. Typography Returns

Examples of each trend but you are left on your own for the details.

Good time to review your web presence for the coming year.

December 24, 2012

10 Rules for Persistent URIs [Actually only one] Present of Persistent URIs

Filed under: Linked Data,Semantic Web,WWW — Patrick Durusau @ 2:11 pm

Interoperability Solutions for European Public Administrations got into the egg nog early:

D7.1.3 – Study on persistent URIs, with identification of best practices and recommendations on the topic for the MSs and the EC (PDF) (I’m not kidding, go see for yourself.)

Five (5) positive rules:

  1. Follow the pattern: http://(domain)/(type)/(concept)/(reference)
  2. Re-use existing identifiers
  3. Link multiple representations
  4. Implement 303 redirects for real-world objects
  5. Use a dedicated servive

Five (5) negative rules:

  1. Avoid stating ownership
  2. Avoid version numbers
  3. Avoid using auto-increment
  4. Avoid query strings
  5. Avoid file extensions

If the goal is “persistent” URIs, only the “Use a dedicated server” has any relationship to making a URIs “persistent.”

That is that five (5) or ten (10) years from now, a URI used as an identifier will return the same value as today.

The other nine rules have no relationship to persistence. Good arguments can be made for some of them, but persistence isn’t one of them.

Why the report hides behind the rhetoric of persistence I cannot say.

But you can satisfy yourself that only a “dedicated server” can persist a URI, whatever its form.

W3C confusion over identifiers and locators for web resources continues to plague this area.

There isn’t anything particularly remarkable about using a URI as an identifier. So long as it is understood that URI identifiers are just like any other identifier.

That is they can be indexed, annotated, searched for and returned to users with data about the object of the identification.

Viewed that way, that once upon a time there was a resource with the location specified by a URI, has little or nothing to do with the persistent of that URI.

So long as we have indexed the URI, that index can serve as a resolution of that URI/identifier for as long as the index persists. With additional information should we choose to create and provide it.

The EU document concedes as much when it says:

Without exception, all the use cases discussed in section 3 where a policy of URI persistence has been adopted, have used a dedicated service that is independent of the data originator. The Australian National Data Service uses a handle resolver, Dublin Core uses purl.org, services, data.gov.uk and publications.europa.eu are all also independent of a specific government department and could readily be transferred and run by someone else if necessary. This does not imply that a single service should be adopted for multiple data providers. On the contrary – distribution is a key advantage of the Web. It simply means that the provision of persistent URIs should be independent of the data originator.

That is if you read: “…independent of the data originator” to mean independent of a particular location on the WWW.

No changes in form, content, protocols, server software, etc., required. And you get persistent URIs.

Merry Christmas to all and to all…, persistent URIs as identifiers (not locators)!

(I first saw this at: New Report: 10 Rules for Persistent URIs)

December 18, 2012

HTML5 and Canvas 2D – Feature Complete

Filed under: HTML5,Web Applications,Web Browser,WWW — Patrick Durusau @ 6:08 am

HTML5 and Canvas 2D have been released as feature complete drafts.

Not final but a stable target for development.

If you are interested in “testimonials,” see: HTML5 Definition Complete, W3C Moves to Interoperability Testing and Performance

Personally I prefer the single page HTML versions:

HTML5 singe page version.

The Canvas 2D draft is already a single page version.

Now would be a good time to begin working on how you will use HTML5 and Canvas 2D for delivery of topic map based information.

« Newer PostsOlder Posts »

Powered by WordPress