Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 18, 2012

HTML5 and Canvas 2D – Feature Complete

Filed under: HTML5,Web Applications,Web Browser,WWW — Patrick Durusau @ 6:08 am

HTML5 and Canvas 2D have been released as feature complete drafts.

Not final but a stable target for development.

If you are interested in “testimonials,” see: HTML5 Definition Complete, W3C Moves to Interoperability Testing and Performance

Personally I prefer the single page HTML versions:

HTML5 singe page version.

The Canvas 2D draft is already a single page version.

Now would be a good time to begin working on how you will use HTML5 and Canvas 2D for delivery of topic map based information.

December 17, 2012

taxize: Taxonomic search and phylogeny retrieval [R]

Filed under: Bioinformatics,Biomedical,Phylogenetic Trees,Searching,Taxonomy — Patrick Durusau @ 4:58 pm

taxize: Taxonomic search and phylogeny retrieval by Scott Chamberlain, Eduard Szoecs and Carl Boettiger.

From the documentation:

We are developing taxize as a package to allow users to search over many websites for species names (scientific and common) and download up- and downstream taxonomic hierarchical information – and many other things. The functions in the package that hit a specific API have a prefix and suffix separated by an underscore. They follow the format of service_whatitdoes. For example, gnr_resolve uses the Global Names Resolver API to resolve species names. General functions in the package that don’t hit a specific API don’t have two words separated by an underscore, e.g., classification. You need API keys for Encyclopedia of Life (EOL), the Universal Biological Indexer and Organizer (uBio), Tropicos, and Plantminer.

Just in case you need species names and/or taxonomic hierarchy information for your topic map.

Apache Ambari: Hadoop Operations, Innovation, and Enterprise Readiness

Filed under: Apache Ambari,Hadoop,MapReduce — Patrick Durusau @ 4:23 pm

Apache Ambari: Hadoop Operations, Innovation, and Enterprise Readiness by Shaun Connolly

From the post:

Over the course of 2012, through Hortonworks’ leadership within the Apache Ambari community we have seen the rapid creation of an enterprise-class management platform required for enabling Apache Hadoop to be an enterprise viable data platform. Hortonworks engineers and the broader Ambari community have been working hard on their latest release, and we’d like to highlight the exciting progress that’s been made to Ambari, a 100% open and free solution that delivers the features required from an enterprise-class management platform for Apache Hadoop.

Why is the open source Ambari management platform important?

For Apache Hadoop to be an enterprise viable platform it not only needs the Data Services that sit atop core Hadoop (such as Pig, Hive, and HBase), but it also needs the Management Platform to be developed in an open and free manner. Ambari is a key operational component within the Hortonworks Data Platform (HDP), which helps make Hadoop deployments for our customers and partners easier and more manageable.

Stability and ease of management are two key requirements for enterprise adoption of Hadoop and Ambari delivers on both of these. Moreover, the rate at which this project is innovating is very exciting. In under a year, the community has accomplished what has taken years to complete for other solutions. As expected the “ship early and often” philosophy demonstrates innovation and helps encourage a vibrant and widespread following.

A reminder that tools can’t just be cool or clever.

Tools must fit within enterprise contexts where “those who lead from behind” are neither cool nor clever. But they do pay the bills and so are entitled to predictable and manageable outcomes.

Maybe. 😉 But that is the usual trade-off and if Apache Ambari helps Hadoop meet their requirements, so much the better for Hadoop.

blekko donates search data to Common Crawl [uncommon knowledge graphs]

Filed under: Common Crawl,Searching — Patrick Durusau @ 4:10 pm

blekko donates search data to Common Crawl by Lisa Green.

From the post:

I am very excited to announce that blekko is donating search data to Common Crawl!

blekko was founded in 2007 to pursue innovations that would eliminate spam in search results. blekko has created a new type of search experience that enlists human editors in its efforts to eliminate spam and personalize search. blekko has raised $55 million in VC and currently has 48 employees, including former Google and Yahoo! Search engineers.

For details of their donation and collaboration with Common Crawl see the post from their blog below. Follow blekko on Twitter and subscribe to their blog to keep abreast of their news (lots of cool stuff going on over there!) and be sure to check out there search.

And from blekko:

At blekko, we believe the web and search should be open and transparent — it’s number one in the blekko Bill of Rights. To make web data accessible, blekko gives away our search results to innovative applications using our API. Today, we’re happy to announce the ongoing donation of our search engine ranking metadata for 140 million websites and 22 billion webpages to the Common Crawl Foundation.

That’s a fair sized chunk of metadata.

The advantage of having large scale crawling and storage capabilities is slowly fading.

Are you ready to take the next step beyond tweaking the same approach?

Yes, Google has the Knowledge Graph. Which is no mean achievement.

On the other hand, aren’t most enterprises interested in uncommon knowledge graphs? As in their knowledge graph?

The difference between a freebie calculator and a Sun workstation.

Which one do you want?

MOOCs have exploded!

Filed under: Education,Teaching — Patrick Durusau @ 3:13 pm

MOOCs have exploded! by John Johnson.

From the post:

About a year and two months ago, Stanford University taught three classes online: Intro to Databases, Machine Learning, and Artificial Intelligence. I took two of those classes (I did not feel I had time to take Artificial Intelligence), and found them very valuable. The success of those programs led to the development of at least two companies in a new area of online education: Coursera and Udacity. In the meantime, other efforts have been started (I’m thinking mainly edX, but there are others as well), and now many universities are scrambling to take advantage of either the framework of these companies or other platforms.

Put simply, if you have not already, then you need to make the time to do some of these classes. Education is the most important investment you can make in yourself, and at this point there are hundreds of free online university-level classes in everything from the arts to statistics. If ever you wanted to expand your horizons, now’s the time.

John mentions that the courses require self-discipline. For enrollment of any size, that would be true of the person offering the course as well.

If you have taken one or more MOOCs, I am interested to hear your thoughts on teaching topic maps via a MOOC.

The syntaxes look amenable to the mini-test with automated grading style of testing. Could subject a topic map to parsing validity.

Would that be enough? As a mini-introduction to topic maps?

Saving in-depth discussion of semantics, identity and such for smaller settings?

The Rewards of Ignoring Data

Filed under: Boosting,Machine Learning,Random Forests — Patrick Durusau @ 2:55 pm

The Rewards of Ignoring Data by Charles Parker.

From the post:

Can you make smarter decisions by ignoring data? It certainly runs counter to our mission, and sounds a little like an Orwellean dystopia. But as we’re going to see, ignoring some of your data some of the time can be a very useful thing to do.

Charlie does an excellent job of introducing the use of multiple models of data and includes deeper material:

There are fairly deep mathematical reasons for this, and ML scientist par excellence Robert Shapire lays out one of the most important arguments in the landmark paper “The Strength of Weak Learnability” in which he proves that a machine learning algorithm that performs only slightly better than randomly can be “boosted” into a classifier that is able to learn to an arbitrary degree of accuracy. For this incredible contribution (and for the later paper that gave us the Adaboost algorithm), he and his colleague Yoav Freund earned the Gödel Prize for computer science theory, the only time the award has been given for a machine learning paper.

Not being satisfied, Charles demonstrates how you can create a random decision forest from your data.

Which is possible without reading the deeper material.

The Cooperative Computing Lab

Filed under: Cloud Computing,Clustering (servers),HPC,Parallel Programming,Programming — Patrick Durusau @ 2:39 pm

The Cooperative Computing Lab

I encountered this site while tracking down resources for the DASPOS post.

From the homepage:

The Cooperative Computing Lab at the University of Notre Dame seeks to give ordinary users the power to harness large systems of hundreds or thousands of machines, often called clusters, clouds, or grids. We create real software that helps people to attack extraordinary problems in fields such as physics, chemistry, bioinformatics, biometrics, and data mining. We welcome others at the University to make use of our computing systems for research and education.

As the computing requirements of your data mining or topic maps increase, so will your need for clusters, clouds, or grids.

The CCL offers several software packages for free download that you may find useful.

Data and Software Preservation for Open Science (DASPOS)

Filed under: BigData,Data Preservation,HEP - High Energy Physics,Software Preservation — Patrick Durusau @ 11:36 am

I first read in: Preserving Science Data and Software for Open Science:

One of the emerging, and soon to be defining, characteristics of science research is the collection, usage and storage of immense amounts of data. In fields as diverse as medicine, astronomy and economics, large data sets are becoming the foundation for new scientific advances. A new project led by University of Notre Dame researchers will explore solutions to the problems of preserving data, analysis software and computational work flows, and how these relate to results obtained from the analysis of large data sets.

Titled “Data and Software Preservation for Open Science (DASPOS),” the National Science Foundation-funded $1.8 million program is focused on high energy physics data from the Large Hadron Collider (LHC) and the Fermilab Tevatron.

The research group, which is led by Mike Hildreth, a professor of physics; Jarek Nabrzyski, director of the Center for Research Computing with a concurrent appointment as associate professor of computer science and engineering; and Douglas Thain, associate professor of computer science and engineering, also will survey and incorporate the preservation needs of other research communities, such as astrophysics and bioinformatics, where large data sets and the derived results are becoming the core of emerging science in these disciplines.

Preservation of data and software semantics. Sounds like topic maps!

Materials you may find useful:

Status Report of the DPHEP Study Group: Towards a Global Effort for Sustainable Data Preservation in High Energy Physics (May 2012, Omitted the last 40 authors so I am omitting the first 50 authors. See the paper for the complete list.)

Data Preservation in High Energy Physics (December 2009, forerunner to the 2012 report)

DASPOS: Common Formats? by Mike Hildreth (slides, 19 November 2012)

DASPOS Overview by Mike Hildreth (slides, 20 November 2012)

Perhaps the most important statement from the 20 November slides:

A “scouting party”: push forward in what looks like a good direction without worrying about full world-wide consensus

I have participated in, seen, read about, any number of projects and well, this is quite refreshing.

Starting a project with or prematurely developing final answers is a guarantee of poor results.

Both science and the humanities explore to find answers. Why should developing standards be any different?

A great deal to be learned here, even if you are just listening in on the conversations.

Learn You Some Erlang for Great Good!

Filed under: Erlang,Functional Programming — Patrick Durusau @ 6:48 am

Learn You Some Erlang for Great Good! is now a real book! by Paolo D’Incau.

From the post:

In my humble opinion if you want to learn or improve your Erlang, writing a lot of code is a good idea but is really not enough: you have to learn from other people’s work, you have to read more from blogs and books.

That’s the reason why in one of my oldest posts I recommended you to take a look at 7 Erlang related websites among which you will find the good old http://learnyousomeerlang.com/. I firmly believe that most of Erlangers out there learnt a lot from Fred Heber‘s work; the amount of information he provides is just impressive and his way to teach Erlang by small (well, not that small) examples is the best one I have seen so far online.

BTW, if you read Paolo’s post, you will find a 30% discount code for: Learn You Some Erlang for Great Good!.

Thanks to Paolo, I am now also waiting for my copy to arrive! (Misery loves company.)

Go3R [Searching for Alternatives to Animal Testing]

Go3R

A semantic search engine for finding alternatives to animal testing.

I mention it as an example of a search interface that assists the user in searching.

The help documentation is a bit sparse if you are looking for an opportunity to contribute to such a project.

I did locate some additional information on the project, all usefully with the same title to make locating it “easy.” 😉

[Introduction] Knowledge-based semantic search engine for alternative methods to animal experiments

[PubMed – entry] Go3R – semantic Internet search engine for alternative methods to animal testing by Sauer UG, Wächter T, Grune B, Doms A, Alvers MR, Spielmann H, Schroeder M. (ALTEX. 2009;26(1):17-31).

Abstract:

Consideration and incorporation of all available scientific information is an important part of the planning of any scientific project. As regards research with sentient animals, EU Directive 86/609/EEC for the protection of laboratory animals requires scientists to consider whether any planned animal experiment can be substituted by other scientifically satisfactory methods not entailing the use of animals or entailing less animals or less animal suffering, before performing the experiment. Thus, collection of relevant information is indispensable in order to meet this legal obligation. However, no standard procedures or services exist to provide convenient access to the information required to reliably determine whether it is possible to replace, reduce or refine a planned animal experiment in accordance with the 3Rs principle. The search engine Go3R, which is available free of charge under http://Go3R.org, runs up to become such a standard service. Go3R is the world-wide first search engine on alternative methods building on new semantic technologies that use an expert-knowledge based ontology to identify relevant documents. Due to Go3R’s concept and design, the search engine can be used without lengthy instructions. It enables all those involved in the planning, authorisation and performance of animal experiments to determine the availability of non-animal methodologies in a fast, comprehensive and transparent manner. Thereby, Go3R strives to significantly contribute to the avoidance and replacement of animal experiments.

[ALTEX entry – full text available] Go3R – Semantic Internet Search Engine for Alternative Methods to Animal Testing

Visualizing Facebook Friends With D3.js…

Filed under: D3,Graphs,Networks,Visualization — Patrick Durusau @ 5:30 am

Visualizing Facebook Friends With D3.js or “How Wolfram|Alpha Does That Cool Friend Network Graph” by Tony Young.

From the post:

A while ago, Wolfram|Alpha got the ability to generate personal analytics based on your Facebook profile. It made some cool numbers and stuff, but the friend network graph was the most impressive:

clustering of friends

Wolfram|Alpha neatly separates your various social circles into clusters, based on proximity — with freaky accuracy.

With the awesome D3.js library, along with some gratuitous abuse of the Facebook API, we can make our own!

If you’re impatient, skip through all this text and check out the example or the screenshot!

A good example of the ease of deduplication (read merging) where the source of ids is uniform.

Possible classroom exercise to create additional Facebook accounts for students, so that each student has at least two (2) Facebook accounts. Each with friend lists.

Any overlapping friends will “merge” but the different accounts don’t, even though the same person.

Walk through solving the merging problem where there are different accounts.

I first saw this in a tweet by Christophe Viau.

December 16, 2012

OrgOrgChart: The Dynamic Organization of an Organization

Filed under: Graphs,Networks,Visualization — Patrick Durusau @ 9:14 pm

OrgOrgChart: The Dynamic Organization of an Organization by Andrew Vande Moere.

From the post:

The Organic Organization Chart [autodeskresearch.com], developed by Justin Matejka at Autodesk Research, shows what a Human Resources manager dreams at night.

The animated force-directed network diagram shows how a company’s structure evolves over time, here the daily organizational changes within the company Autodesk over the last 4 years.

The entire hierarchy of AutoDesk is constructed as a single tree with each employee represented by a circle, and a line connecting each employee with his or her manager. Larger circles represent managers with more employees working under them.

Occurs to me that a similar diagram could be useful in tracking the flow of information from one person to another. By adding in email, phone and observed personal contacts.

Say from an internal briefing to a leak, for example.

Impressive demonstration of changes over time. Very impressive.

Asterank: an Accurate 3D Model of the Asteroids in our Solar System

Filed under: Astroinformatics,Mapping,Maps — Patrick Durusau @ 9:02 pm

Asterank: an Accurate 3D Model of the Asteroids in our Solar System by Andrew Vande Moere.

From the post:

Asterank 3D Asteroid Orbit Space Simulation [asterank.com], developed by software engineer Ian Webster, is a 3D WebGL-based model of the first 5 planets and the 30 most valuable asteroids, together with their respective orbits in our inner solar system.

Asterank’s database contains the astronomically accurate locations, as well as some economic and scientific information, of over 580,000 asteroids in our solar system. Each asteroid is accompanied by its “Value of Materials”, in terms of the metals, volatile compounds, or water it seem to contain. The “Cost of Operations” provides a financial estimation of how much it would cost to travel to the asteroid and move the materials back to Earth.

Will you be ready as semantic diversity spreads from the Earth out into the Solar System?

Why There Shouldn’t Be A Single Version Of The Truth

Filed under: Diversity,Semantics — Patrick Durusau @ 8:35 pm

Why There Shouldn’t Be A Single Version Of The Truth by Chuck Hollis.

From the post:

Legacy thinking can get you in trouble in so many ways. The challenge is that — well — there’s so much of it around.

Maxims that seemed to make logical sense in one era quickly become the intellectual chains that hold so many of us back. Personally, I’ve come to enjoy blowing up conventional wisdom to make room for emerging realities.

I’m getting into more and more customer discussions with progressive IT organizations that are seriously contemplating building platforms and services that meet the broad goal of “analytically enabling the business” — business analytics as service, if you will.

The problem? The people in charge have done things a certain way for a very long time. And the new, emerging requirements are forcing them to go back and seriously reconsider some of their most deeply-held assumptions.

Like having “one version of the truth”. I’ve seen multiple examples of it get in the way of organizations who need to be doing more with their data.

As usual, a highly entertaining and well illustrated essay from Chuck.

Chuck makes the case for enough uniformity to enable communication but enough diversity to generate new ideas and interesting discussions.

The InChI and its influence on the domain of chemical information

Filed under: Cheminformatics,InChl — Patrick Durusau @ 8:26 pm

The InChI and its influence on the domain of chemical information by Bailey Fallon.

From the post:

A thematic series on the IUPAC International Chemical Identifier (InChI) and Its Influence on the Domain of Chemical Information has just seen its first articles published in Journal of Cheminformatics.

The InChI is as a textual identifier for chemical substances, which provides a standard way of representing chemical information. It is machine readable, allowing it to be used for structure searching in databases and on the web. This thematic issue, edited by Dr Antony Williams at the Royal Society of Chemistry, aggregates a number of contributions demonstrating the value of InChI as an enabling technology in the world of cheminformatics and its continuing value for linking chemistry data.

Certainly should command your attention if you are in cheminformatics.

But also if you want to duplicate its success.

Closing the Knowledge Gap: A Case Study – How Cisco Unlocks Communications

Filed under: Indexing,Lucene,Searching — Patrick Durusau @ 8:12 pm

Closing the Knowledge Gap: A Case Study – How Cisco Unlocks Communications by Tony Frazier, Director of Product Management, Cisco Systems and David Fishman, Marketing, Lucid Imagination.

A bit dated (2010) but I think you will find this interesting reading.

A couple of snippets to tempt you into reading the full post:

Consider this: you’re looking for information and immediately search the documents at your disposal to find the answer. Are you the first person who conducted this search? If you are in a reasonably large organization, given the scope and mix of electronic communications today, there could be more than 10 other employees looking for the same answer. Unearthing documents, one employee at a time, may not be the best way of tapping into that collective intellect and maximizing resources across an organization. Wouldn’t it make more sense to tap into existing discussions taking place across the network—over email, voice and increasingly video communications?

and,

Cisco’s approach to this project centered on vocabulary-based tagging and search. Every organization has the ability to define keywords for their personalized library. Cisco Pulse then tags a user’s activity, content and behavior in electronic communications to match the vocabulary, presenting valuable information that simplifies and accelerates knowledge sharing across an organization. Vocabulary-based tagging makes unlocking the relevant content of electronic communications safe and efficient.

Add a little vocabulary mapping with topic maps, toss and serve!

Searching an Encrypted Document Collection with Solr4, MongoDB and JCE

Filed under: Encryption,MongoDB,Security,Solr — Patrick Durusau @ 8:00 pm

Searching an Encrypted Document Collection with Solr4, MongoDB and JCE by Sujit Pal.

From the post:

A while back, someone asked me if it was possible to make an encrypted document collection searchable through Solr. The use case was patient records – the patient is the owner of the records, and the only person who can search through them, unless he temporarily grants permission to someone else (for example his doctor) for diagnostic purposes. I couldn’t come up with a good way of doing it off the bat, but after some thought, came up with a design that roughly looked like the picture below:

With privacy being all the rage, a very timely post.

Not to mention an opportunity to try out Solr4.

Collaborating, Online with LaTeX?

Filed under: Authoring Topic Maps,Collaboration,Editor — Patrick Durusau @ 6:17 am

I saw a tweet tonight that mentioned two online collaborative editors based on LaTeX:

writeLaTeX

and,

ShareLaTeX

I don’t have the time to look closely at them tonight but thought you would find them interesting.

If collaborative editing is possible for LaTeX, shouldn’t that also be possible for a topic map?

I saw this mentioned in a tweet by Jan-Piet Mens

December 15, 2012

Neuroscience Information Framework (NIF)

Filed under: Bioinformatics,Biomedical,Medical Informatics,Neuroinformatics,Searching — Patrick Durusau @ 8:21 pm

Neuroscience Information Framework (NIF)

From the about page:

The Neuroscience Information Framework is a dynamic inventory of Web-based neuroscience resources: data, materials, and tools accessible via any computer connected to the Internet. An initiative of the NIH Blueprint for Neuroscience Research, NIF advances neuroscience research by enabling discovery and access to public research data and tools worldwide through an open source, networked environment.

Example of a subject specific information resource that provides much deeper coverage than possible with Google, for example.

If you aren’t trying to index everything, you can out perform more general search solutions.

Rosalind

Filed under: Bioinformatics,Python,Teaching — Patrick Durusau @ 8:16 pm

Rosalind

From the homepage:

Rosalind is a platform for learning bioinformatics through problem solving.

Rather than teaching topic maps from the “basics” forward, what about teaching problems for which topic maps are a likely solution?

And introduce syntax/practices as solutions to particular issues?

Suggestions for problems?

DSpace 3.0 Released

Filed under: CMS,DSpace — Patrick Durusau @ 2:20 pm

DSpace 3.0 Released

From the post:

DSpace 3.0 was officially released to the public on November 30, 2012. The previous version of DSpace was 1.8.2 and DSpace has changed its numbering scheme and this is explained here. The demo version of this release is available for testing here.

The new DSpace 3.0 comes with a number of new features.There are two groups of features, those that are enabled by default and those that require deliberate activation.

Default features

Activation features

The features listed below are included in the DSpace 3.0 release but they are enabled manually.

DSpace 3.0 can be [down]loaded at

If you aren’t already familiar with DSpace, the DSpace homepage offer the following helpful summary:

DSpace open source software is a turnkey institutional repository application.

🙂

The DSpace Video is more forthcoming.

December 14, 2012

Semantic Technology ROI: Article of Faith? or Benchmarks for 1.28% of the web?

Filed under: Benchmarks,Marketing,Performance,RDFa,Semantic Web — Patrick Durusau @ 3:58 pm

Orri Erling, in LDBC: A Socio-technical Perspective, writes in part:

I had a conversation with Michael at a DERI meeting a couple of years ago about measuring the total cost of technology adoption, thus including socio-technical aspects such as acceptance by users, learning curves of various stakeholders, whether in fact one could demonstrate an overall gain in productivity arising from semantic technologies. [in my words, paraphrased]

“Can one measure the effectiveness of different approaches to data integration?” asked I.

“Of course one can,” answered Michael, “this only involves carrying out the same task with two different technologies, two different teams and then doing a double blind test with users. However, this never happens. Nobody does this because doing the task even once in a large organization is enormously costly and nobody will even seriously consider doubling the expense.”

LDBC does in fact intend to address technical aspects of data integration, i.e., schema conversion, entity resolution, and the like. Addressing the sociotechnical aspects of this (whether one should integrate in the first place, whether the integration result adds value, whether it violates privacy or security concerns, whether users will understand the result, what the learning curves are, etc.) is simply too diverse and so totally domain dependent that a general purpose metric cannot be developed, at least not in the time and budget constraints of the project. Further, adding a large human element in the experimental setting (e.g., how skilled the developers are, how well the stakeholders can explain their needs, how often these needs change, etc.) will lead to experiments that are so expensive to carry out and whose results will have so many unquantifiable factors that these will constitute an insuperable barrier to adoption.

The need for parallel systems to judge the benefits of a new technology is a straw man. And one that is easy to dispel.

For example, if your company provides technical support, you are tracking metrics on how quickly your staff can answer questions. And probably customer satisfaction with your technical support.

Both are common metrics in use today.

Assume the suggestion that linked data to improve technical support for your products. You begin with a pilot project to measure the benefit from the suggested change.

If the length of support calls goes down or customer customer satisfaction goes up, or both, change to linked data. If not, don’t.

Naming a technology as “semantic” doesn’t change how you measure the benefits of a change in process.

LDBC will find purely machine based performance measures easier to produce than answering more difficult socio-technical issues.

But of what value are great benchmarks for a technology that no one wants to use?

See my comments under: Web Data Commons (2012) – [RDFa at 1.28% of 40.5 million websites]. Benchmarks for 1.28% of the web?

Setting Up a Neo4J Cluster on Amazon

Filed under: Amazon Web Services AWS,Graphs,Neo4j — Patrick Durusau @ 3:28 pm

Setting Up a Neo4J Cluster on Amazon by Max De Marzi.

From the post:

There are multiple ways to setup a Neo4j Cluster on Amazon Web Services (AWS) and I want to show you one way to do it.

Overview:

  1. Create a VPC
  2. Launch 1 Instance
  3. Install Neo4j HA
  4. Clone 2 Instances
  5. Configure the Instances
  6. Start the Coordinators
  7. Start the Neo4j Cluster
  8. Create 2 Load Balancers
  9. Next Steps

In case you are curious about moving off of your local box to something that can handle more demand.

mathURL

Filed under: Mathematics,TeX/LaTeX — Patrick Durusau @ 3:07 pm

mathURL live equation editing · permanent short links · LaTeX+AMS input

Try: http://mathurl.com/5euwuy

Includes layout, letters and symbols, operators and relations, punctuation and accents, functions, formatting and common forms as selectable items that generate LaTeX code in the editing window.

Interesting to think about use of such a link as a subject identifier.

I first saw this in a tweet from Tex tips.

How-To: Run a MapReduce Job in CDH4

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 2:56 pm

How-To: Run a MapReduce Job in CDH4 by Sandy Ryza.

From the post:

This is the first post in series that will get you going on how to write, compile, and run a simple MapReduce job on Apache Hadoop. The full code, along with tests, is available at http://github.com/cloudera/mapreduce-tutorial. The program will run on either MR1 or MR2.

We’ll assume that you have a running Hadoop installation, either locally or on a cluster, and your environment is set up correctly so that typing “hadoop” into your command line gives you some notes on usage. Detailed instructions for installing CDH, Cloudera’s open-source, enterprise-ready distro of Hadoop and related projects, are available here: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation. We’ll also assume you have Maven installed on your system, as this will make compiling your code easier. Note that Maven is not a strict dependency; we could also compile using Java on the command line or with an IDE like Eclipse.

The Use Case

There’s been a lot of brawling on our pirate ship recently. Not so rarely, one of the mates will punch another one in the mouth, knocking a tooth out onto the deck. Our poor sailors will wake up the next day with an empty bottle of rum, wondering who’s responsible for the gap between their teeth. All this violence has gotten out of hand, so as a deterrent, we’d like to provide everyone with a list of everyone that’s ever left them with a gap. Luckily, we’ve been able to set up a Flume source so that every time someone punches someone else, it gets written out as a line in a big log file in Hadoop. To turn this data into these lists, we need a MapReduce job that can 1) invert the mapping from attacker to their victim, 2) group by victims, and 3) eliminate duplicates.

Cool!

Imagine using the same technique while you watch the evening news!

On second thought, that would take too much data entry and be depressing.

Stick to the pirates!

Web Data Commons (2012) – [RDFa at 1.28% of 40.5 million websites]

Filed under: Common Crawl,Microdata,Microformats,RDFa — Patrick Durusau @ 2:34 pm

Web Data Commons announced the extraction results from the August 2012 Common Crawl corpus on 2012-12-10!

Access:

The August 2012 Common Crawl Corpus is available on Amazon S3 in the bucket aws-publicdatasets under the key prefix /common-crawl/parse-output/segment/ .

The numbers:

Extraction Statistics

Crawl Date January-June 2012
Total Data 40.1 Terabyte (compressed)
Parsed HTML URLs 3,005,629,093
URLs with Triples 369,254,196
Domains in Crawl 40,600,000
Domains with Triples 2,286,277
Typed Entities 1,811,471,956
Triples 7,350,953,995

See also:

Web Data Commons Extraction Report – August 2012 Corpus

and,

Additional Statistics and Analysis of the Web Data Commons August 2012 Corpus

Where the authors report:

Altogether we discovered structured data within 369 million of the 3 billion pages contained in the Common Crawl corpus (12.3%). The pages containing structured data originate from 2.29 million among the 40.5 million websites (PLDs) contained in the corpus (5.65%). Approximately 519 thousand websites use RDFa, while only 140 thousand websites use Microdata. Microformats are used on 1.7 million websites. It is interesting to see that Microformats are used by approximately 2.5 times as many websites as RDFa and Microdata together.

PLDs = Pay-Level-Domains.

The use of Microformats on “2.5 times as many websites as RDFa and Microdata together” has to make you wonder about the viability of RDFa.

Or to put it differently, if RDFa is 1.28% of the 40.5 million websites, eight (8) years after its introduction (2004) and four (4) years after reaching Recommendation status (2008), is it time to look for an alternative?

I first saw the news about the new Web Data Commons data drop in a tweet by Tobias Trapp.

Sitegeist:…

Filed under: Geographic Data,Geographic Information Retrieval,Geography — Patrick Durusau @ 10:58 am

Sitegeist: A mobile app that tells you about your data surroundings by Nathan Yau.

Nathan writes:

From businesses to demographics, there’s data for just about anywhere you are. Sitegeist, a mobile application by the Sunlight Foundation, puts the sources into perspective.

App is free and the Sunlight site lists the following data for a geographic location:

  • Age Distribution
  • Political Contributions
  • Average Rent
  • Popular Local Spots
  • Recommended Restaurants
  • How People Commute
  • Record Temperatures
  • Housing Units Over Time

If you have an iPhone or Android phone, can you report if other data is available?

I was thinking along the lines of:

  • # of drug arrests
  • # type of drug arrests
  • # of arrests for soliciting (graphed by day/time)
  • # location of bail bond agencies

More tourist type information. 😉

How would you enhance this data flow with a topic map?

Analyzing Big Data With Twitter

Filed under: BigData,CS Lectures,Tweets — Patrick Durusau @ 10:42 am

UC Berkeley Course Lectures: Analyzing Big Data With Twitter by Marti Hearst.

Marti gives a summary of this excellent class, with links to videos, slides and high level notes for the course.

If you enjoyed these materials, make a post about them, recommend them to others or even send Marti a note of appreciation.

Prof. Marti Hearst, ude.yelekreb.loohcsinull@tsraeh

OpenTopography Project

Filed under: LiDAR,Mapping,Maps,Topography — Patrick Durusau @ 10:19 am

OpenTopograpy: A Portal to High-Resolution Topography Data and Tools

Which ironically has its “spotlight” on:

Discover Lidar Data Hosted by NCALM and USGS from OpenTopography

Which is summarized in the “spotlight” as:

The OpenTopography Find Data page is updated to display not only OpenTopography hosted-data, but also provides linkages to data hosted at the NCALM Data Distribution Center and USGS Center for Lidar Coordination and Knowledge (CLICK). The goal of this collaboration is to make it easier for lidar users to discover and link to online sources of data regardless of host.

Non-self referential and/or paid links that lead to additional content of interest to the reader.

If enough people did that, why we would have a useful WWW.

PS: Introduction to LiDAR video by the Idaho State University Geoscience Department

FutureLearn [MOOCs from Open University, UK]

Filed under: CS Lectures,Education — Patrick Durusau @ 7:05 am

Futurelearn

From the webpage:

Futurelearn will bring together a range of free, open, online courses from leading UK universities, in the same place and under the same brand.

The Company will be able to draw on The Open University’s unparalleled expertise in delivering distance learning and in pioneering open education resources. These will enable Futurelearn to present a single, coherent entry point for students to the best of the UK’s online education content.

Futurelearn will increase the accessibility of higher education, opening up a wide range of new online courses and learning materials to students across the UK and the rest of the world.

More details in 2013.

If you want to know more, now, try:

Open University launches British Mooc platform to rival US providers

or,

OU Launches FutureLearn Ltd

Have you noticed that the more players in a space the greater the semantic diversity?

Makes me suspect that semantic diversity is a characteristic of humanity.

Are there any counter examples?

PS: MOOCs should be fertile grounds for mapping across different vocabularies for the same content.

PPS: In case you are wondering why the Open University has the .com domain, consider that futurelearn.org was taken. Oh! There are those damned re-use of name issues! 😉

« Newer PostsOlder Posts »

Powered by WordPress