July « 2011 « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 2, 2011

How Much Data…?

Filed under: Marketing — Patrick Durusau @ 3:09 pm

How Much Data Will Humans Create & Store This Year? [INFOGRAPHIC]

I started to create a category called scary graphic but then decided that marketing came close enough.

Use this graphic to emphasize how your client will need the ability to navigate the digital wilderness. Well, ok, their corner of the digital wilderness.

Which is a piece of sanity that seems to be missing from most presentations on data growth. Twitter/Facebook/SuperCollider data is exploding but how much of that is relevant to managing sales data at Walmart? Or HomeDepot? Or even the New York Stock Exchange?

Twitter traffic and the like makes great copy, but crunching numbers with economic significance is more likely to win and maintain contracts.

PS: That’s not to discount the potential for tracking Twitter traffic and insider sales on stocks. But discovering the basis for lawsuits qualifies as crunching numbers with economic significance. Topic maps would be a nice way to summarize and present such information.

Comments Off

Automatoon

Filed under: Graphics,Humor — Patrick Durusau @ 3:07 pm

Automatoon

My first reaction to this site was amusement. Not to mention realizing other people would be glad I lack the artistic skills to create animations. 😉

My second reaction was a clever person could use these techniques to animate search, sort, merge and other algorithms.

Animations of many of those already exist but another tool, this on in HTML5, won’t hurt.

Comments Off

July 1, 2011

Balisage 2011 – Final Program

Filed under: Conferences,XPath,XSLT — Patrick Durusau @ 2:58 pm

A recent post from Tommie Usdin announce the following additions to the Balisage 2011 program:

XQuery and SparQL
XQuery and XSLT
the Logical Form of a Metadata Record
Why is XML a pain to produce?
XML Serialization of C# and Java Objects
testing XSLT in continuous integration
dealing with markup without using words
REST for document resource nodes
tagging journal article supplemental materials
using 15 year old SGML documents in current software

and then goes on to talk about why markup geeks should be at Balisage.

I’ll make that shorter:

If you see either < or > at work or anyone talks about them, you need to be at Balisage 2011.

If you are not a markup geek, you will be one by the time you leave. Road to Damascus sort of experience. Or you will decide to move to San Francisco. Either way, what do you have to lose?

August 2-5, 2011, Montreal, Canada Time is running out!

Comments (2)

Indexing The World Wide Web:…

Filed under: Indexing,Search Algorithms,Search Engines,Searching — Patrick Durusau @ 2:57 pm

Indexing The World Wide Web: The Journey So Far by Abhishek Das and Ankit Jain.

Abstract:

In this chapter, we describe the key indexing components of today’s web search engines. As the World Wide Web has grown, the systems and methods for indexing have changed significantly. We present the data structures used, the features extracted, the infrastructure needed, and the options available for designing a brand new search engine. We highlight techniques that improve relevance of results, discuss trade-offs to best utilize machine resources, and cover distributed processing concept in this context. In particular, we delve into the topics of indexing phrases instead of terms, storage in memory vs. on disk, and data partitioning. We will finish with some thoughts on information organization for the newly emerging data-forms.

A non-trivial survey of indexing the web attempts and issues. This is going to take a while to digest but it looks like a very good starting place to uncover what to try next.

Comments Off

…filling space — without cubes

Filed under: Algorithms,Data Structures,Mathematics — Patrick Durusau @ 2:56 pm

Princeton researchers solve problem filling space — without cubes

From the post:

Whether packing oranges into a crate, fitting molecules into a human cell or getting data onto a compact disc, wasted space is usually not a good thing.

Now, in findings published June 20 in the Proceedings of the National Academy Sciences, Princeton University chemist Salvatore Torquato and colleagues have solved a conundrum that has baffled mathematical minds since ancient times — how to fill three-dimensional space with multi-sided objects other than cubes without having any gaps.

The discovery could lead to scientists finding new materials and could lead to advances in communications systems and computer security.

“You know you can fill space with cubes,” Torquato said, “We were looking for another way.” In the article “New Family of Tilings of Three-Dimensional Euclidean Space by Tetrahedra and Octahedra,” he and his team show they have solved the problem.

Not immediately useful for topic maps but will be interesting to see if new data structures emerge from this work.

See the article: New Family of Tilings of Three-Dimensional Euclidean Space by Tetrahedra and Octahedra (pay-per-view site)

Comments Off

Cloning Tinkerpop Repositories

Filed under: Blueprints,Frames,Gremlin,Neo4j,Pipes,Rexster — Patrick Durusau @ 2:55 pm

Instructions on creating a local copy of the Gremlin wiki (posted to the gremlin-users@googlegroups.com mailing list by Pierre De Wilde).

The instructions (with minor formatting changes) from his post:

For those who want a local copy of Gremlin wiki:

cd gremlin
git clone https://github.com/tinkerpop/gremlin.wiki.git
doc/wiki
cd doc/wiki
gollum

Open your browser at http://localhost:4567 and ta-da…

Moreover, the wiki is searchable and (unlike the github version) it’s printer-fiendly.

Gollum is a simple wiki system built on top of Git that powers GitHub Wikis.

https://github.com/github/gollum

To install Gollum, use RubyGems (http://rubygems.org/):

[sudo] gem install gollumcd cd

Of course, the same procedure may be applied for other Tinkerpop repositories (blueprints, pipes, frames, rexster, rexster-kibbles).

Unfortunately, gollum cannot access multiple repositories at once, so you will need to launch several versions with a different port (gollum -port xxxx)

Thanks Pierre!

Comments Off

World Bank Data

Filed under: Data Source,Mapping,Marketing — Patrick Durusau @ 2:55 pm

World Bank Data

Available through other portals, the World Bank offers access to over 7,000 indicators at its site, along with widgets for displaying the data.

While the World Bank Data website is well done and a step towards “transparency,” it does not address the need for “transparency” in terms financial auditing.

Take for example the Uganda – Financial Sector DPC Project. Admittedly it is only $50M but given it has a forty (40) year term with a ten (10) year grace period, who will be able to say with any certainty what happened to the funds in question?

If there were a mapping between the financial systems that disburse these funds into the financial systems in Uganda, then on whatever basis the information is updated, the World Bank would know and could assure others of the fate of the funds in question.

Granted I am assuming that different institutions and countries have different financial systems and that uniformity of such applications or systems isn’t realistic. It should certainly be possible to setup and maintain mappings between such systems. I suspect that mappings to banks and other financial institutions should be made as well to enable off-site auditing of any and all transactions.

Lest it seem like I am picking on World Bank recipients, I would recommend such mapping/auditing practices for all countries before approval of big ticket items like defense budgets. The fact that an auditing mapping fails in a following year is an indication something was changed for a reason. Once it is understood that changes attract attention and attention uncovers fraud, unexpected maintenance is unlikely to be an issue.

Comments Off

ScraperWiki

Filed under: Data,Data Mining,Data Source,Text Extraction — Patrick Durusau @ 2:49 pm

ScraperWiki

From the About page:

What is ScraperWiki?

There’s lots of useful data on the internet – crime statistics, government spending, missing kittens…

But getting at it isn’t always easy. There’s a table here, a report there, a few web pages, PDFs, spreadsheets… And it can be scattered over thousands of different places on the web, making it hard to see the whole picture and the story behind it. It’s like trying to build something from Lego when someone has hidden the bricks all over town and you have to find them before you can start building!

To get at data, programmers write bits of code called ‘screen scrapers’, which extract the useful bits so they can be reused in other apps, or rummaged through by journalists and researchers. But these bits of code tend to break, get thrown away or forgotten once they have been used, and so the data is lost again. Which is bad.

ScraperWiki is an online tool to make that process simpler and more collaborative. Anyone can write a screen scraper using the online editor. In the free version, the code and data are shared with the world. Because it’s a wiki, other programmers can contribute to and improve the code.

Something to keep an eye on and whenever possible, to contribute to.

People make data difficult to access for a reason. Let’s disappoint them.

Comments Off

Hadoop Developer Virtual Appliance

Filed under: Hadoop,MapReduce — Patrick Durusau @ 2:48 pm

Hadoop Developer Virtual Appliance

From the webpage:

The Karmasphere Studio Community All-in-One Virtual Appliance combines Apache Hadoop, Eclipse and Karmasphere Studio Community to make it easy to get started with the Hadoop development lifecycle. With pre-configured and documented examples, this easy-to-install environment gives the developer everything they need to learn, prototype, develop and test Hadoop applications.

Use Studio Community Edition to:

Learn how to develop Hadoop applications in a familiar graphical environment. Fast!

Prototype and develop your Hadoop projects with visual aids and wizards.

Test your Hadoop application on any version of Hadoop whether it runs on premise, in a private data center or in the cloud.

Studio Community Edition is perfect for those new to Hadoop, and a great way to explore Karmasphere Studio before jumping into Studio Professional Edition.

Supports all Hadoop platforms including Amazon Elastic MapReduce, Apache Hadoop, Cloudera CDH, EMC Greenplum HD Community Edition and IBM InfoSphere BigInsights.

Runs on Mac, Linux, and Windows.

Includes comprehensive “Getting Started Guide” with easy to use examples

The tools for Hadoop and Map/Reduce just keep getting better. There’s a lesson in there somewhere.

Comments Off

Apache Lucene 3.3 / Solr 3.3

Filed under: Indexing,Lucene,Search Engines,Solr — Patrick Durusau @ 2:47 pm

Lucene 3.3 Announcement

Lucene Features:

The spellchecker module now includes suggest/auto-complete functionality, with three implementations: Jaspell, Ternary Trie, and Finite State.

Support for merging results from multiple shards, for both “normal” search results (TopDocs.merge) as well as grouped results using the grouping module (SearchGroup.merge, TopGroups.merge).

An optimized implementation of KStem, a less aggressive stemmer for English.

Single-pass grouping implementation based on block document indexing.

Improvements to MMapDirectory (now also the default implementation returned by FSDirectory.open on 64-bit Linux).

NRTManager simplifies handling near-real-time search with multiple search threads, allowing the application to control which indexing changes must be visible to which search requests.

TwoPhaseCommitTool facilitates performing a multi-resource two-phased commit, including IndexWriter.

The default merge policy, TieredMergePolicy, has a new method (set/getReclaimDeletesWeight) to control how aggressively it targets segments with deletions, and is now more aggressive than before by default.

PKIndexSplitter tool splits an index by a mid-point term.

Solr 3.3 Announcement

Solr Features:

Grouping / Field Collapsing

A new, automaton-based suggest/autocomplete implementation offering an order of magnitude smaller RAM consumption.

KStemFilterFactory, an optimized implementation of a less aggressive stemmer for English.

Solr defaults to a new, more efficient merge policy (TieredMergePolicy). See http://s.apache.org/merging for more information.

Important bugfixes, including extremely high RAM usage in spellchecking.

Bugfixes and improvements from Apache Lucene 3.3

Comments Off

Explore MongoDB

Filed under: MongoDB — Patrick Durusau @ 2:46 pm

Explore MongoDB by Joe Lennon (IBM).

From the summary:

In this article, you will learn about MongoDB, the open source, document-oriented database management system written in C++ that provides features for scaling your databases in a production environment. Discover what benefits document-oriented databases have over traditional relational database management systems (RDBMS). Install MongoDB and start creating databases, collections, and documents. Examine Mongo’s dynamic querying features, which provide key/value store efficiency in a way familiar to RDBMS database administrators and developers.

Great way to get your feet wet with MongoDB!

Comments Off

« Newer Posts